Upload
all-things-open
View
273
Download
0
Embed Size (px)
Citation preview
and containers
Andrew Spyker (@aspyker) - Engineering Manager
Not
About Netflix
● 86.7M members● A few thousand employees● 190+ countries● > ⅓ NA internet download traffic● 500+ Microservices● Many 10’s of thousands VM’s● 3 regions across the world
Netflix has a elastic, cloud native, immutable microservice architecture using full devops built on VM’s!
3
Why are we messing around with containers?
Technical motivating factors for containers
● Simpler management of compute resources
● Simpler deployment packaging artifacts for compute jobs
● Need for a consistent local developer environment
Sampling of realized container benefits
Media Encoding - encoding research development time● VM’s platform to container platform - 1 month vs. 1 week
Continuous Integration Testing● Build all Netflix codebases in hours● Saves development 100’s of hours of debugging
Edge Re-architecture using NodeJS● Focus returns to app development● Simplifies, speeds test and deployment
5
Batch applications
Multi-tenant (cgroups/Mesos) historically used for batch
Linux cgroups
What do batch users want?
● Simple shared resources, run till done, job files
● NOT○ EC2 Instance sizes, autoscaling, AMI OS’s
● WHY○ Offloads resource management ops, Simpler
Introducing Titus
Batch
Job Management
Resource Management & Optimization
Container ExecutionIntegration
Workflow, Data Analysis, Adhoc Upstream Systems
Netflix Batch Job Examples
● Algorithm Model Training (with GPU’s)
Netflix Batch Job Examples
● Media Encoding
● Digital Watermarking
1 1
Netflix Batch Job Examples
Open Connect CDN Reporting
AdhocReporting
● Docker helped generalize use cases● Scheduling required (GPU, elastic)● Initially ignored failures (with retries)● Time sensitive batch came later
Lessons Learned from Batch
Current Container Usage - Batch● 1000’s of container hosts (g2, m4, r3 instances)● 1000’s containers / hour average● Large spikes of CI testing and Digital Watermarking
From day of 10/26● 47K containers● Bursts of 1000
containers in a minute
Service applications
Why Services in containers?
Theory Reality
Developer
Opportunities to evolve our baking
● Java focused supported AMI, baking works well!
● However, wanted to allow○ other stacks to evolve independent of OS updates○ simplified builds (vs. Java and OS based tooling)○ reliable smaller instances for dynamic languages○ ability to develop locally with same image
Services are just long running batch?
ServicesJob Management
Resource Management & Optimization
Container ExecutionIntegration
Service Apps
Batch
19
Nope, not that easy - Titus Details
19
Titus UITitus UI
Docker RegistryDocker Registry
Rhea
containercontainer
container
docker
Titus Agent metrics agent
Titus executor
logging agent
zfs
Mesos agent
docker
RheaTitus API
Cassandra
Titus Master
Job Management & Scheduler
S3
ZookeeperDocker Registry
EC2 Autocaling API
Mesos Master
Titus UI
Fenzo
container
Pod & VPC network drivers
containercontainer
AWSmetadata proxy
Integration
AWS VM’sCI/CD
Services more complex● Services resize constantly and run forever
○ Autoscaling○ Hard to upgrade underlying hosts
● Require IPC integration○ Routable IPs, service discovery○ Ready for traffic vs. just started/stopped
● Existing well defined dev, deploy, runtime & ops tools
Real networking is hard
Multi-tenant
Need IP per container - in VPC
Using security groups
Using IAM roles
Considering network resource isolation
Enabling VPC Networking
No IP, SecGrp A
Task 0
SecGrp Y,Z
Task 1 Task 2 Task 3
Titus EC2 Host VMeth1
ENI1SecGrp=A
eth2
ENI2SecGrp=X
eth3
ENI3SecGrp=Y,Z
IP 1IP 2
IP 3
pod root
veth<id>
app
SecGrp X
pod root
veth<id>
app
SecGrp X
pod root
veth<id>
appapp
veth<id>
Linux Policy BasedRouting + Traffic Control
TitusEC2
Metadata Proxy
169.254.169.254IPTables NAT (*)
* **
169.254.169.254Non-routable IP
*
Solutions● VPC Networking driver
○ Supports ENI’s - full IP functionality○ Scheduled security groups○ Support traffic control (resource isolation)
● EC2 Metadata proxy○ Adds container “node” identity○ Delivers IAM roles
Reuse existing infrastructure services
VMVM
EC2
AW
S A
utoS
cale
rVMs
App
Cloud Platform(metrics, IPC, health)
VPC
Netflix Cloud Infrastructure (VM’s + Containers)
Atlas Eureka Edda
Enable them for containers
VMVM
EC2
AW
S A
utoS
cale
rVMs
App
Cloud Platform(metrics, IPC, health)
VPC
Netflix Cloud Infrastructure (VM’s + Containers)
VMVM
Atlas
Titu
s Jo
b C
ontro
l
Containers
App
Cloud Platform(metrics, IPC, health)
Eureka Edda
VMVM
BatchContainers
Spinnaker
Deploy based on new images
tags
Basic resource requirements
IAM Roles & Sec Groups per container
Deploy Strategies
Same as VM’s
Easily see health &
discovery
Secure Multi-tenancyCommon to VM’s and tiered security needed● Protect the reduced host IAM role, Allow containers to have specific IAM
roles● Needed to support same security groups in container networking as VM’s
User namespacing● Docker 1.10 - Introduced User Namespaces
● Didn’t work /w shared networking NS● Docker 1.11 - Fixed shared networking NS’s
● But, namespacing is per daemon, Not per container, as hoped● Waiting on Linux
● Considering mass chmod / ZFS clones
Titus Advanced Scheduling
● Support for AZ balancing● Multiple instance types selected based on workload● Elastic underlying common resource pool
○ Bin packing managed transparently across all apps● Hard and soft constraints● Resource affinity and task affinity● Capacity guarantees (critical tier)
34
Fenzo - Keep resource scheduling extensible
Fenzo - Extensible Scheduling Library
Features:● Heterogeneous resources & tasks● Autoscaling of mesos cluster
○ Multiple instance types● Plugins based scheduling objectives
○ Bin packing, etc.● Plugins based constraints evaluator
○ Resource affinity, task locality, etc.● Scheduling actions visibility
Current Container Usage - Service
● Still small ~ 2000 long running containers
● NodeJS Device UI Scripts Apps● Stream Processing Jobs - Flink● Various Internal Dashboards
Questions?
Andrew Spyker (@aspyker) - Engineering Manager