velocity nyc 2016 - containers @ netflix
TRANSCRIPT
Netflix Open Source Software
http://netflix.github.io
What do batch users want?
● Simple shared resources, run till done, job files
● NOT○ EC2 Instance sizes, autoscaling, AMI OS’s
● WHY○ Offloads resource management ops, Simpler
Titus
Batch
Job Management
Resource Management & Optimization
Container ExecutionIntegration
Workflow, Data Analysis, Adhoc Upstream Systems
Lessons Learned from Batch
● Docker helped generalize use cases● Advanced scheduling required● Initially ignored failures (with retries)
● Docker helped generalize use cases● Advanced scheduling required● Initially ignored failures (with retries)● Time sensitive batch came later
Lessons Learned from Batch
Current Container Usage - Batch
● 100 containers / hour● Peaks of 1000’s per hour● Large spikes of CI testing and Digital Watermarking
A random day’s worth of containers
“Why is Apache and Tomcat running on my NodeJS server”
Problem:BaseAMI optimized for Java, not easily customizable
“Why do I need java, gradle, ospackage after my non-Java build?”
Problem:Reuse of Java-centric AMI tooling
“I want an instance with a single core to run my lightweight server”
Problem:Small instances are not reliable
Enter Docker
● Have a new language?● Have a build tool you like?● Want to resource isolate easily?
Come one, come all
Services are just long running batch?
ServicesJob Management
Resource Management & Optimization
Container ExecutionIntegration
Service Apps
Batch
Services more complex● Services resize constantly and run forever
○ Autoscaling○ Hard to upgrade underlying hosts
Services more complex● Services resize constantly and run forever
○ Autoscaling○ Hard to upgrade underlying hosts
● Have more state○ Ready for traffic vs. just started/stopped○ Even harder to upgrade
Services more complex● Services resize constantly and run forever
○ Autoscaling○ Hard to upgrade underlying hosts
● Have more state○ Ready for traffic vs. just started/stopped○ Even harder to upgrade
● Existing well defined dev, deploy, runtime & ops tools
Multi-tenant
Need IP per container - in VPC
Using security groups
Using IAM roles
Considering network resource isolation
Solutions● VPC Networking driver
○ Supports ENI’s - full IP functionality○ Scheduled security groups○ Support traffic control (isolation)
● EC2 Metadata proxy○ Adds container “node” identity○ Delivers IAM roles
Reuse existing infrastructure services
VMVM
EC2
AW
S A
utoS
cale
rVMs
App
Cloud Platform(metrics, IPC, health)
VPC
Netflix Cloud Infrastructure (VM’s + Containers)
Atlas Eureka Edda
Enable them for containers
VMVM
EC2
AW
S A
utoS
cale
rVMs
App
Cloud Platform(metrics, IPC, health)
VPC
Netflix Cloud Infrastructure (VM’s + Containers)
VMVM
Atlas
Titu
s Jo
b C
ontro
l
Containers
App
Cloud Platform(metrics, IPC, health)
Eureka Edda
VMVM
BatchContainers
Current Container Usage - Service
● Still small - 100’s of containers
● NodeJS Device UI Scripts Apps● Stream Processing Jobs - Flink● Various Internal Dashboards
Node.js supportbefore Newt:● Install Java● Install Nebula (Netflix Gradle)● Add a build.gradle● Run gradlew wrapper● Add deb instr to build.gradle● Install Vagrant + VBox● Test deb locally● Create Stash repo● Create Jenkins job● Create Spinnaker pipelines● git push
after Newt:● Install Newt● newt init --app-type nodejs● git push
Nebula ospackage is hidden inside a local
Docker container managed by Newt
https://en.wikipedia.org/wiki/Data_visualization#/media/File:Social_Network_Analysis_Visualization.png
dependencies @ scale
Future of containers @ Netflix
● More scale!● Guaranteed capacity (service)● Fair scheduling (batch)● Local integration test env (devex)● Next generation CI (devex)● Internal RI spot market of trough