velocity nyc 2016 - containers @ netflix

65
Containers @ Netflix How they add to a proven cloud architecture

Upload: aspyker

Post on 16-Apr-2017

306 views

Category:

Technology


2 download

TRANSCRIPT

Containers @ NetflixHow they add to a proven cloud architecture

https://www.flickr.com/photos/hinnosaar/2655128664

Datacenters

Java monolithic app

Oracle database

AWS cloud

Java microservices

Cassandra

Netflix Open Source Software

http://netflix.github.io

Containers @ NetflixHow they add to a proven cloud architecture

batch applications

Multi-tenant (cgroups/Mesos) historically used for batch

Linux cgroups

What do batch users want?

● Simple shared resources, run till done, job files

● NOT○ EC2 Instance sizes, autoscaling, AMI OS’s

● WHY○ Offloads resource management ops, Simpler

Titus

Batch

Job Management

Resource Management & Optimization

Container ExecutionIntegration

Workflow, Data Analysis, Adhoc Upstream Systems

Netflix Batch Job Examples

● Algorithm Model Training (with GPU’s)

Netflix Batch Job Examples● Media Encoding

● Digital Watermarking

1 1

Netflix Batch Job Examples

Open Connect CDN Reporting

AdhocReporting

Lessons Learned from Batch

● Docker helped generalize use cases

Lessons Learned from Batch

● Docker helped generalize use cases● Advanced scheduling required

Lessons Learned from Batch

● Docker helped generalize use cases● Advanced scheduling required● Initially ignored failures (with retries)

● Docker helped generalize use cases● Advanced scheduling required● Initially ignored failures (with retries)● Time sensitive batch came later

Lessons Learned from Batch

Current Container Usage - Batch

● 100 containers / hour● Peaks of 1000’s per hour● Large spikes of CI testing and Digital Watermarking

A random day’s worth of containers

service applications

Why Services in containers?

Theory Reality

“Why is Apache and Tomcat running on my NodeJS server”

“Why is Apache and Tomcat running on my NodeJS server”

Problem:BaseAMI optimized for Java, not easily customizable

“Why do I need java, gradle, ospackage after my non-Java build?”

“Why do I need java, gradle, ospackage after my non-Java build?”

Problem:Reuse of Java-centric AMI tooling

“I want an instance with a single core to run my lightweight server”

“I want an instance with a single core to run my lightweight server”

Problem:Small instances are not reliable

Enter Docker

● Have a new language?● Have a build tool you like?● Want to resource isolate easily?

Come one, come all

Services are just long running batch?

ServicesJob Management

Resource Management & Optimization

Container ExecutionIntegration

Service Apps

Batch

Services more complex● Services resize constantly and run forever

○ Autoscaling○ Hard to upgrade underlying hosts

Services more complex● Services resize constantly and run forever

○ Autoscaling○ Hard to upgrade underlying hosts

● Have more state○ Ready for traffic vs. just started/stopped○ Even harder to upgrade

Services more complex● Services resize constantly and run forever

○ Autoscaling○ Hard to upgrade underlying hosts

● Have more state○ Ready for traffic vs. just started/stopped○ Even harder to upgrade

● Existing well defined dev, deploy, runtime & ops tools

Real networking is hard

Multi-tenant

Need IP per container - in VPC

Using security groups

Using IAM roles

Considering network resource isolation

Solutions● VPC Networking driver

○ Supports ENI’s - full IP functionality○ Scheduled security groups○ Support traffic control (isolation)

● EC2 Metadata proxy○ Adds container “node” identity○ Delivers IAM roles

Reuse existing infrastructure services

VMVM

EC2

AW

S A

utoS

cale

rVMs

App

Cloud Platform(metrics, IPC, health)

VPC

Netflix Cloud Infrastructure (VM’s + Containers)

Atlas Eureka Edda

Enable them for containers

VMVM

EC2

AW

S A

utoS

cale

rVMs

App

Cloud Platform(metrics, IPC, health)

VPC

Netflix Cloud Infrastructure (VM’s + Containers)

VMVM

Atlas

Titu

s Jo

b C

ontro

l

Containers

App

Cloud Platform(metrics, IPC, health)

Eureka Edda

VMVM

BatchContainers

Spinnaker

Deploy based on new images

tags

Basic resource requirements

IAM Roles & Sec Groups per container

Deploy Strategies

Same as VM’s

Easily see health &

discovery

Current Container Usage - Service

● Still small - 100’s of containers

● NodeJS Device UI Scripts Apps● Stream Processing Jobs - Flink● Various Internal Dashboards

developer experience

● Consistent Mac setup● Consistent workflows● Netflix integration

The Docker experience

NEWT (Netflix Workflow Toolkit)

dev machine bootstrap

project scaffolding

setup dev pipelines

run locally

build history

start pipeline

beyond java

beyond java

Node.js supportbefore Newt:● Install Java● Install Nebula (Netflix Gradle)● Add a build.gradle● Run gradlew wrapper● Add deb instr to build.gradle● Install Vagrant + VBox● Test deb locally● Create Stash repo● Create Jenkins job● Create Spinnaker pipelines● git push

after Newt:● Install Newt● newt init --app-type nodejs● git push

Nebula ospackage is hidden inside a local

Docker container managed by Newt

https://en.wikipedia.org/wiki/Data_visualization#/media/File:Social_Network_Analysis_Visualization.png

dependencies @ scale

Project Niagara

100,000 builds a day, peak

where are we going?

Future of containers @ Netflix

● More scale!● Guaranteed capacity (service)● Fair scheduling (batch)● Local integration test env (devex)● Next generation CI (devex)● Internal RI spot market of trough

Questions?