podila mesos con-northamerica_sep2017

Practical Container Scheduling: Juggling Optimizations, Guarantees, and

Trade-Offs at NetflixSharma Podila, Senior Software Engineer, Netflix

Consider this

You got yourselves an Apache Mesos cluster. Yeah!

Can you– Guarantee capacity for all your applications?– Optimize assignments for locality, affinity?– Keep the cluster size elastic?– Minimize total usage footprint?

About me

• Works in Edge Engineering at Netflix– Distributed resource scheduling– Worked on projects Mantis and Titus

• Created Netflix OSS Fenzo

• Previously, built resource scheduling for HPC like batch processing in data center environments

Agenda

• What are we trying to solve?• Why juggle?• Scheduling challenges in large clusters• A look into what we created, how it works• What’s next?

Reactive stream processing: Mantis

Zuul Cluster

API Cluster

MantisStream processing

Cloud native service

● Configurable message delivery guarantees● Heterogeneous workloads

○ Real-time dashboarding, alerting○ Anomaly detection, metric generation○ Interactive exploration of streaming data

AnomalyDetection

Container deployment: Titus

EC2

VPC

VMVMTi

tus

Job

Con

trol

Containers

AppCloud Platform

(metrics, IPC, health)

VMVM

BatchContainers

Eureka EddaAtlas & Insight

What the cluster needs to support

• Heterogeneous mix of workload– Vary in # of CPUs, memory, network, local disk– Vary in criticality and runtime duration

• Resource demand variation over time– Data volume variation in Mantis– Number of containers in Titus

Agenda


Why juggle at all?

If we had unlimited resources for all workloads, there is no need to juggle

Why juggle at all?

If we had unlimited resources for all workloads, there is no need to juggle

If you are running on an elastic cloud, don’t you have unlimited resources?

Why juggle at all?

• Demand vs. Supply

Why juggle at all?

• Demand vs. Supply• Efficiency

About 50% utilized

Why juggle at all?

• Demand vs. Supply• Efficiency• Workload types

– critical user facing, pre-compute for production, experimentation, testing, “idle-soak”

– services, batch, stream processing

Agenda


Scheduling challenge in large clusters

• ComplexitySpeed Accuracy

First fit assignment Optimal assignment

Real world trade-offs

Scheduling challenge in large clusters

• Complexity• Speed of scheduling; a slow scheduler can

– Leave servers idle longer– Make inefficient and incorrect assignments

Our initial goals for a cluster scheduler

• Multi goal optimization for task placement• Cluster autoscaling• Extensibility

Our initial goals for a cluster scheduler

• Multi goal optimization for task placement• Cluster autoscaling• Extensibility

• Security• Capacity guarantees• Reasoning about allocation failures

Multi goal task placement

DC/Cloud operator

Application owner

Cost Security

Move in the generally right direction

Cluster autoscaling

Large variation in peak to trough resource requirements

Mantis events/sec

12M

2M

Titus concurrent containers

1000s

10s

Cluster autoscaling

Host 1 Host 2 Host 3 Host 4

• Scaling up a cluster is relatively easy

Cluster autoscaling

Host 4Host 3Host 1vs.

Host 1 Host 2

Host 2

Host 3 Host 4

• Scaling up a cluster is relatively easy• Scaling down requires bin packing

Security

SecGrp A

Task 0

SecGrp Y,Z

Task 1 Task 2 Task 3

app

SecGrp X

app

SecGrp X

appapp

Host foo

Mixing tasks with different security access on a single host

Capacity guarantees

Guarantee capacity to all applications per SLA

Critical

FlexCritical

Flex

ResourceAllocationOrder

Quotas Prioritiesvs.

Reasoning about allocation failures

• Why is a job not running?• What resources are we not able to

allocate?• How many servers are failing the resource

requests?

Agenda


Core scheduling, job control plane

AWS EC2

Apache Mesos

Titus/Mantis Framework

Fenzo

Batch Job Mgr

Service Job Mgr

Fenzo scheduling strategy

Fitness

Pending

Assigned

Urg

ency

N tasks to assign from M possible agents

Fenzo, OSS scheduling library

Benefits for any JVM Mesos framework:• Extensibility via plugins• Cluster autoscaling• Tiered queues with weighted DRF• Control for speed vs. optimal assignments• Ease of experimentation

github.com/Netflix/Fenzo

Fenzo scheduling strategyFor each (ordered) task

On each available host

Validate hard constraintsEval score for fitness and soft constraints

Until score good enough, and

A minimum #hosts evaluated

Pick host with highest score

Fitness and constraints are plugins

Fenzo scheduling strategyFor each (ordered) task

On each available host

Validate hard constraintsEval score for fitness and soft constraints

Until score good enough, and

A minimum #hosts evaluated

Pick host with highest score

Fitness and constraints are plugins

Plugins

Fitness functions we use

• CPU, memory, and network bin packing

E.g., CPU fitness = usedCPUs / totalCPUs


• CPU, memory, and network bin packing

Fitness for 0.25 0.5 0.75 1.0 0.0

✔

Host1 Host2 Host3 Host4 Host5Host1 Host2 Host3 Host4 Host5


• CPU, memory, and network bin packing• Task runtime profile type - perpetual vs.

finite time



finite time• Minimize concurrent launch of tasks on an

individual host



finite time• Minimize concurrent launch of tasks on an

individual hostfitness = binPacking * w1 + runtime * w2 + launch * w3

Hard constraints we use

• GPU server matching– Use agent with GPU only if task requires one

• Match tasks with resources earmarked for queue tiers

Soft constraints we use

• Specified by individual jobs at submit time• Balance tasks of a job across availability

zones• Balances tasks of services across hosts

Mixing fitness with soft constraints

Agent score = fitness score * 0.4 +soft constraint score * 0.6

Our queues setup

AppC

1

AppC

2

AppC

3

AppC

N

AppF1

AppF2

AppFM

AppF3

ResourceAllocationOrder

Critical(Tier 0)

Flex(Tier 1)

Separate tiers based on how quickly resources need to be allocated

Weighted DRF across buckets in a tier

User interface for capacity guarantee

• Application setup– Specify total capacity needs for an applicationE.g., “4-CPU, 8GB, 512 Mbps” times 120 containers

• User specifies “application name”• A “default” catch-all bucket supports

experimentation• Cluster admin maps applications to tiers

Sizing agent clusters for capacity

Tier 0: Used capacity

Idle capacity Autoscaled

Cluster min size (guaranteed capacity)Cluster max Size

Tier 1: Used capacity Autoscaled

Cluster desired sizeCluster max Size

(Idle size kept near zero)

Reasoning about allocation failures

Agenda


• Task evictions

What’s next?

What’s next?

• Task evictions

• “Noisy neighbors” feedback from agents

What’s next?

• Task evictions

• “Noisy neighbors” feedback from agents

• Automated rollout of new agent code

Questions?

Practical Container Scheduling: Juggling Optimizations, Guarantees, and Trade-Offs

at Netflix

Sharma Podila @podila

podila mesos con-northamerica_sep2017

Technology