podila mesos con-northamerica_sep2017

51
Practical Container Scheduling: Juggling Optimizations, Guarantees, and Trade-Offs at Netflix Sharma Podila, Senior Software Engineer, Netflix

Upload: sharma-podila

Post on 29-Jan-2018

346 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Podila mesos con-northamerica_sep2017

Practical Container Scheduling: Juggling Optimizations, Guarantees, and

Trade-Offs at NetflixSharma Podila, Senior Software Engineer, Netflix

Page 2: Podila mesos con-northamerica_sep2017

Consider this

You got yourselves an Apache Mesos cluster. Yeah!

Can you– Guarantee capacity for all your applications?– Optimize assignments for locality, affinity?– Keep the cluster size elastic?– Minimize total usage footprint?

Page 3: Podila mesos con-northamerica_sep2017

About me

• Works in Edge Engineering at Netflix– Distributed resource scheduling– Worked on projects Mantis and Titus

• Created Netflix OSS Fenzo

• Previously, built resource scheduling for HPC like batch processing in data center environments

Page 4: Podila mesos con-northamerica_sep2017

Agenda

• What are we trying to solve?• Why juggle?• Scheduling challenges in large clusters• A look into what we created, how it works• What’s next?

Page 5: Podila mesos con-northamerica_sep2017

Agenda

• What are we trying to solve?• Why juggle?• Scheduling challenges in large clusters• A look into what we created, how it works• What’s next?

Page 6: Podila mesos con-northamerica_sep2017

Reactive stream processing: Mantis

Zuul Cluster

API Cluster

MantisStream processing

Cloud native service

● Configurable message delivery guarantees● Heterogeneous workloads

○ Real-time dashboarding, alerting○ Anomaly detection, metric generation○ Interactive exploration of streaming data

AnomalyDetection

Page 7: Podila mesos con-northamerica_sep2017

Container deployment: Titus

EC2

VPC

VMVMTi

tus

Job

Con

trol

Containers

AppCloud Platform

(metrics, IPC, health)

VMVM

BatchContainers

Eureka EddaAtlas & Insight

Page 8: Podila mesos con-northamerica_sep2017

What the cluster needs to support

• Heterogeneous mix of workload– Vary in # of CPUs, memory, network, local disk– Vary in criticality and runtime duration

• Resource demand variation over time– Data volume variation in Mantis– Number of containers in Titus

Page 9: Podila mesos con-northamerica_sep2017

Agenda

• What are we trying to solve?• Why juggle?• Scheduling challenges in large clusters• A look into what we created, how it works• What’s next?

Page 10: Podila mesos con-northamerica_sep2017

Why juggle at all?

If we had unlimited resources for all workloads, there is no need to juggle

Page 11: Podila mesos con-northamerica_sep2017

Why juggle at all?

If we had unlimited resources for all workloads, there is no need to juggle

If you are running on an elastic cloud, don’t you have unlimited resources?

Page 12: Podila mesos con-northamerica_sep2017

Why juggle at all?

• Demand vs. Supply

Page 13: Podila mesos con-northamerica_sep2017

Why juggle at all?

• Demand vs. Supply

Page 14: Podila mesos con-northamerica_sep2017

Why juggle at all?

• Demand vs. Supply• Efficiency

About 50% utilized

Page 15: Podila mesos con-northamerica_sep2017

Why juggle at all?

• Demand vs. Supply• Efficiency• Workload types

– critical user facing, pre-compute for production, experimentation, testing, “idle-soak”

– services, batch, stream processing

Page 16: Podila mesos con-northamerica_sep2017

Agenda

• What are we trying to solve?• Why juggle?• Scheduling challenges in large clusters• A look into what we created, how it works• What’s next?

Page 17: Podila mesos con-northamerica_sep2017

Scheduling challenge in large clusters

• ComplexitySpeed Accuracy

First fit assignment Optimal assignment

Real world trade-offs

Page 18: Podila mesos con-northamerica_sep2017

Scheduling challenge in large clusters

• Complexity• Speed of scheduling; a slow scheduler can

– Leave servers idle longer– Make inefficient and incorrect assignments

Page 19: Podila mesos con-northamerica_sep2017

Our initial goals for a cluster scheduler

• Multi goal optimization for task placement• Cluster autoscaling• Extensibility

Page 20: Podila mesos con-northamerica_sep2017

Our initial goals for a cluster scheduler

• Multi goal optimization for task placement• Cluster autoscaling• Extensibility

• Security• Capacity guarantees• Reasoning about allocation failures

Page 21: Podila mesos con-northamerica_sep2017

Multi goal task placement

DC/Cloud operator

Application owner

Cost Security

Move in the generally right direction

Page 22: Podila mesos con-northamerica_sep2017

Cluster autoscaling

Large variation in peak to trough resource requirements

Mantis events/sec

12M

2M

Titus concurrent containers

1000s

10s

Page 23: Podila mesos con-northamerica_sep2017

Cluster autoscaling

Host 1 Host 2 Host 3 Host 4

• Scaling up a cluster is relatively easy

Page 24: Podila mesos con-northamerica_sep2017

Cluster autoscaling

Host 4Host 3Host 1vs.

Host 1 Host 2

Host 2

Host 3 Host 4

• Scaling up a cluster is relatively easy• Scaling down requires bin packing

Page 25: Podila mesos con-northamerica_sep2017

Cluster autoscaling

Host 4Host 3Host 1vs.

Host 1 Host 2

Host 2

Host 3 Host 4

• Scaling up a cluster is relatively easy• Scaling down requires bin packing

Page 26: Podila mesos con-northamerica_sep2017

Security

SecGrp A

Task 0

SecGrp Y,Z

Task 1 Task 2 Task 3

app

SecGrp X

app

SecGrp X

appapp

Host foo

Mixing tasks with different security access on a single host

Page 27: Podila mesos con-northamerica_sep2017

Capacity guarantees

Guarantee capacity to all applications per SLA

Critical

FlexCritical

Flex

ResourceAllocationOrder

Quotas Prioritiesvs.

Page 28: Podila mesos con-northamerica_sep2017

Reasoning about allocation failures

• Why is a job not running?• What resources are we not able to

allocate?• How many servers are failing the resource

requests?

Page 29: Podila mesos con-northamerica_sep2017

Agenda

• What are we trying to solve?• Why juggle?• Scheduling challenges in large clusters• A look into what we created, how it works• What’s next?

Page 30: Podila mesos con-northamerica_sep2017

Core scheduling, job control plane

AWS EC2

Apache Mesos

Titus/Mantis Framework

Fenzo

Batch Job Mgr

Service Job Mgr

Page 31: Podila mesos con-northamerica_sep2017

Fenzo scheduling strategy

Fitness

Pending

Assigned

Urg

ency

N tasks to assign from M possible agents

Page 32: Podila mesos con-northamerica_sep2017

Fenzo, OSS scheduling library

Benefits for any JVM Mesos framework:• Extensibility via plugins• Cluster autoscaling• Tiered queues with weighted DRF• Control for speed vs. optimal assignments• Ease of experimentation

github.com/Netflix/Fenzo

Page 33: Podila mesos con-northamerica_sep2017

Fenzo scheduling strategyFor each (ordered) task

On each available host

Validate hard constraintsEval score for fitness and soft constraints

Until score good enough, and

A minimum #hosts evaluated

Pick host with highest score

Fitness and constraints are plugins

Page 34: Podila mesos con-northamerica_sep2017

Fenzo scheduling strategyFor each (ordered) task

On each available host

Validate hard constraintsEval score for fitness and soft constraints

Until score good enough, and

A minimum #hosts evaluated

Pick host with highest score

Fitness and constraints are plugins

Plugins

Page 35: Podila mesos con-northamerica_sep2017

Fitness functions we use

• CPU, memory, and network bin packing

Page 36: Podila mesos con-northamerica_sep2017

E.g., CPU fitness = usedCPUs / totalCPUs

Fitness functions we use

• CPU, memory, and network bin packing

Fitness for 0.25 0.5 0.75 1.0 0.0

Host1 Host2 Host3 Host4 Host5Host1 Host2 Host3 Host4 Host5

Page 37: Podila mesos con-northamerica_sep2017

Fitness functions we use

• CPU, memory, and network bin packing• Task runtime profile type - perpetual vs.

finite time

Page 38: Podila mesos con-northamerica_sep2017

Fitness functions we use

• CPU, memory, and network bin packing• Task runtime profile type - perpetual vs.

finite time• Minimize concurrent launch of tasks on an

individual host

Page 39: Podila mesos con-northamerica_sep2017

Fitness functions we use

• CPU, memory, and network bin packing• Task runtime profile type - perpetual vs.

finite time• Minimize concurrent launch of tasks on an

individual hostfitness = binPacking * w1 + runtime * w2 + launch * w3

Page 40: Podila mesos con-northamerica_sep2017

Hard constraints we use

• GPU server matching– Use agent with GPU only if task requires one

• Match tasks with resources earmarked for queue tiers

Page 41: Podila mesos con-northamerica_sep2017

Soft constraints we use

• Specified by individual jobs at submit time• Balance tasks of a job across availability

zones• Balances tasks of services across hosts

Page 42: Podila mesos con-northamerica_sep2017

Mixing fitness with soft constraints

Agent score = fitness score * 0.4 +soft constraint score * 0.6

Page 43: Podila mesos con-northamerica_sep2017

Our queues setup

AppC

1

AppC

2

AppC

3

AppC

N

AppF1

AppF2

AppFM

AppF3

ResourceAllocationOrder

Critical(Tier 0)

Flex(Tier 1)

Separate tiers based on how quickly resources need to be allocated

Weighted DRF across buckets in a tier

Page 44: Podila mesos con-northamerica_sep2017

User interface for capacity guarantee

• Application setup– Specify total capacity needs for an applicationE.g., “4-CPU, 8GB, 512 Mbps” times 120 containers

• User specifies “application name”• A “default” catch-all bucket supports

experimentation• Cluster admin maps applications to tiers

Page 45: Podila mesos con-northamerica_sep2017

Sizing agent clusters for capacity

Tier 0: Used capacity

Idle capacity Autoscaled

Cluster min size (guaranteed capacity)Cluster max Size

Tier 1: Used capacity Autoscaled

Cluster desired sizeCluster max Size

(Idle size kept near zero)

Page 46: Podila mesos con-northamerica_sep2017

Reasoning about allocation failures

Page 47: Podila mesos con-northamerica_sep2017

Agenda

• What are we trying to solve?• Why juggle?• Scheduling challenges in large clusters• A look into what we created, how it works• What’s next?

Page 48: Podila mesos con-northamerica_sep2017

• Task evictions

What’s next?

Page 49: Podila mesos con-northamerica_sep2017

What’s next?

• Task evictions

• “Noisy neighbors” feedback from agents

Page 50: Podila mesos con-northamerica_sep2017

What’s next?

• Task evictions

• “Noisy neighbors” feedback from agents

• Automated rollout of new agent code

Page 51: Podila mesos con-northamerica_sep2017

Questions?

Practical Container Scheduling: Juggling Optimizations, Guarantees, and Trade-Offs

at Netflix

Sharma Podila @podila