cluster scheduling at microsoft...

Cluster Scheduling at Microsoft Scale

MIT, 10/31/2019

Computer Networks (6.829)

Konstantinos Karanasos

GSL (CISL)

Mission: applied research lab working on systems for big data, cloud, and machine learning

Office of the CTO for Azure Data

~15 researchers/engineers in the Bay Area, Redmond, and Madison

The three hats we wear

Applied research group

Collaborating with database, big data, and AI infra groups at MS

Open-sourcing our codeApache Hadoop, ONNX Runtime, MlFlow, REEF, Heron

Current Focus Areas

Resource management

Systems for ML

ML for Systems

Query Optimization Provenance

What is a Resource Manager?

Node Manager

1. Request

2. Allocation

3. Start container

acquire cluster resources in the form of containersDo we really need a

Resource Manager?

Lessons learned: Abstracting out the RM layer

Ad-hocapp

Ad-hocApps

Tez Giraph Storm DryadREEF

Hive / Pig

Hadoop 1.x(MapReduce)

Hive / Pig

ApplicationFrameworks

ProgrammingModel(s)

Cluster OS (Resource Management)

Hadoop 1 World Hadoop 2 World

File SystemHDFS 1 HDFS 2

Hardware

Ad-hocapp

Scopeon

YARNSpark

monolithicReuse of RM component

layering abstractions

Cosmos: Microsoft’s Analytics Stack

Cluster(s):> 50K nodes

Job(s):>2M tasks, >5PB input

Tasks:10 sec 50th %ile

Utilization:~60% avg CPU util

Scheduler:>70K QPS

The Scale and Utilization Challenge

Scheduling in Analytics Clusters: a Journey…

Hadoop MR, Centralized sched.

201920132003 2008

Scope, Centralized sched.

[eurosys07, vldb08]

HydraMR

Tez Spark

+multi-framework+security+scheduler expressivity[socc13]

Scope1 Scope2 Scope3

Distributed sched. +tooling/optimizer,+scale, +high utilization[osdi14]

Teaser…

>99% tenants migrated>250K servers>500k daily jobs>1 Zetabyte data processed>1 Trillion tasks scheduled

Agenda

Overview of legacy systems (Apollo, YARN)

Scale: Federated YARN

Resource utilization: Opportunistic Containers

Production Experience

Distributed Scheduling:Legacy Cosmos (Apollo)

Legacy Cosmos (Apollo)

JM2JM1 JM3

Distributed scheduling got us scale

Execution plan

Cosmos Job Service

Compile/Optimize

T = SELECT * FROM blah;

SELECT T.blah FROM T;

AdmissionControl

(gang-queue)

Distributed scheduling+ node-level queuing

* JM: Job Manager

Script

* JM: Job Manager

Script

Legacy CosmosProblem: (required) static resource allocation yields low utilization

resource fragmentation within and across queues

Solution: opportunistically share resources

“guaranteed” containers

“opportunistic”containers

opportunistic scheduling got us utilization

Legacy Cosmos: Pros/Cons

+ Great scalability

+ Good resource utilization

- No support for multiple frameworks

- Limited control on scheduling (e.g., fairness, load-balancing, locality, special HW)

Centralized Scheduling:Apache Hadoop YARN

Legacy YARN

1) Submit job

2) Admission control (fairness, quotas, constraints, SLOs, …)

3) Schedule Application Master (AM)

4) Start AM container

Legacy YARN

4) AM requests on heartbeat for more containers

6) Start container

7) AM-Task communication

5) RM grants “token”

Legacy YARN: Pros/Cons

+ Support for arbitrary frameworks

+ Rich scheduling invariants

+ Non-tech advantage: OSS/mind-share

- Scale limits (~5K nodes in 2013)

- Utilization: heartbeats lead to idle resources

Main challenges

High resource utilization Scalability

Rich placement constraints

Production jobs and predictability

Medea [eurosys18]

Mercury [atc15,eurosys16] Federation [nsdi19]

Morpheus [osdi16]

OSS + production!

Improving YARN’s scalability:Federated Architecture

Need scale and utilization?Go distributed!Need scheduling control and multi-framework?Go centralized!

Want it all?

sub-cluster 1

sub-cluster 2

Hydra Architecture

StateStoreProxy

StateStore

Hydra Architecture

sub-cluster 1

sub-cluster 2

StateStoreProxy

StateStore

Hydra Architecture

AMRMProxy

sub-cluster 1

sub-cluster 2

StateStoreProxy

StateStore

Hydra Architecture

AMRMProxy

sub-cluster 1

sub-cluster 2

StateStoreProxy

StateStore

GlobalPolicy

Generator

Scheduling Desiderata: Global Goals

High utilization

Scheduling invariants (e.g., fairness)

Locality (e.g., machine preferences)

50% 50%

demand: 12demand: 4

demand: 12demand: 12

✓ Locality, Fairness

✗ Utilization

✓ Utilization, Fairness

✗ Locality

✓ Utilization, locality

✗ Fairness

✓ Utilization✓ Fairness✓ Locality

PoliciesAMRMProxy routing of requests

Enforce locality?Per-cluster RM scheduling decisions

Enforce quotas?

Key Idea

Decouple:

Share determinationHow many resources should a queue get?

PlacementOn which machines should each task run?

sub-cluster 1

sub-cluster 2

StateStoreProxy

StateStore

GlobalPolicy

Generator•

* More advanced than what in prod. (details in paper)

Proposed Solution*

Handling GPG downtime

If GPG is down, we would fallback to local decisionsProblematic if they “diverge” too much from global one

Leverage LP-based “tuning” of local queue allocationHistorical demand as a predictor of future demand

Improving YARN’s Utilization:Opportunistic Containers

Suboptimal Utilization in Scope-on-YARN

NM1 NM2

RMj1j2

Due to:Gang schedulingFeedback delays

5 sec 10 sec 50 sec Mixed-5-50

60.59% 78.35% 92.38% 78.54%

Opportunistic Containers in YARN

Mask feedback delays

Only OPP containers can be queued

Promotion/demotion of containers

Utilization Gains with Node-Side Queuing

But are long queues all we need?

Job Completion Times with Node-Side Queuing

Naïve node-side queuing can be detrimental for job completion times

Proper queue management techniques are required

Problems with Node-Side Queuing

Load imbalance across nodesSuboptimal task placement

Head-of-line blockingEspecially for heterogeneous tasks

Early binding of tasks to nodes N1 N2 N3

Queue management techniques

Place tasks to node queues

Prioritize task execution

(queue reordering)

Bound queue lengths

Placement of Tasks to Queues

Placement based on queue lengthAgnostic of task characteristicsSuboptimal placement for heterogeneous workloads

Placement based on queue wait timeBetter for heterogeneous workloadsRequires task duration estimates

24 100

N1 N2 N3

Task Prioritization

Queue reordering strategiesShortest Remaining Job First (SRJF)Least Remaining Tasks First (LRTF)Shortest Task First (STF)Earliest Job First (EJF)

SRJF and LRTF are job-awareDynamically reorder tasks based on job progress

Starvation freedomGive priority to tasks waiting more than X secs

N1 N2 N3

RMj2: 5 tasksj3: 9 tasksj1: 21 tasks

Bounding Queue Lengths

Determine max number of tasks at a queueTrade-off between short and long queues

Short queuesResource idling à lower throughput

Long queuesHigh queuing delays, early binding of tasks to queues à longer job completion times

Static and dynamic queue bounding

Queuing Techniques: Evaluation

1.7x improvement in median JCT over YARN

Both bounding and reordering are crucial

Workload

Scale High scheduling rate at low allocation latency

UtilizationFederated design improves load balancing, while retaining utilization

Task Performance (comparison)

Performance

Jobs perform just as well (and tasks are as efficient)

Qualitative Experience

operability

Overview of legacy systems (Apollo, YARN)

Scale: Federated YARN

Resource utilization: Opportunistic Containers

Conclusion

Research in Industry is a lot of fun

Fewer bigger projects, with massive winsBig force multipliers

Failure is expected (otherwise we are not trying hard enough)

Modus Operandi

Be picky in choosing problemsEngage early, engage deep

Be inclusive with prod/OSS counterparts

Thank you!

Konstantinos.Karanasos@microsoft.com

cluster scheduling at microsoft...

Documents

a buffer-based approach to rate adaptation:...

computer networks 6.829 fall 2005 september 8, 2005...

6.1 cpu scheduling basic concepts scheduling criteria...

rotornet: a scalable, low-complexity, optical datacenter...

6.829 bgp recitation rob beverly september 29, 2006

internet measurement lecture (mit...

6.829 computer networks1 compressed sensing for...

cpu scheduling basic concepts scheduling criteria scheduling...

congestion manager - mit opencourseware · congestion...

6.829 bgp recitation

cpu scheduling. basic concepts scheduling criteria...

basics of scheduling scheduling algorithms scheduling in

scheduling scheduling (cont.)

networking named content -...

6.829 computer networks -...

chapter 8 operations scheduling. scheduling problems in...

an introduction to smart nics -...

the internet domain name system hari balakrishnan 6.829 fall...

1 6.829 computer networks prof. dina katabi dina dk@mit.edu...

chapter 5: cpu scheduling. basic concepts scheduling...