goodfit: multi-resource packing of tasks with dependencies

GoodFit: Multi-Resource Packing of Tasks with

Dependencies

Cluster Scheduling for JobsJobs

Machines, file-system, network

Cluster Scheduler matches tasks to resources

Goals• High cluster utilization• Fast job completion time• Predictable perf./ fairness

E.g., BigData (Hive, SCOPE, Spark)E.g., CloudBuild

Tasks

Dependencies

• Need not keep resource “buffers”• More dynamic than VM placement (tasks last seconds)• Aggregate properties are important (eg, all tasks in a job should finish)

Need careful multi-resource planning

Problem

Fragmentation

Current Schedulers Packer Scheduler

Over-allocation of net/disk


2 tasks/T 3 tasks/T (+50%) 2 tasks/ 2T 2 tasks/T (+100%)

… worse with dependenciesProblem 2

Tt, r t, 1- r

t, r

t, 1- r t, 1- r

(T- 2)t, r (T- 4)t, r ~Tt, r

……

DAG label= {duration, resource demand}

resource

time

~nT t…

resource

time

~T t

…

…

Crit. Path Best

Critical path scheduling is n times off since it ignores resource demands

Packers can be d times off since they ignore future work [d resources]

Typical job scheduler infrastructure

+ packing+ bounded unfairness+ merge schedules+ overbook

DAGAM

DAGAM

…

Node heartbeat

Task assignment

Schedule Constructor


RMNM

NM

NM

NM

Main ideas in multi-resource packingTask packing ~ Multi-dimensional bin packing, but* Very hard problem (“APX-hard”)* Available heuristics do not directly apply [task demands change with placement]

Alignment score (A) = D R A packing heuristic Task’s resources demand vector: D Machine resource vector: R<

Fit

A job completion time heuristic shortest remaining work, P tasks avg. durationtasks avg. resource demand

**

=remaining # tasks

Packing Efficiency

?delays job completion

loses packing efficiencyJob Completion Time

Fairness

Trade-offs:

We show that:{best “perf” |bounded unfairness} ~ best “perf”

loses both

Main ideas in packing dependent tasks

1. Identify troublesome tasks (meat) and place them first

2. Systematically place other tasks without deadlocks

3. At runtime, use a precedence order from the computed schedule + heuristics to (a) overbook, (b) previous slide.

4. Better lower bounds for DAG completion time

M

P

C

O

time

resource

meat begin

meat end

parents

meat

children

Results - 1

Packing

Packing + Deps.

Lower bound

[20K DAGs from Cosmos]

Results - 2

Tez + PackingTez + Pack +Deps

[200 jobs from TPC-DS, 200 server cluster]

Bundling

Temporal relaxation of fairness

Map (disk)

Reduce (netw.)

Fair share among two identical jobs

50%

50%

50%

50%

2T 4T

Instantaneous fairness

100%

100%

100%

100%

2T 3TT

1) Temporal relaxation of fairnessa job will finish within x the time it takes given strict share

2) Optimal trade-off with performancex fairness costs x on make-span

3) A simple (offline) algorithm that achieves the above trade-off

Problem:

Instantaneous fairness can be up to dx worse on makespan (d resources)

Best

Fairness slack Perf loss

0 (perfectly fair) 2x

1 (<2x longer) 1.1x

2 (<3x longer) 1.07x

Bare metalVM Allocation

Data-parallel Jobs

Job: Tasks

Dependencies

E.g., HDInsight, AzureBatch

E.g., BigData (Yarn, Cosmos, Spark)

E.g., CloudBuild

3500 servers3500 users>20M targets/day

~100K servers (40K at Yahoo)

>50K servers>2EB stored>6K devs

• Tasks are short-lived (10s of seconds)• Have peculiar shaped demands• Composites are important (job needs all tasks to finish)• OK to kill and restart tasks• Locality

1) Job scheduling has specific aspects

2) will speed-up the average job (and reduce resource cost)

3) research + practice

Resource aware scheduling improves SLOs and Return/$

Cluster Scheduling for JobsJobs

Machines, file-system, network

Cluster Scheduler matches tasks to resources

Goals• High cluster utilization• Fast job completion time• Predictable perf./ fairness• Efficient (milliseconds…)

E.g., HDInsight, AzureBatchE.g., BigData (Hive, SCOPE, Spark)E.g., CloudBuild

Tasks

Dependencies

Need careful multi-resource planning

Problem

Fragmentation


Over-allocation of net/disk


2 tasks/T 3 tasks/T (+50%) 2 tasks/ 2T 2 tasks/T (+100%)

… worse with dependenciesProblem 2

Tt, r t, 1- r

t, r

t, 1- r t, 1- r

(T- 2)t, r (T- 4)t, r ~Tt, r

……

DAG label= {duration, resource demand}

resource

time

~nT t…

resource

time

~T t

…

…

Crit. Path Best

Critical path scheduling is n times off since it ignores resource demands

Packers can be d times off since they ignore future work [d resources]

Typical job scheduler infrastructure

+ packing+ bounded unfairness+ merge schedules+ overbook

DAGAM

DAGAM

…

Node heartbeat

Task assignment



RMNM

NM

NM

NM

Main ideas in packing dependent tasks

1. Identify troublesome tasks (T) and place them first

2. Systematically place other tasks without dead-ends

3. At runtime, enforce computed schedule + heuristics to (a) overbook, (b) previous slide.

4. Better lower bounds for DAG completion time

T

P

C

O

time

resource

Trouble begin

Trouble end

parents

trouble

children

Results - 1

Packing

Packing + Deps.

Lower bound

[20K DAGs from Cosmos]

2X1.5X

Results - 2

Tez + PackingTez + Pack +Deps

[200 jobs from TPC-DS, 200 server cluster]

Multi-Resource Packing for Cluster SchedulersTetris

Performance of cluster schedulers

We observe that:

1Time to finish a set of jobs

Resources are fragmented i.e. machines are running below capacity Even at 100% usage, goodput is much smaller due to over-allocation Even pareto-efficient multi-resource fair schemes result in much lower performance

Tetrisup to 40% improvement in makespan1 and job

completion time with near-perfect fairness

25

Findings from Bing and Facebook traces analysis

Tasks need varying amounts of each resource

Demands for resources are weakly correlated

Diversity in multi-resource requirements:

Multiple resources become tight

This matters because no single bottleneck resource: Enough cross-rack network bandwidth to use all CPU cores

Upper bounding potential gains reduce makespan1 by up to 49% reduce avg. job compl. time by up to 46%

26

Why so bad #1

Production schedulers neither pack tasks nor consider all their relevant

resource demands

#1 Resource Fragmentation

#2 Over-allocation

27Current Schedulers

“Packer” Scheduler

Machine A4 GB Memory

Machine B4 GB Memory

T1: 2 GB

T3: 4 GB

T2: 2 GB

Tim

e

Resource Fragmentation (RF)

STOP

Machine A4 GB Memory

Machine B4 GB Memory

T1: 2 GB

T3: 4 GB

T2: 2 GB

Tim

e

Avg. task compl. time = 1 t

Current Schedulers

RF increase with the number of resources being allocated !

Avg. task compl.time = 1.33 t

Resources allocated in terms of Slots

Free resources unable to be assigned to tasks

28Current Schedulers

“Packer” Scheduler

Machine A4 GB Memory; 20 MB/s Nw.

Tim

e T1: 2 GBMemory

20 MB/s Nw.

T2: 2 GBMemory

20 MB/s Nw.

T3: 2 GBMemory

Machine A4 GB Memory; 20 MB/s Nw.

Tim

e T1: 2 GBMemory

20 MB/s Nw.

T2: 2 GBMemory

20 MB/s Nw.

T3: 2 GBMemory

STOP

20 MB/s Nw.

20 MB/s Nw.

Over-Allocation

Not all tasks resource

demands areexplicitly allocated Disk and

network are over-allocated

Avg. task compl.time= 2.33 t Avg. task compl. time = 1.33 t

Current Schedulers

29

Work Conserving != no fragmentation, over-allocation

Treat cluster as a big bag of resources Hides the impact of resource fragmentation

Assume job has a fixed resource profile Different tasks in the same job have different demands

Multi-resource Fairness Schemes do not help eitherWhy so bad #2

The schedule impacts job’s current resource profiles

Can schedule to create complementarity profiles

Packer Scheduler vs. DRF Avg. Job Compl.Time: 50% Makespan: 33%

Pareto1 efficient != performant

1no job can increase share without decreasing the share of another

30

Competing objectives

Job completion time

Fairness

vs.

Cluster efficiency

vs.

Current Schedulers1. Resource Fragmentation

3. Fair allocations sacrifice performance

2. Over-Allocation

31

# 1Pack tasks along multiple resources to improve cluster efficiency and reduce

makespan

Tetris

32

Theory

PracticeMulti-Resource Packing of

Tasks similar to Multi-Dimensional Bin Packing

Balls could be tasks Bin could be machine, time

1APX-Hard is a strict subset of NP-hard

APX-Hard1

Existing heuristics do not directly apply here: Assume balls of a fixed size

Assume balls are known apriori

vary with time / machine placed elastic

cope with online arrival of jobs, dependencies, cluster activity

Avoiding fragmentation looks like: Tight bin packing Reduces # of bins used -> reduce makespan

33

# 1Packing

heuristic

Tetris

1. Check for fit ensure no over-allocation Over-Allocation

Alignment score (A)

A packing heuristic Tasks resources demand vector Machine resource vector<

Fit

“A” works because:

2. Bigger balls get bigger scores

3. Abundant resources used first Resource Fragmentation

4. Can spread load across machines

34

# 2Faster average job completion time

Tetris

35

TetrisCHALLENGE #

2

Shortest Remaining Time First1 (SRTF)

1SRTF – M. Harchol-Balter et al. Connection Scheduling in Web Servers [USITS’99] schedules jobs in ascending order of their remaining time

Job Completion Time Heuristic

Q: What is the shortest “remaining time” ?

“remaining work”

remaining # tasks tasks durationstasks resource demands

&

&=

A job completion time heuristic Gives a score P to every job Extended SRTF to incorporate multiple resources

36

TetrisCHALLENGE #

2Job Completion Time Heuristic

Combine A and P scores !

Packing Efficiency

Completion Time

?

1: among J runnable jobs2: score (j) = A(t, R)+ P(j)3: max task t in j, demand(t) ≤ R (resources free) 4: pick j*, t* = argmax score(j)

A: delays job completion time

P: loss in packing efficiency

37

# 3Achieve performance and fairness

Tetris

38

# 3

Tetris

A says: “task i should go here to improve packing efficiency”

Feasible solution which typically can satisfy all of them

P says: “schedule job j next to improve job completion time”

Fairness says: “this set of jobs should be scheduled next”

Fairness Heuristic

Performance and fairness do not mix well in general

But ….We can get “perfect fairness” and much better performance

39

# 3

Tetris

Fairness Knob, F [0, 1) F = 0 most efficient scheduling F → 1 close to perfect fairness

Pick the best-for-perf. task from among

1-F fraction of jobs furthest from fair share

Fairness Heuristic

Fairness is not a tight constraint

Long term fairness not short term fairness Lose a bit of fairness for a lot of gains in performance

Heuristic

40

Putting it all together

We saw:

Other things in the paper:

Packing efficiency Prefer small remaining work Fairness knob

Estimate task demands Deal with inaccuracies, barriers Ingestion / evacuation

Job Manager1 Node Manager1

Cluster-wide Resource Manager

Multi-resource asks; barrier hint

Track resource usage; enforce allocations

New logic to match tasks to machines (+packing, +SRTF, +fairness)

Allocations

Asks

Offers

Resourceavailability reports

Yarn architectureChanges to add Tetris(shown in orange)

41

Evaluation Pluggable scheduler in Yarn 2.4

250 machine cluster deployment

Replay Bing and Facebook traces

42

Efficiency

Makespan

DRF 28 %

Avg. Job Compl. Time

35%

0 5000 10000 150000

50

100

150

200 CPU Mem In St

Time (s)

Uti

lizat

ion

(%)

Tetris

Gains from avoiding fragmentation avoid over-allocation

0 4500 9000 13500 18000 225000

50

100

150

200 CPU Mem In St

Time (s)

Uti

lizat

ion

(%)

Tetris vs.

Capacity Scheduler 29 % 30 %

Over-allocation

Lower value => higher resource fragmentation

Uti

lizat

ion

(%)

20015010050

00 500

010000

15000Time

(s)

Over-allocation

Lower value => higher resource fragmentation

Capacity Scheduler

43

Fairness

Fairness Knob quantifies the extent to which Tetris adheres to fair allocation

No FairnessF = 0

Makespan

50 %

10 %

25 %

Job Compl. Time

40 %

23 %

35 %

Avg. Slowdown[over impacted jobs]

25 %

2 %

5 %

Full FairnessF → 1

F = 0.25

44

Tetris Pack efficiently along multiple

resources

Prefer jobs with less

“remaining work”

Incorporate Fairness

combine heuristics that improve packing efficiency with those that lower average job completion time

achieving desired amounts of fairness can coexist with improving cluster performance

implemented inside YARN; trace-driven simulations and deployment show encouraging initial results

We are working towards a Yarn check-inhttp://research.microsoft.com/en-us/UM/redmond/projects/tetris/

45

Backup slides

Estimating resource requirements

Estimating Resource Demands

Under-utilization

from:

o finished tasks in the same phase

peak usage demands estimates

Machine1 - In Network

850

1024

0

512

MBy

tes /

sec

Time (sec)In Network UsedIn Network Free

Resource Tracker

o report unused resourceso aware of other cluster activities: ingestion and evacuation

Resource Tracker

o collecting statistics from recurring jobsPeak Demand

o inputs size/location of tasks

46

PlacementImpacts network/disk requirements

47

Packer Scheduler vs. DRF

DRF Scheduler

Packer Schedulers

2 tasks

Job Schedule

Resources used

2 tasks 2 tasks2 tasks 2 tasks 2 tasks6 tasks 6 tasks 6 tasksA

BC

18 cores

16 GB

18 cores

16 GB

18 cores

16 GB

t 2t 3t0 tasks

Job Schedule

Resources used

0 tasks 6 tasks0 tasks 6 tasks18 tasksA

BC

18 cores 18 cores

6 GB

18 cores

6 GB

t 2t 3t

36 GB

Durations:A: 3tB: 3tC: 3t

Durations:A: tB: 2tC: 3t33%

improvement

Dominant Resource Fairness (DRF)computes the dominant share (DS) of every user and seeks to maximize the minimum DS across all users

Cluster [18 Cores, 36 GB Memory] Job: [Task Prof.], # tasks

A [1 Core, 2 GB], 18

B [3 Cores, 1 GB], 6

C [3 Cores, 1 GB], 6DS =

max (, , ) (Maximize allocations) (CPU constraint)

2qA + 1qB + 1qC 36 (Memory constraint)


Machine 1,2: [2 Cores, 4 GB]Job: [Task Prof.], # tasks

A [2 Cores, 3 GB], 6

B [1 Core, 2 GB], 2

Resources used

4 cores

6 GB

2 tasks

2 tasks

2 tasks

2 tasks

t 2t 3t 4tJob Schedule

4 cores

6 GB

4 cores

6 GB

2 cores

4 GB

Resources used

2 cores

4 GB

2 tasks

2 tasks

2 tasks

2 tasks

t 2t 3t 4tJob Schedule

4 cores

6 GB

4 cores

6 GB

4 cores

6 GB

Pack No PackDurations:

A: 3tB: 4t

Durations:A: 4tB: t

29% improvement

Packing efficiency does not achieve everything

Achieving packing efficiency does not necessarily improve job completion time

49

Ingestion / evacuation

ingestion = storing incoming data for later analytics

evacuation = data evacuated and re-replicated before maintenance operations

e.g. some clusters reports volumes of up to 10 TB per hour

Other cluster activities which produce background traffic

e.g. rack decommission for machines re-imaging

Resource Tracker reports, used by Tetris to avoid contention between its tasks and these activities

50

Workload analysis

51

Alternative Packing Heuristics

52

Fairness vs. Efficiency

53

Fairness vs. Efficiency

54

Virtual Machine Packing != Tetris

Virtual Machine Packing

But focus on different challenges and not task packing: balance load across servers ensure VM availability inspite of failures

allow for quick software and hardware updates

NO corresponding entity to a job and hence job completion time is inexpressible

Explicit resource requirements (e.g. small VM) makes VM packing simpler

Consolidating VMs, with multi-dimensional resource requirements, on to the fewest number of servers

55

Barrier knob, b [0, 1)

Tetris gives preference for last tasks in a stage

Offer resources to tasks in a stage preceding a barrier, where b fraction of tasks have finished

b = 1 no tasks preferentially treated

56

Starvation Prevention

It could take a long time to accommodate large tasks ?

But …1. most tasks have demands within one order of magnitude of one another

2. machines report resource availability to the scheduler periodically scheduler learn about all the resources freed up by tasks that finish in the

preceding period together => can to reservation for large tasks

57

Cluster load vs. Tetris performance

Packing and Dependency-aware Scheduling for Data-Parallel Clusters

Graphene

Performance of cluster schedulers

We observe that:


Typically cluster schedulers do dependency-aware scheduling OR multi-resource packing None of the existing solutions are close to optimal for more than 50% of the

production jobs

Graphene> 30% improvements in makespan1 and job

completion time for more than 50% of the jobs

2

Findings from Bing traces analysis

Jobs structure have evolved into complex DAGs of tasks

depth 7 103 tasks

Median job DAG’s has:

A good cluster scheduler should be aware of dependencies

1Time to finish a set of jobs3

61

Findings from Bing traces analysis

High coefficient of variation (~1) for many resources Demands for resources are weakly

correlated

Applications have (very) diverse resource needs:

Multiple resources become tight

This matters because no single bottleneck resource: Enough cross-rack network bandwidth to use all CPU cores

CPU, Memory, Network and Disk

A good cluster scheduler should pack resources

62

Why so bad

Production schedulers DON’T pack tasks

consider dependencies

ORAND

Dependency-aware

Packing

Breadth First Search (BFS)

63

Do not account for tasks resource demands

If so, they assume tasks have homogeneous demands

OR Consider the DAG structure during the schedule

Tetris

Ignore dependencies

Takes local greedy choices

Handle tasks with multiple resource requirements

Any scheduler that is not packing, is up to n x OPTIMAL (n – number tasks)

Any scheduler that ignores dependencies is d x OPTIMAL (d – number resource dimensions)

Critical Path Scheduling (CPSched)

Where does the “work” lie in a DAG?

“Work” – stages in a DAG where most amount of resources X time is spent

Large DAGs that are neither a bunch of unrelated stages nor a chain of stages

> 40% of the DAGs have most of the “work” on the Critical Path CPSched performs well

> 30% of the DAGs have most of the “work” such that Packers performs well

For ~50% of the DAGs neither packers nor critically-based

schedulers may perform well 7

65

Pack tasks along multiple resources while consider tasks dependencies

Graphene State-of-the art techniques are suboptimal

Key ideas in Graphene

Conclusion

66

State-of-the art scheduling techniques are suboptimal

CPSched / Tetris3 X Optimal

t0: t1:

t2:

t3:

1{.7, .31}

.01{.95, .01}

.01{.1, .7}

.96{. 2, .68}

.98{. 1, .01}

.01{. 01, .01}

t4:

t5:

duration{rsrc.1, rsrc.2}

task:

CPSched t0 t4 t5t

t1 t3t22t 3t

Time: ~3T

Tetris t0 t1 t2t

t4 t3t52t 3t

Time: ~3T

Optimal t1 t0t

t4 t3

t2

3tTime: ~T

t5

Key insights: t0, t2, t5 are troublesome tasks schedule them as soon as possible

Total capacity in any dimens. = 1

67

Schedule construction: identify troublesome tasks and place accordingly on a virtual resource time space.

Graphene# 1

T

P

C

O

…

time

reso

urce

s

T

…

time

reso

urce

s

P

OC

T

Schedule Construction

Identify tasks that can lead to a poor schedule (troublesome tasks) - T more likely to be on the critical path more difficult to pack

Break the others tasks into P, C, O sets based on their relationship with tasks from T

Place tasks in T on a virtual time space; overlay the others to fill any resultant holes in this space

Nearly optimal for over three quarters of our analyzed production DAGs

11

69

Online component: enforces the desired schedule of the various DAGs.

Graphene# 2

DAG



Preference order

Preference order- merging schedulesDAG

Runtime component

Node heartbeat

Task assignment

Resource Manager

Prefer jobs with less remaining work

Enforces priority ordering Local placement

Multi-resource packing Judicious overbooking of

malleable resources

Deficit counters to bound unfairness

Enables implementation of different fairness schemes

Job completion time

Online Scheduling

Makespan Being Fair

- bound unfairness- packing + overbooking

13

71

Evaluation Implemented in Yarn and Tez

250 machine cluster deployment

Replay Bing traces and TPC-DS / TPC-H workloads

Makespan

Tetris

29 %


27%

Graphene vs.

Critical Path

31 % 33 %BFS

23 % 24%

Gains from view of the entire DAG place the troublesome tasks first

Efficiency

more compact schedule better packing overbooking

15

73

Graphene combine various mechanisms to improve packing efficiency and

consider tasks dependencies

constructs a good schedule by placing tasks on a virtual resource time space

implemented inside YARN and Tez; trace-driven simulations and deployment show encouraging initial results

online heuristics that softly enforces the desired schedules

Makespan

Tetris

29 %


27%

Graphene vs.

Critical Path

31 % 33 %BFS

23 % 24%

Gains from view of the entire DAG place the troublesome tasks first

Graphene BFSRunning tasks

Efficiency

more compact schedule better packing overbooking

15

goodfit: multi-resource packing of tasks with dependencies

Technology