goodfit: multi-resource packing of tasks with dependencies
TRANSCRIPT
GoodFit: Multi-Resource Packing of Tasks with
Dependencies
Cluster Scheduling for JobsJobs
Machines, file-system, network
Cluster Scheduler matches tasks to resources
Goals• High cluster utilization• Fast job completion time• Predictable perf./ fairness
E.g., BigData (Hive, SCOPE, Spark)E.g., CloudBuild
Tasks
Dependencies
• Need not keep resource “buffers”• More dynamic than VM placement (tasks last seconds)• Aggregate properties are important (eg, all tasks in a job should finish)
Need careful multi-resource planning
Problem
Fragmentation
Current Schedulers Packer Scheduler
Over-allocation of net/disk
Current Schedulers Packer Scheduler
2 tasks/T 3 tasks/T (+50%) 2 tasks/ 2T 2 tasks/T (+100%)
… worse with dependenciesProblem 2
Tt, r t, 1- r
t, r
t, 1- r t, 1- r
(T- 2)t, r (T- 4)t, r ~Tt, r
……
DAG label= {duration, resource demand}
resource
time
~nT t…
resource
time
~T t
…
…
Crit. Path Best
Critical path scheduling is n times off since it ignores resource demands
Packers can be d times off since they ignore future work [d resources]
Typical job scheduler infrastructure
+ packing+ bounded unfairness+ merge schedules+ overbook
DAGAM
DAGAM
…
Node heartbeat
Task assignment
Schedule Constructor
Schedule Constructor
RMNM
NM
NM
NM
Main ideas in multi-resource packingTask packing ~ Multi-dimensional bin packing, but* Very hard problem (“APX-hard”)* Available heuristics do not directly apply [task demands change with placement]
Alignment score (A) = D R A packing heuristic Task’s resources demand vector: D Machine resource vector: R<
Fit
A job completion time heuristic shortest remaining work, P tasks avg. durationtasks avg. resource demand
**
=remaining # tasks
Packing Efficiency
?delays job completion
loses packing efficiencyJob Completion Time
Fairness
Trade-offs:
We show that:{best “perf” |bounded unfairness} ~ best “perf”
loses both
Main ideas in packing dependent tasks
1. Identify troublesome tasks (meat) and place them first
2. Systematically place other tasks without deadlocks
3. At runtime, use a precedence order from the computed schedule + heuristics to (a) overbook, (b) previous slide.
4. Better lower bounds for DAG completion time
M
P
C
O
time
resource
meat begin
meat end
parents
meat
children
Results - 1
Packing
Packing + Deps.
Lower bound
[20K DAGs from Cosmos]
Results - 2
Tez + PackingTez + Pack +Deps
[200 jobs from TPC-DS, 200 server cluster]
Bundling
Temporal relaxation of fairness
Map (disk)
Reduce (netw.)
Fair share among two identical jobs
50%
50%
50%
50%
2T 4T
Instantaneous fairness
100%
100%
100%
100%
2T 3TT
1) Temporal relaxation of fairnessa job will finish within x the time it takes given strict share
2) Optimal trade-off with performancex fairness costs x on make-span
3) A simple (offline) algorithm that achieves the above trade-off
Problem:
Instantaneous fairness can be up to dx worse on makespan (d resources)
Best
Fairness slack Perf loss
0 (perfectly fair) 2x
1 (<2x longer) 1.1x
2 (<3x longer) 1.07x
Bare metalVM Allocation
Data-parallel Jobs
Job: Tasks
Dependencies
E.g., HDInsight, AzureBatch
E.g., BigData (Yarn, Cosmos, Spark)
E.g., CloudBuild
3500 servers3500 users>20M targets/day
~100K servers (40K at Yahoo)
>50K servers>2EB stored>6K devs
• Tasks are short-lived (10s of seconds)• Have peculiar shaped demands• Composites are important (job needs all tasks to finish)• OK to kill and restart tasks• Locality
1) Job scheduling has specific aspects
2) will speed-up the average job (and reduce resource cost)
3) research + practice
Resource aware scheduling improves SLOs and Return/$
Cluster Scheduling for JobsJobs
Machines, file-system, network
Cluster Scheduler matches tasks to resources
Goals• High cluster utilization• Fast job completion time• Predictable perf./ fairness• Efficient (milliseconds…)
E.g., HDInsight, AzureBatchE.g., BigData (Hive, SCOPE, Spark)E.g., CloudBuild
Tasks
Dependencies
Need careful multi-resource planning
Problem
Fragmentation
Current Schedulers Packer Scheduler
Over-allocation of net/disk
Current Schedulers Packer Scheduler
2 tasks/T 3 tasks/T (+50%) 2 tasks/ 2T 2 tasks/T (+100%)
… worse with dependenciesProblem 2
Tt, r t, 1- r
t, r
t, 1- r t, 1- r
(T- 2)t, r (T- 4)t, r ~Tt, r
……
DAG label= {duration, resource demand}
resource
time
~nT t…
resource
time
~T t
…
…
Crit. Path Best
Critical path scheduling is n times off since it ignores resource demands
Packers can be d times off since they ignore future work [d resources]
Typical job scheduler infrastructure
+ packing+ bounded unfairness+ merge schedules+ overbook
DAGAM
DAGAM
…
Node heartbeat
Task assignment
Schedule Constructor
Schedule Constructor
RMNM
NM
NM
NM
Main ideas in packing dependent tasks
1. Identify troublesome tasks (T) and place them first
2. Systematically place other tasks without dead-ends
3. At runtime, enforce computed schedule + heuristics to (a) overbook, (b) previous slide.
4. Better lower bounds for DAG completion time
T
P
C
O
time
resource
Trouble begin
Trouble end
parents
trouble
children
Results - 1
Packing
Packing + Deps.
Lower bound
[20K DAGs from Cosmos]
2X1.5X
Results - 2
Tez + PackingTez + Pack +Deps
[200 jobs from TPC-DS, 200 server cluster]
Multi-Resource Packing for Cluster SchedulersTetris
Performance of cluster schedulers
We observe that:
1Time to finish a set of jobs
Resources are fragmented i.e. machines are running below capacity Even at 100% usage, goodput is much smaller due to over-allocation Even pareto-efficient multi-resource fair schemes result in much lower performance
Tetrisup to 40% improvement in makespan1 and job
completion time with near-perfect fairness
25
Findings from Bing and Facebook traces analysis
Tasks need varying amounts of each resource
Demands for resources are weakly correlated
Diversity in multi-resource requirements:
Multiple resources become tight
This matters because no single bottleneck resource: Enough cross-rack network bandwidth to use all CPU cores
Upper bounding potential gains reduce makespan1 by up to 49% reduce avg. job compl. time by up to 46%
26
Why so bad #1
Production schedulers neither pack tasks nor consider all their relevant
resource demands
#1 Resource Fragmentation
#2 Over-allocation
27Current Schedulers
“Packer” Scheduler
Machine A4 GB Memory
Machine B4 GB Memory
T1: 2 GB
T3: 4 GB
T2: 2 GB
Tim
e
Resource Fragmentation (RF)
STOP
Machine A4 GB Memory
Machine B4 GB Memory
T1: 2 GB
T3: 4 GB
T2: 2 GB
Tim
e
Avg. task compl. time = 1 t
Current Schedulers
RF increase with the number of resources being allocated !
Avg. task compl.time = 1.33 t
Resources allocated in terms of Slots
Free resources unable to be assigned to tasks
28Current Schedulers
“Packer” Scheduler
Machine A4 GB Memory; 20 MB/s Nw.
Tim
e T1: 2 GBMemory
20 MB/s Nw.
T2: 2 GBMemory
20 MB/s Nw.
T3: 2 GBMemory
Machine A4 GB Memory; 20 MB/s Nw.
Tim
e T1: 2 GBMemory
20 MB/s Nw.
T2: 2 GBMemory
20 MB/s Nw.
T3: 2 GBMemory
STOP
20 MB/s Nw.
20 MB/s Nw.
Over-Allocation
Not all tasks resource
demands areexplicitly allocated Disk and
network are over-allocated
Avg. task compl.time= 2.33 t Avg. task compl. time = 1.33 t
Current Schedulers
29
Work Conserving != no fragmentation, over-allocation
Treat cluster as a big bag of resources Hides the impact of resource fragmentation
Assume job has a fixed resource profile Different tasks in the same job have different demands
Multi-resource Fairness Schemes do not help eitherWhy so bad #2
The schedule impacts job’s current resource profiles
Can schedule to create complementarity profiles
Packer Scheduler vs. DRF Avg. Job Compl.Time: 50% Makespan: 33%
Pareto1 efficient != performant
1no job can increase share without decreasing the share of another
30
Competing objectives
Job completion time
Fairness
vs.
Cluster efficiency
vs.
Current Schedulers1. Resource Fragmentation
3. Fair allocations sacrifice performance
2. Over-Allocation
31
# 1Pack tasks along multiple resources to improve cluster efficiency and reduce
makespan
Tetris
32
Theory
PracticeMulti-Resource Packing of
Tasks similar to Multi-Dimensional Bin Packing
Balls could be tasks Bin could be machine, time
1APX-Hard is a strict subset of NP-hard
APX-Hard1
Existing heuristics do not directly apply here: Assume balls of a fixed size
Assume balls are known apriori
vary with time / machine placed elastic
cope with online arrival of jobs, dependencies, cluster activity
Avoiding fragmentation looks like: Tight bin packing Reduces # of bins used -> reduce makespan
33
# 1Packing
heuristic
Tetris
1. Check for fit ensure no over-allocation Over-Allocation
Alignment score (A)
A packing heuristic Tasks resources demand vector Machine resource vector<
Fit
“A” works because:
2. Bigger balls get bigger scores
3. Abundant resources used first Resource Fragmentation
4. Can spread load across machines
34
# 2Faster average job completion time
Tetris
35
TetrisCHALLENGE #
2
Shortest Remaining Time First1 (SRTF)
1SRTF – M. Harchol-Balter et al. Connection Scheduling in Web Servers [USITS’99] schedules jobs in ascending order of their remaining time
Job Completion Time Heuristic
Q: What is the shortest “remaining time” ?
“remaining work”
remaining # tasks tasks durationstasks resource demands
&
&=
A job completion time heuristic Gives a score P to every job Extended SRTF to incorporate multiple resources
36
TetrisCHALLENGE #
2Job Completion Time Heuristic
Combine A and P scores !
Packing Efficiency
Completion Time
?
1: among J runnable jobs2: score (j) = A(t, R)+ P(j)3: max task t in j, demand(t) ≤ R (resources free) 4: pick j*, t* = argmax score(j)
A: delays job completion time
P: loss in packing efficiency
37
# 3Achieve performance and fairness
Tetris
38
# 3
Tetris
A says: “task i should go here to improve packing efficiency”
Feasible solution which typically can satisfy all of them
P says: “schedule job j next to improve job completion time”
Fairness says: “this set of jobs should be scheduled next”
Fairness Heuristic
Performance and fairness do not mix well in general
But ….We can get “perfect fairness” and much better performance
39
# 3
Tetris
Fairness Knob, F [0, 1) F = 0 most efficient scheduling F → 1 close to perfect fairness
Pick the best-for-perf. task from among
1-F fraction of jobs furthest from fair share
Fairness Heuristic
Fairness is not a tight constraint
Long term fairness not short term fairness Lose a bit of fairness for a lot of gains in performance
Heuristic
40
Putting it all together
We saw:
Other things in the paper:
Packing efficiency Prefer small remaining work Fairness knob
Estimate task demands Deal with inaccuracies, barriers Ingestion / evacuation
Job Manager1 Node Manager1
Cluster-wide Resource Manager
Multi-resource asks; barrier hint
Track resource usage; enforce allocations
New logic to match tasks to machines (+packing, +SRTF, +fairness)
Allocations
Asks
Offers
Resourceavailability reports
Yarn architectureChanges to add Tetris(shown in orange)
41
Evaluation Pluggable scheduler in Yarn 2.4
250 machine cluster deployment
Replay Bing and Facebook traces
42
Efficiency
Makespan
DRF 28 %
Avg. Job Compl. Time
35%
0 5000 10000 150000
50
100
150
200 CPU Mem In St
Time (s)
Uti
lizat
ion
(%)
Tetris
Gains from avoiding fragmentation avoid over-allocation
0 4500 9000 13500 18000 225000
50
100
150
200 CPU Mem In St
Time (s)
Uti
lizat
ion
(%)
Tetris vs.
Capacity Scheduler 29 % 30 %
Over-allocation
Lower value => higher resource fragmentation
Uti
lizat
ion
(%)
20015010050
00 500
010000
15000Time
(s)
Over-allocation
Lower value => higher resource fragmentation
Capacity Scheduler
43
Fairness
Fairness Knob quantifies the extent to which Tetris adheres to fair allocation
No FairnessF = 0
Makespan
50 %
10 %
25 %
Job Compl. Time
40 %
23 %
35 %
Avg. Slowdown[over impacted jobs]
25 %
2 %
5 %
Full FairnessF → 1
F = 0.25
44
Tetris Pack efficiently along multiple
resources
Prefer jobs with less
“remaining work”
Incorporate Fairness
combine heuristics that improve packing efficiency with those that lower average job completion time
achieving desired amounts of fairness can coexist with improving cluster performance
implemented inside YARN; trace-driven simulations and deployment show encouraging initial results
We are working towards a Yarn check-inhttp://research.microsoft.com/en-us/UM/redmond/projects/tetris/
45
Backup slides
Estimating resource requirements
Estimating Resource Demands
Under-utilization
from:
o finished tasks in the same phase
peak usage demands estimates
Machine1 - In Network
850
1024
0
512
MBy
tes /
sec
Time (sec)In Network UsedIn Network Free
Resource Tracker
o report unused resourceso aware of other cluster activities: ingestion and evacuation
Resource Tracker
o collecting statistics from recurring jobsPeak Demand
o inputs size/location of tasks
46
PlacementImpacts network/disk requirements
47
Packer Scheduler vs. DRF
DRF Scheduler
Packer Schedulers
2 tasks
Job Schedule
Resources used
2 tasks 2 tasks2 tasks 2 tasks 2 tasks6 tasks 6 tasks 6 tasksA
BC
18 cores
16 GB
18 cores
16 GB
18 cores
16 GB
t 2t 3t0 tasks
Job Schedule
Resources used
0 tasks 6 tasks0 tasks 6 tasks18 tasksA
BC
18 cores 18 cores
6 GB
18 cores
6 GB
t 2t 3t
36 GB
Durations:A: 3tB: 3tC: 3t
Durations:A: tB: 2tC: 3t33%
improvement
Dominant Resource Fairness (DRF)computes the dominant share (DS) of every user and seeks to maximize the minimum DS across all users
Cluster [18 Cores, 36 GB Memory] Job: [Task Prof.], # tasks
A [1 Core, 2 GB], 18
B [3 Cores, 1 GB], 6
C [3 Cores, 1 GB], 6DS =
max (, , ) (Maximize allocations) (CPU constraint)
2qA + 1qB + 1qC 36 (Memory constraint)
481Time to finish a set of jobs
Machine 1,2: [2 Cores, 4 GB]Job: [Task Prof.], # tasks
A [2 Cores, 3 GB], 6
B [1 Core, 2 GB], 2
Resources used
4 cores
6 GB
2 tasks
2 tasks
2 tasks
2 tasks
t 2t 3t 4tJob Schedule
4 cores
6 GB
4 cores
6 GB
2 cores
4 GB
Resources used
2 cores
4 GB
2 tasks
2 tasks
2 tasks
2 tasks
t 2t 3t 4tJob Schedule
4 cores
6 GB
4 cores
6 GB
4 cores
6 GB
Pack No PackDurations:
A: 3tB: 4t
Durations:A: 4tB: t
29% improvement
Packing efficiency does not achieve everything
Achieving packing efficiency does not necessarily improve job completion time
49
Ingestion / evacuation
ingestion = storing incoming data for later analytics
evacuation = data evacuated and re-replicated before maintenance operations
e.g. some clusters reports volumes of up to 10 TB per hour
Other cluster activities which produce background traffic
e.g. rack decommission for machines re-imaging
Resource Tracker reports, used by Tetris to avoid contention between its tasks and these activities
50
Workload analysis
51
Alternative Packing Heuristics
52
Fairness vs. Efficiency
53
Fairness vs. Efficiency
54
Virtual Machine Packing != Tetris
Virtual Machine Packing
But focus on different challenges and not task packing: balance load across servers ensure VM availability inspite of failures
allow for quick software and hardware updates
NO corresponding entity to a job and hence job completion time is inexpressible
Explicit resource requirements (e.g. small VM) makes VM packing simpler
Consolidating VMs, with multi-dimensional resource requirements, on to the fewest number of servers
55
Barrier knob, b [0, 1)
Tetris gives preference for last tasks in a stage
Offer resources to tasks in a stage preceding a barrier, where b fraction of tasks have finished
b = 1 no tasks preferentially treated
56
Starvation Prevention
It could take a long time to accommodate large tasks ?
But …1. most tasks have demands within one order of magnitude of one another
2. machines report resource availability to the scheduler periodically scheduler learn about all the resources freed up by tasks that finish in the
preceding period together => can to reservation for large tasks
57
Cluster load vs. Tetris performance
Packing and Dependency-aware Scheduling for Data-Parallel Clusters
Graphene
Performance of cluster schedulers
We observe that:
1Time to finish a set of jobs
Typically cluster schedulers do dependency-aware scheduling OR multi-resource packing None of the existing solutions are close to optimal for more than 50% of the
production jobs
Graphene> 30% improvements in makespan1 and job
completion time for more than 50% of the jobs
2
Findings from Bing traces analysis
Jobs structure have evolved into complex DAGs of tasks
depth 7 103 tasks
Median job DAG’s has:
A good cluster scheduler should be aware of dependencies
1Time to finish a set of jobs3
61
Findings from Bing traces analysis
High coefficient of variation (~1) for many resources Demands for resources are weakly
correlated
Applications have (very) diverse resource needs:
Multiple resources become tight
This matters because no single bottleneck resource: Enough cross-rack network bandwidth to use all CPU cores
CPU, Memory, Network and Disk
A good cluster scheduler should pack resources
62
Why so bad
Production schedulers DON’T pack tasks
consider dependencies
ORAND
Dependency-aware
Packing
Breadth First Search (BFS)
63
Do not account for tasks resource demands
If so, they assume tasks have homogeneous demands
OR Consider the DAG structure during the schedule
Tetris
Ignore dependencies
Takes local greedy choices
Handle tasks with multiple resource requirements
Any scheduler that is not packing, is up to n x OPTIMAL (n – number tasks)
Any scheduler that ignores dependencies is d x OPTIMAL (d – number resource dimensions)
Critical Path Scheduling (CPSched)
Where does the “work” lie in a DAG?
“Work” – stages in a DAG where most amount of resources X time is spent
Large DAGs that are neither a bunch of unrelated stages nor a chain of stages
> 40% of the DAGs have most of the “work” on the Critical Path CPSched performs well
> 30% of the DAGs have most of the “work” such that Packers performs well
For ~50% of the DAGs neither packers nor critically-based
schedulers may perform well 7
65
Pack tasks along multiple resources while consider tasks dependencies
Graphene State-of-the art techniques are suboptimal
Key ideas in Graphene
Conclusion
66
State-of-the art scheduling techniques are suboptimal
CPSched / Tetris3 X Optimal
t0: t1:
t2:
t3:
1{.7, .31}
.01{.95, .01}
.01{.1, .7}
.96{. 2, .68}
.98{. 1, .01}
.01{. 01, .01}
t4:
t5:
duration{rsrc.1, rsrc.2}
task:
CPSched t0 t4 t5t
t1 t3t22t 3t
Time: ~3T
Tetris t0 t1 t2t
t4 t3t52t 3t
Time: ~3T
Optimal t1 t0t
t4 t3
t2
3tTime: ~T
t5
Key insights: t0, t2, t5 are troublesome tasks schedule them as soon as possible
Total capacity in any dimens. = 1
67
Schedule construction: identify troublesome tasks and place accordingly on a virtual resource time space.
Graphene# 1
T
P
C
O
…
time
reso
urce
s
T
…
time
reso
urce
s
P
OC
T
Schedule Construction
Identify tasks that can lead to a poor schedule (troublesome tasks) - T more likely to be on the critical path more difficult to pack
Break the others tasks into P, C, O sets based on their relationship with tasks from T
Place tasks in T on a virtual time space; overlay the others to fill any resultant holes in this space
Nearly optimal for over three quarters of our analyzed production DAGs
11
69
Online component: enforces the desired schedule of the various DAGs.
Graphene# 2
DAG
Schedule Construction
Schedule Construction
Preference order
Preference order- merging schedulesDAG
Runtime component
Node heartbeat
Task assignment
Resource Manager
Prefer jobs with less remaining work
Enforces priority ordering Local placement
Multi-resource packing Judicious overbooking of
malleable resources
Deficit counters to bound unfairness
Enables implementation of different fairness schemes
Job completion time
Online Scheduling
Makespan Being Fair
- bound unfairness- packing + overbooking
13
71
Evaluation Implemented in Yarn and Tez
250 machine cluster deployment
Replay Bing traces and TPC-DS / TPC-H workloads
Makespan
Tetris
29 %
Avg. Job Compl. Time
27%
Graphene vs.
Critical Path
31 % 33 %BFS
23 % 24%
Gains from view of the entire DAG place the troublesome tasks first
Efficiency
more compact schedule better packing overbooking
15
73
Graphene combine various mechanisms to improve packing efficiency and
consider tasks dependencies
constructs a good schedule by placing tasks on a virtual resource time space
implemented inside YARN and Tez; trace-driven simulations and deployment show encouraging initial results
online heuristics that softly enforces the desired schedules
Makespan
Tetris
29 %
Avg. Job Compl. Time
27%
Graphene vs.
Critical Path
31 % 33 %BFS
23 % 24%
Gains from view of the entire DAG place the troublesome tasks first
Graphene BFSRunning tasks
Efficiency
more compact schedule better packing overbooking
15