scheduling on large-scale platforms
TRANSCRIPT
1
Scheduling on Large-Scale Platforms
Henri Casanova([email protected])
Lecture OutlineWhat is Scheduling?Application Scheduling
State-of-the-artCase-study: divisible loads
Modeling and EvaluationConclusion
2
What is Scheduling?Context: parallel/distributed (scientific) applications that run on distributed computing platforms
a matrix-multiplication parallel algorithm on a clustera complex, multi-component, climate modeling application on a grid that consists of 6 clusters and a data warehouse
Scheduling: the decision process by which application components (computation, data) are assigned to resources (compute node, disk, network link)
What is Scheduling?Scheduling is a well-known hard problem in computer science
NP-hard in most instances (with a few exceptions)Many scheduling heuristics have been proposed
various application modelsvarious platform models
Most general application model: DAGMost general platform model: Graph
Not to say it’s a good model, and there are many subtleties there
3
Where do DAGs come from?Example: Solving a triangular systems of linear equations
A x = b, where A is lower triangular
for i = 1 to nx(i) = b(i) / a(i,i)for j = i+1 to n
b(j) = b(j) - a(j,i) * x(i)
0
Where do DAGs come from?
i=1: x(1) = b(1) / a(1,1) // easily compute x(1)b(2) = b(2) - a(2,1) * x(1) // remove it fromb(3) = b(3) - a(3,1) * x(1) // the system...
i=2: x(2) = b(2) / a(2,2) // easily compute x(2)b(3) = b(3) - a(3,2) * x(2) // remove it fromb(4) = b(4) - a(4,2) * x(2) // the system...
0for i = 1 to nx(i) = b(i) / a(i,i)for j = i+1 to n
b(j) = b(j) - a(j,i) * x(i)
0
0
4
DAG and Linear System Solve
Let ‘<‘ denote the sequential orderT1,1 < T1,2 < T1,n < T2,2 < T2,3 < ...Tn,n
Independent tasks can be done in //How do we determine that tasks are independent?
for i = 1 to nTask Ti,i: x(i) = b(i) / a(i,i)for j = i+1 to n
Task Ti,j: b(j) = b(j) - a(j,i) * x(i)
Bernstein ConditionLet In(T) be the input variables of task TLet Out(T) be the output variables of task TTwo tasks T and T’ are not independent when
In(T) ∩ Out(T’) ≠ ∅or, Out(T) ∩ In(T’) ≠ ∅or, Out(T) ∩ Out(T’) ≠ ∅
5
Back to the Examplefor i = 1 to n
Task Ti,i: x(i) = b(i) / a(i,i)for j = i+1 to n
Task Ti,j: b(j) = b(j) - a(j,i) * x(i)
Task Ti,i:In(Ti,i) = {b(i),a(i,i)}Out(Ti,i) = x(i)
Task Ti,j, for j>iIn(Ti,j) = {b(j),a(j,i),x(j)}Out(Ti,j) = {b(j)}
Dependences?Out(Ti,i) ∩ In(Ti,j) = {x(i)} for all j>iOut(Ti,j) ∩ Out(Ti+1,j) = {b(j)}, for all j > i+1
Back to the Example
for i = 1 to 5Task Ti,i: x(i) = b(i) / a(i,i)for j = i+1 to 5
Task Ti,j: b(j) = b(j) - a(j,i) * x(i)
T1,1
T1,2 T1,3 T1,4
T2,3 T2,4 T2,5
T1,5
T3,4 T3,5
T4,4
T4,5
T5,5
DAG obtained from instruction-level analysis
(root)
(end node)
6
More DAGsA DAG (or Task Graph) may just be inherent to an application at a higher levelWorkflow applications explicitly couple components together into a “coarse-grain” DAG
image processingclimate modelingdatabase operations
Scheduling ProblemAlthough there is a lot of formalism, it comes down to putting slots into Gantt charts:
T1
T2 T3 T4
T5 T6
T7
dependency(no transfer)
7
Scheduling ProblemAlthough there is a lot of formalism, it comes down to putting slots into Gantt charts:
T1
T2 T3 T4
T5 T6
T7
proc
esso
rs
time
P1
P2
P3
dependency(no transfer)
all i
dent
ical
Scheduling ProblemAlthough there is a lot of formalism, it comes down to putting slots into Gantt charts:
T1
T2 T3 T4
T5 T6
T7
proc
esso
rs
time
P1
P2
P3
makespan
w(T6)
dependency(no transfer)
all i
dent
ical
8
DAG Scheduling ProblemTheorem: there exists a valid schedule iff the DAG has no cyclesCritical Path:
Let X be a path in the DAG from the root to the end-node. Let F(X) be the sum of all node weights along the pathThe makespan is larger than F(X) for all pathsThe path that maximizes F(X) is called the critical path
DAG Scheduling problemIf we have an infinite number of processors, then “schedule as early as possible” is optimalOtherwise, it’s NP-hard
NP-hard on 2 identical processorsThings can get more complicated
cost of communication over some networkheterogeneous processors
People have proposed many heuristicsOne common approach: list scheduling
compute task “priorities” and schedule tasks in order of these prioritiespriorities may, for instance, be higher for tasks on the critical path
9
Job SchedulingSo far we’ve talked about a single applicationBut in many cases, resources are shared by multiple users
“jobs” arrive at different timesPerformance metric should be some type of aggregate, rather than the makespan
Minimize average “turn-around time”Minimize average “slow-down”
Slow-down is: “time in the system / time in the system if alone in the system”
Minimize maximum “slow-down”Try to enforce some notion of fairness
There is a wealth of complexity results, as well as open problems
Scheduling in PracticeVery large literatureVery large gap between theory and practiceApplication Scheduling:
Most published algorithms make unrealistic assumptionsEspecially when looking at large-scale Grid platforms
Therefore, they are almost never usedQuestion: Should one just give up and use simple (e.g., greedy) approaches, because there is no hope of doing anything clever inpractice?
Job Scheduling:State-of-the-art: Batch-schedulers (PBS, LQS, LoadLeveler, SGE)NOT designed with user-centric metrics in mind
e.g., resource utilization vs. average slowdownVery rigid request model: “I want N processors for H hours”Question: should we have more flexible interfaces to job schedulers, and should metrics be more user-centric?
10
Lecture OutlineWhat is Scheduling?Application Scheduling
State-of-the-artCase-study: divisible loads
Modeling and EvaluationConclusion
Application SchedulingContext: “Grid” platforms
Application should run on multiple resources over the wide-areaThe “scheduler” component is present in many systems
But: “We’ll put something clever in there at some point”Why is practical application scheduling difficult?
“Algorithm in paper A assumes homogeneous resources”“Algorithm in paper B assumes a fully-connect network with no bandwidth sharing”“Algorithm in paper C assumes that all resources are stablethroughout application execution”“Algorithm in paper D assumes that we know everything about the application”“Algorithm in paper E assumes that there are no latencies”
Assumptions in the literature do not match the real world
11
Lecture OutlineWhat is Scheduling?Application Scheduling
State-of-the-artCase-study: divisible loads
Modeling and EvaluationConclusion
Divisible Load ApplicationsDefinition
Application consists of “many” independent elemental tasksEach task has a data and computation loadApplication approximated as a continuous load W, which is measured in “load units”Amount of data to transfer is significant (i.e., not purely compute intensive)
Application examplesMPEG encoding: 1 load unit = 1 frameBioinformatics: 1 load unit = 1 sequenceVolume rendering: 1 load unit = 1 voxelDatamining, image processing, file compression, database joins, genetic algorithms, etc.
Great candidates for Grid deploymentNo dependencies (MUCH more simple than DAGs)Should be deployable on a large scaleLike a data-intensive SETI@home
12
Divisible Load SchedulingDLS problem
pick a number of chunkspick a size for each chunkwhich chunk on which compute resource and when?
VariationsObjective
minimize “makespan”maximize steady-state throughput
Compute and Network Resourceshomogeneousheterogeneous
Topologylinear array, star, tree, hypercube, mesh, ...
Single-Round DLSN workers, N “chunks”
master
worker worker worker
time
worker #1
worker #2
worker #3
chunk transferchunk computation
Some assumptionsCommunication from master is serializedCommunication and Computation “costs” are commensurate to chunk size (in load units)
13
Single-Round DLSOptimal Schedule
master
worker worker worker
time
worker #1
worker #2
worker #3
workersfinish at the same time
Single-Round DLSOptimal Schedule
master
worker worker worker
time
worker #1
worker #2
worker #3
workersfinish at the same time
Drawback: idle timemost common bane of parallel application performance
idle
idle
idle
14
Multi-Round DLSN workers, MxN chunks (M: # of rounds)
time
worker #1
worker #2
worker #3
idle
idle
idle
Increasing and then decreasing chunk sizesWell-known “pipelining” techniqueAdditional assumption
Overlap of communication and computation at each worker
Multi-Installment (MI) Algorithm
Multi-installment algorithm [Veeravalli’95]
time
worker #1
worker #2
worker #3
Worker finishes receiving data for a chunk exactly when computation for that chunk can startleads to a “tight” schedule
no gaps in network and worker utilizationInduction on chunk sizes to compute the schedule
15
MI for Large-Scale Platforms?Assumes linear costs
Tcomm = chunk_size / transfer_rate (in load units)Tcomp = chunk_size / compute_rate (in load units)
Assumes a homogeneous platformhomogeneous network and workers
Assumes 100% accurate performance prediction for both network and workers
MI is not applicable to large-scale platforms
MI for Large-Scale Platforms?Assumes linear costs
Tcomm = chunk_size / transfer_rate (in load units)Tcomp = chunk_size / compute_rate (in load units)
Assumes a homogeneous platformhomogeneous network and workers
Assumes 100% accurate performance prediction for both network and workers
MI is not applicable to large-scale platforms
16
Affine CostsKnown to be more realistic than linear costsTcomm = β + g / B
β: overhead of establishing connection, network latency, authentication/authorization, etc.Can be significant in wide-area platforms (e.g., TeraGrid) using Grid middleware (e.g., GridFTP) (up to several seconds)
Tcomp = α + g / Sα: overhead of establishing connection, network latency,
authentication/authorization, process initiation, etc.In practice can be several secondsGrid Benchmarking projects report high values (up to 45 seconds on some production Grid platforms)
XMI: eXtended Multi-Installment
time
worker #1
worker #2
worker #3
homogeneous platform
17
XMI: eXtended Multi-Installment
8
time
worker #1
worker #2
worker #3
7
6
5
4
3
2
1
0
XMI: eXtended Multi-Installment
Recursion (R = B / S)∀ i ≥ N, α + gi = (gi-1 + gi-2 + … + gi-N) / R + N β
8
time
worker #1
worker #2
worker #3
7
6
5
4
3
2
1
0
18
XMI: eXtended Multi-Installment
Recursion (R = B / S)∀ i ≥ N, α + gi = (gi-1 + gi-2 + … + gi-N) / R + N β∀ 0 ≤ i < N, α + gi = (gi-1 + … + g0)/R + i β + g0 + α
8
time
worker #1
worker #2
worker #3
7
6
5
4
3
2
1
0
XMI: eXtended Multi-Installment
Recursion (R = B / S)∀ i ≥ N, α + gi = (gi-1 + gi-2 + … + gi-N) / R + N β∀ 0 ≤ i < N, α + gi = (gi-1 + … + g0)/R + i β + g0 + α∀ i ≤ 0 gi = 0
8
time
worker #1
worker #2
worker #3
7
6
5
4
3
2
1
0
19
Solving the XMI recursionHow to solve ?∀ i ≥ N, α + gi = (gi-1 + gi-2 + … + gi-N) / R + N β∀ 0 ≤ i < N, α + gi = (gi-1 + … + g0)/R + i β + g0 + α∀ i ≤ 0 gi = 0
Generating function: G(x) = ∑ gi xi
Compute G(x)?sum over the equations aboveproperty: ∑ gi xi = nth coeff. of G(x)/(1-x))
i=0
∞
i=0
n
g0 (1-xN) + α xN + β (x + x2 + … + xN)G(x) =
(1 – x) – x(1- xN) / R
Solving the XMI recursionThe problem is reduced to finding the
coefficients of G(x)Simple rational expansion theorem [GKP’94]
roots of the denominator polynomialcan only work if all roots of degree 1
true is R ≠ Nif R = N, turns out one can use the General theorem
Roots can be computed numericallyWe obtain chunk sizes as linear combinations of geometric series
gi = g0 ∑ ηjθji + ∑ ξjθj
i (θj: inverse of jth root)j=0
N
j=0
N
20
XMI: Still not there
Remaining limitationsonly applicable to homogeneous platforms
difficult to develop the recursion in the heterogeneous case
assumes 100% accurate performance prediction for both network and workers
XMI: Still not there
Remaining limitationsonly applicable to homogeneous platforms
difficult to develop the recursion in the heterogeneous case
assumes 100% accurate performance prediction for both network and workers
21
UMR: Uniform Multi-RoundKey issue: make the chunk size recursion simpler to solveKey idea: use “uniform” chunks:
chunks are the same size within a roundHomogeneous case:
time
worker #1
worker #2
worker #3
UMR Recursion
∀ 0 ≤ j < M-1 α + chunkj / S = N (chunkj+1 / B) + N β
(last round: same as for XMI)if NS = B we obtain an arithmetic seriesif NS ≠ B we obtain a geometric series
time
worker #1
worker #2
worker #3
chunkj : chunk size at round j (M rounds)
22
UMR Recursion
∀ 0 ≤ j < M-1 α + chunkj / S = N (chunkj+1 / B) + N β
(last round: same as for XMI)if NS = B we obtain an arithmetic seriesif NS ≠ B we obtain a geometric series
time
worker #1
worker #2
worker #3
chunkj : chunk size at round j (M rounds)
UMR RecursionRecursion:
We have
Therefore we can compute all chunk sizes
23
UMR: Computing MOne can compute the makespan
1/2 factor to account for the last round
worker #1
worker #2
worker #3
UMR: Computing MConstrained minimization problem
Lagrange Multiplier method
24
UMR: Computing MSolving the Lagrange system gives
The first equation can be solved numericallyWe then obtain M and chunk0With the recursion we have all the chunk sizes
UMR vs. XMI
25
Heterogeneous UMRChunks within a round are “uniform”
different chunk sizessame processing times on all workers
Main issue: selection and ordering of workersWe know that selection is NP-hardWe reuse a heuristic from [Beaumont’02]
Sort workers by decreasing bandwidthEffective:
within 20% of the “ideal” makespan up to 1000-fold heterogeneityas compared to 80+% with random selection
Robust UMR (RUMR)
All this is great but...
We still assumes 100% accurate performance prediction for both network and workers
RUMR: Robust UMR
26
Performance Prediction ErrorsUncertainty on chunk transfer and computation time
They are random variablesSeveral Possible causes
undedicated platformcommon for networksCPUs on network of workstations
data-dependent chunk computatione.g., MPEG encoding: 10% uncertaintye.g., HMMER: 9% uncertaintye.g., VFleet volume rendering: 1% uncertainty
Decreasing chunk sizesQuestion: How can we account for uncertainties that disrupt the schedule?Factoring Idea [Hummel’92]
Start with large chunks and decrease chunk sizesN chunks for 1/2 load, N chunks for 1/3 load, etc.Greedy allocation of chunks
Example with 3 rounds
27
Decreasing chunk sizesQuestion: How can we account for uncertainties that disrupt the schedule?Factoring Idea [Hummel’92]
Start with large chunks and decrease chunk sizesN chunks for 1/2 load, N chunks for 1/3 load, etc.Greedy allocation of chunks
Example with 3 rounds
worker #1
worker #2
worker #3 1
1
1
Decreasing chunk sizesQuestion: How can we account for uncertainties that disrupt the schedule?Factoring Idea [Hummel’92]
Start with large chunks and decrease chunk sizesN chunks for 1/2 load, N chunks for 1/3 load, etc.Greedy allocation of chunks
Example with 3 rounds
worker #1
worker #2
worker #3
2
2
2
28
Decreasing chunk sizesQuestion: How can we account for uncertainties that disrupt the schedule?Factoring Idea [Hummel’92]
Start with large chunks and decrease chunk sizesN chunks for 1/2 load, N chunks for 1/3 load, etc.Greedy allocation of chunks
Example with 3 rounds
worker #1
worker #2
worker #3
3
Decreasing chunk sizesQuestion: How can we account for uncertainties that disrupt the schedule?Factoring Idea [Hummel’92]
Start with large chunks and decrease chunk sizesN chunks for 1/2 load, N chunks for 1/3 load, etc.Greedy allocation of chunks
Example with 3 rounds
worker #1
worker #2
worker #3
3
3
29
Decreasing chunk sizesQuestion: How can we account for uncertainties that disrupt the schedule?Factoring Idea [Hummel’92]
Start with large chunks and decrease chunk sizesN chunks for 1/2 load, N chunks for 1/4 load, etc.Greedy allocation of chunks
Example with 3 rounds
worker #1
worker #2
worker #3
33
3
processes 2 chunksin “round” 3
RUMR: UMR + FactoringLimitation of Factoring
Does not account for network transfersStarting with large chunks hinders pipelining
chunksize
time
chunksize
time
UMR Factoring
good overlap ofcommunicationand computation
good robustnessto performance prediction errors
30
RUMR: 2 phasesExecution is split into two phases
UMR phaseFactoring phase
When should the UMR phase end?large uncertainty: short UMR phasesmall uncertainty: long UMR phase
Heuristic: Give an amount of load to each phase
“proportionally” to the errordefined as the stdev of actual/predicted
RUMR: Evaluation [HPDC’03]
31
What now?We have addressed the four limitations of previous work on multi-round divisible load scheduling
affine costsnumber of roundsheterogeneous platformrobustness to uncertainty
Implemented as part of real software for real applications on real platforms
LessonGoing from theoretical scheduling to practical scheduling is ... A LOT of workIn some cases it is worth it
Divisible LoadsIn some cases, it is unclear
Workflow applications?Should one try to reuse the DAG scheduling literature?
32
Lecture OutlineWhat is Scheduling?Application Scheduling
State-of-the-artCase-study: divisible loads
Modeling and EvaluationConclusion
Platform ModelingPlatform modeling is key to the relevance of a scheduling algorithmA reason why published scheduling algorithms go unused is because they assume unrealistic platform models
simplistic models make it easier to find scheduling “results”but the sad truth is that the real-world is complex
Particularly true for large-scale networksWe’ve seen latenciesWhat about bandwidth sharing using something like TCP on wide-area networks?
33
Banwidth SharingTraditional Assumption: Fair Sharing
Good approximation for LANsSome people look at network paths [F.
Vivien’s talk this morning]But what about WANs?
That’s what Grids are made of, so we should probably pay attention
20 Mb/sec 10 Mb/sec
10 Mb/sec
A Simple Experiment
- Sent out files of 500MB up to 1.5GB with TTCP- Using from 1 up to 16 simultaneous connections- Recorded the data transfer rate per connection
34
Experimental Results
Number of concurrent TCP connections
Nor
mal
ized
dat
a ra
te p
er c
onne
ctio
n
Bandwidth SharingExplanations:
TCP congestion windowsBackbone links have so many connections that interference among a few selected connections is negligibleNone of this is surprising to networking people…
This empirical data makes it possible to develop a simple model of bandwidth sharing over single network paths, usable at the application level:
bw(i) = bw(1) / (1 + (i - 1) * gamma)gamma = 0: ideal WANgamma = 1: ideal LANin between: fits experimental data within 10%
35
Why do we care?Has an impact of even simple scheduling problems:
20 tasks to doCan do them locally in 5 secondsOr do them remotely in 5 seconds, but pay the communication time(1000Kb and 100kb/sec), with possible simultaneous connections
UCSB
But it’s more complex..WANs are what large-scale systems (Grids, P2P) are made ofTherefore, we have to account for TCP WAN properties to conduct relevant scheduling research, until other protocols ariseBut we only looked at two end-points!We need a model of bandwidth sharing that will work over arbitrary topologies
36
Bandwidth SharingWhat about more complex topologies?
Packet-level simulationToo low-level to think about when designing scheduling algorithmSlow simulations (e.g., NS)
Macroscopic TCP models (it’s a field)“Fluid in Pipe” analogyTurns out TCP implements some type of Proportional Fairness
10 Mb/sec 10 Mb/sec 10 Mb/sec 10 Mb/sec
8 Mb/sec
2 Mb/sec
8 Mb/sec 8 Mb/sec 8 Mb/sec
[ naïveexample]
Bandwidth SharingMany details necessary for dealing with a
complex topologyRound-trip timesBottleneck linksEtc..
One can developed a fast algorithm that computes bandwidth shares according to accurate TCP macroscopic models [Kelly,Massoulie,Roberts]
Question: What do we do with such a model?
37
Modeling for Scheduling
S
S
router
Modeling for Scheduling
S
S
router
Physical topology is complexcomplex bandwidth sharing
It is unlikely that one can design a scheduling algorithm that can account for all details effectivelyIt is unlikely that one can obtain all the necessary information anyway!Therefore, the scheduling algorithm uses an abstract model
38
Modeling for Scheduling
S
S
router
S
S
router
Modeling for Scheduling
S
S
router
S
S
router
39
Modeling for Scheduling
S
S
router
Fully connectedheterogeneous
Modeling for Scheduling
S
S
router
Fully connectedheterogeneous
no latencies
40
Major (Open) QuestionsHow Abstract can we get?
When does the abstraction lead to unreasonable schedules?Different authors take different approaches
Depends on the scheduling problemTrend: going less abstract
How Real can we get?How much information are we able to provide the scheduling algorithm with?There are tools for network measurements, topology discoveries, etc.
How Real do we want to get?Not clear one can design a good scheduling algorithm that account for all routersEven though, information is never perfect/static and trying to account for all details may in fact be damageable
What is the right trade-off?
EvaluationHow does one evaluate/compare scheduling algorithms on large-scale platforms?Option #1: real-world experiments
labor-intensive at best“give me your platform for 10 days so that I run experiments”
Often unrepeatableprevents back-to-back runs
Limited to physical infrastructurelimits statistical significance of results
Limited by time (needs thousands of runs)limits statistical significance of results
One has to resort to Simulation
41
Need for a Simulation model!What is the right model?
nobody agreesvalidation is VERY difficult
Close to physical:Network
Packet-level simulation (DaSSF, MicroGrid, ModelNet)Abstract model (SimGrid)
ComputationEmulation (MicroGrid)Abstract model (SimGrid)
Trade-off: Accuracy vs. SpeedNot such an obvious trade-off
Two approachesScenario #1
Fully-connected, heterogeneous, no latencies
Macroscopic bandwidth-sharing model, latencies, abstractmodel of computation
Physical
Scenario #2Macroscopic bandwidth-sharing
model, latenciesPacket-level simulation,
emulation of computationPhysical
42
Two Fundamental Questions
simulation speed
sim
ulat
ion
accu
racy
abstraction of platform modelused by the scheduling algorithm
appl
icat
ion
perfo
rman
ceA few data pointsStill far from knownGigantic amount of work
validation in the real world?
A few data pointsStill far from knownApplication dependent
Many other questionsSimulated network topologies?
discovered from real networks?a lot of difficulty there
topology generators (BRITE, Tiers, etc.)incomplete vision
Several research effortsFundamentally, simulation models must come from real-world datasets
measurements, benchmarkingwe’re doing this for desktop availability for instance
43
Lecture OutlineWhat is Scheduling?Application Scheduling
State-of-the-artCase-study: divisible loads
Modeling and EvaluationConclusion