scheduling on large-scale platforms

1

Scheduling on Large-Scale Platforms

Henri Casanova([email protected])

Lecture OutlineWhat is Scheduling?Application Scheduling

State-of-the-artCase-study: divisible loads

Modeling and EvaluationConclusion

2

What is Scheduling?Context: parallel/distributed (scientific) applications that run on distributed computing platforms

a matrix-multiplication parallel algorithm on a clustera complex, multi-component, climate modeling application on a grid that consists of 6 clusters and a data warehouse

Scheduling: the decision process by which application components (computation, data) are assigned to resources (compute node, disk, network link)

What is Scheduling?Scheduling is a well-known hard problem in computer science

NP-hard in most instances (with a few exceptions)Many scheduling heuristics have been proposed

various application modelsvarious platform models

Most general application model: DAGMost general platform model: Graph

Not to say it’s a good model, and there are many subtleties there

3

Where do DAGs come from?Example: Solving a triangular systems of linear equations

A x = b, where A is lower triangular

for i = 1 to nx(i) = b(i) / a(i,i)for j = i+1 to n

b(j) = b(j) - a(j,i) * x(i)

0

Where do DAGs come from?

i=1: x(1) = b(1) / a(1,1) // easily compute x(1)b(2) = b(2) - a(2,1) * x(1) // remove it fromb(3) = b(3) - a(3,1) * x(1) // the system...

i=2: x(2) = b(2) / a(2,2) // easily compute x(2)b(3) = b(3) - a(3,2) * x(2) // remove it fromb(4) = b(4) - a(4,2) * x(2) // the system...

0for i = 1 to nx(i) = b(i) / a(i,i)for j = i+1 to n

b(j) = b(j) - a(j,i) * x(i)

0

0

4

DAG and Linear System Solve

Let ‘<‘ denote the sequential orderT1,1 < T1,2 < T1,n < T2,2 < T2,3 < ...Tn,n

Independent tasks can be done in //How do we determine that tasks are independent?

for i = 1 to nTask Ti,i: x(i) = b(i) / a(i,i)for j = i+1 to n

Task Ti,j: b(j) = b(j) - a(j,i) * x(i)

Bernstein ConditionLet In(T) be the input variables of task TLet Out(T) be the output variables of task TTwo tasks T and T’ are not independent when

In(T) ∩ Out(T’) ≠ ∅or, Out(T) ∩ In(T’) ≠ ∅or, Out(T) ∩ Out(T’) ≠ ∅

5

Back to the Examplefor i = 1 to n

Task Ti,i: x(i) = b(i) / a(i,i)for j = i+1 to n


Task Ti,i:In(Ti,i) = {b(i),a(i,i)}Out(Ti,i) = x(i)

Task Ti,j, for j>iIn(Ti,j) = {b(j),a(j,i),x(j)}Out(Ti,j) = {b(j)}

Dependences?Out(Ti,i) ∩ In(Ti,j) = {x(i)} for all j>iOut(Ti,j) ∩ Out(Ti+1,j) = {b(j)}, for all j > i+1

Back to the Example

for i = 1 to 5Task Ti,i: x(i) = b(i) / a(i,i)for j = i+1 to 5


T1,1

T1,2 T1,3 T1,4

T2,3 T2,4 T2,5

T1,5

T3,4 T3,5

T4,4

T4,5

T5,5

DAG obtained from instruction-level analysis

(root)

(end node)

6

More DAGsA DAG (or Task Graph) may just be inherent to an application at a higher levelWorkflow applications explicitly couple components together into a “coarse-grain” DAG

image processingclimate modelingdatabase operations

Scheduling ProblemAlthough there is a lot of formalism, it comes down to putting slots into Gantt charts:

T1

T2 T3 T4

T5 T6

T7

dependency(no transfer)

7


T1

T2 T3 T4

T5 T6

T7

proc

esso

rs

time

P1

P2

P3


all i

dent

ical


T1

T2 T3 T4

T5 T6

T7

proc

esso

rs

time

P1

P2

P3

makespan

w(T6)


all i

dent

ical

8

DAG Scheduling ProblemTheorem: there exists a valid schedule iff the DAG has no cyclesCritical Path:

Let X be a path in the DAG from the root to the end-node. Let F(X) be the sum of all node weights along the pathThe makespan is larger than F(X) for all pathsThe path that maximizes F(X) is called the critical path

DAG Scheduling problemIf we have an infinite number of processors, then “schedule as early as possible” is optimalOtherwise, it’s NP-hard

NP-hard on 2 identical processorsThings can get more complicated

cost of communication over some networkheterogeneous processors

People have proposed many heuristicsOne common approach: list scheduling

compute task “priorities” and schedule tasks in order of these prioritiespriorities may, for instance, be higher for tasks on the critical path

9

Job SchedulingSo far we’ve talked about a single applicationBut in many cases, resources are shared by multiple users

“jobs” arrive at different timesPerformance metric should be some type of aggregate, rather than the makespan

Minimize average “turn-around time”Minimize average “slow-down”

Slow-down is: “time in the system / time in the system if alone in the system”

Minimize maximum “slow-down”Try to enforce some notion of fairness

There is a wealth of complexity results, as well as open problems

Scheduling in PracticeVery large literatureVery large gap between theory and practiceApplication Scheduling:

Most published algorithms make unrealistic assumptionsEspecially when looking at large-scale Grid platforms

Therefore, they are almost never usedQuestion: Should one just give up and use simple (e.g., greedy) approaches, because there is no hope of doing anything clever inpractice?

Job Scheduling:State-of-the-art: Batch-schedulers (PBS, LQS, LoadLeveler, SGE)NOT designed with user-centric metrics in mind

e.g., resource utilization vs. average slowdownVery rigid request model: “I want N processors for H hours”Question: should we have more flexible interfaces to job schedulers, and should metrics be more user-centric?

10




Application SchedulingContext: “Grid” platforms

Application should run on multiple resources over the wide-areaThe “scheduler” component is present in many systems

But: “We’ll put something clever in there at some point”Why is practical application scheduling difficult?

“Algorithm in paper A assumes homogeneous resources”“Algorithm in paper B assumes a fully-connect network with no bandwidth sharing”“Algorithm in paper C assumes that all resources are stablethroughout application execution”“Algorithm in paper D assumes that we know everything about the application”“Algorithm in paper E assumes that there are no latencies”

Assumptions in the literature do not match the real world

11




Divisible Load ApplicationsDefinition

Application consists of “many” independent elemental tasksEach task has a data and computation loadApplication approximated as a continuous load W, which is measured in “load units”Amount of data to transfer is significant (i.e., not purely compute intensive)

Application examplesMPEG encoding: 1 load unit = 1 frameBioinformatics: 1 load unit = 1 sequenceVolume rendering: 1 load unit = 1 voxelDatamining, image processing, file compression, database joins, genetic algorithms, etc.

Great candidates for Grid deploymentNo dependencies (MUCH more simple than DAGs)Should be deployable on a large scaleLike a data-intensive SETI@home

12

Divisible Load SchedulingDLS problem

pick a number of chunkspick a size for each chunkwhich chunk on which compute resource and when?

VariationsObjective

minimize “makespan”maximize steady-state throughput

Compute and Network Resourceshomogeneousheterogeneous

Topologylinear array, star, tree, hypercube, mesh, ...

Single-Round DLSN workers, N “chunks”

master

worker worker worker

time

worker #1

worker #2

worker #3

chunk transferchunk computation

Some assumptionsCommunication from master is serializedCommunication and Computation “costs” are commensurate to chunk size (in load units)

13

Single-Round DLSOptimal Schedule

master


time

worker #1

worker #2

worker #3

workersfinish at the same time

Single-Round DLSOptimal Schedule

master


time

worker #1

worker #2

worker #3

workersfinish at the same time

Drawback: idle timemost common bane of parallel application performance

idle

idle

idle

14

Multi-Round DLSN workers, MxN chunks (M: # of rounds)

time

worker #1

worker #2

worker #3

idle

idle

idle

Increasing and then decreasing chunk sizesWell-known “pipelining” techniqueAdditional assumption

Overlap of communication and computation at each worker

Multi-Installment (MI) Algorithm

Multi-installment algorithm [Veeravalli’95]

time

worker #1

worker #2

worker #3

Worker finishes receiving data for a chunk exactly when computation for that chunk can startleads to a “tight” schedule

no gaps in network and worker utilizationInduction on chunk sizes to compute the schedule

15

MI for Large-Scale Platforms?Assumes linear costs

Tcomm = chunk_size / transfer_rate (in load units)Tcomp = chunk_size / compute_rate (in load units)

Assumes a homogeneous platformhomogeneous network and workers

Assumes 100% accurate performance prediction for both network and workers

MI is not applicable to large-scale platforms

MI for Large-Scale Platforms?Assumes linear costs

Tcomm = chunk_size / transfer_rate (in load units)Tcomp = chunk_size / compute_rate (in load units)

Assumes a homogeneous platformhomogeneous network and workers

Assumes 100% accurate performance prediction for both network and workers

MI is not applicable to large-scale platforms

16

Affine CostsKnown to be more realistic than linear costsTcomm = β + g / B

β: overhead of establishing connection, network latency, authentication/authorization, etc.Can be significant in wide-area platforms (e.g., TeraGrid) using Grid middleware (e.g., GridFTP) (up to several seconds)

Tcomp = α + g / Sα: overhead of establishing connection, network latency,

authentication/authorization, process initiation, etc.In practice can be several secondsGrid Benchmarking projects report high values (up to 45 seconds on some production Grid platforms)

XMI: eXtended Multi-Installment

time

worker #1

worker #2

worker #3

homogeneous platform

17


8

time

worker #1

worker #2

worker #3

7

6

5

4

3

2

1

0


Recursion (R = B / S)∀ i ≥ N, α + gi = (gi-1 + gi-2 + … + gi-N) / R + N β

8

time

worker #1

worker #2

worker #3

7

6

5

4

3

2

1

0

18


Recursion (R = B / S)∀ i ≥ N, α + gi = (gi-1 + gi-2 + … + gi-N) / R + N β∀ 0 ≤ i < N, α + gi = (gi-1 + … + g0)/R + i β + g0 + α

8

time

worker #1

worker #2

worker #3

7

6

5

4

3

2

1

0


Recursion (R = B / S)∀ i ≥ N, α + gi = (gi-1 + gi-2 + … + gi-N) / R + N β∀ 0 ≤ i < N, α + gi = (gi-1 + … + g0)/R + i β + g0 + α∀ i ≤ 0 gi = 0

8

time

worker #1

worker #2

worker #3

7

6

5

4

3

2

1

0

19

Solving the XMI recursionHow to solve ?∀ i ≥ N, α + gi = (gi-1 + gi-2 + … + gi-N) / R + N β∀ 0 ≤ i < N, α + gi = (gi-1 + … + g0)/R + i β + g0 + α∀ i ≤ 0 gi = 0

Generating function: G(x) = ∑ gi xi

Compute G(x)?sum over the equations aboveproperty: ∑ gi xi = nth coeff. of G(x)/(1-x))

i=0

∞

i=0

n

g0 (1-xN) + α xN + β (x + x2 + … + xN)G(x) =

(1 – x) – x(1- xN) / R

Solving the XMI recursionThe problem is reduced to finding the

coefficients of G(x)Simple rational expansion theorem [GKP’94]

roots of the denominator polynomialcan only work if all roots of degree 1

true is R ≠ Nif R = N, turns out one can use the General theorem

Roots can be computed numericallyWe obtain chunk sizes as linear combinations of geometric series

gi = g0 ∑ ηjθji + ∑ ξjθj

i (θj: inverse of jth root)j=0

N

j=0

N

20

XMI: Still not there

Remaining limitationsonly applicable to homogeneous platforms

difficult to develop the recursion in the heterogeneous case

assumes 100% accurate performance prediction for both network and workers

XMI: Still not there

Remaining limitationsonly applicable to homogeneous platforms

difficult to develop the recursion in the heterogeneous case

assumes 100% accurate performance prediction for both network and workers

21

UMR: Uniform Multi-RoundKey issue: make the chunk size recursion simpler to solveKey idea: use “uniform” chunks:

chunks are the same size within a roundHomogeneous case:

time

worker #1

worker #2

worker #3

UMR Recursion

∀ 0 ≤ j < M-1 α + chunkj / S = N (chunkj+1 / B) + N β

(last round: same as for XMI)if NS = B we obtain an arithmetic seriesif NS ≠ B we obtain a geometric series

time

worker #1

worker #2

worker #3

chunkj : chunk size at round j (M rounds)

22

UMR Recursion

∀ 0 ≤ j < M-1 α + chunkj / S = N (chunkj+1 / B) + N β

(last round: same as for XMI)if NS = B we obtain an arithmetic seriesif NS ≠ B we obtain a geometric series

time

worker #1

worker #2

worker #3

chunkj : chunk size at round j (M rounds)

UMR RecursionRecursion:

We have

Therefore we can compute all chunk sizes

23

UMR: Computing MOne can compute the makespan

1/2 factor to account for the last round

worker #1

worker #2

worker #3

UMR: Computing MConstrained minimization problem

Lagrange Multiplier method

24

UMR: Computing MSolving the Lagrange system gives

The first equation can be solved numericallyWe then obtain M and chunk0With the recursion we have all the chunk sizes

UMR vs. XMI

25

Heterogeneous UMRChunks within a round are “uniform”

different chunk sizessame processing times on all workers

Main issue: selection and ordering of workersWe know that selection is NP-hardWe reuse a heuristic from [Beaumont’02]

Sort workers by decreasing bandwidthEffective:

within 20% of the “ideal” makespan up to 1000-fold heterogeneityas compared to 80+% with random selection

Robust UMR (RUMR)

All this is great but...

We still assumes 100% accurate performance prediction for both network and workers

RUMR: Robust UMR

26

Performance Prediction ErrorsUncertainty on chunk transfer and computation time

They are random variablesSeveral Possible causes

undedicated platformcommon for networksCPUs on network of workstations

data-dependent chunk computatione.g., MPEG encoding: 10% uncertaintye.g., HMMER: 9% uncertaintye.g., VFleet volume rendering: 1% uncertainty

Decreasing chunk sizesQuestion: How can we account for uncertainties that disrupt the schedule?Factoring Idea [Hummel’92]

Start with large chunks and decrease chunk sizesN chunks for 1/2 load, N chunks for 1/3 load, etc.Greedy allocation of chunks

Example with 3 rounds

27




worker #1

worker #2

worker #3 1

1

1




worker #1

worker #2

worker #3

2

2

2

28




worker #1

worker #2

worker #3

3




worker #1

worker #2

worker #3

3

3

29




worker #1

worker #2

worker #3

33

3

processes 2 chunksin “round” 3

RUMR: UMR + FactoringLimitation of Factoring

Does not account for network transfersStarting with large chunks hinders pipelining

chunksize

time

chunksize

time

UMR Factoring

good overlap ofcommunicationand computation

good robustnessto performance prediction errors

30

RUMR: 2 phasesExecution is split into two phases

UMR phaseFactoring phase

When should the UMR phase end?large uncertainty: short UMR phasesmall uncertainty: long UMR phase

Heuristic: Give an amount of load to each phase

“proportionally” to the errordefined as the stdev of actual/predicted

RUMR: Evaluation [HPDC’03]

31

What now?We have addressed the four limitations of previous work on multi-round divisible load scheduling

affine costsnumber of roundsheterogeneous platformrobustness to uncertainty

Implemented as part of real software for real applications on real platforms

LessonGoing from theoretical scheduling to practical scheduling is ... A LOT of workIn some cases it is worth it

Divisible LoadsIn some cases, it is unclear

Workflow applications?Should one try to reuse the DAG scheduling literature?

32




Platform ModelingPlatform modeling is key to the relevance of a scheduling algorithmA reason why published scheduling algorithms go unused is because they assume unrealistic platform models

simplistic models make it easier to find scheduling “results”but the sad truth is that the real-world is complex

Particularly true for large-scale networksWe’ve seen latenciesWhat about bandwidth sharing using something like TCP on wide-area networks?

33

Banwidth SharingTraditional Assumption: Fair Sharing

Good approximation for LANsSome people look at network paths [F.

Vivien’s talk this morning]But what about WANs?

That’s what Grids are made of, so we should probably pay attention

20 Mb/sec 10 Mb/sec

10 Mb/sec

A Simple Experiment

- Sent out files of 500MB up to 1.5GB with TTCP- Using from 1 up to 16 simultaneous connections- Recorded the data transfer rate per connection

34

Experimental Results

Number of concurrent TCP connections

Nor

mal

ized

dat

a ra

te p

er c

onne

ctio

n

Bandwidth SharingExplanations:

TCP congestion windowsBackbone links have so many connections that interference among a few selected connections is negligibleNone of this is surprising to networking people…

This empirical data makes it possible to develop a simple model of bandwidth sharing over single network paths, usable at the application level:

bw(i) = bw(1) / (1 + (i - 1) * gamma)gamma = 0: ideal WANgamma = 1: ideal LANin between: fits experimental data within 10%

35

Why do we care?Has an impact of even simple scheduling problems:

20 tasks to doCan do them locally in 5 secondsOr do them remotely in 5 seconds, but pay the communication time(1000Kb and 100kb/sec), with possible simultaneous connections

UCSB

But it’s more complex..WANs are what large-scale systems (Grids, P2P) are made ofTherefore, we have to account for TCP WAN properties to conduct relevant scheduling research, until other protocols ariseBut we only looked at two end-points!We need a model of bandwidth sharing that will work over arbitrary topologies

36

Bandwidth SharingWhat about more complex topologies?

Packet-level simulationToo low-level to think about when designing scheduling algorithmSlow simulations (e.g., NS)

Macroscopic TCP models (it’s a field)“Fluid in Pipe” analogyTurns out TCP implements some type of Proportional Fairness

10 Mb/sec 10 Mb/sec 10 Mb/sec 10 Mb/sec

8 Mb/sec

2 Mb/sec

8 Mb/sec 8 Mb/sec 8 Mb/sec

[ naïveexample]

Bandwidth SharingMany details necessary for dealing with a

complex topologyRound-trip timesBottleneck linksEtc..

One can developed a fast algorithm that computes bandwidth shares according to accurate TCP macroscopic models [Kelly,Massoulie,Roberts]

Question: What do we do with such a model?

37

Modeling for Scheduling

S

S

router


S

S

router

Physical topology is complexcomplex bandwidth sharing

It is unlikely that one can design a scheduling algorithm that can account for all details effectivelyIt is unlikely that one can obtain all the necessary information anyway!Therefore, the scheduling algorithm uses an abstract model

38


S

S

router

S

S

router


S

S

router

S

S

router

39


S

S

router

Fully connectedheterogeneous


S

S

router

Fully connectedheterogeneous

no latencies

40

Major (Open) QuestionsHow Abstract can we get?

When does the abstraction lead to unreasonable schedules?Different authors take different approaches

Depends on the scheduling problemTrend: going less abstract

How Real can we get?How much information are we able to provide the scheduling algorithm with?There are tools for network measurements, topology discoveries, etc.

How Real do we want to get?Not clear one can design a good scheduling algorithm that account for all routersEven though, information is never perfect/static and trying to account for all details may in fact be damageable

What is the right trade-off?

EvaluationHow does one evaluate/compare scheduling algorithms on large-scale platforms?Option #1: real-world experiments

labor-intensive at best“give me your platform for 10 days so that I run experiments”

Often unrepeatableprevents back-to-back runs

Limited to physical infrastructurelimits statistical significance of results

Limited by time (needs thousands of runs)limits statistical significance of results

One has to resort to Simulation

41

Need for a Simulation model!What is the right model?

nobody agreesvalidation is VERY difficult

Close to physical:Network

Packet-level simulation (DaSSF, MicroGrid, ModelNet)Abstract model (SimGrid)

ComputationEmulation (MicroGrid)Abstract model (SimGrid)

Trade-off: Accuracy vs. SpeedNot such an obvious trade-off

Two approachesScenario #1

Fully-connected, heterogeneous, no latencies

Macroscopic bandwidth-sharing model, latencies, abstractmodel of computation

Physical

Scenario #2Macroscopic bandwidth-sharing

model, latenciesPacket-level simulation,

emulation of computationPhysical

42

Two Fundamental Questions

simulation speed

sim

ulat

ion

accu

racy

abstraction of platform modelused by the scheduling algorithm

appl

icat

ion

perfo

rman

ceA few data pointsStill far from knownGigantic amount of work

validation in the real world?

A few data pointsStill far from knownApplication dependent

Many other questionsSimulated network topologies?

discovered from real networks?a lot of difficulty there

topology generators (BRITE, Tiers, etc.)incomplete vision

Several research effortsFundamentally, simulation models must come from real-world datasets

measurements, benchmarkingwe’re doing this for desktop availability for instance

43




scheduling on large-scale platforms

Documents