dynamic multi phase scheduling for heterogeneous clusters florina m. ciorba †, theodore andronikos...

Dynamic Multi Phase Scheduling for Dynamic Multi Phase Scheduling for Heterogeneous ClustersHeterogeneous Clusters

Florina M. Ciorba†, Theodore Andronikos†, Ioannis Riakiotakis†,

Anthony T. Chronopoulos‡ and George Papakonstantinou†

† National Technical University of Athens

Computing Systems Laboratory

‡ University of Texas at San Antonio

cflorina@cslab.ece.ntua.grwww.cslab.ece.ntua.gr

20th International Parallel and Distributed Processing Symposium

25-29 April 2006

April 27, 2006 IPDPS 2006 2

OutlineOutline• IntroductionIntroduction

• Notation

• Some existing self-scheduling algorithms

• Dynamic self-scheduling for dependence

• Implementation and test results

• Conclusions

• Future work

April 27, 2006 IPDPS 2006 3

IntroductionIntroduction Motivation for dynamically scheduling

loops with dependencies:

• Existing dynamic algorithms can not cope

with dependencies, because they lack

inter-slave communication

• Static algorithms are not always efficient

• In their original form, if dynamic algorithms

are applied to loops with dependencies,

they yield a serial/invalid execution

April 27, 2006 IPDPS 2006 4

• Notation

• Conclusions

• Future work

April 27, 2006 IPDPS 2006 5

NotationNotationAlgorithmic model:FOR (i1=l1; i1<=u1; i1++) FOR (i2=l2; i2<=u2; i2++) … FOR (in=ln; in<=un; in++)

Loop Body ENDFOR … ENDFORENDFOR• Perfectly nested loops

• Constant flow data dependencies• General program statements within the loop body

• J – index space of an n-dimensional uniform dependence loop

}1,|j{ nruilNJ rrr

April 27, 2006 IPDPS 2006 6

NotationNotation• u1 – synchronization dimension, un – scheduling dimension

• – set of dependence vectors

• PE – processing element

• P1,...,Pm – slaves

• N – number of scheduling steps

• Ci – chunk size at the i-th scheduling step

• Vi – size (iteration-wise) of Ci along scheduling dimension un

• VPk – virtual computing power of slave Pk

• Qk – number of processes in the run-queue of slave Pk

• – available computing power of slave Pk

• – total available computing power of the cluster

},...,{ 1 pddDS

kkk QVPA

k kAA1

April 27, 2006 IPDPS 2006 7

Outline• IntroductionIntroduction

• Notation

• Conclusions

• Future work

April 27, 2006 IPDPS 2006 8

Some existing self-scheduling Some existing self-scheduling algorithmsalgorithms

• CSS and TSS are devised for homogeneous systems

• DTSS improves on TSS for heterogeneous systems by selecting

the chunk sizes according to:

• the virtual computational power of the slaves, Vk

• the number of processes in the run-queue of each PE, Qk

3 self-scheduling algorithms: CSS – Chunk Self-Scheduling,

Ci = constant

TSS – Trapezoid Self-Scheduling, Ci

= Ci-1 – D, where D – decrement, and

the first chunk is F = |J|/(2×m) and the last chunk is L = 1.

DTSS – Distributed TSS, Ci = Ci-1 – D,

where D – decrement, and the first chunk is F = |J|/(2×A) and the last chunk is L = 1.

April 27, 2006 IPDPS 2006 9

Some existing self-scheduling Some existing self-scheduling algorithmsalgorithms

Algorithm Chunk sizes

CSS 300 300 300 300 300 300 300 300 300 300 300 300 300 300 300 300 200

TSS 277 270 263 256 249 242 235 228 221214 207 200 193 186 179 172 165 158 151 144 137 130 123 116 109 102 73

DTSS(dedicated)

392 253 368 237 344 221 108 211 103 300 192 276 176 176 252 160 77 149 72 207 130 183 114 159 98 46 87 41 44

DTSS(non-

dedicated)

263 383 369 355 229 112 219 107 209203 293 279 265 169 33 96 46 89 8683 80 77 74 24 69 66 31 59 56 5350 47 44 20 39 20 33 30 27 24 2120 20 20 20 20 20 20 20 8

|J|=5000×10000

m = 10 slaves

CSS and TSS give

the same chunk

sizes both in

dedicated and non-

dedicated systems,

respectively

DTSS adjusts the

chunk sizes to match

the different Ak of

slaves

April 27, 2006 IPDPS 2006 10

• Notation

• Conclusions

• Future work

April 27, 2006 IPDPS 2006 11

More notationMore notation• SP – synchronization point

• M – number of SPs inserted along synchronization

dimension u1

• H – interval (iteration-wise) between two SPs along u1

• H – is the same for every chunk

• SCi,j – the set of iterations of Ci between SPj-1 and

• Ci = Vi × M × H

• Current slave – the slave assigned chunk Ci

• Previous slave – the slave assigned chunk Ci-1

April 27, 2006 IPDPS 2006 12

Self-scheduling with Self-scheduling with synchronizationsynchronization

• Chunks are formed along scheduling dimension, here say u2

• SPs are inserted along synchronization dimension, u1

• Phase 1: Apply self-scheduling algorithms to the scheduling dimension• Phase 2: Insert synchronization points along the synchronization dimension

April 27, 2006 IPDPS 2006 13

The inter-slave communication scheme

• Ci-1 is assigned to Pk-1, Ci assigned to Pk and Ci+1 to Pk+1

• When Pk reaches SPj+1, it sends to Pk+1 only the data Pk+1 requires (i.e.,

those iterations imposed by the existing dependence vectors)• Afterwards, Pk receives from Pk-1 the data required for the current

computation

Slaves do not reach a SP at the same time, which leads to a wavefront execution fashion

communication setset of points computed at moment t+1set of points computed at moment tindicates communicationauxiliary explanations

SCi,j+

tt tt+1+1

tt+1+1

April 27, 2006 IPDPS 2006 14

Dynamic Multi-Phase Scheduling DMPS(x)

INPUT: (a) An n-dimensional dependence nested loop.

(b) The choice of the algorithm CSS, TSS or DTSS.

(c) If CSS is chosen, then chunk size Ci.

(d) The synchronization interval H.

(e) The number of slaves m; in case of DTSS, the virtual power Vk of

every slave.

MasterMaster:

Initialization: (M.a) Register slaves. In case of DTSS, slaves report their Ak.

(M.b) Calculate F, L, N, D for TSS and DTSS. For CSS use the given Ci.

While there are unassigned iterations do:

(M.1) If a request arrives, put it in the queue.

(M.2) Pick a request from the queue, and compute the next chunk size using CSS,

TSS or DTSS.

(M.3) Update the current and previous slave ids.

(M.4) Send the id of the current slave to the previous one.

April 27, 2006 IPDPS 2006 15

Slave Slave PPkk:

Initialization: (S.a) Register with the master. In case of DTSS, report Ak.

(S.b) Compute M according to the given H.

(S.1) Send request to the master.

(S.2) Wait for reply; if received chunk from master, go to step 3, else go to

OUTPUT.

(S.3) While the next SP is not reached, compute chunk i.

(S.4) If id of the send-to slave is known, go to step 5, else go to step 6.

(S.5) Send computed data to send-to slave

(S.6) Receive data from the receive-from slave and go to step 3.

OUTPUT

MasterMaster: If there are no more chunks to be assigned, terminate.

Slave Slave PPkk: If no more tasks come from master, terminate.

April 27, 2006 IPDPS 2006 16

Advantages of DMPS(x)

Can take as input any self-scheduling algorithm,

without any modifications

Phase 2 is independent of Phase 1

Phase 1 deals with the heterogeneity & load

variation in the system

Phase 2 deals with minimizing the inter-slave

communication cost

Suitable for any type of heterogeneous systems

April 27, 2006 IPDPS 2006 17

• Notation

• Conclusions

• Future work

April 27, 2006 IPDPS 2006 18

Implementation and testing setupImplementation and testing setup The algorithms are implemented in C and C++

MPI platform is used for master-slave and inter-slave

communication

The heterogeneous system consists of 10 machines:

4 Intel Pentiums III, 1266 MHz with 1GB RAM (called zealots),

assumed to have VPk = 1.5 (one of them is the master)

6 Intel Pentiums III, 500 MHz with 512MB RAM (called kids),

assumed to have VPk = 0.5.

Interconnection network is Fast Ethernet, at 100Mbit/sec.

Dedicated system: all machines are dedicated to running the

program and no other loads are interposed during the execution.

Non-dedicated system: at the beginning of program’s execution,

a resource expensive process is started on some of the slaves,

halving their Ak.

April 27, 2006 IPDPS 2006 19

Implementation and testing setupImplementation and testing setup System configuration: zealot1 (master), zealot2, kid1, zealot3, kid2,

zealot4, kid3, kid4, kid5, kid6.

Three series of experiments for both dedicated & non-dedicated

systems, for m = 3,4,5,6,7,8,9 slaves:

1) DMPS(CSS)

2) DMPS(TSS)

3) DMPS(DTSS)

Two real-life applications: heat equation, Floyd-Steinberg computation

Speedup Sp is computed with:

where TPi – serial execution time on slave Pi, 1 ≤ i ≤ m, and

TPAR – parallel execution time (on m slaves)

In the plotting of Sp, VP is used instead of m on the x-axis.

PPPp T

TTTS m

},...,,min{21

April 27, 2006 IPDPS 2006 20

Performance results – Heat Performance results – Heat equationequation

Sync. interval

Dedicated systemSeries tested

Number of slaves m

3 4 5 6 7 8 9

1) DMPS(CSS) 2.32 1.75 1.73 1.23 1.21 1.21 1.182) DMPS(TSS) 2.20 1.73 1.56 1.38 1.25 1.14 1.02

3) DMPS(DTSS) 1.42 1.14 1.00 0.95 0.91 0.85 0.78

1501) DMPS(CSS) 2.31 1.74 1.71 1.21 1.22 1.21 1.182) DMPS(TSS) 2.18 1.72 1.54 1.38 1.25 1.14 1.02

3) DMPS(DTSS) 1.42 1.13 0.99 0.93 0.90 0.84 0.78

1) DMPS(CSS) 2.30 1.74 1.73 1.22 1.23 1.22 1.192) DMPS(TSS) 2.21 1.74 1.55 1.38 1.25 1.14 1.02

3) DMPS(DTSS) 1.42 1.13 0.99 0.94 0.90 0.83 0.78Heat Equation, dedicated heterogeneous cluster

3.5 4 5.5 6 6.5 7 7.5

Virtual powers

DMPS(CSS) DMPS(TSS) DMPS(DTSS)

April 27, 2006 IPDPS 2006 21

Performance results – Heat Performance results – Heat equationequation

Sync. interval

Non-dedicated system

Series tested

Number of slaves m

3 4 5 6 7 8 9

1) DMPS(CSS) 2.33 1.76 1.73 2.46 2.45 2.38 2.062) DMPS(TSS) 2.20 1.74 1.56 2.52 2.56 2.18 2.10

3) DMPS(DTSS) 1.95 1.45 1.30 1.31 1.33 1.38 1.25

1501) DMPS(CSS) 2.33 1.74 1.72 2.46 2.49 2.43 2.052) DMPS(TSS) 2.19 1.72 1.54 2.42 2.23 2.31 2.06

3) DMPS(DTSS) 1.94 1.47 1.30 1.30 1.28 1.36 1.23

1) DMPS(CSS) 2.30 1.74 1.73 2.39 2.36 2.38 2.102) DMPS(TSS) 2.22 1.75 1.56 1.79 2.32 2.10 2.02

3) DMPS(DTSS) 1.96 1.44 1.29 1.29 1.27 1.32 1.21Heat Equation, non-dedicated heterogeneous cluster

3.5 4 5.5 6 6.5 7 7.5

Virtual powers

April 27, 2006 IPDPS 2006 22

Performance results – Floyd-Performance results – Floyd-SteinbergSteinberg

Sync. interval

Dedicated system

Series tested

Number of slaves m

3 4 5 6 7 8 9

1) DMPS(CSS) 27.79

2) DMPS(TSS) 25.32

3) DMPS(DTSS) 19.63

1) DMPS(CSS) 27.52

2) DMPS(TSS) 25.22

3) DMPS(DTSS) 19.63

1) DMPS(CSS) 27.58

2) DMPS(TSS) 25.22

3) DMPS(DTSS) 19.62

Floyd-Steinberg, dedicated heterogeneous cluster

3.5 4 5.5 6 6.5 7 7.5

Virtual powers

April 27, 2006 IPDPS 2006 23

Performance results – Floyd-Performance results – Floyd-SteinbergSteinberg

Floyd-Steinberg, non-dedicated heterogeneous cluster

3.5 4 5.5 6 6.5 7 7.5

Virtual power

Sync. interval

Non-dedicated system

Series tested

Number of slaves m

3 4 5 6 7 8 9

1) DMPS(CSS) 27.72

2) DMPS(TSS) 25.18

3) DMPS(DTSS) 21.88

1) DMPS(CSS) 27.49

2) DMPS(TSS) 25.18

3) DMPS(DTSS) 21.85

1) DMPS(CSS) 27.57

2) DMPS(TSS) 25.17

3) DMPS(DTSS) 21.86

April 27, 2006 IPDPS 2006 24

Interpretation of the results• Dedicated system:

• as expected, all algorithms perform better on a dedicated system, compared to a non-dedicated one.

• DMPS(TSS) slightly outperforms DMPS(CSS) for parallel loops, because it provides better load balancing

• DMPS(DTSS) outperforms both other algorithms because it explicitly accounts for system’s heterogeneity

• Non-dedicated system:

• DMPS(DTSS) stands out even more, since the other algorithms cannot handle extra load variations

• The speedup for DMPS(DTSS) increases in all cases

• H must be chosen so as to maintain the comm/comp ratio < 1,

for every test case

• Even then, small variations of the value of H, do not significantly affect the overall performance.

April 27, 2006 IPDPS 2006 25

• Notation

• Conclusions

• Future work

April 27, 2006 IPDPS 2006 26

ConclusionsConclusions

• Loops with dependencies can now be

dynamically scheduled on heterogeneous

dedicated & non-dedicated systems

• Distributed algorithms efficiently compensate

for the system’s heterogeneity for loops with

dependencies, especially in non-dedicated

systems

April 27, 2006 IPDPS 2006 27

• Notation

• Conclusions

• Future work

April 27, 2006 IPDPS 2006 28

Future work

• Establish a model for predicting the

optimal synchronization interval H and

minimize the communication

• Extend all other self-scheduling

algorithms, such that they can handle

loops with dependencies and account for

system’s heterogeneity

April 27, 2006 IPDPS 2006 29

Thank you

Questions?

dynamic multi phase scheduling for heterogeneous clusters florina m. ciorba †, theodore andronikos...

scheduling loops

existing dynamic algorithms

dynamic multi phase

serialinvalid execution

notation algorithmic

heterogeneous clusters

theodore andronikos

ioannis riakiotakis

Documents

archaeological work in macedonia and thrace 18 ... - … 18,...

ciorbe & salate & gustari calde pizza … soup ciorbe &...

economia de marketing andronikos business university

· 5.5lel/300ml supa de chicken soup ciorba ciorba ciorba...

cyberinfrastructure-based codes repository: description of...

athens workshop / ec 8 – 3 : 2005 and ngcsi : 2012 april...

radu ciorba fradu@devrandom · 2017. 10. 6. · radu ciorba...

house specialties...ciorba de fasole cu costita navy sour...

an efficient atm network switch scheduling - broadcasting...

athens workshop / ec 8–3 : 2005 and ngcsi : 2012 april 12,...

satish anthony t. chronopoulossatish penmatsa and anthony t....

princeton/stanford working papers in...

report on energy scenarios - euro-fusion.org · umberto...

from the desk of the state librarian nh librarians’...

covid economicsand tiago tavares consumption in great...

yeșilköy/ayios andronikos aquifer groundwater …...

technava s.a. technava news · a slugblast resulting from...

computational and applied mathematics · 2010. 5. 19. ·...

j - kane county, illinois · 2010. 2. 9. · whereas,...

mic dejun breakfast - oneclubrestaurant.ro fileciorbe ciorbe...