dynamic multi phase scheduling for heterogeneous clusters florina m. ciorba †, theodore andronikos...
Post on 21-Dec-2015
213 Views
Preview:
TRANSCRIPT
Dynamic Multi Phase Scheduling for Dynamic Multi Phase Scheduling for Heterogeneous ClustersHeterogeneous Clusters
Florina M. Ciorba†, Theodore Andronikos†, Ioannis Riakiotakis†,
Anthony T. Chronopoulos‡ and George Papakonstantinou†
† National Technical University of Athens
Computing Systems Laboratory
‡ University of Texas at San Antonio
cflorina@cslab.ece.ntua.grwww.cslab.ece.ntua.gr
20th International Parallel and Distributed Processing Symposium
25-29 April 2006
April 27, 2006 IPDPS 2006 2
OutlineOutline• IntroductionIntroduction
• Notation
• Some existing self-scheduling algorithms
• Dynamic self-scheduling for dependence
loops
• Implementation and test results
• Conclusions
• Future work
April 27, 2006 IPDPS 2006 3
IntroductionIntroduction Motivation for dynamically scheduling
loops with dependencies:
• Existing dynamic algorithms can not cope
with dependencies, because they lack
inter-slave communication
• Static algorithms are not always efficient
• In their original form, if dynamic algorithms
are applied to loops with dependencies,
they yield a serial/invalid execution
April 27, 2006 IPDPS 2006 4
OutlineOutline• IntroductionIntroduction
• Notation
• Some existing self-scheduling algorithms
• Dynamic self-scheduling for dependence
loops
• Implementation and test results
• Conclusions
• Future work
April 27, 2006 IPDPS 2006 5
NotationNotationAlgorithmic model:FOR (i1=l1; i1<=u1; i1++) FOR (i2=l2; i2<=u2; i2++) … FOR (in=ln; in<=un; in++)
Loop Body ENDFOR … ENDFORENDFOR• Perfectly nested loops
• Constant flow data dependencies• General program statements within the loop body
• J – index space of an n-dimensional uniform dependence loop
•
}1,|j{ nruilNJ rrr
April 27, 2006 IPDPS 2006 6
NotationNotation• u1 – synchronization dimension, un – scheduling dimension
• – set of dependence vectors
• PE – processing element
• P1,...,Pm – slaves
• N – number of scheduling steps
• Ci – chunk size at the i-th scheduling step
• Vi – size (iteration-wise) of Ci along scheduling dimension un
• VPk – virtual computing power of slave Pk
• Qk – number of processes in the run-queue of slave Pk
• – available computing power of slave Pk
• – total available computing power of the cluster
},...,{ 1 pddDS
kkk QVPA
m
k kAA1
April 27, 2006 IPDPS 2006 7
Outline• IntroductionIntroduction
• Notation
• Some existing self-scheduling algorithms
• Dynamic self-scheduling for dependence
loops
• Implementation and test results
• Conclusions
• Future work
April 27, 2006 IPDPS 2006 8
Some existing self-scheduling Some existing self-scheduling algorithmsalgorithms
• CSS and TSS are devised for homogeneous systems
• DTSS improves on TSS for heterogeneous systems by selecting
the chunk sizes according to:
• the virtual computational power of the slaves, Vk
• the number of processes in the run-queue of each PE, Qk
3 self-scheduling algorithms: CSS – Chunk Self-Scheduling,
Ci = constant
TSS – Trapezoid Self-Scheduling, Ci
= Ci-1 – D, where D – decrement, and
the first chunk is F = |J|/(2×m) and the last chunk is L = 1.
DTSS – Distributed TSS, Ci = Ci-1 – D,
where D – decrement, and the first chunk is F = |J|/(2×A) and the last chunk is L = 1.
u1
u2
Vi+1
Vi
Vi-1
V1
VN
...
...
DTSS
TSS
CSS
Ci+1
Ci
Ci-1
April 27, 2006 IPDPS 2006 9
Some existing self-scheduling Some existing self-scheduling algorithmsalgorithms
Algorithm Chunk sizes
CSS 300 300 300 300 300 300 300 300 300 300 300 300 300 300 300 300 200
TSS 277 270 263 256 249 242 235 228 221214 207 200 193 186 179 172 165 158 151 144 137 130 123 116 109 102 73
DTSS(dedicated)
392 253 368 237 344 221 108 211 103 300 192 276 176 176 252 160 77 149 72 207 130 183 114 159 98 46 87 41 44
DTSS(non-
dedicated)
263 383 369 355 229 112 219 107 209203 293 279 265 169 33 96 46 89 8683 80 77 74 24 69 66 31 59 56 5350 47 44 20 39 20 33 30 27 24 2120 20 20 20 20 20 20 20 8
|J|=5000×10000
m = 10 slaves
CSS and TSS give
the same chunk
sizes both in
dedicated and non-
dedicated systems,
respectively
DTSS adjusts the
chunk sizes to match
the different Ak of
slaves
April 27, 2006 IPDPS 2006 10
Outline• IntroductionIntroduction
• Notation
• Some existing self-scheduling algorithms
• Dynamic self-scheduling for dependence
loops
• Implementation and test results
• Conclusions
• Future work
April 27, 2006 IPDPS 2006 11
More notationMore notation• SP – synchronization point
• M – number of SPs inserted along synchronization
dimension u1
• H – interval (iteration-wise) between two SPs along u1
• H – is the same for every chunk
• SCi,j – the set of iterations of Ci between SPj-1 and
SPj
• Ci = Vi × M × H
• Current slave – the slave assigned chunk Ci
• Previous slave – the slave assigned chunk Ci-1
April 27, 2006 IPDPS 2006 12
Self-scheduling with Self-scheduling with synchronizationsynchronization
• Chunks are formed along scheduling dimension, here say u2
• SPs are inserted along synchronization dimension, u1
• Phase 1: Apply self-scheduling algorithms to the scheduling dimension• Phase 2: Insert synchronization points along the synchronization dimension
April 27, 2006 IPDPS 2006 13
The inter-slave communication scheme
• Ci-1 is assigned to Pk-1, Ci assigned to Pk and Ci+1 to Pk+1
• When Pk reaches SPj+1, it sends to Pk+1 only the data Pk+1 requires (i.e.,
those iterations imposed by the existing dependence vectors)• Afterwards, Pk receives from Pk-1 the data required for the current
computation
Slaves do not reach a SP at the same time, which leads to a wavefront execution fashion
communication setset of points computed at moment t+1set of points computed at moment tindicates communicationauxiliary explanations
Pk+
1
Pk
Pk-1
SPj
Ci+1
Ci
Ci-1
SPj+
1
SPj+
2
SCi,j+
1
SCi-
1,j+1
tt
tt tt+1+1
tt+1+1
April 27, 2006 IPDPS 2006 14
Dynamic Multi-Phase Scheduling DMPS(x)
INPUT: (a) An n-dimensional dependence nested loop.
(b) The choice of the algorithm CSS, TSS or DTSS.
(c) If CSS is chosen, then chunk size Ci.
(d) The synchronization interval H.
(e) The number of slaves m; in case of DTSS, the virtual power Vk of
every slave.
MasterMaster:
Initialization: (M.a) Register slaves. In case of DTSS, slaves report their Ak.
(M.b) Calculate F, L, N, D for TSS and DTSS. For CSS use the given Ci.
While there are unassigned iterations do:
(M.1) If a request arrives, put it in the queue.
(M.2) Pick a request from the queue, and compute the next chunk size using CSS,
TSS or DTSS.
(M.3) Update the current and previous slave ids.
(M.4) Send the id of the current slave to the previous one.
April 27, 2006 IPDPS 2006 15
Dynamic Multi-Phase Scheduling DMPS(x)
Slave Slave PPkk:
Initialization: (S.a) Register with the master. In case of DTSS, report Ak.
(S.b) Compute M according to the given H.
(S.1) Send request to the master.
(S.2) Wait for reply; if received chunk from master, go to step 3, else go to
OUTPUT.
(S.3) While the next SP is not reached, compute chunk i.
(S.4) If id of the send-to slave is known, go to step 5, else go to step 6.
(S.5) Send computed data to send-to slave
(S.6) Receive data from the receive-from slave and go to step 3.
OUTPUT
MasterMaster: If there are no more chunks to be assigned, terminate.
Slave Slave PPkk: If no more tasks come from master, terminate.
April 27, 2006 IPDPS 2006 16
Advantages of DMPS(x)
Can take as input any self-scheduling algorithm,
without any modifications
Phase 2 is independent of Phase 1
Phase 1 deals with the heterogeneity & load
variation in the system
Phase 2 deals with minimizing the inter-slave
communication cost
Suitable for any type of heterogeneous systems
Dynamic Multi-Phase Scheduling DMPS(x)
April 27, 2006 IPDPS 2006 17
Outline• IntroductionIntroduction
• Notation
• Some existing self-scheduling algorithms
• Dynamic self-scheduling for dependence
loops
• Implementation and test results
• Conclusions
• Future work
April 27, 2006 IPDPS 2006 18
Implementation and testing setupImplementation and testing setup The algorithms are implemented in C and C++
MPI platform is used for master-slave and inter-slave
communication
The heterogeneous system consists of 10 machines:
4 Intel Pentiums III, 1266 MHz with 1GB RAM (called zealots),
assumed to have VPk = 1.5 (one of them is the master)
6 Intel Pentiums III, 500 MHz with 512MB RAM (called kids),
assumed to have VPk = 0.5.
Interconnection network is Fast Ethernet, at 100Mbit/sec.
Dedicated system: all machines are dedicated to running the
program and no other loads are interposed during the execution.
Non-dedicated system: at the beginning of program’s execution,
a resource expensive process is started on some of the slaves,
halving their Ak.
April 27, 2006 IPDPS 2006 19
Implementation and testing setupImplementation and testing setup System configuration: zealot1 (master), zealot2, kid1, zealot3, kid2,
zealot4, kid3, kid4, kid5, kid6.
Three series of experiments for both dedicated & non-dedicated
systems, for m = 3,4,5,6,7,8,9 slaves:
1) DMPS(CSS)
2) DMPS(TSS)
3) DMPS(DTSS)
Two real-life applications: heat equation, Floyd-Steinberg computation
Speedup Sp is computed with:
where TPi – serial execution time on slave Pi, 1 ≤ i ≤ m, and
TPAR – parallel execution time (on m slaves)
In the plotting of Sp, VP is used instead of m on the x-axis.
PAR
PPPp T
TTTS m
},...,,min{21
April 27, 2006 IPDPS 2006 20
Performance results – Heat Performance results – Heat equationequation
Sync. interval
H
Dedicated systemSeries tested
Number of slaves m
3 4 5 6 7 8 9
100
1) DMPS(CSS) 2.32 1.75 1.73 1.23 1.21 1.21 1.182) DMPS(TSS) 2.20 1.73 1.56 1.38 1.25 1.14 1.02
3) DMPS(DTSS) 1.42 1.14 1.00 0.95 0.91 0.85 0.78
1501) DMPS(CSS) 2.31 1.74 1.71 1.21 1.22 1.21 1.182) DMPS(TSS) 2.18 1.72 1.54 1.38 1.25 1.14 1.02
3) DMPS(DTSS) 1.42 1.13 0.99 0.93 0.90 0.84 0.78
200
1) DMPS(CSS) 2.30 1.74 1.73 1.22 1.23 1.22 1.192) DMPS(TSS) 2.21 1.74 1.55 1.38 1.25 1.14 1.02
3) DMPS(DTSS) 1.42 1.13 0.99 0.94 0.90 0.83 0.78Heat Equation, dedicated heterogeneous cluster
0
1
2
3
4
5
6
3.5 4 5.5 6 6.5 7 7.5
Virtual powers
Spe
edup
DMPS(CSS) DMPS(TSS) DMPS(DTSS)
April 27, 2006 IPDPS 2006 21
Performance results – Heat Performance results – Heat equationequation
Sync. interval
H
Non-dedicated system
Series tested
Number of slaves m
3 4 5 6 7 8 9
100
1) DMPS(CSS) 2.33 1.76 1.73 2.46 2.45 2.38 2.062) DMPS(TSS) 2.20 1.74 1.56 2.52 2.56 2.18 2.10
3) DMPS(DTSS) 1.95 1.45 1.30 1.31 1.33 1.38 1.25
1501) DMPS(CSS) 2.33 1.74 1.72 2.46 2.49 2.43 2.052) DMPS(TSS) 2.19 1.72 1.54 2.42 2.23 2.31 2.06
3) DMPS(DTSS) 1.94 1.47 1.30 1.30 1.28 1.36 1.23
200
1) DMPS(CSS) 2.30 1.74 1.73 2.39 2.36 2.38 2.102) DMPS(TSS) 2.22 1.75 1.56 1.79 2.32 2.10 2.02
3) DMPS(DTSS) 1.96 1.44 1.29 1.29 1.27 1.32 1.21Heat Equation, non-dedicated heterogeneous cluster
0
1
2
3
4
3.5 4 5.5 6 6.5 7 7.5
Virtual powers
Sp
eed
up
DMPS(CSS) DMPS(TSS) DMPS(DTSS)
April 27, 2006 IPDPS 2006 22
Performance results – Floyd-Performance results – Floyd-SteinbergSteinberg
Sync. interval
H
Dedicated system
Series tested
Number of slaves m
3 4 5 6 7 8 9
50
1) DMPS(CSS) 27.79
22.14
16.78
16.69
16.53
11.38
11.36
2) DMPS(TSS) 25.32
19.77
17.30
15.41
13.80
12.43
11.40
3) DMPS(DTSS) 19.63
14.87
13.28
12.72
11.57
11.45
10.73
100
1) DMPS(CSS) 27.52
22.01
16.70
16.65
16.43
11.34
11.33
2) DMPS(TSS) 25.22
19.70
17.24
15.35
13.75
12.38
11.38
3) DMPS(DTSS) 19.63
14.80
13.21
12.66
11.52
11.34
10.64
150
1) DMPS(CSS) 27.58
22.03
16.75
16.70
16.44
11.43
11.43
2) DMPS(TSS) 25.22
19.70
17.22
15.34
13.75
12.39
11.38
3) DMPS(DTSS) 19.62
14.82
13.24
12.67
11.53
11.34
10.65
Floyd-Steinberg, dedicated heterogeneous cluster
0
12
3
4
56
7
3.5 4 5.5 6 6.5 7 7.5
Virtual powers
Sp
eed
up
DMPS(CSS) DMPS(TSS) DMPS(DTSS)
April 27, 2006 IPDPS 2006 23
Performance results – Floyd-Performance results – Floyd-SteinbergSteinberg
Floyd-Steinberg, non-dedicated heterogeneous cluster
0
1
2
3
4
5
6
3.5 4 5.5 6 6.5 7 7.5
Virtual power
Sp
eed
up
DMPS(CSS) DMPS(TSS) DMPS(DTSS)
Sync. interval
H
Non-dedicated system
Series tested
Number of slaves m
3 4 5 6 7 8 9
50
1) DMPS(CSS) 27.72
22.13
16.76
23.81
22.32
22.47
22.44
2) DMPS(TSS) 25.18
19.72
17.24
22.34
24.14
22.26
20.95
3) DMPS(DTSS) 21.88
16.06
14.38
13.74
13.26
13.02
11.71
100
1) DMPS(CSS) 27.49
21.99
16.67
22.61
22.42
22.59
22.35
2) DMPS(TSS) 25.18
19.66
17.17
19.23
24.15
22.24
20.88
3) DMPS(DTSS) 21.85
15.96
14.32
13.65
13.11
12.80
11.58
150
1) DMPS(CSS) 27.57
22.01
16.74
22.49
22.48
22.32
22.46
2) DMPS(TSS) 25.17
19.65
17.20
26.20
24.14
22.02
20.82
3) DMPS(DTSS) 21.86
15.96
14.31
13.58
13.18
12.80
11.59
April 27, 2006 IPDPS 2006 24
Interpretation of the results• Dedicated system:
• as expected, all algorithms perform better on a dedicated system, compared to a non-dedicated one.
• DMPS(TSS) slightly outperforms DMPS(CSS) for parallel loops, because it provides better load balancing
• DMPS(DTSS) outperforms both other algorithms because it explicitly accounts for system’s heterogeneity
• Non-dedicated system:
• DMPS(DTSS) stands out even more, since the other algorithms cannot handle extra load variations
• The speedup for DMPS(DTSS) increases in all cases
• H must be chosen so as to maintain the comm/comp ratio < 1,
for every test case
• Even then, small variations of the value of H, do not significantly affect the overall performance.
April 27, 2006 IPDPS 2006 25
OutlineOutline• IntroductionIntroduction
• Notation
• Some existing self-scheduling algorithms
• Dynamic self-scheduling for dependence
loops
• Implementation and test results
• Conclusions
• Future work
April 27, 2006 IPDPS 2006 26
ConclusionsConclusions
• Loops with dependencies can now be
dynamically scheduled on heterogeneous
dedicated & non-dedicated systems
• Distributed algorithms efficiently compensate
for the system’s heterogeneity for loops with
dependencies, especially in non-dedicated
systems
April 27, 2006 IPDPS 2006 27
OutlineOutline• IntroductionIntroduction
• Notation
• Some existing self-scheduling algorithms
• Dynamic self-scheduling for dependence
loops
• Implementation and test results
• Conclusions
• Future work
April 27, 2006 IPDPS 2006 28
Future work
• Establish a model for predicting the
optimal synchronization interval H and
minimize the communication
• Extend all other self-scheduling
algorithms, such that they can handle
loops with dependencies and account for
system’s heterogeneity
top related