cgrid 2005, slide 1 empirical evaluation of shared parallel execution on independently scheduled...
TRANSCRIPT
CGrid 2005, slide 1
Empirical Evaluation of Shared Parallel Execution on
Independently Scheduled Clusters
Mala Ghanesh Satish Kumar Jaspal Subhlok
University of Houston
CCGrid, May 2005
CGrid 2005, slide 2
Scheduling Parallel Threads Scheduling Parallel Threads
Space Sharing/Gang Scheduling• All parallel threads of an application scheduled
together by a global scheduler
Independent Scheduling• Threads scheduled independently on each node
of a parallel system by the local scheduler
CGrid 2005, slide 3
Space Sharing and Gang Scheduling
T1 T2
T3
T4
T5
T6
N1 N2 N3 N4
a1 a2 a3 a4
b1 b2 b3 b4
a1 a2 a3 a4
a1 a2 a3 a4
b1 b2 b3 b4
b1 b2 b3 b4
Gang scheduling
T1 T2
T3
T4 T5
T6
N1 N2 N3 N4
a1 a2 a3 a4
b1 b2 b3 b4
a1 a2 a3 a4
b1 b2 b3 b4
a1 a2 a3 a4
b1 b2 b3 b4
Tim
e sl
ice
Nodes Space sharing
Threads of application A are a1, a2, a3, a4Threads of application B are b1, b2, b3, b4
CGrid 2005, slide 4
Independent Scheduling and Gang Scheduling
T1 T2
T3
T4
T5
T6
N1 N2 N3 N4
a1 a2 a3 a4
b1 b2 a3 b4
a1 a2 b3 a4
b1 b2 b3 a4
a1 a2 b3 b4
b1 b2 a3 b4
Gang scheduling
T1 T2
T3
T4 T5
T6
N1 N2 N3 N4
a1 a2 a3 a4
b1 b2 b3 b4
a1 a2 a3 a4
b1 b2 b3 b4
a1 a2 a3 a4
b1 b2 b3 b4
Tim
e sl
ice
Nodes Independent Scheduling
CGrid 2005, slide 5
Gang versus Independent Scheduling
Gang scheduling is de-facto standard for parallel computation clusters
How does independent scheduling compare ?
+ More flexible – no central scheduler required
+ Potentially uses resources more efficiently
- Potentially increases synchronization overhead
CGrid 2005, slide 6
Synchronization/Communication with Independent Scheduling
T1 T2
T3
T4
T5
T6
N1 N2 N3 N4
a1 a2 b3 b4
b1 b2 a3 a4
a1 a2 b3 b4
b1 b2 a3 a4
a1 a2 b3 b4
b1 b2 a3 a4
Tim
e sl
ice
NodesWith strict independent round robin scheduling parallel threads may never be able to communicate!
Fortunately scheduling is never strictly round robin, but this is a significant performance issue
CGrid 2005, slide 7
Research in This Paper
How does node sharing with independent scheduling perform in practice ?
• Improved resource utilization versus higher synchronization overhead ?
• Dependence on application characteristics ?
• Dependence on CPU time slice values ?
CGrid 2005, slide 8
Experiments
All experiments with NAS benchmarks on 2 clusters
Benchmark programs executed:1. Dedicated mode on a cluster
2. With node sharing with competing applications
3. Slowdown due to sharing analyzed
Above experiments conducted with– Various node and thread counts
– Various CPU time slice values
CGrid 2005, slide 9
Experimental Setup
Two clusters are used: 1. 10 node, 1 GB RAM, dual Pentium Xeon processors, RedHat
Linux 7.2, GigE interconnect2. 18 node 1 GB RAM, dual AMD Athlon processors, RedHat Linux
7.3, GigE interconnect
NAS Parallel Benchmarks 2.3, Class B MPI Versions• CG, EP, IS, LU, MP compiled for 4, 8,16, 32 threads• SP and BT compiled for 4, 9, 16, 36 threads
IS (Integer Sort) and CG (Conjugate Gradient) are most communication intensive benchmarks.
EP(Embarassingly Parallel) has no communication.
CGrid 2005, slide 10
Experimental # 1
NAS Benchmarks compiled for 4, 8/9 and 16 threads
1. Benchmarks first executed in dedicated mode with one thread per node
2. Then executed with 2 additional competing threads on each node• Each node has 2 CPUs – minimum 3 total threads
are needed to cause contention• Competing load threads are simple compute loops
with no communication3. Slowdown (%age increase in execution time) plotted
• Nominal slowdown is 50% - used for comparison as gang scheduling slowdown
CGrid 2005, slide 11
Results: 10 node cluster
0
10
20
30
40
50
60
70
80
CG EP IS LU MG SP BT Avg
Benchmark
Per
cen
tag
e S
low
do
wn
4 nodes
8/9 nodes
Expected slowdown with
gang scheduling
• Slowdown ranges around 50%• Some increase in slowdown going from 4 to 8 nodes
CGrid 2005, slide 12
Results: 18 node cluster
• Broadly similar• Slow increase in slowdown from 4 to 16 nodes
0
10
20
30
40
50
60
70
80
90
BT CG EP IS LU MG SP Suiteaverage
4 Nodes 8/9 Nodes 16 Nodes
Slowdown
CGrid 2005, slide 13
Remarks
Why is slowdown not much higher ?• Scheduling is not strict round robin – a blocked
application thread will get scheduled again on message arrival
• leads to self synchronization - threads of the same application across nodes get scheduled together
• Applications often have significant wait times that are used by competing applications with sharing
Increase in slowdown with more nodes is expected as communication operations are more complex
• The rate of increase is modest
CGrid 2005, slide 14
Experiment # 2
Similar to the previous batch of experiments, except…
• 2 Application threads per node• 1 load thread per node
• Nominal slowdown is still 50%
CGrid 2005, slide 15
Performance: 1 and 2 app threads/nodePerformance: 1 and 2 app threads/node
0
10
20
30
40
50
60
70
80
CG EP IS LU MG SP BT Avg
Per
cen
tag
e S
low
do
wn
1 app thread per node, 4 nodes1 app thread per node, 8/9 nodes
2 app threads per node, 4/5 nodes2 app threads per node, 8 nodes
Expected slowdown with
gang scheduling
Slowdown is lower for 2 threads/node
CGrid 2005, slide 16
Performance: 1 and 2 app threads/nodePerformance: 1 and 2 app threads/node
01020304050607080
CG EP IS LU MG SP BT Avg
Per
cen
tag
e S
low
do
wn
1 app thread per node, 4 nodes1 app thread per node, 8/9 nodes
2 app threads per node, 4/5 nodes2 app threads per node, 8 nodes
Slowdown is lower for 2 threads/node• competing with one 100% compute thread (not 2)• scaling a fixed size problem to more threads means each thread uses CPU less efficiently
• hence more free cycles available
CGrid 2005, slide 17
Experiment # 3
Similar to the previous batch of experiments, except…
• CPU time slice quantum varied from 30 to 200 ms.• (default was 50 msecs)
CPU time slice quantum is the amount of time a process gets when others are waiting in ready queue
Intuitively, longer time slice quantum means• a communication operation between nodes is less
likely to be interrupted due to swapping – good• a node may have to wait longer for a peer to be
scheduled, before communicating - bad
CGrid 2005, slide 18
Performance with different CPU time slice quanta
0
10
20
30
40
50
60
70
80
90
100
CG EP IS LU MG SP BT
Per
cent
age
Slo
wdo
wn
CPU time slice=30 ms
CPU time slice=50 ms
CPU time slice=100 ms
CPU time slice=200 ms
• Small time slices are uniformly bad• Medium time slices (50 ms and 100 ms) generally good• Longer time slice good for communication intensive codes
CGrid 2005, slide 19
Conclusions
• Performance with independent scheduling competitive with gang scheduling for small clusters.– Key is passive self synchronization of application
threads across the cluster
• Steady but slow increase in slowdown with larger number of nodes
• Given the flexibility of independent scheduling, it may be a good choice for some scenarios
CGrid 2005, slide 20
Broader Picture: Distributed Applications on Networks: Resource selection, Mapping, Adapting
Data
Sim 1
VisSim 2
Stream
Model
Pre
?Application
Network
Which nodes offer best performance
CGrid 2005, slide 21
End of Talk! End of Talk!
FOR MORE INFORMATION:
www.cs.uh.edu/~jaspal [email protected]
CGrid 2005, slide 22
Mapping Distributed Applications on Networks: “state of the art”
Data
Sim 1Sim 2
Stream
Model
PreVis
Mapping for Best Performance
1. Measure and model network properties, such as available bandwidth and CPU loads (with tools like NWS, Remos)
2. Find “best” nodes for execution based on network status
But the approach has significant limitations…
• Knowing network status is not the same as knowing how an application will perform
• Frequent measurements are expensive, less frequent measurements mean stale data
CGrid 2005, slide 23
Discovered Communication Structure of NAS Benchmarks
0 1
32
BT
0 1
32
CG
0 1
3
IS
0 1
32
EP
0 1
32
LU
0 1
32
MG
0 1
32
SP
2