cgrid 2005, slide 1 empirical evaluation of shared parallel execution on independently scheduled...

CGrid 2005, slide 1

Empirical Evaluation of Shared Parallel Execution on

Independently Scheduled Clusters

Mala Ghanesh Satish Kumar Jaspal Subhlok

University of Houston

CCGrid, May 2005

CGrid 2005, slide 2

Scheduling Parallel Threads Scheduling Parallel Threads

Space Sharing/Gang Scheduling• All parallel threads of an application scheduled

together by a global scheduler

Independent Scheduling• Threads scheduled independently on each node

of a parallel system by the local scheduler

CGrid 2005, slide 3

Space Sharing and Gang Scheduling

T1 T2

T3

T4

T5

T6

N1 N2 N3 N4

a1 a2 a3 a4

b1 b2 b3 b4

a1 a2 a3 a4

a1 a2 a3 a4

b1 b2 b3 b4

b1 b2 b3 b4

Gang scheduling

T1 T2

T3

T4 T5

T6

N1 N2 N3 N4

a1 a2 a3 a4

b1 b2 b3 b4

a1 a2 a3 a4

b1 b2 b3 b4

a1 a2 a3 a4

b1 b2 b3 b4

Tim

e sl

ice

Nodes Space sharing

Threads of application A are a1, a2, a3, a4Threads of application B are b1, b2, b3, b4

CGrid 2005, slide 4

Independent Scheduling and Gang Scheduling

T1 T2

T3

T4

T5

T6

N1 N2 N3 N4

a1 a2 a3 a4

b1 b2 a3 b4

a1 a2 b3 a4

b1 b2 b3 a4

a1 a2 b3 b4

b1 b2 a3 b4

Gang scheduling

T1 T2

T3

T4 T5

T6

N1 N2 N3 N4

a1 a2 a3 a4

b1 b2 b3 b4

a1 a2 a3 a4

b1 b2 b3 b4

a1 a2 a3 a4

b1 b2 b3 b4

Tim

e sl

ice

Nodes Independent Scheduling

CGrid 2005, slide 5

Gang versus Independent Scheduling

Gang scheduling is de-facto standard for parallel computation clusters

How does independent scheduling compare ?

+ More flexible – no central scheduler required

+ Potentially uses resources more efficiently

- Potentially increases synchronization overhead

CGrid 2005, slide 6

Synchronization/Communication with Independent Scheduling

T1 T2

T3

T4

T5

T6

N1 N2 N3 N4

a1 a2 b3 b4

b1 b2 a3 a4

a1 a2 b3 b4

b1 b2 a3 a4

a1 a2 b3 b4

b1 b2 a3 a4

Tim

e sl

ice

NodesWith strict independent round robin scheduling parallel threads may never be able to communicate!

Fortunately scheduling is never strictly round robin, but this is a significant performance issue

CGrid 2005, slide 7

Research in This Paper

How does node sharing with independent scheduling perform in practice ?

• Improved resource utilization versus higher synchronization overhead ?

• Dependence on application characteristics ?

• Dependence on CPU time slice values ?

CGrid 2005, slide 8

Experiments

All experiments with NAS benchmarks on 2 clusters

Benchmark programs executed:1. Dedicated mode on a cluster

2. With node sharing with competing applications

3. Slowdown due to sharing analyzed

Above experiments conducted with– Various node and thread counts

– Various CPU time slice values

CGrid 2005, slide 9

Experimental Setup

Two clusters are used: 1. 10 node, 1 GB RAM, dual Pentium Xeon processors, RedHat

Linux 7.2, GigE interconnect2. 18 node 1 GB RAM, dual AMD Athlon processors, RedHat Linux

7.3, GigE interconnect

NAS Parallel Benchmarks 2.3, Class B MPI Versions• CG, EP, IS, LU, MP compiled for 4, 8,16, 32 threads• SP and BT compiled for 4, 9, 16, 36 threads

IS (Integer Sort) and CG (Conjugate Gradient) are most communication intensive benchmarks.

EP(Embarassingly Parallel) has no communication.

CGrid 2005, slide 10

Experimental # 1

NAS Benchmarks compiled for 4, 8/9 and 16 threads

1. Benchmarks first executed in dedicated mode with one thread per node

2. Then executed with 2 additional competing threads on each node• Each node has 2 CPUs – minimum 3 total threads

are needed to cause contention• Competing load threads are simple compute loops

with no communication3. Slowdown (%age increase in execution time) plotted

• Nominal slowdown is 50% - used for comparison as gang scheduling slowdown


Results: 10 node cluster

0

10

20

30

40

50

60

70

80

CG EP IS LU MG SP BT Avg

Benchmark

Per

cen

tag

e S

low

do

wn

4 nodes

8/9 nodes

Expected slowdown with

gang scheduling

• Slowdown ranges around 50%• Some increase in slowdown going from 4 to 8 nodes


Results: 18 node cluster

• Broadly similar• Slow increase in slowdown from 4 to 16 nodes

0

10

20

30

40

50

60

70

80

90

BT CG EP IS LU MG SP Suiteaverage

4 Nodes 8/9 Nodes 16 Nodes

Slowdown


Remarks

Why is slowdown not much higher ?• Scheduling is not strict round robin – a blocked

application thread will get scheduled again on message arrival

• leads to self synchronization - threads of the same application across nodes get scheduled together

• Applications often have significant wait times that are used by competing applications with sharing

Increase in slowdown with more nodes is expected as communication operations are more complex

• The rate of increase is modest


Experiment # 2

Similar to the previous batch of experiments, except…

• 2 Application threads per node• 1 load thread per node

• Nominal slowdown is still 50%


Performance: 1 and 2 app threads/nodePerformance: 1 and 2 app threads/node

0

10

20

30

40

50

60

70

80


Per

cen

tag

e S

low

do

wn

1 app thread per node, 4 nodes1 app thread per node, 8/9 nodes

2 app threads per node, 4/5 nodes2 app threads per node, 8 nodes

Expected slowdown with

gang scheduling

Slowdown is lower for 2 threads/node


Performance: 1 and 2 app threads/nodePerformance: 1 and 2 app threads/node

01020304050607080


Per

cen

tag

e S

low

do

wn

1 app thread per node, 4 nodes1 app thread per node, 8/9 nodes

2 app threads per node, 4/5 nodes2 app threads per node, 8 nodes

Slowdown is lower for 2 threads/node• competing with one 100% compute thread (not 2)• scaling a fixed size problem to more threads means each thread uses CPU less efficiently

• hence more free cycles available


Experiment # 3

Similar to the previous batch of experiments, except…

• CPU time slice quantum varied from 30 to 200 ms.• (default was 50 msecs)

CPU time slice quantum is the amount of time a process gets when others are waiting in ready queue

Intuitively, longer time slice quantum means• a communication operation between nodes is less

likely to be interrupted due to swapping – good• a node may have to wait longer for a peer to be

scheduled, before communicating - bad


Performance with different CPU time slice quanta

0

10

20

30

40

50

60

70

80

90

100

CG EP IS LU MG SP BT

Per

cent

age

Slo

wdo

wn

CPU time slice=30 ms




• Small time slices are uniformly bad• Medium time slices (50 ms and 100 ms) generally good• Longer time slice good for communication intensive codes


Conclusions

• Performance with independent scheduling competitive with gang scheduling for small clusters.– Key is passive self synchronization of application

threads across the cluster

• Steady but slow increase in slowdown with larger number of nodes

• Given the flexibility of independent scheduling, it may be a good choice for some scenarios


Broader Picture: Distributed Applications on Networks: Resource selection, Mapping, Adapting

Data

Sim 1

VisSim 2

Stream

Model

Pre

?Application

Network

Which nodes offer best performance


End of Talk! End of Talk!

FOR MORE INFORMATION:

www.cs.uh.edu/~jaspal [email protected]

http://www.cs.uh.edu/~jaspal

mailto:[email protected]


Mapping Distributed Applications on Networks: “state of the art”

Data

Sim 1Sim 2

Stream

Model

PreVis

Mapping for Best Performance

1. Measure and model network properties, such as available bandwidth and CPU loads (with tools like NWS, Remos)

2. Find “best” nodes for execution based on network status

But the approach has significant limitations…

• Knowing network status is not the same as knowing how an application will perform

• Frequent measurements are expensive, less frequent measurements mean stale data


Discovered Communication Structure of NAS Benchmarks

0 1

32

BT

0 1

32

CG

0 1

3

IS

0 1

32

EP

0 1

32

LU

0 1

32

MG

0 1

32

SP

2


CPU Behavior of NAS Benchmarks

0%10%20%

30%40%50%60%70%

80%90%

100%

CG IS MG SP LU BT EP

Computation Communication Idle

cgrid 2005, slide 1 empirical evaluation of shared parallel execution on independently scheduled...

Documents

b4 slide

independent scheduling

local scheduler slide

synchronization overhead

a4 threads of application

parallel system

epembarassingly parallel

various node