cgrid 2005, slide 1 empirical evaluation of shared parallel execution on independently scheduled...

24
CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University of Houston CCGrid, May 2005

Upload: allison-hallas

Post on 14-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

CGrid 2005, slide 1

Empirical Evaluation of Shared Parallel Execution on

Independently Scheduled Clusters

Mala Ghanesh Satish Kumar Jaspal Subhlok

University of Houston

CCGrid, May 2005

CGrid 2005, slide 2

Scheduling Parallel Threads Scheduling Parallel Threads

Space Sharing/Gang Scheduling• All parallel threads of an application scheduled

together by a global scheduler

Independent Scheduling• Threads scheduled independently on each node

of a parallel system by the local scheduler

CGrid 2005, slide 3

Space Sharing and Gang Scheduling

T1 T2

T3

T4

T5

T6

N1 N2 N3 N4

a1 a2 a3 a4

b1 b2 b3 b4

a1 a2 a3 a4

a1 a2 a3 a4

b1 b2 b3 b4

b1 b2 b3 b4

Gang scheduling

T1 T2

T3

T4 T5

T6

N1 N2 N3 N4

a1 a2 a3 a4

b1 b2 b3 b4

a1 a2 a3 a4

b1 b2 b3 b4

a1 a2 a3 a4

b1 b2 b3 b4

Tim

e sl

ice

Nodes Space sharing

Threads of application A are a1, a2, a3, a4Threads of application B are b1, b2, b3, b4

CGrid 2005, slide 4

Independent Scheduling and Gang Scheduling

T1 T2

T3

T4

T5

T6

N1 N2 N3 N4

a1 a2 a3 a4

b1 b2 a3 b4

a1 a2 b3 a4

b1 b2 b3 a4

a1 a2 b3 b4

b1 b2 a3 b4

Gang scheduling

T1 T2

T3

T4 T5

T6

N1 N2 N3 N4

a1 a2 a3 a4

b1 b2 b3 b4

a1 a2 a3 a4

b1 b2 b3 b4

a1 a2 a3 a4

b1 b2 b3 b4

Tim

e sl

ice

Nodes Independent Scheduling

CGrid 2005, slide 5

Gang versus Independent Scheduling

Gang scheduling is de-facto standard for parallel computation clusters

How does independent scheduling compare ?

+ More flexible – no central scheduler required

+ Potentially uses resources more efficiently

- Potentially increases synchronization overhead

CGrid 2005, slide 6

Synchronization/Communication with Independent Scheduling

T1 T2

T3

T4

T5

T6

N1 N2 N3 N4

a1 a2 b3 b4

b1 b2 a3 a4

a1 a2 b3 b4

b1 b2 a3 a4

a1 a2 b3 b4

b1 b2 a3 a4

Tim

e sl

ice

NodesWith strict independent round robin scheduling parallel threads may never be able to communicate!

Fortunately scheduling is never strictly round robin, but this is a significant performance issue

CGrid 2005, slide 7

Research in This Paper

How does node sharing with independent scheduling perform in practice ?

• Improved resource utilization versus higher synchronization overhead ?

• Dependence on application characteristics ?

• Dependence on CPU time slice values ?

CGrid 2005, slide 8

Experiments

All experiments with NAS benchmarks on 2 clusters

Benchmark programs executed:1. Dedicated mode on a cluster

2. With node sharing with competing applications

3. Slowdown due to sharing analyzed

Above experiments conducted with– Various node and thread counts

– Various CPU time slice values

CGrid 2005, slide 9

Experimental Setup

Two clusters are used: 1. 10 node, 1 GB RAM, dual Pentium Xeon processors, RedHat

Linux 7.2, GigE interconnect2. 18 node 1 GB RAM, dual AMD Athlon processors, RedHat Linux

7.3, GigE interconnect

NAS Parallel Benchmarks 2.3, Class B MPI Versions• CG, EP, IS, LU, MP compiled for 4, 8,16, 32 threads• SP and BT compiled for 4, 9, 16, 36 threads

IS (Integer Sort) and CG (Conjugate Gradient) are most communication intensive benchmarks.

EP(Embarassingly Parallel) has no communication.

CGrid 2005, slide 10

Experimental # 1

NAS Benchmarks compiled for 4, 8/9 and 16 threads

1. Benchmarks first executed in dedicated mode with one thread per node

2. Then executed with 2 additional competing threads on each node• Each node has 2 CPUs – minimum 3 total threads

are needed to cause contention• Competing load threads are simple compute loops

with no communication3. Slowdown (%age increase in execution time) plotted

• Nominal slowdown is 50% - used for comparison as gang scheduling slowdown

CGrid 2005, slide 11

Results: 10 node cluster

0

10

20

30

40

50

60

70

80

CG EP IS LU MG SP BT Avg

Benchmark

Per

cen

tag

e S

low

do

wn

4 nodes

8/9 nodes

Expected slowdown with

gang scheduling

• Slowdown ranges around 50%• Some increase in slowdown going from 4 to 8 nodes

CGrid 2005, slide 12

Results: 18 node cluster

• Broadly similar• Slow increase in slowdown from 4 to 16 nodes

0

10

20

30

40

50

60

70

80

90

BT CG EP IS LU MG SP Suiteaverage

4 Nodes 8/9 Nodes 16 Nodes

Slowdown

CGrid 2005, slide 13

Remarks

Why is slowdown not much higher ?• Scheduling is not strict round robin – a blocked

application thread will get scheduled again on message arrival

• leads to self synchronization - threads of the same application across nodes get scheduled together

• Applications often have significant wait times that are used by competing applications with sharing

Increase in slowdown with more nodes is expected as communication operations are more complex

• The rate of increase is modest

CGrid 2005, slide 14

Experiment # 2

Similar to the previous batch of experiments, except…

• 2 Application threads per node• 1 load thread per node

• Nominal slowdown is still 50%

CGrid 2005, slide 15

Performance: 1 and 2 app threads/nodePerformance: 1 and 2 app threads/node

0

10

20

30

40

50

60

70

80

CG EP IS LU MG SP BT Avg

Per

cen

tag

e S

low

do

wn

1 app thread per node, 4 nodes1 app thread per node, 8/9 nodes

2 app threads per node, 4/5 nodes2 app threads per node, 8 nodes

Expected slowdown with

gang scheduling

Slowdown is lower for 2 threads/node

CGrid 2005, slide 16

Performance: 1 and 2 app threads/nodePerformance: 1 and 2 app threads/node

01020304050607080

CG EP IS LU MG SP BT Avg

Per

cen

tag

e S

low

do

wn

1 app thread per node, 4 nodes1 app thread per node, 8/9 nodes

2 app threads per node, 4/5 nodes2 app threads per node, 8 nodes

Slowdown is lower for 2 threads/node• competing with one 100% compute thread (not 2)• scaling a fixed size problem to more threads means each thread uses CPU less efficiently

• hence more free cycles available

CGrid 2005, slide 17

Experiment # 3

Similar to the previous batch of experiments, except…

• CPU time slice quantum varied from 30 to 200 ms.• (default was 50 msecs)

CPU time slice quantum is the amount of time a process gets when others are waiting in ready queue

Intuitively, longer time slice quantum means• a communication operation between nodes is less

likely to be interrupted due to swapping – good• a node may have to wait longer for a peer to be

scheduled, before communicating - bad

CGrid 2005, slide 18

Performance with different CPU time slice quanta

0

10

20

30

40

50

60

70

80

90

100

CG EP IS LU MG SP BT

Per

cent

age

Slo

wdo

wn

CPU time slice=30 ms

CPU time slice=50 ms

CPU time slice=100 ms

CPU time slice=200 ms

• Small time slices are uniformly bad• Medium time slices (50 ms and 100 ms) generally good• Longer time slice good for communication intensive codes

CGrid 2005, slide 19

Conclusions

• Performance with independent scheduling competitive with gang scheduling for small clusters.– Key is passive self synchronization of application

threads across the cluster

• Steady but slow increase in slowdown with larger number of nodes

• Given the flexibility of independent scheduling, it may be a good choice for some scenarios

CGrid 2005, slide 20

Broader Picture: Distributed Applications on Networks: Resource selection, Mapping, Adapting

Data

Sim 1

VisSim 2

Stream

Model

Pre

?Application

Network

Which nodes offer best performance

CGrid 2005, slide 21

End of Talk! End of Talk!

FOR MORE INFORMATION:

www.cs.uh.edu/~jaspal [email protected]

CGrid 2005, slide 22

Mapping Distributed Applications on Networks: “state of the art”

Data

Sim 1Sim 2

Stream

Model

PreVis

Mapping for Best Performance

1. Measure and model network properties, such as available bandwidth and CPU loads (with tools like NWS, Remos)

2. Find “best” nodes for execution based on network status

But the approach has significant limitations…

• Knowing network status is not the same as knowing how an application will perform

• Frequent measurements are expensive, less frequent measurements mean stale data

CGrid 2005, slide 23

Discovered Communication Structure of NAS Benchmarks

0 1

32

BT

0 1

32

CG

0 1

3

IS

0 1

32

EP

0 1

32

LU

0 1

32

MG

0 1

32

SP

2

CGrid 2005, slide 24

CPU Behavior of NAS Benchmarks

0%10%20%

30%40%50%60%70%

80%90%

100%

CG IS MG SP LU BT EP

Computation Communication Idle