delivering high performance to parallel applications using advanced scheduling nikolaos drosinos,...

40
Delivering High Performance to Delivering High Performance to Parallel Applications Using Parallel Applications Using Advanced Scheduling Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,goumas,maria,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr

Upload: nichole-blurton

Post on 14-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

Delivering High Performance to Delivering High Performance to Parallel Applications Using Parallel Applications Using

Advanced SchedulingAdvanced Scheduling

Nikolaos Drosinos, Georgios GoumasMaria Athanasaki and Nectarios Koziris

National Technical University of Athens

Computing Systems Laboratory

{ndros,goumas,maria,nkoziris}@cslab.ece.ntua.grwww.cslab.ece.ntua.gr

Page 2: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 2

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

Page 3: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 3

IntroductionIntroduction

Motivation:

• A lot of theoretical work has been done on arbitrary tiling but there are no actual experimental results!• There is no complete method to generate code for non-rectangular tiles

Page 4: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 4

IntroductionIntroduction

Contribution:

• Complete end-to-end SPMD code generation method for arbitrarily tiled iteration spaces• Simulation of blocking and non-blocking communication primitives• Experimental evaluation of proposed scheduling scheme

Page 5: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 5

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

Page 6: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 6

BackgroundBackground

Algorithmic Model:

FOR j1 = min1 TO max1 DO

FOR jn = minn TO maxn DO

Computation(j1,…,jn);

ENDFOR

ENDFOR

Perfectly nested loops Constant flow data dependencies (D)

Page 7: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 7

BackgroundBackground

Tiling:

Popular loop transformation Groups iterations into atomic units Enhances locality in uniprocessors Enables coarse-grain parallelism in distributed memory systems Valid tiling matrix H: 0DH

Page 8: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 8

Tiling TransformationTiling Transformation

Example:

FOR j1=0 TO 11 DO FOR j2=0 TO 8 DO A[j1,j2]:=A[j1-1,j2] + A[j1-1,j2-1]; ENDFORENDFOR

10

11D

Page 9: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 9

Rectangular Tiling Rectangular Tiling TransformationTransformation

j1

j2

3 0P = 0 3

1/3 0H = 0 1/3

Page 10: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 10

Non-rectangular Tiling Non-rectangular Tiling TransformationTransformation

j1

j2

3 3P = 0 3

1/3 -1/3H = 0 1/3

Page 11: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 11

Why Non-rectangular Tiling?Why Non-rectangular Tiling?

Reduces communication

Enables more efficient scheduling schemes

8 communication points

6 communication points

6 time steps 5 time steps

Page 12: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 12

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

Page 13: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 13

Computation DistributionComputation Distribution

We map tiles along the longest dimension to the same processor because:

• It reduces the number of processors required• It simplifies message-passing• It reduces total execution times when overlapping computation with communication

Page 14: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 14

Computation DistributionComputation Distribution

P3

P2

P1

j1

j2

Page 15: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 15

Data DistributionData Distribution

Computer-owns rule: Each processor owns the data it computes Arbitrary convex iteration space, arbitrary tiling Rectangular local iteration and data spaces

Page 16: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 16

Data DistributionData Distribution

j1

j2

Tile Iteration SpaceTIS

j'2

Transformed Tile Iteration SpaceTTIS

j'1

H'

P'

Page 17: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 17

Data DistributionData Distribution

ma

pp

ing

dim

en

sio

n

t=0

t=1

t=2

t=3

j'1

j''2

j''1

j'2

map-1

map

LDS TTIS...

off1

off2

Computation Storage

Communication Storage

Unused Space

Page 18: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 18

Data DistributionData Distribution

)( ypidLDS

Jn

)( xpidLDS

loc()

loc–1()

fw()

DS

loc()loc–1()

j2

j1

w1

w2

j2''

j2''

j1''

j1''

Page 19: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 19

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

Page 20: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 20

P3

P2

P1

Communication SchemesCommunication Schemes

With whom do I communicate?

j1

j2

Page 21: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 21

P3

P2

P1

With whom do I communicate?

j1

j2

Communication SchemesCommunication Schemes

Page 22: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 22

What do I send?

Tile Iteration SpaceTIS

j1

j2

21

13D

j'2

Transformed Tile Iteration SpaceTTIS

j'1

50

05D

Communication SchemesCommunication Schemes

Page 23: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 23

Blocking SchemeBlocking Scheme

P3

P2

P1

j1

j2

12 time steps

Page 24: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 24

Non-blocking SchemeNon-blocking Scheme

P3

P2

P1

j1

j2

6 time steps

Page 25: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 25

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

Page 26: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 26

Code Generation SummaryCode Generation Summary

Advanced Scheduling = Suitable Tiling + Non-blocking Communication Scheme

Sequential Code

Dependence Analysis

ParallelizationParallel SPMD

Code

Tiling Transformation

Sequential Tiled Code

Tiling

• Computation Distribution

• Data Distribution

• Communication Primitives

Page 27: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 27

Code Summary – Blocking Code Summary – Blocking SchemeSchemeFORACROSS Spid 11 min TO S

1max DO … FORACROSS S

nnpid 11 min TO Sn 1max DO

/*Sequential execution of tiles*/ FOR S

nSt min TO S

nmax DO /*Receive data from neighbouring tiles*/

BLOCKING_RECV( CCDtpid SS ,,, ); /*Traverse the internal of the tile*/ FOR 11 lj TO 1u STEP= 1c DO … FOR nn lj TO nu STEP= nc DO /*Perform computations on LDS*/ t := St - S

nl ;

LA[map( tj , )]:=F(LA[map( tdj ,1 )],…,

LA[map( tdj q , )]); ENDFOR … ENDFOR /*Send data to neighbouring processors*/

BLOCKING_SEND( CCDtpid mS ,,, ); ENDFOR ENDFORACROSS … ENDFORACROSS

Page 28: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 28

FORACROSS Spid 11 min TO S1max DO

… FORACROSS S

nnpid 11 min TO Sn 1max DO

/*Sequential execution of tiles*/ FOR S

nSt min TO S

nmax DO /*Receive data for next tile*/

NON_BLOCKING_RECV( CCDtpid SS ,,1, ); /*Send data computed at previous tile*/

NON_BLOCKING_SEND( CCDtpid mS ,,1, ); /*Compute current tile*/ FOR 11 lj TO 1u STEP= 1c DO … FOR nn lj TO nu STEP= nc DO /*Perform computations on LDS*/ t := St - S

nl ;

LA[map( tj , )]:=F(LA[map( tdj ,1 )],…,

LA[map( tdj q , )]);

ENDFOR … ENDFOR /*Wait for communication completion*/ WAIT_ALL ENDFOR ENDFORACROSS … ENDFORACROSS

Code Summary – Non-blocking Code Summary – Non-blocking SchemeScheme

Page 29: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 29

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

Page 30: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 30

Experimental ResultsExperimental Results

8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) MPICH v.1.2.5 (--with-device=p4, --with-comm=shared) g++ compiler v.2.95.4 (-O3) FastEthernet interconnection 2 micro-kernel benchmarks (3D):

• Gauss Successive Over-Relaxation (SOR)• Texture Smoothing Code (TSC)

Simulation of communication schemes

Page 31: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 31

SORSOR

Iteration space M x N x N Dependence matrix:

01010

00101

11100

D

Rectangular Tiling:

Non-rectangular Tiling:

z

y

x

Pr00

00

00

zx

y

x

Pnr0

00

00

Page 32: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 32

SORSOR

Page 33: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 33

SORSOR

Page 34: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 34

TSCTSC

Iteration space T x N x N Dependence matrix:

10111101

11100111

11110000

D

Rectangular Tiling:

Non-rectangular Tiling:

z

y

x

Pr00

00

00

zyx

yx

x

Pnr 0

00

Page 35: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 35

TSCTSC

Page 36: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 36

TSCTSC

Page 37: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 37

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work

Page 38: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 38

ConclusionsConclusions

Automatic code generation for arbitrary tiled spaces can be efficient High performance can be achieved by means of

a suitable tiling transformation overlapping computation with communication

Page 39: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 39

Future WorkFuture Work

Application of methodology to imperfectly nested loops and non-constant dependencies Investigation of hybrid programming models (MPI+OpenMP) Performance evaluation on advanced interconnection networks (SCI, Myrinet)

Page 40: Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris

5/9/2003 Parallel Computing 2003 40

Questions?Questions?

http://www.cslab.ece.ntua.gr/~ndros