delivering high performance to parallel applications using advanced scheduling nikolaos drosinos,...

Delivering High Performance to Delivering High Performance to Parallel Applications Using Parallel Applications Using

Advanced SchedulingAdvanced Scheduling

Nikolaos Drosinos, Georgios GoumasMaria Athanasaki and Nectarios Koziris

National Technical University of Athens

Computing Systems Laboratory

{ndros,goumas,maria,nkoziris}@cslab.ece.ntua.grwww.cslab.ece.ntua.gr

5/9/2003 Parallel Computing 2003 2

OverviewOverview

Introduction Background Code Generation

• Computation/Data Distribution• Communication Schemes• Summary

Experimental Results Conclusions – Future Work


IntroductionIntroduction

Motivation:

• A lot of theoretical work has been done on arbitrary tiling but there are no actual experimental results!• There is no complete method to generate code for non-rectangular tiles


IntroductionIntroduction

Contribution:

• Complete end-to-end SPMD code generation method for arbitrarily tiled iteration spaces• Simulation of blocking and non-blocking communication primitives• Experimental evaluation of proposed scheduling scheme


OverviewOverview





BackgroundBackground

Algorithmic Model:

FOR j1 = min1 TO max1 DO

…

FOR jn = minn TO maxn DO

Computation(j1,…,jn);

ENDFOR

…

ENDFOR

Perfectly nested loops Constant flow data dependencies (D)


BackgroundBackground

Tiling:

Popular loop transformation Groups iterations into atomic units Enhances locality in uniprocessors Enables coarse-grain parallelism in distributed memory systems Valid tiling matrix H: 0DH


Tiling TransformationTiling Transformation

Example:

FOR j1=0 TO 11 DO FOR j2=0 TO 8 DO A[j1,j2]:=A[j1-1,j2] + A[j1-1,j2-1]; ENDFORENDFOR

10

11D


Rectangular Tiling Rectangular Tiling TransformationTransformation

j1

j2

3 0P = 0 3

1/3 0H = 0 1/3


Non-rectangular Tiling Non-rectangular Tiling TransformationTransformation

j1

j2

3 3P = 0 3

1/3 -1/3H = 0 1/3


Why Non-rectangular Tiling?Why Non-rectangular Tiling?

Reduces communication

Enables more efficient scheduling schemes

8 communication points

6 communication points

6 time steps 5 time steps


OverviewOverview





Computation DistributionComputation Distribution

We map tiles along the longest dimension to the same processor because:

• It reduces the number of processors required• It simplifies message-passing• It reduces total execution times when overlapping computation with communication


Computation DistributionComputation Distribution

P3

P2

P1

j1

j2


Data DistributionData Distribution

Computer-owns rule: Each processor owns the data it computes Arbitrary convex iteration space, arbitrary tiling Rectangular local iteration and data spaces



j1

j2

Tile Iteration SpaceTIS

j'2

Transformed Tile Iteration SpaceTTIS

j'1

H'

P'



ma

pp

ing

dim

en

sio

n

t=0

t=1

t=2

t=3

j'1

j''2

j''1

j'2

map-1

map

LDS TTIS...

off1

off2

Computation Storage

Communication Storage

Unused Space



)( ypidLDS

Jn

)( xpidLDS

loc()

loc–1()

fw()

DS

loc()loc–1()

j2

j1

w1

w2

j2''

j2''

j1''

j1''


OverviewOverview





P3

P2

P1

Communication SchemesCommunication Schemes

With whom do I communicate?

j1

j2


P3

P2

P1

With whom do I communicate?

j1

j2



What do I send?

Tile Iteration SpaceTIS

j1

j2

21

13D

j'2

Transformed Tile Iteration SpaceTTIS

j'1

50

05D



Blocking SchemeBlocking Scheme

P3

P2

P1

j1

j2

12 time steps


Non-blocking SchemeNon-blocking Scheme

P3

P2

P1

j1

j2

6 time steps


OverviewOverview





Code Generation SummaryCode Generation Summary

Advanced Scheduling = Suitable Tiling + Non-blocking Communication Scheme

Sequential Code

Dependence Analysis

ParallelizationParallel SPMD

Code

Tiling Transformation

Sequential Tiled Code

Tiling

• Computation Distribution

• Data Distribution

• Communication Primitives


Code Summary – Blocking Code Summary – Blocking SchemeSchemeFORACROSS Spid 11 min TO S

1max DO … FORACROSS S

nnpid 11 min TO Sn 1max DO

/*Sequential execution of tiles*/ FOR S

nSt min TO S

nmax DO /*Receive data from neighbouring tiles*/

BLOCKING_RECV( CCDtpid SS ,,, ); /*Traverse the internal of the tile*/ FOR 11 lj TO 1u STEP= 1c DO … FOR nn lj TO nu STEP= nc DO /*Perform computations on LDS*/ t := St - S

nl ;

LA[map( tj , )]:=F(LA[map( tdj ,1 )],…,

LA[map( tdj q , )]); ENDFOR … ENDFOR /*Send data to neighbouring processors*/

BLOCKING_SEND( CCDtpid mS ,,, ); ENDFOR ENDFORACROSS … ENDFORACROSS


FORACROSS Spid 11 min TO S1max DO

… FORACROSS S

nnpid 11 min TO Sn 1max DO

/*Sequential execution of tiles*/ FOR S

nSt min TO S

nmax DO /*Receive data for next tile*/

NON_BLOCKING_RECV( CCDtpid SS ,,1, ); /*Send data computed at previous tile*/

NON_BLOCKING_SEND( CCDtpid mS ,,1, ); /*Compute current tile*/ FOR 11 lj TO 1u STEP= 1c DO … FOR nn lj TO nu STEP= nc DO /*Perform computations on LDS*/ t := St - S

nl ;

LA[map( tj , )]:=F(LA[map( tdj ,1 )],…,

LA[map( tdj q , )]);

ENDFOR … ENDFOR /*Wait for communication completion*/ WAIT_ALL ENDFOR ENDFORACROSS … ENDFORACROSS

Code Summary – Non-blocking Code Summary – Non-blocking SchemeScheme


OverviewOverview





Experimental ResultsExperimental Results

8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) MPICH v.1.2.5 (--with-device=p4, --with-comm=shared) g++ compiler v.2.95.4 (-O3) FastEthernet interconnection 2 micro-kernel benchmarks (3D):

• Gauss Successive Over-Relaxation (SOR)• Texture Smoothing Code (TSC)

Simulation of communication schemes


SORSOR

Iteration space M x N x N Dependence matrix:

01010

00101

11100

D

Rectangular Tiling:

Non-rectangular Tiling:

z

y

x

Pr00

00

00

zx

y

x

Pnr0

00

00


SORSOR


TSCTSC

Iteration space T x N x N Dependence matrix:

10111101

11100111

11110000

D

Rectangular Tiling:

Non-rectangular Tiling:

z

y

x

Pr00

00

00

zyx

yx

x

Pnr 0

00


TSCTSC


OverviewOverview





ConclusionsConclusions

Automatic code generation for arbitrary tiled spaces can be efficient High performance can be achieved by means of

a suitable tiling transformation overlapping computation with communication


Future WorkFuture Work

Application of methodology to imperfectly nested loops and non-constant dependencies Investigation of hybrid programming models (MPI+OpenMP) Performance evaluation on advanced interconnection networks (SCI, Myrinet)


Questions?Questions?

http://www.cslab.ece.ntua.gr/~ndros

delivering high performance to parallel applications using advanced scheduling nikolaos drosinos,...

Documents

parallel computing

communication slide

data distribution slide

endfor slide

parallel applications

background tiling

data distribution computer

communication points