the map 3 s static-and-regular mesh simulation and wavefront parallel-programming patterns

The MAPThe MAP33S S Static-and-Regular Mesh Static-and-Regular Mesh

Simulation and WavefrontSimulation and WavefrontParallel-Programming PatternsParallel-Programming Patterns

By Robert Niewiadomski, José Nelson Amaral, and Duane Szafron

Department of Computing ScienceUniversity of Alberta

Edmonton, Alberta, Canada

Pattern-based parallel-programmingPattern-based parallel-programming

• Observation:– Many seemingly different parallel programs have a common parallel

computation-communication-synchronization pattern.

• A Parallel-programming pattern instance:– Is a parallel program that adheres to a certain parallel computation-

communication-synchronization pattern.– Consists of engine-side code and user-side code:

• Engine-side code:– Is complete and handles all communication and synchronization.

• User-side code:– Is incomplete and handles all computation.– User completes the incomplete portions.

• MAP3S targets distributed-memory systems.

MAPMAP33SS

• MAPMAP33S = S = MMPI/C PI/C AAdvanced dvanced PPattern-based attern-based PParallel arallel PProgramming rogramming SSystemystem

Technical expertise Domain knowledge

Engine designer

Pattern designer

Application developer

Pattern-based parallel-programmingPattern-based parallel-programming

• The MAP3S usage scheme:

Select PatternSelect Pattern

Create Specification File

Create Specification File

Generate Pattern Instance

Generate Pattern Instance

Write User-side Code

Write User-side Code

(p.e. dimensions of mesh, data dependences, etc)

(automatic by pattern-instance generator)

(domain-specific computation code)

The Simulation and Wavefront computations

The Simulation and Wavefront computations

• The computations operate on a k-dimensional mesh of elements.• Simulation:

– Multiple mesh instances M0, M1, … are computed.

– In iteration i = 0, elements of M0 are initialized.

– In iteration i > 0, certain elements of Mi are computed using elements of Mi - 1 that were initialized/computed in previous iteration.

– Execution proceeds until a terminating condition is met.– Example: cellular-automata computations.

• Wavefront:– Single mesh instance M is computed.– In iteration i = 0,certain elements of M are initialized.– In iteration i > 0, elements of M whose data dependences are satisfied are computed. – Execution proceeds until there are no elements to compute.– Example: dynamic-programming computations.

Mesh-blocksMesh-blocks

• Computation proceeds at granularity of mesh-blocks.

0

6

1

7

12

18

13

19

2

8

3

9

14

20

15

21

24

30

25

31

26

32

27

33

4

10

5

11

16

22

17

23

28

34

29

35

0

6

1

7

12

18

13

19

2

8

3

9

14

20

15

21

24

30

25

31

26

32

27

33

4

10

5

11

16

22

17

23

28

34

29

35

A B C

D E F

G H I

• A k-dimensional mesh is logically partitioned into k-dimensional sub-meshes called mesh-blocks.

Prologue

BodyLocal

BodyGlobal

Epilogue

Prelude

Prologue

BodyLocal

BodyGlobal

Epilogue

Prelude

Prelude: process command-line arguments.

Prologue: initialize first mesh, possibly at granularity of mesh blocks.

BodyLocal: compute next mesh at granularity of mesh blocks.

BodyGlobal: decide whether to compute another mesh or to terminate.

Epilogue: process last computed mesh, possibly at granularity of mesh blocks.

User-side code: SimulationUser-side code: Simulation

Prologue

Body

Epilogue

Prelude

Prologue

Body

Epilogue

Prelude

Prelude: process command-line arguments

Prologue: initialize mesh, possibly at granularity of mesh blocks.

Body: continue computing of mesh at granularity of mesh blocks

Epilogue: process mesh, possibly at granularity of mesh blocks.

User-side code: WavefrontUser-side code: Wavefront

• The computation of an element depends on the values of certain other elements.• In MAP3S, the user specifies these data-dependencies using conditional shape-

lists at pattern-instance generation time.– Syntax: given an element p(c0, c1,…, ck - 1), if a certain condition is met, then, the

computation of p requires the values of all the elements falling into the specified k-dimensional volumes of the k-dimensional mesh, each of which is specified relative to position (c0, c1,…, ck - 1).

– Here is a simple example (expressing dependences for the green element):

{“x > 0 && y > 0”, {([“x - 1”, ”x - 1”], [“y - 1”, “y - 1”]), ([“x - 1”, “x - 1”], [“y”, “y”]), ([“x”, “x”], [“y - 1”, ”y - 1”])}};

Data-dependency specificationData-dependency specification

0 1 2 3

0

1

2

3

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

9

{"y<=x", {(["0","y-1"],["y","y"]), (["x","x"],["0","y-1"])}};

• The strengths of conditional shape-lists: • user is not limited to pre-defined data-dependency specifications,• user is able to express irregular data-dependency specifications.

• In this example, conditional shape-lists specify the data-dependencies of the Lower/Upper Matrix-Decomposition Wavefront computation.


0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

9

{"y>x", {(["0","x-1"],["y","y"]), (["x","x"],["0","x-1"]), (["x","x"], ["x","x"])}};

{"y<=x", {(["0","y-1"],["y","y"]), (["x","x"],["0","y-1"])}};

• The strengths of conditional shape-lists: • user is not limited to pre-defined data-dependency specifications,• user is able to express irregular data-dependency specifications.

• In this example, conditional shape-lists specify the data-dependencies of the Lower/Upper Matrix-Decomposition Wavefront computation.


Direct mesh-accessDirect mesh-access• In the user-code all the mesh elements can be accessed directly.

void computeMeshBlock(mesh, xMin, xMax, yMin, yMax) { for(x = xMin; x<= xMax; x++) { for(y = yMin; y<= yMax; y++) { mesh[x][y] = f(mesh[x-1][y-1], mesh[x][y-1], mesh[x-1][y]); } }}

0 1 2 3 4 5

0

1

2

3

4

5

• With direct mesh-access, the user does not need to refactor their sequential-code w.r.t. mesh access. In contrast, with indirect mesh-access a refactoring is necessary, since input elements are accessed in auxiliary data-structures.

06

17

A28

39

B4

105

11C

12

18

13

19D

1420

1521

E16

22

17

23F

2430

2531

G2632

2733

H2834

2935

I

0

6 7

1 2

8 9

3 4

10 11

5

12

18 19

13 14

20 21

15 16

22 23

17

24

30 31

25 26

32 33

27 28

34 35

29

Engine-side codeEngine-side codeEngine-side code in the Wavefront pattern.

• Element-level data-dependencies --- specified by the user --- are automatically extended to mesh-block-level data-dependencies.

06

17

A28

39

B4

105

11C

12

18

13

19D

1420

1521

E16

22

17

23F

2430

2531

G2632

2733

H2834

2935

I

1622

1723

F2632

2733

H2834

2935

I

Engine-side codeEngine-side code• The mesh-block-level data-dependencies are utilized

to establish a parallel-computation schedule.

CPU 0CPU 0 CPU 1CPU 1

28

39

B

410

511

C

1218

1319

D

1420

1521

E

1622

1723

F

2430

2531

G

2632

2733

H2834

2935

I

06

17

A

06

17

A

Engine-side codeEngine-side code• The parallel computation-schedule is

refined with mesh-blocks being assigned among the processors in a round-robin fashion (shown).

• The parallel computation-schedule is then complemented with a parallel communication schedule (not shown).

• The engine-side code executes user-side code in accordance to the parallel computation and communication schedule.

Prelude

CPU 0CPU 0 CPU 1CPU 1

Prologue(A)

Body(A)

Body(B) Body(D)

Body(C) Body(E)

Body(G)

Epilogue(A,B,C,D,E,G)

28

39

B

410

511

C

1218

1319

D

1420

1521

E

1622

1723

F

2430

2531

G

2632

2733

H2834

2935

I

06

17

A

06

17

A

Engine-side codeEngine-side code• Execution of user-side code by

the engine-side code when using sequential prologue and epilogue.

Tim

e

08162432404856

19172533414957

2618102

58504234

2719113

59514335

412202836445260

513212937455361

3022146

62544638

3123157

63554739

08

191624

1725

3240

3341

4856

4957

210

311

1826

1927

3442

3543

5058

5159

412

513

2028

2129

3644

3745

5260

5361

614

715

2230

2331

3846

3947

5462

5563

2D-mesh in dense mesh-representation.

2D mesh in sparse mesh-representation.

Mesh representationMesh representation• The mesh can be represented using

either the dense mesh-representation or the sparse mesh-representation.

• Sparse representation can have better locality and can distribute the memory footprint of the mesh among nodes.

•A mesh memory-footprint can be as much a problem as performance. The combination of parallel prologue and epilogue, and sparse-mesh representation, both minimizes and distributes the mesh-storage memory-footprint.

CPU 0

CPU 0

CPU 1

CPU 1

06

17

1218

1319

28

39

1420

1521

2430

2531

2632

2733

410

511

1622

1723

2834

2935

A B C

D E F

G H I

06

17

1218

1319

28

39

1420

1521

2430

2531

2632

2733

410

511

1622

1723

2834

2935

A B C

D E F

G H I

Original Do not store dead mesh-blocks

Only store non-owned mesh-blocks that are used by owned

mesh-blocks.

06

17

1218

1319

28

39

2430

2531

410

511

A B C

D

G

06

17

1218

1319

28

39

1420

1521

2430

2531

410

511

A B C

D E

G

06

17

1218

1319

28

39

1420

1521

2430

2531

410

511

A B C

D E

G

06

17

1218

1319

28

39

1420

1521

2430

2531

A B

D E

G

• Memory-footprint reduction varies. It is most effective for large simulation computations.

Mesh representationMesh representation

Experimental evaluationExperimental evaluation

• Problems:– 2D problems:

• GoL: game-of-life (Simulation)• LUMD: lower/upper matrix-decomposition (Wavefront)

– 3D problems:• RTA: room-temperature annealing (Simulation) • MSA: multiple-sequence alignment (Wavefront)

• Hardware:– GigE:a 16-node cluster with Gigabit Ethernet– IB: a 128-node cluster with InfiniBand (limited to 64)


GoL (2D Simulation) RTA (3D Simulation)

MSA (3D Wavefront)LUMD (2D Wavefront)

• Speedup on GigE:– x-axis is # of nodes.– y-axis is speedup.

• Performance gains on LUMD and MSA are worse than on GoL and RTA:– LUMD has non-uniform computation intensity,

which limits parallelism.– MSA has limited computation granularity, which

increases relative overhead of communication and synchronization.

Experimental evaluationExperimental evaluation• Speedup on IB:

– x-axis is # of nodes.– y-axis is speedup.

• Performance gains on LUMD and MSA are worse than on GoL and RTA:

– See GigE.

GoL (2D Simulation) RTA (3D Simulation)

MSA (3D Wavefront)LUMD (2D Wavefront)


• Capability:– Use of sparse-mesh representation distributes the mesh memory-

footprint across multiple nodes.• Allows for handling of meshes whose memory-footprint exceeds memory capacity

of a single node.

– Using 16 nodes on GigE:

Problem InstanceGlobal Mesh

Memory-Footprint (GB)

Maximum Local MeshMemory-Footprint

(GB)

GoL (131,072x 131,072) 32 3.0

RTA (1,024x 1,024 x 1,024) 32 4.4

LUMD (40,132x 40,132) 12 3.0

MSA (2,048 x 2,048 x 2,048) 32 3.0

• LUMD mesh memory-footprint effectiveness is limited due to computation of elements being dependent on a larger number of elements.

Experimental evaluationExperimental evaluation• What we learned:

– Dense meshes + large computation granularity:• MAP3S delivers speedups in the range of 10 to 12 on 16 nodes;• an in the range of 10 to 43 on 64 nodes;

– Sparse meshes:• smaller speedups• memory consumption is reduced by 20% to 50% (per node)

The EndThe End

the map 3 s static-and-regular mesh simulation and wavefront parallel-programming patterns

Documents