the map 3 s static-and-regular mesh simulation and wavefront parallel-programming patterns
DESCRIPTION
The MAP 3 S Static-and-Regular Mesh Simulation and Wavefront Parallel-Programming Patterns. By Robert Niewiadomski, José Nelson Amaral, and Duane Szafron Department of Computing Science University of Alberta Edmonton, Alberta, Canada. Pattern-based parallel-programming. Observation: - PowerPoint PPT PresentationTRANSCRIPT
The MAPThe MAP33S S Static-and-Regular Mesh Static-and-Regular Mesh
Simulation and WavefrontSimulation and WavefrontParallel-Programming PatternsParallel-Programming Patterns
By Robert Niewiadomski, José Nelson Amaral, and Duane Szafron
Department of Computing ScienceUniversity of Alberta
Edmonton, Alberta, Canada
Pattern-based parallel-programmingPattern-based parallel-programming
• Observation:– Many seemingly different parallel programs have a common parallel
computation-communication-synchronization pattern.
• A Parallel-programming pattern instance:– Is a parallel program that adheres to a certain parallel computation-
communication-synchronization pattern.– Consists of engine-side code and user-side code:
• Engine-side code:– Is complete and handles all communication and synchronization.
• User-side code:– Is incomplete and handles all computation.– User completes the incomplete portions.
• MAP3S targets distributed-memory systems.
MAPMAP33SS
• MAPMAP33S = S = MMPI/C PI/C AAdvanced dvanced PPattern-based attern-based PParallel arallel PProgramming rogramming SSystemystem
Technical expertise Domain knowledge
Engine designer
Pattern designer
Application developer
Pattern-based parallel-programmingPattern-based parallel-programming
• The MAP3S usage scheme:
Select PatternSelect Pattern
Create Specification File
Create Specification File
Generate Pattern Instance
Generate Pattern Instance
Write User-side Code
Write User-side Code
(p.e. dimensions of mesh, data dependences, etc)
(automatic by pattern-instance generator)
(domain-specific computation code)
The Simulation and Wavefront computations
The Simulation and Wavefront computations
• The computations operate on a k-dimensional mesh of elements.• Simulation:
– Multiple mesh instances M0, M1, … are computed.
– In iteration i = 0, elements of M0 are initialized.
– In iteration i > 0, certain elements of Mi are computed using elements of Mi - 1 that were initialized/computed in previous iteration.
– Execution proceeds until a terminating condition is met.– Example: cellular-automata computations.
• Wavefront:– Single mesh instance M is computed.– In iteration i = 0,certain elements of M are initialized.– In iteration i > 0, elements of M whose data dependences are satisfied are computed. – Execution proceeds until there are no elements to compute.– Example: dynamic-programming computations.
Mesh-blocksMesh-blocks
• Computation proceeds at granularity of mesh-blocks.
0
6
1
7
12
18
13
19
2
8
3
9
14
20
15
21
24
30
25
31
26
32
27
33
4
10
5
11
16
22
17
23
28
34
29
35
0
6
1
7
12
18
13
19
2
8
3
9
14
20
15
21
24
30
25
31
26
32
27
33
4
10
5
11
16
22
17
23
28
34
29
35
A B C
D E F
G H I
• A k-dimensional mesh is logically partitioned into k-dimensional sub-meshes called mesh-blocks.
Prologue
BodyLocal
BodyGlobal
Epilogue
Prelude
Prologue
BodyLocal
BodyGlobal
Epilogue
Prelude
Prelude: process command-line arguments.
Prologue: initialize first mesh, possibly at granularity of mesh blocks.
BodyLocal: compute next mesh at granularity of mesh blocks.
BodyGlobal: decide whether to compute another mesh or to terminate.
Epilogue: process last computed mesh, possibly at granularity of mesh blocks.
User-side code: SimulationUser-side code: Simulation
Prologue
BodyLocal
BodyGlobal
Epilogue
Prelude
Prologue
BodyLocal
BodyGlobal
Epilogue
Prelude
Prelude: process command-line arguments.
Prologue: initialize first mesh, possibly at granularity of mesh blocks.
BodyLocal: compute next mesh at granularity of mesh blocks.
BodyGlobal: decide whether to compute another mesh or to terminate.
Epilogue: process last computed mesh, possibly at granularity of mesh blocks.
User-side code: SimulationUser-side code: Simulation
Prologue
BodyLocal
BodyGlobal
Epilogue
Prelude
Prologue
BodyLocal
BodyGlobal
Epilogue
Prelude
Prelude: process command-line arguments.
Prologue: initialize first mesh, possibly at granularity of mesh blocks.
BodyLocal: compute next mesh at granularity of mesh blocks.
BodyGlobal: decide whether to compute another mesh or to terminate.
Epilogue: process last computed mesh, possibly at granularity of mesh blocks.
User-side code: SimulationUser-side code: Simulation
Prologue
BodyLocal
BodyGlobal
Epilogue
Prelude
Prologue
BodyLocal
BodyGlobal
Epilogue
Prelude
Prelude: process command-line arguments.
Prologue: initialize first mesh, possibly at granularity of mesh blocks.
BodyLocal: compute next mesh at granularity of mesh blocks.
BodyGlobal: decide whether to compute another mesh or to terminate.
Epilogue: process last computed mesh, possibly at granularity of mesh blocks.
User-side code: SimulationUser-side code: Simulation
Prologue
BodyLocal
BodyGlobal
Epilogue
Prelude
Prologue
BodyLocal
BodyGlobal
Epilogue
Prelude
Prelude: process command-line arguments.
Prologue: initialize first mesh, possibly at granularity of mesh blocks.
BodyLocal: compute next mesh at granularity of mesh blocks.
BodyGlobal: decide whether to compute another mesh or to terminate.
Epilogue: process last computed mesh, possibly at granularity of mesh blocks.
User-side code: SimulationUser-side code: Simulation
Prologue
Body
Epilogue
Prelude
Prologue
Body
Epilogue
Prelude
Prelude: process command-line arguments
Prologue: initialize mesh, possibly at granularity of mesh blocks.
Body: continue computing of mesh at granularity of mesh blocks
Epilogue: process mesh, possibly at granularity of mesh blocks.
User-side code: WavefrontUser-side code: Wavefront
• The computation of an element depends on the values of certain other elements.• In MAP3S, the user specifies these data-dependencies using conditional shape-
lists at pattern-instance generation time.– Syntax: given an element p(c0, c1,…, ck - 1), if a certain condition is met, then, the
computation of p requires the values of all the elements falling into the specified k-dimensional volumes of the k-dimensional mesh, each of which is specified relative to position (c0, c1,…, ck - 1).
– Here is a simple example (expressing dependences for the green element):
{“x > 0 && y > 0”, {([“x - 1”, ”x - 1”], [“y - 1”, “y - 1”]), ([“x - 1”, “x - 1”], [“y”, “y”]), ([“x”, “x”], [“y - 1”, ”y - 1”])}};
Data-dependency specificationData-dependency specification
0 1 2 3
0
1
2
3
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
9
{"y<=x", {(["0","y-1"],["y","y"]), (["x","x"],["0","y-1"])}};
• The strengths of conditional shape-lists: • user is not limited to pre-defined data-dependency specifications,• user is able to express irregular data-dependency specifications.
• In this example, conditional shape-lists specify the data-dependencies of the Lower/Upper Matrix-Decomposition Wavefront computation.
Data-dependency specificationData-dependency specification
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
9
{"y>x", {(["0","x-1"],["y","y"]), (["x","x"],["0","x-1"]), (["x","x"], ["x","x"])}};
{"y<=x", {(["0","y-1"],["y","y"]), (["x","x"],["0","y-1"])}};
• The strengths of conditional shape-lists: • user is not limited to pre-defined data-dependency specifications,• user is able to express irregular data-dependency specifications.
• In this example, conditional shape-lists specify the data-dependencies of the Lower/Upper Matrix-Decomposition Wavefront computation.
Data-dependency specificationData-dependency specification
Direct mesh-accessDirect mesh-access• In the user-code all the mesh elements can be accessed directly.
void computeMeshBlock(mesh, xMin, xMax, yMin, yMax) { for(x = xMin; x<= xMax; x++) { for(y = yMin; y<= yMax; y++) { mesh[x][y] = f(mesh[x-1][y-1], mesh[x][y-1], mesh[x-1][y]); } }}
0 1 2 3 4 5
0
1
2
3
4
5
• With direct mesh-access, the user does not need to refactor their sequential-code w.r.t. mesh access. In contrast, with indirect mesh-access a refactoring is necessary, since input elements are accessed in auxiliary data-structures.
06
17
A28
39
B4
105
11C
12
18
13
19D
1420
1521
E16
22
17
23F
2430
2531
G2632
2733
H2834
2935
I
0
6 7
1 2
8 9
3 4
10 11
5
12
18 19
13 14
20 21
15 16
22 23
17
24
30 31
25 26
32 33
27 28
34 35
29
Engine-side codeEngine-side codeEngine-side code in the Wavefront pattern.
• Element-level data-dependencies --- specified by the user --- are automatically extended to mesh-block-level data-dependencies.
06
17
A28
39
B4
105
11C
12
18
13
19D
1420
1521
E16
22
17
23F
2430
2531
G2632
2733
H2834
2935
I
1622
1723
F2632
2733
H2834
2935
I
Engine-side codeEngine-side code• The mesh-block-level data-dependencies are utilized
to establish a parallel-computation schedule.
CPU 0CPU 0 CPU 1CPU 1
28
39
B
410
511
C
1218
1319
D
1420
1521
E
1622
1723
F
2430
2531
G
2632
2733
H2834
2935
I
06
17
A
06
17
A
Engine-side codeEngine-side code• The parallel computation-schedule is
refined with mesh-blocks being assigned among the processors in a round-robin fashion (shown).
• The parallel computation-schedule is then complemented with a parallel communication schedule (not shown).
• The engine-side code executes user-side code in accordance to the parallel computation and communication schedule.
Prelude
CPU 0CPU 0 CPU 1CPU 1
Prologue(A)
Body(A)
Body(B) Body(D)
Body(C) Body(E)
Body(G)
Epilogue(A,B,C,D,E,G)
28
39
B
410
511
C
1218
1319
D
1420
1521
E
1622
1723
F
2430
2531
G
2632
2733
H2834
2935
I
06
17
A
06
17
A
Engine-side codeEngine-side code• Execution of user-side code by
the engine-side code when using sequential prologue and epilogue.
Tim
e
08162432404856
19172533414957
2618102
58504234
2719113
59514335
412202836445260
513212937455361
3022146
62544638
3123157
63554739
08
191624
1725
3240
3341
4856
4957
210
311
1826
1927
3442
3543
5058
5159
412
513
2028
2129
3644
3745
5260
5361
614
715
2230
2331
3846
3947
5462
5563
2D-mesh in dense mesh-representation.
2D mesh in sparse mesh-representation.
Mesh representationMesh representation• The mesh can be represented using
either the dense mesh-representation or the sparse mesh-representation.
• Sparse representation can have better locality and can distribute the memory footprint of the mesh among nodes.
•A mesh memory-footprint can be as much a problem as performance. The combination of parallel prologue and epilogue, and sparse-mesh representation, both minimizes and distributes the mesh-storage memory-footprint.
CPU 0
CPU 0
CPU 1
CPU 1
06
17
1218
1319
28
39
1420
1521
2430
2531
2632
2733
410
511
1622
1723
2834
2935
A B C
D E F
G H I
06
17
1218
1319
28
39
1420
1521
2430
2531
2632
2733
410
511
1622
1723
2834
2935
A B C
D E F
G H I
Original Do not store dead mesh-blocks
Only store non-owned mesh-blocks that are used by owned
mesh-blocks.
06
17
1218
1319
28
39
2430
2531
410
511
A B C
D
G
06
17
1218
1319
28
39
1420
1521
2430
2531
410
511
A B C
D E
G
06
17
1218
1319
28
39
1420
1521
2430
2531
410
511
A B C
D E
G
06
17
1218
1319
28
39
1420
1521
2430
2531
A B
D E
G
• Memory-footprint reduction varies. It is most effective for large simulation computations.
Mesh representationMesh representation
Experimental evaluationExperimental evaluation
• Problems:– 2D problems:
• GoL: game-of-life (Simulation)• LUMD: lower/upper matrix-decomposition (Wavefront)
– 3D problems:• RTA: room-temperature annealing (Simulation) • MSA: multiple-sequence alignment (Wavefront)
• Hardware:– GigE:a 16-node cluster with Gigabit Ethernet– IB: a 128-node cluster with InfiniBand (limited to 64)
Experimental evaluationExperimental evaluation
GoL (2D Simulation) RTA (3D Simulation)
MSA (3D Wavefront)LUMD (2D Wavefront)
• Speedup on GigE:– x-axis is # of nodes.– y-axis is speedup.
• Performance gains on LUMD and MSA are worse than on GoL and RTA:– LUMD has non-uniform computation intensity,
which limits parallelism.– MSA has limited computation granularity, which
increases relative overhead of communication and synchronization.
Experimental evaluationExperimental evaluation• Speedup on IB:
– x-axis is # of nodes.– y-axis is speedup.
• Performance gains on LUMD and MSA are worse than on GoL and RTA:
– See GigE.
GoL (2D Simulation) RTA (3D Simulation)
MSA (3D Wavefront)LUMD (2D Wavefront)
Experimental evaluationExperimental evaluation
• Capability:– Use of sparse-mesh representation distributes the mesh memory-
footprint across multiple nodes.• Allows for handling of meshes whose memory-footprint exceeds memory capacity
of a single node.
– Using 16 nodes on GigE:
Problem InstanceGlobal Mesh
Memory-Footprint (GB)
Maximum Local MeshMemory-Footprint
(GB)
GoL (131,072x 131,072) 32 3.0
RTA (1,024x 1,024 x 1,024) 32 4.4
LUMD (40,132x 40,132) 12 3.0
MSA (2,048 x 2,048 x 2,048) 32 3.0
• LUMD mesh memory-footprint effectiveness is limited due to computation of elements being dependent on a larger number of elements.
Experimental evaluationExperimental evaluation
• Capability:– Use of sparse-mesh representation distributes the mesh memory-
footprint across multiple nodes.• Allows for handling of meshes whose memory-footprint exceeds memory capacity
of a single node.
– Using 16 nodes on GigE:
Problem InstanceGlobal Mesh
Memory-Footprint (GB)
Maximum Local MeshMemory-Footprint
(GB)
GoL (131,072x 131,072) 32 3.0
RTA (1,024x 1,024 x 1,024) 32 4.4
LUMD (40,132x 40,132) 12 3.0
MSA (2,048 x 2,048 x 2,048) 32 3.0
• LUMD mesh memory-footprint effectiveness is limited due to computation of elements being dependent on a larger number of elements.
Experimental evaluationExperimental evaluation• What we learned:
– Dense meshes + large computation granularity:• MAP3S delivers speedups in the range of 10 to 12 on 16 nodes;• an in the range of 10 to 43 on 64 nodes;
– Sparse meshes:• smaller speedups• memory consumption is reduced by 20% to 50% (per node)
The EndThe End