Download - Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli
![Page 1: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/1.jpg)
Creating Coarse-Creating Coarse-grained Parallelism grained Parallelism for Loop Nestsfor Loop Nests
Chapter 6, Sections 6.3 Chapter 6, Sections 6.3 through 6.9through 6.9Yaniv CarmeliYaniv Carmeli
![Page 2: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/2.jpg)
Single loop methods
Privatization
Loop distribution
Alignment
Loop Fusion
Last time…
![Page 3: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/3.jpg)
Perfect Loop Nests Loop Interchange Loop Selection Loop Reversal Loop Skewing Profitibility-Based Methods
This time…
![Page 4: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/4.jpg)
This time… Imperfectly Nested Loops
Multilevel Loop Fusion Parallel Code Generation
Packaging Parallelism Strip Mining Pipeline Parallelism Guided Self Scheduling
![Page 5: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/5.jpg)
Vectorization: BadParallelization: Good
Loop Interchange
A(I+1, J) = A(I, J) + B(I, J) ENDDO
ENDDO
Vectorization: OKParallelization: Problematic
DO J = 1, MDO I = 1, NPARALLEL DO J = 1 , M DO I = 1 , N A(I+1, J) = A(I, J) + B(I, J) ENDDOEND PARALLEL DO D = ( < , = )
![Page 6: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/6.jpg)
DO I = 1, N
DO J = 1, M A(I+1, J+1) = A(I, J) + B(I, J) ENDDO
ENDDO
DO I = 1, N
PARALLEL DO J = 1, M A(I+1, J+1) = A(I, J) + B(I, J) END PARALLEL DO
ENDDO
Loop Interchange (Cont.)
Loop Interchange doesn’t work, as both loops carry dependence!!
Best we can do
D = ( < , < )
When can a loop be moved to the outermost position in the nest, and be guaranteed to be parallel?
![Page 7: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/7.jpg)
Loop Interchange (Cont.) Theorem: In a perfect nest of loops, a particular
loop can be parallelized at the outermost level if and only if the column of the direction matrix for that nest contains only ‘=‘ entries.
Proof. If. A column with only “=“ entries represents a loop that can be interchanged, and carries no dependence.
Only If. There is a non “=“ entry in that column: If it is “>” – Can’t interchange loops (dependence
will be reversed) If it is “<“ – Can interchange, but can’t shake the
dependece (Will not allow parallelization anyway...)
![Page 8: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/8.jpg)
Loop Interchange (Cont.) Working with direction matrix
1. Move loops with all “=“ entries into outermost position and parallelize it. Remove the column from the matrix
2. Move loops with most “<“ entries into next outermost position and sequentialize it, eliminate the column and any rows representing carried dependences
3. Repeat step 1
![Page 9: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/9.jpg)
DO I = 1, NDO J = 1, M
DO K = 1, LA(I+1, J,K) = A(I, J,K) + X1B(I, J,K+1) = B(I, J,K) + X2C(I+1, J+1,K+1) = C(I, J,K) + X3
ENDDOENDDO
ENDO
DO I = 1, NPARALLEL DO J = 1, M
DO K = 1, LA(I+1, J,K) = A(I, J,K) + X1B(I, J,K+1) = B(I, J,K) + X2C(I+1, J+1,K+1) = C(I, J,K) + X3
ENDDOEND PARALLEL DO
ENDO
Loop Interchange (Cont.) Example:
< = == = << < <
![Page 10: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/10.jpg)
< < = =< = < =< = = <= < = == = < == = = <
Loop Selection – Optimal? Is the approach of selecting the loop with
the most ‘ < ‘ directions optimal? Will result in NO
parallelization for this matrix
While other selections may allow parallelization
< < = =< = < =< = = <= < = == = < == = = <
Is it possible to derive a selection heuristic that provides optimal code?
![Page 11: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/11.jpg)
The problem of loop selection is NP-complete Loop selection is best done by a heuristic!
Loop Selection
< < = =< = < =< = = <= < = == = < == = = <
Favor the selection of loops that must be sequentialized before parallelism can be uncovered.
![Page 12: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/12.jpg)
Heuristic Loop Selection (Cont.) Example of principals involved
in heuristic loop selectionDO I = 2, N
DO J = 2, MDO K = 2, L
A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1)
ENDDO ENDDO
ENDDO
The I-loop must be sequentialized because of the fourth dependence
The J-loop must be sequentialized because of the first dependence
DO J = 2, M DO I = 2, N PARALLEL DO K = 2, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1) END PARALLEL DO ENDDO
ENDDO
= < =< = <= < << = >
![Page 13: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/13.jpg)
Loop Reversal
DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDOENDDO
Using loop reversal to create coarse-grained parallelism. Consider:
DO I = 2, N+1 DO J = 2, M+1 DO K = L, 1, -1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDOENDDO
DO K = L, 1, -1 DO I = 2, N+1
DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDOENDDO
DO K = L, 1, -1 PARALLEL DO I = 2, N+1
PARALLEL DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) END PARALLEL DO END PARALLEL DOENDDO
= < >< = >
= < << = <
![Page 14: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/14.jpg)
Loop Skewing
DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K) B(I, J, K+1) = B(I, J, K) + A(I, J, K) ENDDO ENDDOENDDO
Skewed using k = K + I + J yield:DO I = 2, N+1 DO J = 2, M+1 DO k = I+J+1, I+J+L A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J) B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J) ENDDO ENDDOENDDO
= < =< = == = <= = =
0 1 01 0 00 0 10 0 0
= < << = <= = <= = =
![Page 15: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/15.jpg)
Loop Skewing - Main Benefits
Eliminate “>” signs in the matrix
Transforms skewed loops in such a way, that after outward interchange, it will carry all dependences formerly carried by the loop with respect to which it is skewed
![Page 16: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/16.jpg)
Loop Skewing - Drawback
The resulting parallelism is usually unbalanced. (The resulting loop executes a variable amount of iterations each time). As we shall see – It’s not really a problem for
asynchronous parallelism (unlike vectorization).
![Page 17: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/17.jpg)
Loop Skewing (Cont.) Updated strategy
1. Parallelize outermost loop if possible2. Sequentializes at most one outer loop to find parallelism in the next loop3. If 1 and 2 fail, try skewing4. If 3 fails, sequentialize the loop that can be moved to the outermost
position and cover the most other loops
![Page 18: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/18.jpg)
In Practice – Sometimes we get much worse
execution times, than we would have gotten parallelizing less\different loops.
![Page 19: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/19.jpg)
Profitability-Based Methods
Static performance estimation function No need to be accurate, just good at
selecting the better of two alternatives
Key considerations Cost of memory references Sufficiency of granularity
![Page 20: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/20.jpg)
Profitability-Based Methods (Cont.) Impractical to choose from all
arrangements
Consider only subset of the possible code arrangements, based on properties of the cost function In our case: consider only the inner-most loop
![Page 21: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/21.jpg)
Profitability-Based Methods (Cont.)
1. Subdivide all the references in the loop body into reference groups
Two references are in the same group if: There is a loop independent dependence between
them. There is a constant-distance loop carried
dependence between them.
A possible cost evaluation heuristics:
![Page 22: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/22.jpg)
Profitability-Based Methods (Cont.)
2. Determine whether subsequent accesses to the same reference are
Loop invariant Cost = 1
Unit stride Cost = number of iterations / cache line size
Non-unit stride Cost = number of iterations
A possible cost evaluation heuristics:
![Page 23: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/23.jpg)
Profitability-Based Methods (Cont.)
3. Compute loop cost:
A possible cost evaluation heuristics:
#loop _ cost reference cost
aggregate
of iterations
![Page 24: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/24.jpg)
Profitability-Based Methods: ExampleDO I = 1, N
DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO
![Page 25: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/25.jpg)
Profitability-Based Methods: ExampleDO I = 1, N
DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO
2N3/L+N21N/LN/LI
2N3+N2N1NJ
N3(1+1/L)+N2N/LN1K
COSTBACInner-most loop
Worst
Best
![Page 26: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/26.jpg)
Profitability-Based Methods: Example Reorder loop from innermost to outermost by increasing loop cost: I,K,J
Can’t always have desired loop order(as some permutations are illegal) - Try to find the possible permutation closest to the desired one.
DO J = 1, N DO K = 1, N DO I = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO
![Page 27: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/27.jpg)
Profitability-Based Methods (Cont.) Goal: Given a desired loop order and a direction matrix for a loop nest - find the legal permutation closest to the desired one.
Method: Until there are no more loops:
Choose from all the loops that can be interchanged to the outermost position, the one that is outermost in the desired permutation. Drop that loop. It can be shown that if a legal permutation with the desired innermost loop in the innermost position exists – this algorithm will find such a permutation.
![Page 28: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/28.jpg)
Profitability-Based Methods (Cont.)DO J = 1, N DO K = 1, N DO I = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO
For performance reasons – the compiler may mark the inner loop as “not meant for parallelization” (sequential performance utilizes locality in memory accesses).
![Page 29: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/29.jpg)
Multilevel Loop Fusion
Commonly used for imperfect loop nests
Used after maximal loop distribution
![Page 30: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/30.jpg)
Multilevel Loop Fusion
DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C
B(I+1, J) = B(I, J) + D ENDDOENDDO
DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C
ENDDOENDDO
DO I = 1, N DO J = 1, M
B(I+1, J) = B(I, J) + D ENDDOENDDO
PARALLEL DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C
ENDDOEND PARALLEL DO
PARALLEL DO J = 1, M DO I = 1, N
B(I+1, J) = B(I, J) + D ENDDOEND PARALLEL DO
After distribution each nest is better with a different outer loop – Can’t fuse!
![Page 31: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/31.jpg)
Multilevel Loop Fusion (Cont.)
DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X
B(I+1, J) = A(I, J) + B(I,J)C(I, J+1) = A(I, J) + B(I,J)D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)
ENDDOENDDO
DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X
ENDDOENDDODO I = 1, N DO J = 1, M
B(I+1, J) = A(I, J) + B(I,J) ENDDOENDDODO I = 1, N DO J = 1, M
C(I, J+1) = A(I, J) + C(I,J) ENDDOENDDODO I = 1, N DO J = 1, M
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOENDDO
DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X
ENDDOENDDODO I = 1, N DO J = 1, M
B(I+1, J) = A(I, J) + B(I,J) ENDDOENDDODO I = 1, N DO J = 1, M
C(I, J+1) = A(I, J) + C(I,J) ENDDOENDDODO I = 1, N DO J = 1, M
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOENDDO
Which loop should be fused into the A loop?
i,jA
jB iC
jD
![Page 32: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/32.jpg)
Multilevel Loop Fusion (Cont.)PARALLEL DO J = 1, M
DO I = 1, N A(I, J) = A(I, J) + X
B(I+1, J) = A(I, J) + B(I,J)ENDDO
ENDDO
PARALLEL DO I = 1, N DO J = 1, M
C(I, J+1) = A(I, J) + C(I,J) ENDDOENDDO
PARALLEL DO J = 1, M DO I = 1, N
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOENDDO
jAB
iC
jD
Fusing A loop with B loop
2 barriers
![Page 33: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/33.jpg)
Multilevel Loop Fusion (Cont.)PARALLEL DO I = 1, N
DO J = 1, M A(I, J) = A(I, J) + X
C(I, J+1) = A(I, J) + C(I,J)ENDDO
ENDDO
PARALLEL DO J = 1, M DO I = 1, N
B(I+1, J) = A(I, J) + B(I,J)ENDDO
ENDDO
PARALLEL DO J = 1, M DO I = 1, N
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOENDDO
iAC
jB
jD
PARALLEL DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X
C(I, J+1) = A(I, J) + C(I,J)ENDDO
ENDDO
PARALLEL DO J = 1, M DO I = 1, N
B(I+1, J) = A(I, J) + B(I,J)D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)
ENDDO
ENDDO
Now we can also fuse B-D
iAC
jBD
Fusing A loop with C loop
1 barrier
![Page 34: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/34.jpg)
Multilevel Loop Fusion (Cont.) Decision making needs look-ahead Strategy: Fuse with the loop that cannot
be fused with one of its successorsRationale: If it can’t be fused with its successors – a barrier will be formed anyway.
i,jA
jB iC
jD A barrier is inevitable!!
![Page 35: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/35.jpg)
Parallel Code Generation
Parallelize(l,D)1. Try methods for perfect nests (loop
interchange, loop skewing, loop reversal), and stop if parallelism is found.
2. If nest can be distributed: distribute, run recursively on the distributed nests, and merge.
3. Else sequentialize outer loop, eliminate the dependences it carries, and try recursively on each of the loops nested in it.
Code generation scheme:
![Page 36: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/36.jpg)
Parallel Code Generationprocedure Parallelize(l, Dl);
ParallelizeNest(l, success); //(try methods for perfect nests..)if ¬success then begin
if l can be distributed then begindistribute l into loop nests l1, l2, …,
ln;for i:=1 to n do begin
Parallelize(li, Di);endMerge({l1, l2, …, ln});
end
![Page 37: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/37.jpg)
Parallel Code Generation (Cont.)else begin // if l cannot be distributed then
for each outer loop l0 nested in l do beginlet D0 be the set of dependences between statements in l0 less dependences carried by l;
Parallelize(l0,D0);endlet S - the set of outer loops and statements loops left in l;If ||S||>1 then Merge(S);endend
end Parallelize
![Page 38: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/38.jpg)
Parallel Code Generation (Cont.)
DO J = 1, M DO I = 1, N A(I+1, J+1) = A(I+1, J) + C
X(I, J) = A(I, J) + C ENDDOENDDO
DO J = 1, M DO I = 1, N A(I+1, J+1) = A(I+1, J) + C
ENDDOEND DO
DO J = 1, M DO I = 1, N
X(I, J) = A(I, J) + C ENDDOEND DO
Both loops carry dependence – loop interchange will not find sufficient parallelism.
Try distribution…PARALLEL DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I+1, J) + C
ENDDOEND PARALLEL DO
DO J = 1, M DO I = 1, N
X(I, J) = A(I, J) + C ENDDOEND DO
I loop can be parallelizedPARALLEL DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I+1, J) + C
ENDDOEND PARALLEL DO
PARALLEL DO J = 1, M
DO I = 1, N !Left sequential for memory hierarchy
X(I, J) = A(I, J) + C ENDDOEND PARALLEL DO
Both loops can be parallelized
Now fusing…
Type: (I-loop, parallel)
Type: (J-loop, parallel)
Different types – can’t fuse
![Page 39: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/39.jpg)
Parallel Code Generation (Cont.)
DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X
B(I+1, J) = A(I, J) + B(I,J)C(I, J+1) = A(I, J) + C(I,J)D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)
ENDDOENDDO
PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X
ENDDOENDPARALLEL DOPARALLEL DO J = 1, M DO I = 1, N
B(I+1, J) = A(I, J) + B(I,J) ENDDOEND PARALLEL DOPARALLEL DO I = 1, N DO J = 1, M
C(I, J+1) = A(I, J) + C(I,J) ENDDOEND PARALLEL DOPARALLEL DO J = 1, M DO J = 1, N
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOEND PARLLEL DO
PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X
ENDDO DO I = 1, N
B(I+1, J) = A(I, J) + B(I,J) ENDDOEND PARALLEL DO
PARALLEL DO I = 1, N DO J = 1, M
C(I, J+1) = A(I, J) + C(I,J) ENDDOEND PARALLEL DOPARALLEL DO J = 1, M DO J = 1, N
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOEND PARLLEL DO
PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X
B(I+1, J) = A(I, J) + B(I,J) ENDDOEND PARALLEL DO
PARALLEL DO I = 1, N DO J = 1, M
C(I, J+1) = A(I, J) + C(I,J) ENDDOEND PARALLEL DOPARALLEL DO J = 1, M DO J = 1, N
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOEND PARLLEL DO
A
B C
D
I loop, parallel
J loop, parallel
J loop, parallel
J loop, parallel
A
C
D
![Page 40: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/40.jpg)
DO J = 1, JMAXDO I = 1, IMAXD
F(I, J, 1) = F(I, J, 1) * B(1)
DO K = 2, N-1DO J = 1, JMAXD
DO I = 1, IMAXDF(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)
DO J = 1, JMAXDDO I = 1, IMAXD
TOT(I, J) = 0.0
DO J = 1, JMAXDDO I = 1, IMAXD
TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)
DO K = 2, N-1DO J = 1, JMAXD
DO I = 1, IMAXDTOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)
ErlebacherPARALLEL DO J = 1, JMAX
DO I = 1, IMAXDF(I, J, 1) = F(I, J, 1) * B(1)
DO K = 2, N-1PARALLEL DO J = 1, JMAXD
DO I = 1, IMAXDF(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)
PARALLEL DO J = 1, JMAXDDO I = 1, IMAXD
TOT(I, J) = 0.0
PARALLEL DO J = 1, JMAXDDO I = 1, IMAXD
TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)
DO K = 2, N-1PARALLEL DO J = 1, JMAXD
DO I = 1, IMAXDTOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)
![Page 41: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/41.jpg)
ErlebacherPARALLEL DO J= 1, MAXD
L1 : DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1)
L2: DO K = 2, N – 1 DO I = 1, IMAXD
F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)
L3: DO I = 1, IMAXD TOT(I, J) = 0.0
L4: DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)
L5: DO K = 2, N-1 DO I = 1, IMAXD
TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)
END PARALLEL DO
L1
L4
L2
L3
L5
![Page 42: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/42.jpg)
ErlebacherPARALLEL DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) TOT(I, J) = 0.0 TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) ENDDO
DO K = 2, N-1 DO I = 1, IMAXD F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) ENDDO ENDDOEND PARALLEL DO
![Page 43: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/43.jpg)
Packaging of Parallelism Trade off between parallelism and
granularity of synchronization. Larger granularity work-units means
synchronization needs to be done less frequently, but at a cost of less parallelism, and poorer load balance.
![Page 44: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/44.jpg)
Strip Mining Converts available parallelism into a form
more suitable for the hardware
DO I = 1, NA(I) = A(I) + B(I)
ENDDO
Interruptions may be disastrous
k = CEIL (N / P)PARALLEL DO I = 1, N, k
DO i = I, MIN(I + k-1, N)A(i) = A(i) + B(i)
ENDDOEND PARALLEL DO
The value of P is unknown until runtime, so strip mining is often handled by special hardware (Convex C2 and C3)
![Page 45: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/45.jpg)
Strip Mining (Cont.) What if the execution time varies among
iteraions?
PARALLEL DO I = 1, NDO J = 2, I
A(J, I) = A(J-1, I) * 2.0ENDDO
END PARALLEL DO
Solution: smaller unit size to allow more balanced distribution
8
N 7
8
N3
8
N 5
8
N
![Page 46: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/46.jpg)
Pipeline Parallelism Fortran command DOACROSS – pipelines
parallel loop iterations with cross-iteration synchronization.
Useful where parallelization is not available High synchronization costs
DOACROSS I = 2, NS1: A(I) = B(I) + C(I)
POST(EV(I)) IF (I>2) WAIT (EV(I-1))
S2: C(I) = A(I-1) + A(I)ENDDO
![Page 47: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/47.jpg)
Scheduling Parallel Work
Load balance
LittleSychro.
![Page 48: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/48.jpg)
Scheduling Parallel Work Parallel execution is slower than serial execution if
Bakery-counter scheduling Moderate synchronization overhead
N- number of iterationsB- time of one iterationp- number of processorsσ0- constant overhead per processor
0
NB
p
![Page 49: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/49.jpg)
Guided Self-Scheduling Incorporates some level of static
scheduling to guide dynamic self-scheduling Schedules groups of iterations Going from large to small chunks of work
Iterations dispensed at time t follows:1
tt t t t
Nx N N x
p
![Page 50: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/50.jpg)
Guided Self-Scheduling (Cont.) GSS: (20 iteration, 4 processors)
Not completely balanced
Required synchronization: 9In bakery counter: 20
6 45 4
![Page 51: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/51.jpg)
Guided Self-Scheduling (Cont.) In the example, last 4 allocation are for a single iteration. Coincidence? Last p-1 iterations will always be of 1 iteration.
GSS(2): No block of iterations smaller than 2
GSS(k): No block is smaller than k
1
2tt t t t
Nx N N x
p
p
![Page 52: Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli](https://reader035.vdocument.in/reader035/viewer/2022062407/56649d6a5503460f94a488df/html5/thumbnails/52.jpg)
Yaniv
Carmeli
B.A. in CS
Thanks for you attention!