![Page 1: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/1.jpg)
COMP 515: Advanced Compilation for Vector and Parallel Processors
Vivek SarkarDepartment of Computer ScienceRice [email protected]
http://www.cs.rice.edu/~vsarkar/comp515
COMP 515 Lecture 10 22 September, 2011
1
![Page 2: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/2.jpg)
Acknowledgments• Slides from previous offerings of COMP 515 by Prof. Ken
Kennedy—http://www.cs.rice.edu/~ken/comp515/
2
![Page 3: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/3.jpg)
Enhancing Fine-Grained Parallelism (contd)
Chapter 5 of Allen and Kennedy
3
![Page 4: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/4.jpg)
Run-time Symbolic Resolution
• “Breaking Conditions”
DO I = 1, N !
! A(I+L) = A(I) + B(I)
ENDDO
Transformed to..
IF(L.LE.0 .OR. L.GT.N) THEN
! A(L+1:N+L) = A(1:N) + B(1:N)
ELSE
! DO I = 1, N !
! A(I+L) = A(I) + B(I)
! ENDDO
ENDIF
4
![Page 5: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/5.jpg)
Run-time Symbolic Resolution
• Identifying minimum number of breaking conditions to break a recurrence is NP-hard
• Heuristic:—Identify when a critical dependence can be conditionally eliminated
via a breaking condition
5
![Page 6: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/6.jpg)
Loop Skewing
• Reshape Iteration Space to uncover parallelism
DO I = 1, N
! DO J = 1, N
! ! ! (=,<)!
S:! A(I,J) = A(I-1,J) + A(I,J-1)
! ! (<,=)
! ENDDO
ENDDO
Parallelism not apparent
6
![Page 7: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/7.jpg)
Loop Skewing• Dependence Pattern before loop skewing
7
![Page 8: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/8.jpg)
Loop Skewing• Do the following transformation called loop skewing
jj=J+I or J=jj-I
DO I = 1, N
! DO jj = I+1, I+N !
J = jj - I (=,<)
S:! A(I,J) = A(I-1,J) + A(I,J-1)
! ! (<,<)
! ENDDO
ENDDO
Note: Direction Vector Changes, but statement body remains the same
(Examples in textbook usually copy propagate J=jj-I in all uses of J)
8
![Page 9: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/9.jpg)
Loop Skewing• Dependence pattern after loop skewing
9
![Page 10: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/10.jpg)
Loop Skewing
• Reshape Iteration Space to uncover parallelism
DO I = 1, N
! DO J = 1, N
! ! ! (=,<)!
S:! A(I,J) = A(I-1,J) + A(I,J-1)
! ! ! (<,=)
! ENDDO
ENDDO
Parallelism not apparent
10
![Page 11: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/11.jpg)
Loop SkewingDO I = 1, N ! DV = { (<,<), (=, <) }
! DO j = I+1, I+N
S:!! A(I,j-I) = A(I-1,j-I) + A(I,j-I-1)
! ENDDO
ENDDO
Loop interchange to..
DO j = 2, N+N ! DV = { (<,<), (<, =) }
! DO I = max(1,j-N), min(N,j-1)
S:!! A(I,j-I) = A(I-1,j-I) + A(I,j-I-1)
! ENDDO
ENDDO
Vectorize to..
DO j = 2, N+N
! FORALL I = max(1,j-N), min(N,j-1)
S:!! A(I,j-I) = A(I-1,j-I) + A(I,j-I-1)
! END FORALL
ENDDO
11
![Page 12: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/12.jpg)
Loop Skewing• Disadvantages:
— Varying vector length
– Not profitable if N is small— If vector startup time is more than speedup time, this is not profitable— Vector bounds must be recomputed on each iteration of outer loop
• Apply loop skewing if everything else fails
• See Unimodular transformations in Sarkar-Thekkath-92 paper (and related work) for generalization of loop skewing
12
![Page 13: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/13.jpg)
Chapter 5: Putting It All Together• Good Part
—Many transformations imply more choices to exploit parallelism
• Bad Part—Choosing the right transformation—How to automate transformation selection process?—Interference between transformations
13
![Page 14: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/14.jpg)
Putting It All Together• Example of Interference
DO I = 1, N
! DO J = 1, M !
! ! S(I) = S(I) + A(I,J)
! ENDDO
ENDDO
Sum Reduction gives..
DO I = 1, N !
! S(I) = S(I) + SUM (A(I,1:M))
ENDDO
While Loop Interchange and Vectorization gives..
DO J = 1, N !
! S(1:N) = S(1:N) + A(1:N,J)
ENDDO
14
![Page 15: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/15.jpg)
Putting It All Together• Any algorithm which tries to tie all transformations must
—Take a global view of transformed code—Know the architecture of the target machine
• Goal of our algorithm—Finding ONE good vector loop in each loop nest [works well for most
vector register architectures]
15
![Page 16: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/16.jpg)
Unified Framework• Detection: finding ALL loops for EACH statement that can be
run in vector
• Selection: choosing best loop for vector execution for EACH statement
• Transformation: carrying out the transformations necessary to vectorize the selected loop
• See Section 5.10 for details
16
![Page 17: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/17.jpg)
Performance on Benchmarks
PFC = Parallel Fortran Converter tool developed at Rice by Allen & Kennedy17
![Page 18: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/18.jpg)
Test 171: One example that PFC was unable to vectorize
DO I = 1, N
! ! A(I*N) = A(I*N) + B(I)
ENDDO
18
![Page 19: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/19.jpg)
Coarse-Grain Parallelism
Chapter 6 of Allen and Kennedy
19
![Page 20: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/20.jpg)
Introduction
• Previously, our transformations targeted vector and superscalar architectures.
• In this lecture, we worry about transformations for symmetric multiprocessor machines.
• The difference between these transformations tends to be one of granularity.
20
![Page 21: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/21.jpg)
p1
Memory
Bus
p2 p3 p4
Review• SMP machines have multiple
processors all accessing a central memory.
• The processors are unrelated, and can run separate processes.
• Starting processes and synchonrization between proccesses is expensive.
21
![Page 22: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/22.jpg)
Synchronization• A basic synchronization element is the barrier at the end of a
parallel loop.
• A barrier in a program forces all processes to reach a certain point before execution continues.
• Bus contention can cause slowdowns.
22
![Page 23: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/23.jpg)
Techniques for parallelizing a single loop
• Single loop methods—Privatization—Loop distribution—Loop fusion—Alignment—Code replication
![Page 24: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/24.jpg)
DO I = 1,N
S1 T = A(I)
S2 A(I) = B(I)
S3 B(I) = T
ENDDO
PARALLEL DO I = 1,N
PRIVATE t
S1 t = A(I)
S2 A(I) = B(I)
S3 B(I) = t
ENDDO
Single Loops• The analog of scalar expansion is privatization.
• Temporaries can be given separate namespaces for each iteration.
24
![Page 25: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/25.jpg)
Definition: A scalar variable x in a loop L is said to be privatizable if every path from the loop entry to a use of x inside the loop passes through a definition of x.
Privatizability can be stated as a data-flow problem:
We can also do this by declaring a variable x private if its SSA graph doesn’t contain a phi function at the entry.
up(x) = use(x)∪ (!def (x)∩ up(y))y∈succ( x )U
private(L) =!up(entry)∩ ( def (y))y∈LU
Privatization
25
![Page 26: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/26.jpg)
Example of Privatizable Scalar Variable
• “Method of, system for, and computer program product for efficient identification of private variables in program loops by an optimizing compiler”, US Patent 5,790,859, issued Aug 1998.
26
![Page 27: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/27.jpg)
We need to privatize array variables.
For iteration J, upwards exposed variables are those exposed due to loop body without variables defined earlier.
DO I = 1,100
S0 T(1)=X
L1 DO J = 2,N
S1 T(J) = T(J-1)+B(I,J)
S2 A(I,J) = T(J)
ENDDO
ENDDO
up(L1) = ({T (J −1)} \ {T(n) : 2 ≤ n ≤ j})
J= 2
N
U
So for this fragment, T(1) is the only exposed variable.
Array Privatization
27
![Page 28: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/28.jpg)
PARALLEL DO I = 1,100 PRIVATE t(N)
S0 t(1) = X
L1 DO J = 2,N
S1 t(J) = t(J-1)+B(I,J)
S2 A(I,J)=t(J)
ENDDO
ENDDO
Array Privatization• Using this analysis, we get the following code:
28
![Page 29: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/29.jpg)
Loop Distribution• Loop distribution can convert loop-carried dependences to loop-
independent dependences.
• Consequently, it often creates opportunity for outer-loop parallelism.
• However, we must add extra barriers to keep distributed loops from executing out of order, so the overhead may override the parallel savings.
29
![Page 30: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/30.jpg)
DO I = 2,N
A(I) = B(I)+C(I)
D(I) = A(I-1)*2.0
ENDDO
DO I = 1,N ! Aligned loop
IF (I .GT. 1) A(I) = B(I)+C(I)
IF (I .LT. N) D(I+1) = A(I)*2.0
ENDDO
Loop Alignment• Many carried dependencies are due to array alignment issues.
• If we can align all references, then dependencies would go away, and parallelism is possible.
• This is also related to Software Pipelining
30
D(2) = A(1)*2.0
DO I = 2,N-1 ! Pipelined loop
A(I) = B(I)+C(I)
D(I+1) = A(I)*2.0
ENDDO
A(N) = B(N)+C(N)
![Page 31: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/31.jpg)
DO I = 2,N
J = MOD(I+N-4,N-1)+2
A(J) = B(J)+C
D(I)=A(I-1)*2.0
ENDDO
D(2) = A(1)*2.0DO I = 2,N-1 A(I) = B(I)+C(I) D(I+1) = A(I)*2.0ENDDOA(N) = B(N)+C(N)
Alignment• There are other ways to align the loop:
31
![Page 32: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/32.jpg)
DO I = 1,N
A(I+1) = B(I)+C
X(I) = A(I+1)+A(I)
ENDDO
DO I = 1,N A(I+1) = B(I)+C ! Replicated Statement IF (I .EQ 1) THEN t = A(I) ELSE t = B(I-1)+C END IF X(I) = A(I+1)+tENDDO
Code Replication• If an array is involved in a recurrence, then alignment isn’t
possible.
• If two dependencies between the same statements have different dependency distances, then alignment doesn’t work.
• We can fix the second case by replicating code:
32
![Page 33: COMP 515: Advanced Compilation for Vector and Parallel ...vs3/PDF//comp515-lec10-f11-v1.pdf · 33 COMP 515, Fall 2011 (V.Sarkar) REMINDER: Homework #4 (Written Assignment) 1. Solve](https://reader030.vdocument.in/reader030/viewer/2022040905/5e7a7a0d8785941e4a633511/html5/thumbnails/33.jpg)
COMP 515, Fall 2011 (V.Sarkar)33
REMINDER: Homework #4 (Written Assignment)
1. Solve exercise 5.6 in book—Your solution should be legal for all values of K (note that the value of K
is invariant in loop I)
• Due by 5pm on Wednesday, Sep 28th
• Homework should be turned into Amanda Nokleby, Duncan Hall 3137
• Honor Code Policy: All submitted homeworks are expected to be the result of your individual effort. You are free to discuss course material and approaches to problems with your other classmates, the teaching assistants and the professor, but you should never misrepresent someone else’s work as your own. If you use any material from external sources, you must provide proper attribution.