loop tiling for iterative stencil computations marta jiménez
Post on 19-Dec-2015
228 views
TRANSCRIPT
Loop Tiling for Iterative Stencil Computations
Marta Jiménez
What is an Iterative Stencil Computation?
• ISC often performed for PDE, GM, IP– swim, tomcatv, mgrid (from SPEC95 benchmark)
– Jacobi
DO K = 1, NITER /* time-step loop */ do J = ... do I = ... {A(I,J), A(I+1,J),…} enddo enddo {wrapped-around computations}ENDDO
Matrix A
Loop Tiling• Loop Tiling
– divides IS into regular tiles to make the working set fit in the memory level being exploited
– can be applied hierarchically (Multilevel Tiling)
• Current algorithms for Loop Tiling are limited to loops that:– are “perfectly” nested
– are fully permutable
– define a rectangular IS
• However, in iterative stencil computations, loops are:– NOT perfectly nested
– NOT fully permutable
• Show how Loop Tiling can be applied to iterative stencil computations– based on Song & Li’s paper [PLDI99]
• define a Program Model• 1 Level of 1D-Tiling (cache)
– program example: SWIM• 2 levels of Tiling
– 2D-Tiling at the cache level
– 1D-Tiling at the register level (based on Jiménez et al. [ICS98][HPCA98])
• Performance Results– Loop Tiling on EV5 & EV6
Today’s talk
Steps
1- Apply a set of transformations to the original program to achieve the desired program model defined by Song & Li
2- Perform 2D-Tiling for the Cache Level
3- Perform 1D-Tiling for the Register Level
1st Step: achieve desired program model
DO K = 1, NITER /* time-step loop */ do J1 = LJ1, UJ1
do I1 = LI1, UI1
{A(I,J), A(I+1,J),…} enddo enddo . . . do Jm = LJm, UJm
do Im = LIm, UIm
{A(I,J), A(I+1,J),…} enddo enddo
ENDDO
Program Model:
Usually, programs are NOT directly written in this form – We must apply a set of transformations to achieve this program model
SWIM original code
initializations90 NCYCLE = NCYCLE +1
CALL CALC1
CALL CALC2
IF (NCYCLE >= ITMAX) STOP
IF (NCYCLE <= 1) THEN
CALL CALC3Z
ELSE
CALL CALC3
ENDIF
GO TO 90
Transformations–Inline subroutines
–Convert GO TO into DO-loop
–Peel iterations of the time-step loop to eliminate IF-statements guarded by NCYCLE
SUBROUTINE CALCX do J = 1,N do I = 1,M ... enddo enddoc wrapped-around computations do J = 1, N ... enddo do I = 1, M ... enddo
...
Wrapped-around Computations
DO K = 2, ITMAX-1 do J = 1,N do I = 1,M ... enddo enddo
wrapped-around comp do J = 1, N ... enddo do I = 1, M ... enddo ... do J = 1,N do I = 1,M ... enddo enddo ...
...ENDDO
J
I I
J
CALC1
CALC2
CALC3
Projection along direction I
DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around comp do J = 1, N ... enddo do J = 1,N ... enddo wrapped-around comp do J = 1, N ... enddo...ENDDO
c
J
Wrapped-around Computations
c
Another way of dealing with the wrapped-around computations is performing code sinking
DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around
do J = 1,N ... enddo
wrapped-around
ENDDO
J
1st Step: achieved program model Flow dependencies & iterations space for SWIM (Projection along direction I )
CALC1
CALC2
CALC3
K-loop(time)
K=2
K=3
1 N
Steps
1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li
2- Perform 2D-Tiling for the Cache Level
3- Perform 1D-Tiling for the Register Level
1D-Tiling
K=2
K=3
K=4
J1 N
Dependencies are violated Tiling parameters: SLOPE, OFFSETS-i
SLOPE
OFFSET-i
J
1 N1 N
2D-Tiling
K (time-step loop)
J
I
1
M
N1
1
M
1
M
N1 N1
Tiling parameters: SLOPE, OFFSETS-i for each tiled dimension (J and I) Computed using the JI-loop distance subgraph
N1 N1 N1
1
M
1
M
1
M
flow dependenciesanti-dependenciesoutput dependencies
JI3-loopJI2-loopJI1-loop
[1,-1,0][1,0,-1]
[1,-1,-1]
[1,0,0]
[1, 0, 0][1, 0, 0][1, 0, 0]
[0,0,0]
[1,-1,0][1,0,-1]
[1,0,-1][1,-1,0]
[0,0,0]
JI-loop Distance Subgraph
Each node represents a JI-loop nest Each edge represents a dependence (distance vector)
SWIM: Projection along direction I
Wrapped-around Computations
Backward dependencies with large distances make Tiling not profitable
– apply Circular Loop Skewing to shorten backward dependencies
DO K = 2, ITMAX-1 do J = 1,N ... enddo wrapped-around do J = 1,N ... enddo wrapped-around
do J = 1,N ... enddo
wrapped-around
ENDDO
K-loop(time)
K=2
K=3
1 N
J
Shorts backward dependencies by changing the iteration order
Circular Loop Skewing
1 N
J
CLS parameters: BETA-i, DELTA (computed using the JI-loop distance subgraph)
K=2
K=3
1 N
J1 42 3
BETA-i
DELTA
22
Circular Loop Skewing
DO K = 2, ITMAX-1 do JX = 1+BETA1+DELTA(K-2),
N+BETA1+DELTA(K-2)
J = MOD(JX-1, N) + 1
... enddo wrapped-around do JX = 1+BETA2+DELTA(K-2),
N+BETA2+DELTA(K-2)
J = MOD(JX-1, N) + 1 ... enddo wrapped-around
do JX = 1+BETA3+DELTA(K-2),
N+BETA3+DELTA(K-2)
J = MOD(JX-1, N) + 1 ... enddo wrapped-around
ENDDO
K=2
K=3
1 N
J1 42 3
BETA-i
DELTA
DO JJ = ... DO II = ... DO K = ... if (first tile) then do JX = ... offsets iter. enddo endif do JX = ... Iter. inside tile enddo do JX = ... Iter. inside tile enddo do JX = ... Iter. inside tile enddo
ENDDO
SWIM: projection along direction I CLS parameters: DELTA=2, BETA1=0, BETA2=1, BETA3=2 Tiling parameters: SLOPE=2, OFFSET1=1, OFFSET2=OFFSET3=0
2nd Step: 2D-Tiling for cache level
J
31 2 N 31 2
K=2
K=3
K=4
31 2 N 31 2
Steps
1- Apply a set of transformations to the original program to achieve the program model defined by Song & Li
2- Perform 2D-Tiling for the Cache Level
3- Perform 1D-Tiling for the Register Level
3rd Step: 1D-Tiling for register level
DO JJ = ... DO II = ... DO K = ... ...
do JX = LJ, UJ
J = MOD (JX-1, N)+1
do IX = LI, UI
I = MOD (IX-1, M)+1
[loop body: {I,J}]
enddo
enddo
...ENDDO
The MOD operation introduced by CLS prevents us to fully unroll the loop
Apply first Index Set Splitting to loop J
J
I
1
M
M-1
2
M-2
N 1N-1 2N-2
unrolled
Index Set Splitting ISS splits a loop into two new loops that iterate over non-intersecting portions of
the iteration space
DO JJ = ... DO II = ... DO K = ... ...
do JX = LJ, min(N,UJ)
J = JX
do IX = ...
enddo
enddo
do JX = max(N+1,LJ), UJ
J = JX-N
do IX = ...
enddo
enddo
...ENDDO
J
I
1
M
M-1
2
M-2
N 1N-1 2N-2
ISS
DO JJ = ... DO II = ... DO K = ... ...
do JX = LJ, min(N,UJ)-3+1,3
J = JX
do IX = ...
[loop body: {J}]
[loop body: {J+1}]
[loop body: {J+2}]
enddo
enddo
do JX = JX, min(N,UJ)
J = JX
do IX = ...
[loop body: {J}]
enddo
enddo
...ENDDO
J
I
1
M
M-1
2
M-2
N 1N-1 2N-2
ISS
3rd Step: 1D-Tiling for register level
Code Transformations Summary
1- Apply a set of transformations to the original program to achieve
the program model defined by Song & Li– Inline subroutines
– Convert GOTO into DO-loop
– Peel iterations of the time-step loop to eliminate IF-statements
2- Perform 2D-Tiling for the Cache Level– Construct JI-loop distance subgraph
– Compute DELTA and BETAs and apply CLS to shorten backwards dep.
– Update JI-loop distance subgraph
– Compute OFSSETs and SLOPE and tile the IS
3- Perform 1D-Tiling for the Register Level– Index Set Splitting
– Tiling in a straightforward manner
• Architecture: EV56 (500Mhz, L1:8KB, L2:96KB), EV6(500MHz, L1:64KB, L2:4MB) • Compiler Invocation:
– f77 -O5 -arch ev56 (EV5) – kf77 -O5 -arch ev6 -notransform_loop -unroll 1 (EV6)
• Programs:– 1D-Tiling for the Cache Level: loop J, TS = 4 (EV5), TS=8 (EV6)
– 2D -Tiling for the Cache Level: TSIxJ = 32x16 (EV5), TSIxJ=40x12(EV6)
– 1D-Tiling for the register level: loop J, TS=4 (EV5 & EV6)
Performance Results (SWIM)
0.5
1
1.5
2
2.5
EV6
EV5
Spe
edup
ORI ORI + RT
1D 1D + RT
2D 2D + RT
439s 658s 294s 371s 578s 296s(execution time)
1519s 1533s 1023s 999s 1009s 677sEV5
EV6
• Architecture: EV56 (500Mhz, L1:8KB, L2:96KB)
• Compiler invocations:
– base: kf77 -O5 -arch ev56
– no_prefetch: kf77 -O5 -arch ev56 -switch nolu_prefetch_fetch …..
Performance Results EV5 (SWIM)
0.5
1
1.5
2
2.5
base
no_prefetch
Speedup over ORI (base)
ORI ORI + RT
1D 1D + RT
2D 2D + RT
Spe
edup
• Architecture: EV6(500MHz, L1:64KB, L2:4MB)
• Compiler invocations:
– base: f77 -O5 -arch ev6
– no_prefetch: f77 -O5 -arch ev6 -switch nolu_prefetch_fetch …..
Performance Results EV6 (SWIM)
0
0.5
1
1.5
2
2.5
base
no_prefetch
Speedup over ORI (base)
Spe
edup
ORI ORI + RT
1D 1D + RT
2D 2D + RT
J
Code for Result Verification
DO K = 2, ITMAX-1 ... do J = 1,N ... enddo
result verification
IF (MOD(K,MPRINT).eq.0) THEN do I = do J = UCHECK = UCHECK + {UNEW(I,J)} enddo UNEW (I,I) = . . . enddo PRINTS
ENDIF do J = 1,N ... enddoENDDO
c
Apply strip-mining to loop K (only useful if MPRINT is large)
NEW in SPEC2000!!