single-dimension software pipelining for multi-dimensional loops ifip tele-seminar june 1, 2004...
TRANSCRIPT
![Page 1: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/1.jpg)
Single-dimension Software Pipelining for Multi-dimensional
Loops
IFIP Tele-seminar June 1, 2004
Hongbo Rong
Zhizhong Tang
Alban Douillet
Ramaswamy Govindarajan
Guang R. Gao
Presented by: Hongbo Rong
![Page 2: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/2.jpg)
2
Introduction Loops and software pipelining are
important Innermost loops are not enough
[Burger&Goodman04] Billion-transistor architectures tend to
have much more parallelism Previous methods for scheduling
multi-dimensional loops are meeting new challenges
![Page 3: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/3.jpg)
3
Motivating Example
int U[N1+1][N2+1], V[N1+1][N2+1];L1: for (i1=0; i1<N1; i1++) {L2: for (i2=0; i2<N2; i2++) { a: U[i1+1][i2]=V[i1][i2]+ U[i1][i2];
b: V[i1][i2+1]=U[i1+1][i2]; } }
a
b<0,0> <0,1>
<1,0>
A strong cycle in the inner loop: No
parallelism
![Page 4: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/4.jpg)
4
Loop Interchange Followed by Modulo Scheduling of the Inner Loop
• Why not select a better loop to software pipeline?
• Which and how?
a
b<0,0> <1,0>
<0,1>
![Page 5: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/5.jpg)
5
Starting from A Naïve Approach
a(0,0)
b(0,0)
---
a(0,1)
b(0,1)
---
a(0,2)
b(0,2)
---
0
1
2
3
4
5
6
7
8
9
10
11
12
0
1 2 3 4 5
Cycle
i1
a(1,0)
b(1,0)
---
a(1,1)
b(1,1)
---
a(1,2)
b(1,2)
---
a(3,0)
b(3,0)
---
a(3,1)
b(3,1)
---
a(3,2)
b(3,2)
---
a(4,0)
b(4,0)
---
a(4,1)
b(4,1)
---
a(4,2)
b(4,2)
---
a(5,0)
b(5,0)
---
a(5,1)
b(5,1)
---
a(5,2)
b(5,2)
---
a(2,1)
b(2,1)
---
a(2,2)
b(2,2)
---
a(2,0)
b(2,0)
---
a
b
<0,0> <0,1>
<1,0>
2 function unitsa: 1 cycleb: 2 cyclesN2=3
Resource conflicts
![Page 6: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/6.jpg)
6
Looking from Another Angle
a(0,0)
b(0,0)
---
a(1,0)
b(1,0)
--- a(3,0)
b(3,0)
---
a(4,0)
b(4,0)
---
a(5,0)
b(5,0)
---
a(0,1)
b(0,1)
---
a(1,1)
b(1,1)
---
a(2,1)
b(2,1)
---
a(3,1)
b(3,1)
---
a(4,1)
b(4,1)
---
a(5,1)
b(5,1)
---
a(0,2)
b(0,2)
---
a(1,2)
b(1,2)
---
a(2,2)
b(2,2)
---
a(3,2)
b(3,2)
---
a(4,2)
b(4,2)
---
a(5,2)
b(5,2)
---
0
1
2
3
4
5
6
7
8
9
10
11
12
0
1 2 3 4 5
Cycle
i1
Slice 1
Slice2
Slice 3
Initiation interval T=1
a(2,0)
b(2,0)
---
Kernel, with S=3 stages
Resource conflicts
![Page 7: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/7.jpg)
SSP (Single-dimension Software Pipelining)
a(0,0)
b(0,0)
---
a(1,0)
b(1,0)
--- a(3,0)
b(3,0)
---
a(4,0)
b(4,0)
---
a(5,0)
b(5,0)
---
a(0,1)
b(0,1)
---
a(1,1)
b(1,1)
---
a(2,1)
b(2,1)
---
a(3,1)
b(3,1)
---
a(4,1)
b(4,1)
---
a(5,1)
b(5,1)
---
a(0,2)
b(0,2)
---
a(1,2)
b(1,2)
---
a(2,2)
b(2,2)
---
a(3,2)
b(3,2)
---
a(4,2)
b(4,2)
---
a(5,2)
b(5,2)
---
0
1
2
3
4
5
6
7
8
9
10
11
12
0
1 2 3 4 5
Cycle
i1
Initiation interval T=1
a(2,0)
b(2,0)
---
Kernel, with S=3 stages
Delay = (N2-1)*S*T
7
![Page 8: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/8.jpg)
An iteration point per cyle
Filling & draining naturally overlapped
Dependences are still respected!
Resources fully used Data reuse
exploited!
SSP (Single-dimension Software Pipelining)
a(0,0)
b(0,0)
---
a(1,0)
b(1,0)
---
a(3,0)
b(3,0)
---
a(4,0)
b(4,0)
---
a(5,0)
b(5,0)
---
a(0,1)
b(0,1)
---
a(1,1)
b(1,1)
---
a(2,1)
b(2,1)
---
a(0,2)
b(0,2)
---
a(1,2)
b(1,2)
---
a(2,2)
b(2,2)
---
a(3,1)
b(3,1)
---
a(4,1)
b(4,1)
---
a(5,1)
b(5,1)
---
a(3,2)
b(3,2)
---
a(4,2)
b(4,2)
---
a(5,2)
b(5,2)
---
0
1
2
3
4
5
6
7
8
9
10
11
12
0
1 2 3 4 5
Cycle
i1
a(2,0)
b(2,0)
---
Delay = (N2-1)*S*T
8
Initiation interval T=1 Kernel, with S=3
stages
![Page 9: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/9.jpg)
9
Loop Rewriting
int U[N1+1][N2+1], V[N1+1][N2+1]; L1': for (i1=0; i1<N1; i1+=3) { b(i1-1, N2-1) a(i1, 0) b(i1, 0) a(i1+1, 0) b(i1+1, 0) a(i1+2, 0)L2': for (i2=1; i2<N2; i2++) {
a(i1, i2) b(i1+2, i2-1)b(i1, i2) a(i1+1, i2) b(i1+1, i2) a(i1+2, i2)
} }
b(i1-1, N2-1)
![Page 10: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/10.jpg)
10
Outline
Motivation Problem Formulation &
Perspective Properties Extensions Current and Future work Code Generation and
experiments
![Page 11: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/11.jpg)
11
Problem Formulation
Given a loop nest L composed of n loops L1, …, Ln, identify the most profitable loop level Lx with 1<= x<=n, and software pipeline it.
Which loop to software pipeline? How to software pipeline the selected
loop? How to handle the n-D dependences? How to enforce resource constraints? How can we guarantee that repeating
patterns will definitely appear?
![Page 12: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/12.jpg)
12
Single-dimension Software Pipelining
A resource-constrained scheduling method for loop nests
Can schedule at an arbitrary level Simplify n-D dependences to 1-D 3 steps
Loop Selection Dependence Simplification and 1-D
Schedule Construction Final schedule computation
![Page 13: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/13.jpg)
13
Perspective Which loop to software pipeline?
Most profitable one in terms of parallelism, data reuse, or others
How to software pipeline the selected loop? Allocate iteration points to slices Software pipeline each slice Partition slices into groups Delay groups until resources available
Enforce resource constraints in two steps
![Page 14: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/14.jpg)
14
Perspective (Cont.)
How to handle dependences? If a dependence is respected before pushing-down the groups, it will be respected afterwards
Simplify dependences from n-D to 1-D
![Page 15: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/15.jpg)
How to handle dependences?
a(0,0)
b(0,0)
---
a(1,0)
b(1,0)
--- a(3,0)
b(3,0)
---
a(4,0)
b(4,0)
---
a(5,0)
b(5,0)
---
a(0,1)
b(0,1)
---
a(1,1)
b(1,1)
---
a(2,1)
b(2,1)
---
a(3,1)
b(3,1)
---
a(4,1)
b(4,1)
---
a(5,1)
b(5,1)
---
a(0,2)
b(0,2)
---
a(1,2)
b(1,2)
---
a(2,2)
b(2,2)
---
a(3,2)
b(3,2)
---
a(4,2)
b(4,2)
---
a(5,2)
b(5,2)
---
0
1
2
3
4
5
6
7
8
9
10
11
12
0
1 2 3 4 5
Cycle
i1
a(2,0)
b(2,0)
---
15
a
b
<0,0> <0,1>
<1,0>
Dependences within a slice
Dependences between slices
Still respected after pushing down
![Page 16: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/16.jpg)
16
Simplify n-D Dependences
Cycle
……(i1, 0, …, 0,1)
(i1+1, 0, …, 0,1)
……(i1, 0, …, 0,0)
(i1+1, 0, …, 0,0)
Only the first distance useful
Ignorablea
b
<0,1><0 >, 0
<1 >,0
![Page 17: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/17.jpg)
17
Step 1: Loop Selection Scan each loop. Evaluate parallelism
Recurrence Minimum II (RecMII) from the cycles in 1-D DDG
Evaluate data reuse average memory accesses of an
S*S tile from the future final schedule (optimized iteration space).
![Page 18: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/18.jpg)
18
Example: Evaluate Parallelism
Inner loop: RecMII=3
a
b
<0> < 1>
Outer loop: RecMII=1
a
b
<0>
<1>
![Page 19: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/19.jpg)
Evaluate Data Reuse
Symbolic parametersS: total stages
l: cache line size Evaluate data
reuse[WolfLam91] Localize
space=span{(0,1),(1,0)} Calculate equivalent
classes for temporal and spatial reuse space
avarage accesses=2/l
i10 1 ……S-1 S S+1 ……2S-1 …. N1-1
Cycle
……
……
……
……
…………
19
……
……
……
……
…………
![Page 20: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/20.jpg)
20
Step 2: Dependence Simplification and 1-D Schedule Construction
Dependence Simplification
1-D schedule construction
a
b
<0,1>
<1,0>
<0,0>
-
b-
ab-
ab
aT
Modulo property
Resource constraints
Sequential constraints Dependence
constraints
a
b
<1>
<0>
![Page 21: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/21.jpg)
Final Schedule Computation
Example: a(5,2)
a(0,0)
b(0,0)
---
a(1,0)
b(1,0)
--- a(3,0)
b(3,0)
---
a(4,0)
b(4,0)
---
a(5,0)
b(5,0)
---
a(0,1)
b(0,1)
---
a(1,1)
b(1,1)
---
a(2,1)
b(2,1)
---
a(3,1)
b(3,1)
---
a(4,1)
b(4,1)
---
a(5,1)
b(5,1)
---
a(0,2)
b(0,2)
---
a(1,2)
b(1,2)
---
a(2,2)
b(2,2)
---
a(3,2)
b(3,2)
---
a(4,2)
b(4,2)
---
a(5,2)
b(5,2)
---
0
1
2
3
4
5
6
7
8
9
10
11
12
0
1 2 3 4 5
Cycle
i1
a(2,0)
b(2,0)
---
21
Module schedule time=5
Distance=
61*3*2**2 TSi
Final schedule time=5+6+6=17
Distance=
61*3*2**2 TSi
Delay = (N2-1)*S*T
=(3-1)*3*1=6
![Page 22: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/22.jpg)
22
Step 3: Final Schedule Computation
For any operation o, iteration point I=(i1, i2,…,in),
f(o,I) = σ(o, i1) +
+
TSNinx nyxy
yx **)*(2 x, 1,
TSNSinxx
x **)1(*/2,
1
Delay from pushing down
Distance between o(i1,0, …, 0) and o(i1, i2, …, in)
Modulo schedule time
![Page 23: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/23.jpg)
23
Outline
Motivation Problem Formulation &
Perspective Properties Extensions Current and Future work Code Generation and
experiments
![Page 24: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/24.jpg)
24
Correctness of the Final Schedule
Respects the original n-D dependences Although we use 1-D
dependences in scheduling No resource competition Repeating patterns
definitely appear
![Page 25: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/25.jpg)
25
Efficiency of the Final Schedule
Schedule length <= the innermost-centric approach One iteration point per T cycles Draining and filling of pipelines
naturally overlapped Execution time: even better
Data reuse exploited from outermost and innermost dimensions
![Page 26: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/26.jpg)
26
Relation with Modulo Scheduling
The classical MS for single loops is subsumed as a special case of SSP No sequential constraints f(o,I) = Modulo schedule time (σ(o, i1))
![Page 27: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/27.jpg)
27
Outline
Motivation Problem Formulation &
Perspective Properties Extensions Current and Future work Code Generation and
experiments
![Page 28: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/28.jpg)
28
SSP for Imperfect Loop Nest
Loop selection Dependence simplification and 1-D schedule construction Sequential constraints
Final schedule
![Page 29: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/29.jpg)
SSP for Imperfect Loop
Nest (Cont.)
a(0,0)
b(0,0)
c(0,0)
a(1,0)
b(1,0)
a(3,0)
0
1
2
3
4
5
6
7
8
9
10
11
12
0
1 2 3 4 5
Cycle
i1
Initiation interval T=1
a(2,0)
b(2,0)d(0,0) c(1,0)
a(4,0)b(3,0)c(2,0)
a(5,0)b(4,0)d(2,0) c(3,0)
c(0,1)
d(0,1)
c(0,2)
d(0,2)
c(1,1)
d(1,1)
c(1,2)
d(1,2)
c(2,1)
d(2,1)
c(2,2)
d(2,2)
d(3,0)
c(3,1)
d(3,1)
c(3,2)
d(3,2)
c(4,0)
d(4,0)
c(4,1)
d(4,1)
c(4,2)
d(4,2)
b(5,0)
c(5,0)
d(5,0)
c(5,1)
d(5,1)
c(5,2)
d(5,2)
Kernel, with S=3 stagesd(1,0)
Push from here
Push from herea(5,0)b(4,0)
c(4,0)
d(4,0)
c(4,1)
d(4,1)
c(4,2)
d(4,2)
b(5,0)
c(5,0)
d(5,0)
c(5,1)
d(5,1)
c(5,2)
d(5,2)
29
a
b
c
d
![Page 30: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/30.jpg)
30
Outline
Motivation Problem Formulation &
Perspective Properties Extensions Current and Future work Code Generation and
experiments
![Page 31: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/31.jpg)
31
Compiler Platform Under Construction
Front End
Middle End
Back End
C/C++/Fortran
High WHIRL
Middle WHIRL
Low WHIRL
Very Low WHIRL
gfec/gfecc/f90
Very High WHIRL
Loop Selection
Selected Loop
Dependence Simplification
1-D DDG
Bundling
Bundled kernel
Register Allocation
Register-allocated kernel
Code generation
Assembly code
Pre-Loop
Selection
Consistency
Maintenance
1-D Schedule Construction
Intermediate kernel
![Page 32: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/32.jpg)
32
Current and Future Work Register allocation Implementation and evaluation Interaction and comparison with
pre-transforming the loop nest Unroll-and-jam Tiling Loop interchange Loop skewing and Peeling …….
![Page 33: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/33.jpg)
33
An (Incomplete) Taxonomy of Software Pipelining
Software Pipelining Modulo scheduling and
others
Hierarchical reduction[Lam88]
Pipelining-dovetailing[WangGao96]
Outer Loop Pipelining[MuthukumarDoshi01]
For 1-dimensional loops
Innermost-loop centric
Resource-constrained
Parallelism -oriented
For n-dimensional loops
SSP
Affine-by-statement scheduling[DarteEtal00,94]
Statement-level rational affine scheduling[Ramanujam94]
Linear scheduling with constants[DarteEtal00,94]
r-periodic scheduling[GaoEtAl93]Juggling problem[DarteEtAl02]
![Page 34: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/34.jpg)
34
Outline
Motivation Problem Formulation &
Perspective Properties Extensions Current and Future work Code Generation and
experiments
![Page 35: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/35.jpg)
35
Code GenerationLoop nest in CGIR
SSP
Intermediate Kernel
Register allocation
Register- allocated kernel
Code Generation
Final code
Code generation issues•Register assignment•Predicated execution•Loop and drain control•Generating prolog and epilog•Generating outermost loop pattern•Generating innermost loop pattern•Code-size optimizations
Problem Statement
Given an register allocated kernel generated by SSP and a target architecture, generate the SSP final schedule, while reducing code size and loop control overheads.
![Page 36: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/36.jpg)
36
Code Generation: Challenges
Multiple repeating patterns Code emission algorithms
Register Assignment Lack of multiple rotating register files
Mix of rotating registers and static register renaming techniques
Loop and drain control Predicated execution Loop counters Branch instructions
Code size increase Code compression techniques
![Page 37: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/37.jpg)
37
Experiments: Setup
Stand-alone module at assembly level. Software-pipelining using Huff's modulo-
scheduling. SSP kernel generation & register allocation by
hand. Scheduling algorithms: MS, xMS, SSP, CS-SSP Other optimizations: unroll-and-jam, loop tiling Benchmarks: MM, HD, LU, SOR Itanium workstation 733MHz,
16KB/96KB/2MB/2GB
![Page 38: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/38.jpg)
38
Experiments: Relative Speedup
Speedup between 1.1 and 4.24, average 2.1. Better performance : better parallelism and/or better data reuse. Code-size optimized version performs as well as original version. Code duplication and code size do not degrade performance.
![Page 39: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/39.jpg)
39
Experiments: Bundle Density
Bundle density measures average number of non-NOP in a bundle. Average: MS-xMS: 1.90, SSP: 1.91, CS-SSP: 2.1 CS-SSP produces a denser code. CS-SSP makes better use of available resources.
![Page 40: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/40.jpg)
40
Experiments: Relative Code Size
SSP code is between 3.6 and 9.0 times bigger than MS/xMS . CS-SSP code is between 2 and 6.85 times bigger than MS/xMS. Because of multiple patterns and code duplication in innermost loop. However entire code (~4KB) easily fits in the L1 instruction cache.
![Page 41: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/41.jpg)
41
Acknowledgement Prof.Bogong Su, Dr.Hongbo Yang Anonymous reviewers Chan, Sun C. NSF, DOE agencies
![Page 42: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/42.jpg)
42
Appendix The following slides are for the
detailed performance analysis of SSP.
![Page 43: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/43.jpg)
43
Exploiting Parallelism from the Whole Iteration Space
Represents a class of important application Strong dependence cycle in the innermost loop The middle loop has negative dependence but can be removed.
(Matrix size is N*N)
![Page 44: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/44.jpg)
44
Exploiting Data Reuse from the Whole Iteration Space
![Page 45: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/45.jpg)
45
Advantage of Code Generation
N
Speedup
![Page 46: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/46.jpg)
46
Exploiting Parallelism from the Whole Iteration Space
(Cont.)
Both have dependence cycles in the innermost loop
![Page 47: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/47.jpg)
47
Exploiting Data Reuse from the Whole Iteration Space
![Page 48: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/48.jpg)
48
Exploiting Data Reuse from the Whole Iteration Space
(Cont.)
![Page 49: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/49.jpg)
49
Exploiting Data Reuse from the Whole Iteration Space
(Cont.)
(Matrix size is jn*jn)
![Page 50: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/50.jpg)
50
Advantage of Code Generation
SSP considers all operations in constructing 1-D scheule, thus effectively offsets the overhead of operations out of the innermost loop
N
Speedup
![Page 51: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/51.jpg)
51
Performance Analysis from L2 Cache misses
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
ijk jik ikj jki kij kji HD LU SOR jki+RT jik+T
MS
xMS
SSP
CS-SSP
Cache misses relative to MS
![Page 52: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/52.jpg)
52
Performance Analysis from L3 Cache misses
0
0.2
0.4
0.6
0.8
1
1.2
ijk jik ikj jki kij kji HD LU SOR jki+RT jik+T
MS
xMS
SSP
CS-SSP
Cache misses relative to MS
![Page 53: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/53.jpg)
53
Comparison with Linear Schedule
Linear schedule Traditionally apply to multi-
processing, systolic arrays, etc. , not for uniprocessor
Parallelism oriented. Do not consider Fine-grain resource constraints Register usage Data reuse
Code generation Communicate values through memory, or
message passing, etc.
![Page 54: Single-dimension Software Pipelining for Multi-dimensional Loops IFIP Tele-seminar June 1, 2004 Hongbo Rong Zhizhong Tang Alban Douillet Ramaswamy Govindarajan](https://reader035.vdocument.in/reader035/viewer/2022062804/56649d195503460f949eefcb/html5/thumbnails/54.jpg)
54
Optimized Iteration Space of A Linear Schedule
i1
0 1 2 3 4 5 6 7 8 9
Cycle54