polyhedral code generation in the real world nicolas vasilache cdric bastoul albert cohen
DESCRIPTION
Introduction – Polyhedral Model Powerful expressiveness for high level transformations (parallelism, locality) Can express any composition of usual loop transformations [Pugh91] Compact representation of all legal transformations [Feautrier90] Code Generation was the weakest link [Griebl & al. 98] Until recent algorithm [Quilleré00] without transformations However, still problematic on long, parametric sequences on SPECs 3TRANSCRIPT
Polyhedral Code GenerationIn The Real World
Nicolas VASILACHECédric BASTOUL
Albert COHEN
Outline
• Introduction
• Affine schedules
• Formal General Form
• Contributions
• Focus on Modulo Conditional Removal (speed & quality)
• Experimental Results
2
Introduction – Polyhedral Model
• Powerful expressiveness for high level transformations (parallelism, locality)
• Can express any composition of usual loop transformations [Pugh91]
• Compact representation of all legal transformations [Feautrier90]
• Code Generation was the weakest link [Griebl & al. 98]
• Until recent algorithm [Quilleré00] without transformations
• However, still problematic on long, parametric sequences on SPECs
3
Introduction – Transformations
4
WHY TRANSFORM ???
Cholesky factorization, 6 statements, Optimal allocation functions [McKin92] SwimFP2000 [ICS05]
~ 30 polyhedral loop transformations
40% speedup wrt best peak perf. on AMD64
• Huge code generation times (ex: full Swim ~ 421 2267 lines, 20 mn / 300 MB)• In the context of complex transformations
Goal : Generation time comparable to BE of a real compiler (EKOPath)
Introduction – Context & Notations
5
Code Generation : syntactic loops from matrix representation
1 3 i
j
1
3
1 3 i
j
1
3
(i,j) (t1=i, t2=j)
• Bijection between domain and time iterations• Time iterations determine the generated loops (nesting, bounds)• Execution follows lexicographic order on time dimensions• Domain values touched by the statement : i=t1,j=t2
Affine Schedule – Trivial Example
for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j)
for(t1=1;t1<=3;t1++) for(t2=1;t2<=3;t2++) S(i=t1,j=t2)
1 3 t1
t2
1
3
time
domain
timedomain
7
1 00 1
=
S(1,1)S(1,2)S(1,3)S(2,1)S(2,2)S(2,3)S(3,1)S(3,2)S(3,3)
1 3 i
j
1
3
1 3 i
j
1
3
(i,j) (t1=j, t2=i)
• Another bijection between domain and time iterations• New bounds computation• Lexicographic order on time dimensions• Domain values touched by the statement : i=t2,j=t1
Affine Schedule – Loop Interchange
for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j)
for(t1=1;t1<=3;t1++) for(t2=1;t2<=3;t2++) S(i=t2,j=t1)
1 3 t1
t2
1
3
time
timedomain
8
domain
0 11 0
=
S(1,1)S(1,2)S(1,3)S(2,1)S(2,2)S(2,3)S(3,1)S(3,2)S(3,3)
1 3 i
j
1
3 (i,j) (t1= i+j)
• NOT a bijection (just a surjection)• New bounds computation (t1: [2, 6])• Domain values touched by the statement: {(i,j)|i+j==t1}
Affine Schedule – Parallel Wavefronts
for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j)
for(t1=2;t1<=6;t1++) DOALL{(i,j)|i+j==t1} S(i,j)
2 6 t1
time
timedomain
1 3 i
j
1
3
9
domain
1 1=
S(1,1)S(1,2)S(1,3)S(2,1)S(2,2)S(2,3)
S(3,1)S(3,2)S(3,3)
1 3 i
j
1
3
1 3 i
j
1
3
(i,j)1 (t1=i, t2=j)(i,j)2 (t1=i+1, t2=j)
Affine Schedule – Statement Shifting
for(i=1;i<=3;i++) for(j=1;j<=3;j++) S1(i,j) S2(i,j)
for(t2=1;t2<=3;t2++) S1(i=1,j=t2)for(t1=2;t1<=3;t1++) for(t2=1;t2<=3;t2++) S1(i=t1,j=t2) S2(i=t1-1,j=t2)for(t2=1;t2<=3;t2++) S2(i=4-1,j=t2)
1 4 t1
t2
1
3
time
time
domain
• New bounds computation (S1: [1,3]x[1,3] S2: [2,4]x[1,3]) have disjoint parts• Separation phase needed on each time dimension (3nb_stmt w.c. complexity)
P
K
E
10
domain
1 00 1=
2
10010
1 00 1=
1
2
General Case
• Schedules: Zmi Zni for each statement Si
• Schedules associate logical time to each iteration domain point • Time value sets need to be separated scattering functions
• Time part used for separation and ordering (Polylib computations 2dim [Wilde93])• Domain part determines the values spanned by time dimensions• Quilleré separation phase [Quilleré00, Bastoul04]
Time
Domain
Time iterators
Domain iterators
11
Separation Principles
Consider statements with domain and schedule functions such that: S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9])
[0,1] [2,3] [4,5]
Polyhedral inter / diff (2dim)
worklist remaining
13
[2,5] [0,3]
Considering t1
Separation Principles
[0,1] [2,3] [4,5] [-2,6]
[-2,-1] [0,1] [2,3] [0,1] [-2,-1]
Polyhedral inter / diff (2dim)
worklist remaining
kernel
Consider statements with domain and schedule functions such that: S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9])
14
Considering t1
3nb_stmt w.c. compl.
Separation Principles
That was for the first time dimensionRecursively for all time dimensionsResult is a syntax tree of the generated loops
Consider statements with domain and schedule functions such that: S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9])
15
Considering t1for(t1=-2;t1<=-1;t1++) for(t2=5;t2<=9;t2++) S3(…) for(t1=0;t1<=1;t1++) for(t2=5;t2<=7;t2++) S2(…) S3(…) for(t2=8;t2<=9;t2++) S3(…) ...
Contributions
• Problems provided by different sources (academia, industry, SPECFP2000)• Exhibit different challenging issues
Real World Issues
State of the art polyhedral code generator CLooG [Bastoul04] ALL PERFORMANCE COMPARISONS WILL BE CLooG vs URGenT
16
• Node fusion (exploiting transformations’ “locality”)• Exploiting scalar dimensions (replacing exponential computations with trivial ones)• Domain iterator mapping improvement (replacing exponential by matrix inversions)
• Faster If-Hoisting yielding much smaller code (conditional factorization)• Modulo Conditional removal by strip-mining (stride issue) (detailed)
Code Generation Speed
Code Quality
Generation Speed – Node Fusion
• Multidimensional schedules allow expression of non affine (polynomial) quantities as affine ones with more dimensions improved flexibility• Drawback Pressure on code generation (height of the tree)• Add parameters Add dimensions (polyhedral operation complexity)
HOWEVER
• Loop level transformations affect blocks of statements (tiling, interchange…)• Polyhedron inclusion check is NOT exponential
Before each separation phase, fuse consecutive nodes with equal scattering polyhedra.
18
Generation Speed – Scalar Dimensions
• Some multidimensional schedules have scalar dimensions (UTF, URUK[ICS05])• Scalar dimensions express strict statement interleaving
• Comparison of integers, no need for polyhedral separation• Syntactic tree height reduction (potentially half the height)• Marginal overhead for detection and computation• Combines well with Node Fusion
19
• Generation of sequential loops for non invertible schedules (wavefronts)• CLooG [Bas04] handles it with polyhedral projection on domain iterators• Drawback Adds dimensions (polyhedral operation complexity)• Drawback Additional polyhedral computations on each leaf
Use transformation invertibility (ideally, given the rank, mix of projections and invertibility)
20
Generation Speed – Domain Iterator Regen.
ST afterQui. separationPhase (3nb_stmts)
Code Quality – If Hoisting
• Quilleré separation phase leaves conditionals on triangular loops• Need of the so-called backtracking phase too aggressive (code bloat)• Potentially tremendous amount of useless work
cond: t1<= 4cond: 11 <= t1
t2
t1
Smaller CodeNo useless work (simplification IS needed)Explains the generation speedup on dreamupT3
22
……t1
Backtracking illustration
for t1
cond: 5 <= t1<=10
for t2
for t3
for t1
for t2
for t3
If-Hoisting illustration
CodeBloat
Useless Work
Code Quality – If Hoisting
• Previous example doesn’t take place in real life (just an illustration)
Backtrack + 50%
Matrix Mult. with URUK :• strip-mine by factor 4 (x3)• interchange loops (x2)• unroll
23
• Let be the transformation function for a statement• Suppose is invertible, and let the matrix of denominators of• Let and
• Inverse Scatter Matrix expresses domain iterators from time iterators• ensures all coefficients are integral• Replaces leaf polyhedral projections by matrix inversions
Time iterators
Domain iterators
Substitute for usual Hermite Normal Form in stride computations
24
Removing Modulos – Domain Iterator Regen.
Problem since 91: [Irigoin91], [Pingali92], [Ramanujam95], [Xue96], [Griebl98] and others …
for(t1=5;t1<=2*M+3*N;t1++) for(t2=?;t2<=?;t2++) if(t1%3 == 0) S(i=t2,j=t1/3-k) (t2 = 3k) if(t1%3 == 2) S(i=t2,j=(t1-2)/3-k) (t2 = 3k+1) if(t1%3 == 1) S(i=t2,j=(t1-1)/3-k-1) (t2 = 3k+2)
25
Removing Modulos - Inverse Scatter Matrix
• Consider S with 2 domain iterators, = and =
• We have = and ISM =
2 31 0
0 11/3 -2/3
1 00 3
0 1 -1 01 -2 0 -3
for(i=1;i<=M;i++) for(j=1;j<=N;j++) S(i,j)
Time iterators
Domain iterators
INTEGRALMeaning: i = t2 , 3*j = t1-2*t2
for(t1=5;t1<=2*M+3*N;t1++) for(t2=1;t2<=min(M,t1/2);t2++) if((t1 – 2t2)%3 == 0) S(i=t2,j=(t1-2t2)/3)
2 31 0=
OUCH !!!SM & unroll t2 by (3 / gcd(2,3))
SM & unroll t1 by (3 / gcd(1,3)) for(t1=?;t1<=?;t1++) for(t2=?;t2<=?;t2++) S(i=t2,j=l-k) (t1 = 3l) S(i=t2,j=l-k-1) (t1 = 3l+1) S(i=t2,j=l-k-2) (t1 = 3l+2)
26
Removing Modulos – There is a CATCH• Previous example flowed nicely What about the loops’ bounds ???• “Issue” (feature) with our SM + unroll transformation (strip-mine NOT strided)• Modulos are indeed removed from the kernels only
P
KE
V.S.for(i=M;i<=N;i++) S(i,j)
for(i=M;i<N;i+=2) for(ii=i; ii<=min(i+1,N); i++) S(i,j)
Code Size
HOWEVER: P and E have marginal execution time when SM factor is “decent”PROLOGUE gives us ALIGNMENT on %2 (strip-mine factor) !!!!!!!!!!!!!!!
Transformation quality issue
27
Removing Modulos – Hermite Normal Form
All statements need to have the same transformation TOO RESTRICTIVE
• Our solution unrolls modulo guards from kernels after strip-mining• Hermite Normal Form: Mathematical decomposition of = U.H• Where U is unimodular (skewing matrix)• H is diagonal (stride in transformed space diagonal coefficients)• Suppresses the need for internal modulo guards
BUT• If U is not the same, skewing are different• Deal with non parallel lattices … how ?• In practice, used for 1 statement or “simple” examples
Putting it all Together – Code Size Experiments
29
CL04 CL06 Improv.
State of the art polyhedral code generator CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG04 vs CLooG06
Generation Speed – Experiments
Domain Iterators
Swim
• 36% Time reduction• 58% Memory reduction
30
0102030405060708090
100
DreamupT3 Classen QR Swim
T OrigT ScalarM OrigM Scalar
Node Fusion
% ofCL04
0102030405060708090
100
DreamupT3 Classen (3/7) QR(4/7) Swim(6/11)
T OrigT ScalarM OrigM Scalar
Scalar Dimensions
% ofCL04
We compare original CLooG(CL04) from [Bastiul04] PACT paper with ouroptimized CLooG (CL06)
Putting it all Together – Code Generation Speed Experiments
31
CL UR CL UR
Affine Schedule: 412 2267 lines (40% execution speedup wrt best peak)Pathscale –Ofast needs ~22s to process the AST (LNO OFF)
State of the art polyhedral code generator CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG vs URGenT
Conclusion / Future Works
• Implemented as the Code Generation phase of the URUK framework [ICS05]
• Generation Speed Goal achieved (up to 56x, stands PathScale comparison)
• Greatly improved code size with improved if-hoisting technique (up to 5.8x)
• Modulo Conditionals are removed (from kernel) Mix with HNF
• Still room for speeding up generation (caches, memory pools, parallelization)
• Focus on Code Generation Friendly transformations
32
Thank you !!!www.cloog.org for full presentation & more