polyhedral code generation in the real world nicolas vasilache cdric bastoul albert cohen

Polyhedral Code GenerationIn The Real World

Nicolas VASILACHECédric BASTOUL

Albert COHEN

http://www.lri.fr/

http://www.lri.fr/

Outline

• Introduction

• Affine schedules

• Formal General Form

• Contributions

• Focus on Modulo Conditional Removal (speed & quality)

• Experimental Results

2

http://www.lri.fr/

Introduction – Polyhedral Model

• Powerful expressiveness for high level transformations (parallelism, locality)

• Can express any composition of usual loop transformations [Pugh91]

• Compact representation of all legal transformations [Feautrier90]

• Code Generation was the weakest link [Griebl & al. 98]

• Until recent algorithm [Quilleré00] without transformations

• However, still problematic on long, parametric sequences on SPECs

3

http://www.lri.fr/

Introduction – Transformations

4

WHY TRANSFORM ???

Cholesky factorization, 6 statements, Optimal allocation functions [McKin92] SwimFP2000 [ICS05]

~ 30 polyhedral loop transformations

40% speedup wrt best peak perf. on AMD64

• Huge code generation times (ex: full Swim ~ 421 2267 lines, 20 mn / 300 MB)• In the context of complex transformations

Goal : Generation time comparable to BE of a real compiler (EKOPath)

http://www.lri.fr/

Introduction – Context & Notations

5

Code Generation : syntactic loops from matrix representation

http://www.lri.fr/

Affine Schedules

6

http://www.lri.fr/

1 3 i

j

1

3

1 3 i

j

1

3

(i,j) (t1=i, t2=j)

• Bijection between domain and time iterations• Time iterations determine the generated loops (nesting, bounds)• Execution follows lexicographic order on time dimensions• Domain values touched by the statement : i=t1,j=t2

Affine Schedule – Trivial Example

for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j)

for(t1=1;t1<=3;t1++) for(t2=1;t2<=3;t2++) S(i=t1,j=t2)

1 3 t1

t2

1

3

time

domain

timedomain

7

1 00 1

=

S(1,1)S(1,2)S(1,3)S(2,1)S(2,2)S(2,3)S(3,1)S(3,2)S(3,3)

http://www.lri.fr/

1 3 i

j

1

3

1 3 i

j

1

3

(i,j) (t1=j, t2=i)

• Another bijection between domain and time iterations• New bounds computation• Lexicographic order on time dimensions• Domain values touched by the statement : i=t2,j=t1

Affine Schedule – Loop Interchange


for(t1=1;t1<=3;t1++) for(t2=1;t2<=3;t2++) S(i=t2,j=t1)

1 3 t1

t2

1

3

time

timedomain

8

domain

0 11 0

=

S(1,1)S(1,2)S(1,3)S(2,1)S(2,2)S(2,3)S(3,1)S(3,2)S(3,3)

http://www.lri.fr/

1 3 i

j

1

3 (i,j) (t1= i+j)

• NOT a bijection (just a surjection)• New bounds computation (t1: [2, 6])• Domain values touched by the statement: {(i,j)|i+j==t1}

Affine Schedule – Parallel Wavefronts


for(t1=2;t1<=6;t1++) DOALL{(i,j)|i+j==t1} S(i,j)

2 6 t1

time

timedomain

1 3 i

j

1

3

9

domain

1 1=

S(1,1)S(1,2)S(1,3)S(2,1)S(2,2)S(2,3)

S(3,1)S(3,2)S(3,3)

http://www.lri.fr/

1 3 i

j

1

3

1 3 i

j

1

3

(i,j)1 (t1=i, t2=j)(i,j)2 (t1=i+1, t2=j)

Affine Schedule – Statement Shifting

for(i=1;i<=3;i++) for(j=1;j<=3;j++) S1(i,j) S2(i,j)

for(t2=1;t2<=3;t2++) S1(i=1,j=t2)for(t1=2;t1<=3;t1++) for(t2=1;t2<=3;t2++) S1(i=t1,j=t2) S2(i=t1-1,j=t2)for(t2=1;t2<=3;t2++) S2(i=4-1,j=t2)

1 4 t1

t2

1

3

time

time

domain

• New bounds computation (S1: [1,3]x[1,3] S2: [2,4]x[1,3]) have disjoint parts• Separation phase needed on each time dimension (3nb_stmt w.c. complexity)

P

K

E

10

domain

1 00 1=

2

10010

1 00 1=

1

2

http://www.lri.fr/

General Case

• Schedules: Zmi Zni for each statement Si

• Schedules associate logical time to each iteration domain point • Time value sets need to be separated scattering functions

• Time part used for separation and ordering (Polylib computations 2dim [Wilde93])• Domain part determines the values spanned by time dimensions• Quilleré separation phase [Quilleré00, Bastoul04]

Time

Domain

Time iterators

Domain iterators

11

http://www.lri.fr/

Quilleré separation phase

12

http://www.lri.fr/

Separation Principles

Consider statements with domain and schedule functions such that: S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9])

[0,1] [2,3] [4,5]

Polyhedral inter / diff (2dim)

worklist remaining

13

[2,5] [0,3]

Considering t1

http://www.lri.fr/


[0,1] [2,3] [4,5] [-2,6]

[-2,-1] [0,1] [2,3] [0,1] [-2,-1]

Polyhedral inter / diff (2dim)

worklist remaining

kernel


14

Considering t1

3nb_stmt w.c. compl.

http://www.lri.fr/


That was for the first time dimensionRecursively for all time dimensionsResult is a syntax tree of the generated loops


15

Considering t1for(t1=-2;t1<=-1;t1++) for(t2=5;t2<=9;t2++) S3(…) for(t1=0;t1<=1;t1++) for(t2=5;t2<=7;t2++) S2(…) S3(…) for(t2=8;t2<=9;t2++) S3(…) ...

http://www.lri.fr/

Contributions

• Problems provided by different sources (academia, industry, SPECFP2000)• Exhibit different challenging issues

Real World Issues

State of the art polyhedral code generator CLooG [Bastoul04] ALL PERFORMANCE COMPARISONS WILL BE CLooG vs URGenT

16

• Node fusion (exploiting transformations’ “locality”)• Exploiting scalar dimensions (replacing exponential computations with trivial ones)• Domain iterator mapping improvement (replacing exponential by matrix inversions)

• Faster If-Hoisting yielding much smaller code (conditional factorization)• Modulo Conditional removal by strip-mining (stride issue) (detailed)

Code Generation Speed

Code Quality

http://www.lri.fr/

Generation Speed Improvements

17

http://www.lri.fr/

Generation Speed – Node Fusion

• Multidimensional schedules allow expression of non affine (polynomial) quantities as affine ones with more dimensions improved flexibility• Drawback Pressure on code generation (height of the tree)• Add parameters Add dimensions (polyhedral operation complexity)

HOWEVER

• Loop level transformations affect blocks of statements (tiling, interchange…)• Polyhedron inclusion check is NOT exponential

Before each separation phase, fuse consecutive nodes with equal scattering polyhedra.

18

http://www.lri.fr/

Generation Speed – Scalar Dimensions

• Some multidimensional schedules have scalar dimensions (UTF, URUK[ICS05])• Scalar dimensions express strict statement interleaving

• Comparison of integers, no need for polyhedral separation• Syntactic tree height reduction (potentially half the height)• Marginal overhead for detection and computation• Combines well with Node Fusion

19

http://www.lri.fr/

• Generation of sequential loops for non invertible schedules (wavefronts)• CLooG [Bas04] handles it with polyhedral projection on domain iterators• Drawback Adds dimensions (polyhedral operation complexity)• Drawback Additional polyhedral computations on each leaf

Use transformation invertibility (ideally, given the rank, mix of projections and invertibility)

20

Generation Speed – Domain Iterator Regen.

ST afterQui. separationPhase (3nb_stmts)

http://www.lri.fr/

Code Quality Improvements

21

http://www.lri.fr/

Code Quality – If Hoisting

• Quilleré separation phase leaves conditionals on triangular loops• Need of the so-called backtracking phase too aggressive (code bloat)• Potentially tremendous amount of useless work

cond: t1<= 4cond: 11 <= t1

t2

t1

Smaller CodeNo useless work (simplification IS needed)Explains the generation speedup on dreamupT3

22

……t1

Backtracking illustration

for t1

cond: 5 <= t1<=10

for t2

for t3

for t1

for t2

for t3

If-Hoisting illustration

CodeBloat

Useless Work

http://www.lri.fr/

Code Quality – If Hoisting

• Previous example doesn’t take place in real life (just an illustration)

Backtrack + 50%

Matrix Mult. with URUK :• strip-mine by factor 4 (x3)• interchange loops (x2)• unroll

23

http://www.lri.fr/

• Let be the transformation function for a statement• Suppose is invertible, and let the matrix of denominators of• Let and

• Inverse Scatter Matrix expresses domain iterators from time iterators• ensures all coefficients are integral• Replaces leaf polyhedral projections by matrix inversions

Time iterators

Domain iterators

Substitute for usual Hermite Normal Form in stride computations

24

Removing Modulos – Domain Iterator Regen.

Problem since 91: [Irigoin91], [Pingali92], [Ramanujam95], [Xue96], [Griebl98] and others …

http://www.lri.fr/

for(t1=5;t1<=2*M+3*N;t1++) for(t2=?;t2<=?;t2++) if(t1%3 == 0) S(i=t2,j=t1/3-k) (t2 = 3k) if(t1%3 == 2) S(i=t2,j=(t1-2)/3-k) (t2 = 3k+1) if(t1%3 == 1) S(i=t2,j=(t1-1)/3-k-1) (t2 = 3k+2)

25

Removing Modulos - Inverse Scatter Matrix

• Consider S with 2 domain iterators, = and =

• We have = and ISM =

2 31 0

0 11/3 -2/3

1 00 3

0 1 -1 01 -2 0 -3

for(i=1;i<=M;i++) for(j=1;j<=N;j++) S(i,j)

Time iterators

Domain iterators

INTEGRALMeaning: i = t2 , 3*j = t1-2*t2

for(t1=5;t1<=2*M+3*N;t1++) for(t2=1;t2<=min(M,t1/2);t2++) if((t1 – 2t2)%3 == 0) S(i=t2,j=(t1-2t2)/3)

2 31 0=

OUCH !!!SM & unroll t2 by (3 / gcd(2,3))

SM & unroll t1 by (3 / gcd(1,3)) for(t1=?;t1<=?;t1++) for(t2=?;t2<=?;t2++) S(i=t2,j=l-k) (t1 = 3l) S(i=t2,j=l-k-1) (t1 = 3l+1) S(i=t2,j=l-k-2) (t1 = 3l+2)

http://www.lri.fr/

26

Removing Modulos – There is a CATCH• Previous example flowed nicely What about the loops’ bounds ???• “Issue” (feature) with our SM + unroll transformation (strip-mine NOT strided)• Modulos are indeed removed from the kernels only

P

KE

V.S.for(i=M;i<=N;i++) S(i,j)

for(i=M;i<N;i+=2) for(ii=i; ii<=min(i+1,N); i++) S(i,j)

Code Size

HOWEVER: P and E have marginal execution time when SM factor is “decent”PROLOGUE gives us ALIGNMENT on %2 (strip-mine factor) !!!!!!!!!!!!!!!

Transformation quality issue

http://www.lri.fr/

27

Removing Modulos – Hermite Normal Form

All statements need to have the same transformation TOO RESTRICTIVE

• Our solution unrolls modulo guards from kernels after strip-mining• Hermite Normal Form: Mathematical decomposition of = U.H• Where U is unimodular (skewing matrix)• H is diagonal (stride in transformed space diagonal coefficients)• Suppresses the need for internal modulo guards

BUT• If U is not the same, skewing are different• Deal with non parallel lattices … how ?• In practice, used for 1 statement or “simple” examples

http://www.lri.fr/

Experimental Results

28

http://www.lri.fr/

Putting it all Together – Code Size Experiments

29

CL04 CL06 Improv.

State of the art polyhedral code generator CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG04 vs CLooG06

http://www.lri.fr/

Generation Speed – Experiments

Domain Iterators

Swim

• 36% Time reduction• 58% Memory reduction

30

0102030405060708090

100

DreamupT3 Classen QR Swim

T OrigT ScalarM OrigM Scalar

Node Fusion

% ofCL04

0102030405060708090

100

DreamupT3 Classen (3/7) QR(4/7) Swim(6/11)

T OrigT ScalarM OrigM Scalar

Scalar Dimensions

% ofCL04

We compare original CLooG(CL04) from [Bastiul04] PACT paper with ouroptimized CLooG (CL06)

http://www.lri.fr/

Putting it all Together – Code Generation Speed Experiments

31

CL UR CL UR

Affine Schedule: 412 2267 lines (40% execution speedup wrt best peak)Pathscale –Ofast needs ~22s to process the AST (LNO OFF)

State of the art polyhedral code generator CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG vs URGenT

http://www.lri.fr/

Conclusion / Future Works

• Implemented as the Code Generation phase of the URUK framework [ICS05]

• Generation Speed Goal achieved (up to 56x, stands PathScale comparison)

• Greatly improved code size with improved if-hoisting technique (up to 5.8x)

• Modulo Conditionals are removed (from kernel) Mix with HNF

• Still room for speeding up generation (caches, memory pools, parallelization)

• Focus on Code Generation Friendly transformations

32

http://www.lri.fr/

Thank you !!!www.cloog.org for full presentation & more

http://www.cloog.org/

http://www.lri.fr/

http://www.lri.fr/

polyhedral code generation in the real world nicolas vasilache cdric bastoul albert cohen

Documents