polyhedral code generation in the real world nicolas vasilache cdric bastoul albert cohen

33
Polyhedral Code Generation In The Real World Nicolas VASILACHE Cédric BASTOUL Albert COHEN

Upload: loren-foster

Post on 19-Jan-2018

233 views

Category:

Documents


0 download

DESCRIPTION

Introduction – Polyhedral Model Powerful expressiveness for high level transformations (parallelism, locality) Can express any composition of usual loop transformations [Pugh91] Compact representation of all legal transformations [Feautrier90] Code Generation was the weakest link [Griebl & al. 98] Until recent algorithm [Quilleré00]  without transformations However, still problematic on long, parametric sequences on SPECs 3

TRANSCRIPT

Page 1: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Polyhedral Code GenerationIn The Real World

Nicolas VASILACHECédric BASTOUL

Albert COHEN

Page 2: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Outline

• Introduction

• Affine schedules

• Formal General Form

• Contributions

• Focus on Modulo Conditional Removal (speed & quality)

• Experimental Results

2

Page 3: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Introduction – Polyhedral Model

• Powerful expressiveness for high level transformations (parallelism, locality)

• Can express any composition of usual loop transformations [Pugh91]

• Compact representation of all legal transformations [Feautrier90]

• Code Generation was the weakest link [Griebl & al. 98]

• Until recent algorithm [Quilleré00] without transformations

• However, still problematic on long, parametric sequences on SPECs

3

Page 4: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Introduction – Transformations

4

WHY TRANSFORM ???

Cholesky factorization, 6 statements, Optimal allocation functions [McKin92] SwimFP2000 [ICS05]

~ 30 polyhedral loop transformations

40% speedup wrt best peak perf. on AMD64

• Huge code generation times (ex: full Swim ~ 421 2267 lines, 20 mn / 300 MB)• In the context of complex transformations

Goal : Generation time comparable to BE of a real compiler (EKOPath)

Page 5: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Introduction – Context & Notations

5

Code Generation : syntactic loops from matrix representation

Page 6: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Affine Schedules

6

Page 7: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

1 3 i

j

1

3

1 3 i

j

1

3

(i,j) (t1=i, t2=j)

• Bijection between domain and time iterations• Time iterations determine the generated loops (nesting, bounds)• Execution follows lexicographic order on time dimensions• Domain values touched by the statement : i=t1,j=t2

Affine Schedule – Trivial Example

for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j)

for(t1=1;t1<=3;t1++) for(t2=1;t2<=3;t2++) S(i=t1,j=t2)

1 3 t1

t2

1

3

time

domain

timedomain

7

1 00 1

=

S(1,1)S(1,2)S(1,3)S(2,1)S(2,2)S(2,3)S(3,1)S(3,2)S(3,3)

Page 8: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

1 3 i

j

1

3

1 3 i

j

1

3

(i,j) (t1=j, t2=i)

• Another bijection between domain and time iterations• New bounds computation• Lexicographic order on time dimensions• Domain values touched by the statement : i=t2,j=t1

Affine Schedule – Loop Interchange

for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j)

for(t1=1;t1<=3;t1++) for(t2=1;t2<=3;t2++) S(i=t2,j=t1)

1 3 t1

t2

1

3

time

timedomain

8

domain

0 11 0

=

S(1,1)S(1,2)S(1,3)S(2,1)S(2,2)S(2,3)S(3,1)S(3,2)S(3,3)

Page 9: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

1 3 i

j

1

3 (i,j) (t1= i+j)

• NOT a bijection (just a surjection)• New bounds computation (t1: [2, 6])• Domain values touched by the statement: {(i,j)|i+j==t1}

Affine Schedule – Parallel Wavefronts

for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j)

for(t1=2;t1<=6;t1++) DOALL{(i,j)|i+j==t1} S(i,j)

2 6 t1

time

timedomain

1 3 i

j

1

3

9

domain

1 1=

S(1,1)S(1,2)S(1,3)S(2,1)S(2,2)S(2,3)

S(3,1)S(3,2)S(3,3)

Page 10: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

1 3 i

j

1

3

1 3 i

j

1

3

(i,j)1 (t1=i, t2=j)(i,j)2 (t1=i+1, t2=j)

Affine Schedule – Statement Shifting

for(i=1;i<=3;i++) for(j=1;j<=3;j++) S1(i,j) S2(i,j)

for(t2=1;t2<=3;t2++) S1(i=1,j=t2)for(t1=2;t1<=3;t1++) for(t2=1;t2<=3;t2++) S1(i=t1,j=t2) S2(i=t1-1,j=t2)for(t2=1;t2<=3;t2++) S2(i=4-1,j=t2)

1 4 t1

t2

1

3

time

time

domain

• New bounds computation (S1: [1,3]x[1,3] S2: [2,4]x[1,3]) have disjoint parts• Separation phase needed on each time dimension (3nb_stmt w.c. complexity)

P

K

E

10

domain

1 00 1=

2

10010

1 00 1=

1

2

Page 11: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

General Case

• Schedules: Zmi Zni for each statement Si

• Schedules associate logical time to each iteration domain point • Time value sets need to be separated scattering functions

• Time part used for separation and ordering (Polylib computations 2dim [Wilde93])• Domain part determines the values spanned by time dimensions• Quilleré separation phase [Quilleré00, Bastoul04]

Time

Domain

Time iterators

Domain iterators

11

Page 12: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Quilleré separation phase

12

Page 13: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Separation Principles

Consider statements with domain and schedule functions such that: S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9])

[0,1] [2,3] [4,5]

Polyhedral inter / diff (2dim)

worklist remaining

13

[2,5] [0,3]

Considering t1

Page 14: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Separation Principles

[0,1] [2,3] [4,5] [-2,6]

[-2,-1] [0,1] [2,3] [0,1] [-2,-1]

Polyhedral inter / diff (2dim)

worklist remaining

kernel

Consider statements with domain and schedule functions such that: S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9])

14

Considering t1

3nb_stmt w.c. compl.

Page 15: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Separation Principles

That was for the first time dimensionRecursively for all time dimensionsResult is a syntax tree of the generated loops

Consider statements with domain and schedule functions such that: S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9])

15

Considering t1for(t1=-2;t1<=-1;t1++) for(t2=5;t2<=9;t2++) S3(…) for(t1=0;t1<=1;t1++) for(t2=5;t2<=7;t2++) S2(…) S3(…) for(t2=8;t2<=9;t2++) S3(…) ...

Page 16: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Contributions

• Problems provided by different sources (academia, industry, SPECFP2000)• Exhibit different challenging issues

Real World Issues

State of the art polyhedral code generator CLooG [Bastoul04] ALL PERFORMANCE COMPARISONS WILL BE CLooG vs URGenT

16

• Node fusion (exploiting transformations’ “locality”)• Exploiting scalar dimensions (replacing exponential computations with trivial ones)• Domain iterator mapping improvement (replacing exponential by matrix inversions)

• Faster If-Hoisting yielding much smaller code (conditional factorization)• Modulo Conditional removal by strip-mining (stride issue) (detailed)

Code Generation Speed

Code Quality

Page 17: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Generation Speed Improvements

17

Page 18: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Generation Speed – Node Fusion

• Multidimensional schedules allow expression of non affine (polynomial) quantities as affine ones with more dimensions improved flexibility• Drawback Pressure on code generation (height of the tree)• Add parameters Add dimensions (polyhedral operation complexity)

HOWEVER

• Loop level transformations affect blocks of statements (tiling, interchange…)• Polyhedron inclusion check is NOT exponential

Before each separation phase, fuse consecutive nodes with equal scattering polyhedra.

18

Page 19: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Generation Speed – Scalar Dimensions

• Some multidimensional schedules have scalar dimensions (UTF, URUK[ICS05])• Scalar dimensions express strict statement interleaving

• Comparison of integers, no need for polyhedral separation• Syntactic tree height reduction (potentially half the height)• Marginal overhead for detection and computation• Combines well with Node Fusion

19

Page 20: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

• Generation of sequential loops for non invertible schedules (wavefronts)• CLooG [Bas04] handles it with polyhedral projection on domain iterators• Drawback Adds dimensions (polyhedral operation complexity)• Drawback Additional polyhedral computations on each leaf

Use transformation invertibility (ideally, given the rank, mix of projections and invertibility)

20

Generation Speed – Domain Iterator Regen.

ST afterQui. separationPhase (3nb_stmts)

Page 21: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Code Quality Improvements

21

Page 22: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Code Quality – If Hoisting

• Quilleré separation phase leaves conditionals on triangular loops• Need of the so-called backtracking phase too aggressive (code bloat)• Potentially tremendous amount of useless work

cond: t1<= 4cond: 11 <= t1

t2

t1

Smaller CodeNo useless work (simplification IS needed)Explains the generation speedup on dreamupT3

22

……t1

Backtracking illustration

for t1

cond: 5 <= t1<=10

for t2

for t3

for t1

for t2

for t3

If-Hoisting illustration

CodeBloat

Useless Work

Page 23: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Code Quality – If Hoisting

• Previous example doesn’t take place in real life (just an illustration)

Backtrack + 50%

Matrix Mult. with URUK :• strip-mine by factor 4 (x3)• interchange loops (x2)• unroll

23

Page 24: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

• Let be the transformation function for a statement• Suppose is invertible, and let the matrix of denominators of• Let and

• Inverse Scatter Matrix expresses domain iterators from time iterators• ensures all coefficients are integral• Replaces leaf polyhedral projections by matrix inversions

Time iterators

Domain iterators

Substitute for usual Hermite Normal Form in stride computations

24

Removing Modulos – Domain Iterator Regen.

Problem since 91: [Irigoin91], [Pingali92], [Ramanujam95], [Xue96], [Griebl98] and others …

Page 25: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

for(t1=5;t1<=2*M+3*N;t1++) for(t2=?;t2<=?;t2++) if(t1%3 == 0) S(i=t2,j=t1/3-k) (t2 = 3k) if(t1%3 == 2) S(i=t2,j=(t1-2)/3-k) (t2 = 3k+1) if(t1%3 == 1) S(i=t2,j=(t1-1)/3-k-1) (t2 = 3k+2)

25

Removing Modulos - Inverse Scatter Matrix

• Consider S with 2 domain iterators, = and =

• We have = and ISM =

2 31 0

0 11/3 -2/3

1 00 3

0 1 -1 01 -2 0 -3

for(i=1;i<=M;i++) for(j=1;j<=N;j++) S(i,j)

Time iterators

Domain iterators

INTEGRALMeaning: i = t2 , 3*j = t1-2*t2

for(t1=5;t1<=2*M+3*N;t1++) for(t2=1;t2<=min(M,t1/2);t2++) if((t1 – 2t2)%3 == 0) S(i=t2,j=(t1-2t2)/3)

2 31 0=

OUCH !!!SM & unroll t2 by (3 / gcd(2,3))

SM & unroll t1 by (3 / gcd(1,3)) for(t1=?;t1<=?;t1++) for(t2=?;t2<=?;t2++) S(i=t2,j=l-k) (t1 = 3l) S(i=t2,j=l-k-1) (t1 = 3l+1) S(i=t2,j=l-k-2) (t1 = 3l+2)

Page 26: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

26

Removing Modulos – There is a CATCH• Previous example flowed nicely What about the loops’ bounds ???• “Issue” (feature) with our SM + unroll transformation (strip-mine NOT strided)• Modulos are indeed removed from the kernels only

P

KE

V.S.for(i=M;i<=N;i++) S(i,j)

for(i=M;i<N;i+=2) for(ii=i; ii<=min(i+1,N); i++) S(i,j)

Code Size

HOWEVER: P and E have marginal execution time when SM factor is “decent”PROLOGUE gives us ALIGNMENT on %2 (strip-mine factor) !!!!!!!!!!!!!!!

Transformation quality issue

Page 27: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

27

Removing Modulos – Hermite Normal Form

All statements need to have the same transformation TOO RESTRICTIVE

• Our solution unrolls modulo guards from kernels after strip-mining• Hermite Normal Form: Mathematical decomposition of = U.H• Where U is unimodular (skewing matrix)• H is diagonal (stride in transformed space diagonal coefficients)• Suppresses the need for internal modulo guards

BUT• If U is not the same, skewing are different• Deal with non parallel lattices … how ?• In practice, used for 1 statement or “simple” examples

Page 28: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Experimental Results

28

Page 29: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Putting it all Together – Code Size Experiments

29

CL04 CL06 Improv.

State of the art polyhedral code generator CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG04 vs CLooG06

Page 30: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Generation Speed – Experiments

Domain Iterators

Swim

• 36% Time reduction• 58% Memory reduction

30

0102030405060708090

100

DreamupT3 Classen QR Swim

T OrigT ScalarM OrigM Scalar

Node Fusion

% ofCL04

0102030405060708090

100

DreamupT3 Classen (3/7) QR(4/7) Swim(6/11)

T OrigT ScalarM OrigM Scalar

Scalar Dimensions

% ofCL04

We compare original CLooG(CL04) from [Bastiul04] PACT paper with ouroptimized CLooG (CL06)

Page 31: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Putting it all Together – Code Generation Speed Experiments

31

CL UR CL UR

Affine Schedule: 412 2267 lines (40% execution speedup wrt best peak)Pathscale –Ofast needs ~22s to process the AST (LNO OFF)

State of the art polyhedral code generator CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG vs URGenT

Page 32: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Conclusion / Future Works

• Implemented as the Code Generation phase of the URUK framework [ICS05]

• Generation Speed Goal achieved (up to 56x, stands PathScale comparison)

• Greatly improved code size with improved if-hoisting technique (up to 5.8x)

• Modulo Conditionals are removed (from kernel) Mix with HNF

• Still room for speeding up generation (caches, memory pools, parallelization)

• Focus on Code Generation Friendly transformations

32

Page 33: Polyhedral Code Generation In The Real World Nicolas VASILACHE Cdric BASTOUL Albert COHEN

Thank you !!!www.cloog.org for full presentation & more