compiler-directed synthesis of programmable loop accelerators

23
University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park, Scott Mahlke September 25, 2004 EDCEP Workshop

Upload: enrico

Post on 15-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Compiler-directed Synthesis of Programmable Loop Accelerators. Kevin Fan, Hyunchul Park, Scott Mahlke September 25, 2004 EDCEP Workshop. Loop Accelerators. Hardware implementation of a critical loop nest Hardwired state machine Digital camera appln – 1000x vs Pentium III - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Compiler-directed Synthesis of Programmable Loop Accelerators

Kevin Fan, Hyunchul Park, Scott MahlkeSeptember 25, 2004EDCEP Workshop

Page 2: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Loop Accelerators

• Hardware implementation of a critical loop nest– Hardwired state machine– Digital camera appln – 1000x vs Pentium III– Multiple accelerators hooked up in a pipeline

• Loop accelerator vs. customized processor– 1 block of code vs. multiple blocks– Trivial control flow vs. handling generic branches– Traditionally state machine vs. instruction driven

Page 3: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Programmable Loop Accelerators• Goals

– Multifunction accelerators – Accelerator hardware can handle multiple loops (re-use)

– Post-programmable – To a degree, allow changes to the application

– Use compiler as architecture synthesis tool• But …

– Don’t build a customized processor– Maintain ASIC-level efficiency

Page 4: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

NPA (Nonprogrammable Accelerator) Synthesis in PICO

Sequential Loop Nest

Performance Requirement

Systolic Array Datapath

Systolic Array

Controller

Systolic Array

Coprocessor Interface

External Bus

Data In

Data Out

commands

timing

done

commands

done

Loadyii

Loadwjj

Loadxii-jj

10

10

+

Store

10

10 1

0

t1

t2 t3

Xr-1

Yr-1 yii

Systolic Processor Datapath

Page 5: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

PICO Frontendfor i = 1 to nifor j = 1 to nj

y[i] += w[j] * x[i+j]

for jt = 1 to 100 step 10

for t = 0 to 502

for p = 0 to 1

(i,j) = function of (t,p)

if (i>1) W[t][p] = W[t-5][p] else w[jt+j]

if (i>1 && j<bj) X[t][p] = X[t-4][p+1] else x[i+jt+j]

Y[t][p] += W[t][p] * X[t][p]

• Goals– Exploit loop-level parallelism– Map loop to abstract hardware– Manage global memory BW

• Steps– Tiling– Load/store elimination– Iteration mapping– Iteration scheduling– Virtual processor clustering

Page 6: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

PICO Backend

FSelectT

FSelectT

FSelectT TSelectF Load

LoadLoadTSelectF

+

Store

Copy

Copy

• Resource allocation (II, operation graph)• Synthesize machine description for “fake” fully connected processor with allocated resources

Page 7: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Reduced VLIW Processor after Modulo Scheduling

FSelectT

FSelectT

FSelectT TSelectF Load

LoadLoadTSelectF

+

Store

Xr-1Yr-1

t3t1 t2 yii wjj xii-jj

yii

Page 8: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Data/control-path Synthesis NPA

Loadyii

Loadwjj

Loadxii-jj

10

10

+

Store

10

10

10

t1

t2 t3

Xr-1

Yr-1 yii

Page 9: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

PICO Methodology – Why it Works?

• Systematic design methodology– 1. Parameterized meta-architecture – all NPAs have

same general organization– 2. Performance/throughput is input– 3. Abstract architecture – We know how to build

compilers for this – 4. Mapping mechanism – Determine architecture

specifics from schedule for abstract architecture

Page 10: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Direct Generalization of PICO?

FSelectT

FSelectT

FSelectT TSelectF Load

LoadLoadTSelectF

+

Store

Copy

Copy

• Programmability would require full interconnect between elements• Back to the meta architecture!

• Generalize connectivity to enable post-programmability• But stylize it

Page 11: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Programmable Loop Accelerator – Design Strategy

• Compile for partially defined architecture– Build long distance communication into schedule– Limit global communication bandwidth

• Proposed meta-architecture– Multi-cluster VLIW

• Explicit inter-cluster transfers (varying latency/BW)• Intra-cluster communication is complete

– Hardware partially defined – expensive units

Page 12: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Programmable Loop Accelerator Schema

Intra-cluster Communication

Shift Register

SRAM

DRAM

Stream Buffer

Accelerator

Accelerator

Pipeline of Tiled orClustered Accelerators

Accelerator Datapath

ControlUnit

Stream Unit

Stream UnitII

FU

… …

FU

… …

FU

… …

MEM

Inter-clusterRegister File

… …

Page 13: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Flow Diagram

FU Alloc

Partition

ModuloSchedule

Assembly code,II

# clusters# expensive FUs

# cheap FUsFUs assigned to clusters

Shift register depth, width, portingIntercluster bandwidth

LoopAccelerator

Page 14: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Sobel Kernel for (i = 0; i < N1; i++) { for (j = 0; j < N2; j++) {

int t00, t01, t02, t10, t12, t20, t21, t22; int e, tmp;

t00 = x[i ][j ]; t01 = x[i ][j+1]; t02 = x[i ][j+2]; t10 = x[i+1][j ]; t12 = x[i+1][j+2]; t20 = x[i+2][j ]; t21 = x[i+2][j+1]; t22 = x[i+2][j+2];

e1 = ((t00 + t01) + (t01 + t02)) – ((t20 + t21) + (t21 + t22)); e2 = ((t00 + t10) + (t10 + t20)) – ((t02 + t12) + (t12 + t22));

e12 = e1*e1; e22 = e2*e2; e = e12 + e22; if (e > threshold) tmp = 1; else tmp = 0; edge[i][j] = tmp;

} }

Page 15: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

FU Allocation• Determine number of

clusters:

• Determine number of expensive FUs– MPY, DIV, memory

II

typeofops __#

IIops

4

#

• Sobel with II=4

41 ops 3 clusters

2 MPY ops 1 multiplier

9 memory ops 3 memory units

Page 16: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Partitioning

• Multi-level approach consists of two phases– Coarsening– Refinement

• Minimize inter-cluster communication• Load balance

– Max of 4 II operations per cluster• Take FU allocation into account

– Restricted # of expensive units– # of cheap units (ADD, logic) determined from partition

Page 17: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Coarsening

• Group highly related operations together– Pair operations together at each step– Forces partitioner to consider several operations as a

single unit• Coarsening Sobel subgraph into 2 groups:

+ + + +++

L LLL L

+ + + +++

L LLL L

+ + + +++

L LLL L

+ + + +++

L LLL L

Page 18: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Refinement

• Move operations between clusters• Good moves:

– Reduce inter-cluster communication– Improve load balance– Reduce hardware cost

• Reduce number of expensive units to meet limit

• Collect similar bitwidth operations together

+ + + +

++

L LLL L

?

Page 19: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Partitioning Example

• From sobel, II=4• Place MPYs together• Place each tree of ADD-

LOAD-ADDs together• Cuts 6 edges

Page 20: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Modulo Scheduling

• Determines shift register width, depth, and number of read ports

• Sobel II=4

LD

ADD

ADD ADD

LD

ADD0

3

1

2

cycle FU0 FU1 FU2 FU3

FU Cycle Max resultlifetime

Req’ddepth

Req’d ports

0 2 4 4 1

1 1 2 4 2

3 4

2 4 1 1 1

3 0 - 1 1

3 1

Page 21: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Test Cases

• Sobel and fsed kernels, II=4 designs• Each machine has 4 clusters with 4 FUs per cluster

M + -

+ - + -

M + -

+ - + -

M + -

* &

B <<

+ - <<

M + -

+ - <<

M + -

+ - <<

M + &

+ & + &

B + -

*

sobel

fsed

Page 22: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Cross Compile Results

• Computation is localized– sobel: 1.5 moves/cycle– fsed: 1 move/cycle

• Cross compile– Can still achieve II=4– More inter-cluster communication– May require more units– sobel on fsed machine: ~2 moves/cycle– fsed on sobel machine: ~3 moves/cycle

Page 23: Compiler-directed Synthesis of Programmable Loop Accelerators

University of MichiganElectrical Engineering and Computer Science

Concluding Remarks

• Programmable loop accelerator design strategy– Meta-architecture with stylized interconnect– Systematic compiler-directed design flow

• Costs of programmability:– Interconnect, inter-cluster communication– Control – “micro-instructions” are necessary

• Just scratching the surface of this work• For more, see the CCCP group webpage

– http://cccp.eecs.umich.edu