compiler-directed synthesis of programmable loop accelerators

University of MichiganElectrical Engineering and Computer Science

Compiler-directed Synthesis of Programmable Loop Accelerators

Kevin Fan, Hyunchul Park, Scott MahlkeSeptember 25, 2004EDCEP Workshop


Loop Accelerators

• Hardware implementation of a critical loop nest– Hardwired state machine– Digital camera appln – 1000x vs Pentium III– Multiple accelerators hooked up in a pipeline

• Loop accelerator vs. customized processor– 1 block of code vs. multiple blocks– Trivial control flow vs. handling generic branches– Traditionally state machine vs. instruction driven


Programmable Loop Accelerators• Goals

– Multifunction accelerators – Accelerator hardware can handle multiple loops (re-use)

– Post-programmable – To a degree, allow changes to the application

– Use compiler as architecture synthesis tool• But …

– Don’t build a customized processor– Maintain ASIC-level efficiency


NPA (Nonprogrammable Accelerator) Synthesis in PICO

Sequential Loop Nest

Performance Requirement

Systolic Array Datapath

Systolic Array

Controller

Systolic Array

Coprocessor Interface

External Bus

Data In

Data Out

commands

timing

done

commands

done

Loadyii

Loadwjj

Loadxii-jj

10

10

+

Store

10

10 1

0

t1

t2 t3

Xr-1

Yr-1 yii

Systolic Processor Datapath


PICO Frontendfor i = 1 to nifor j = 1 to nj

y[i] += w[j] * x[i+j]

for jt = 1 to 100 step 10

for t = 0 to 502

for p = 0 to 1

(i,j) = function of (t,p)

if (i>1) W[t][p] = W[t-5][p] else w[jt+j]

if (i>1 && j<bj) X[t][p] = X[t-4][p+1] else x[i+jt+j]

Y[t][p] += W[t][p] * X[t][p]

• Goals– Exploit loop-level parallelism– Map loop to abstract hardware– Manage global memory BW

• Steps– Tiling– Load/store elimination– Iteration mapping– Iteration scheduling– Virtual processor clustering


PICO Backend

FSelectT

FSelectT

FSelectT TSelectF Load

LoadLoadTSelectF

+

Store

Copy

Copy

• Resource allocation (II, operation graph)• Synthesize machine description for “fake” fully connected processor with allocated resources


Reduced VLIW Processor after Modulo Scheduling

FSelectT

FSelectT


LoadLoadTSelectF

+

Store

Xr-1Yr-1

t3t1 t2 yii wjj xii-jj

yii


Data/control-path Synthesis NPA

Loadyii

Loadwjj

Loadxii-jj

10

10

+

Store

10

10

10

t1

t2 t3

Xr-1

Yr-1 yii


PICO Methodology – Why it Works?

• Systematic design methodology– 1. Parameterized meta-architecture – all NPAs have

same general organization– 2. Performance/throughput is input– 3. Abstract architecture – We know how to build

compilers for this – 4. Mapping mechanism – Determine architecture

specifics from schedule for abstract architecture


Direct Generalization of PICO?

FSelectT

FSelectT


LoadLoadTSelectF

+

Store

Copy

Copy

• Programmability would require full interconnect between elements• Back to the meta architecture!

• Generalize connectivity to enable post-programmability• But stylize it


Programmable Loop Accelerator – Design Strategy

• Compile for partially defined architecture– Build long distance communication into schedule– Limit global communication bandwidth

• Proposed meta-architecture– Multi-cluster VLIW

• Explicit inter-cluster transfers (varying latency/BW)• Intra-cluster communication is complete

– Hardware partially defined – expensive units


Programmable Loop Accelerator Schema

Intra-cluster Communication

Shift Register

SRAM

DRAM

…

Stream Buffer

Accelerator

Accelerator

Pipeline of Tiled orClustered Accelerators

Accelerator Datapath

ControlUnit

Stream Unit

Stream UnitII

FU

… …

FU

… …

FU

… …

MEM

Inter-clusterRegister File

… …


Flow Diagram

FU Alloc

Partition

ModuloSchedule

Assembly code,II

# clusters# expensive FUs

# cheap FUsFUs assigned to clusters

Shift register depth, width, portingIntercluster bandwidth

LoopAccelerator


Sobel Kernel for (i = 0; i < N1; i++) { for (j = 0; j < N2; j++) {

int t00, t01, t02, t10, t12, t20, t21, t22; int e, tmp;

t00 = x[i ][j ]; t01 = x[i ][j+1]; t02 = x[i ][j+2]; t10 = x[i+1][j ]; t12 = x[i+1][j+2]; t20 = x[i+2][j ]; t21 = x[i+2][j+1]; t22 = x[i+2][j+2];

e1 = ((t00 + t01) + (t01 + t02)) – ((t20 + t21) + (t21 + t22)); e2 = ((t00 + t10) + (t10 + t20)) – ((t02 + t12) + (t12 + t22));

e12 = e1*e1; e22 = e2*e2; e = e12 + e22; if (e > threshold) tmp = 1; else tmp = 0; edge[i][j] = tmp;

} }


FU Allocation• Determine number of

clusters:

• Determine number of expensive FUs– MPY, DIV, memory

II

typeofops __#

IIops

4

#

• Sobel with II=4

41 ops 3 clusters

2 MPY ops 1 multiplier

9 memory ops 3 memory units


Partitioning

• Multi-level approach consists of two phases– Coarsening– Refinement

• Minimize inter-cluster communication• Load balance

– Max of 4 II operations per cluster• Take FU allocation into account

– Restricted # of expensive units– # of cheap units (ADD, logic) determined from partition


Coarsening

• Group highly related operations together– Pair operations together at each step– Forces partitioner to consider several operations as a

single unit• Coarsening Sobel subgraph into 2 groups:

+ + + +++

L LLL L

+ + + +++

L LLL L

+ + + +++

L LLL L

+ + + +++

L LLL L


Refinement

• Move operations between clusters• Good moves:

– Reduce inter-cluster communication– Improve load balance– Reduce hardware cost

• Reduce number of expensive units to meet limit

• Collect similar bitwidth operations together

+ + + +

++

L LLL L

?


Partitioning Example

• From sobel, II=4• Place MPYs together• Place each tree of ADD-

LOAD-ADDs together• Cuts 6 edges


Modulo Scheduling

• Determines shift register width, depth, and number of read ports

• Sobel II=4

LD

ADD

ADD ADD

LD

ADD0

3

1

2

cycle FU0 FU1 FU2 FU3

FU Cycle Max resultlifetime

Req’ddepth

Req’d ports

0 2 4 4 1

1 1 2 4 2

3 4

2 4 1 1 1

3 0 - 1 1

3 1


Test Cases

• Sobel and fsed kernels, II=4 designs• Each machine has 4 clusters with 4 FUs per cluster

M + -

+ - + -

M + -

+ - + -

M + -

* &

B <<

+ - <<

M + -

+ - <<

M + -

+ - <<

M + &

+ & + &

B + -

*

sobel

fsed


Cross Compile Results

• Computation is localized– sobel: 1.5 moves/cycle– fsed: 1 move/cycle

• Cross compile– Can still achieve II=4– More inter-cluster communication– May require more units– sobel on fsed machine: ~2 moves/cycle– fsed on sobel machine: ~3 moves/cycle


Concluding Remarks

• Programmable loop accelerator design strategy– Meta-architecture with stylized interconnect– Systematic compiler-directed design flow

• Costs of programmability:– Interconnect, inter-cluster communication– Control – “micro-instructions” are necessary

• Just scratching the surface of this work• For more, see the CCCP group webpage

– http://cccp.eecs.umich.edu

compiler-directed synthesis of programmable loop accelerators

Documents