university of michigan electrical engineering and computer science compiler-directed synthesis of...

23
University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park, Scott Mahlke September 25, 2004 EDCEP Workshop

Post on 20-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Compiler-directed Synthesis of Programmable Loop Accelerators

Kevin Fan, Hyunchul Park, Scott MahlkeSeptember 25, 2004EDCEP Workshop

Page 2: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Loop Accelerators

• Hardware implementation of a critical loop nest– Hardwired state machine– Digital camera appln – 1000x vs Pentium III– Multiple accelerators hooked up in a pipeline

• Loop accelerator vs. customized processor– 1 block of code vs. multiple blocks– Trivial control flow vs. handling generic branches– Traditionally state machine vs. instruction driven

Page 3: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Programmable Loop Accelerators• Goals

– Multifunction accelerators – Accelerator hardware can handle multiple loops (re-use)

– Post-programmable – To a degree, allow changes to the application

– Use compiler as architecture synthesis tool• But …

– Don’t build a customized processor– Maintain ASIC-level efficiency

Page 4: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

NPA (Nonprogrammable Accelerator) Synthesis in PICO

Sequential Loop Nest

Performance Requirement

Systolic Array Datapath

Systolic Array

Controller

Systolic Array

Coprocessor Interface

External Bus

Data In

Data Out

commands

timing

done

commands

done

Loadyii

Loadwjj

Loadxii-jj

10

10

+

Store

10

10 1

0

t1

t2 t3

Xr-1

Yr-1 yii

Systolic Processor Datapath

Page 5: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

PICO Frontendfor i = 1 to nifor j = 1 to nj

y[i] += w[j] * x[i+j]

for jt = 1 to 100 step 10

for t = 0 to 502

for p = 0 to 1

(i,j) = function of (t,p)

if (i>1) W[t][p] = W[t-5][p] else w[jt+j]

if (i>1 && j<bj) X[t][p] = X[t-4][p+1] else x[i+jt+j]

Y[t][p] += W[t][p] * X[t][p]

• Goals– Exploit loop-level parallelism– Map loop to abstract hardware– Manage global memory BW

• Steps– Tiling– Load/store elimination– Iteration mapping– Iteration scheduling– Virtual processor clustering

Page 6: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

PICO Backend

FSelectT

FSelectT

FSelectT TSelectF Load

LoadLoadTSelectF

+

Store

Copy

Copy

• Resource allocation (II, operation graph)• Synthesize machine description for “fake” fully connected processor with allocated resources

Page 7: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Reduced VLIW Processor after Modulo Scheduling

FSelectT

FSelectT

FSelectT TSelectF Load

LoadLoadTSelectF

+

Store

Xr-1Yr-1

t3t1 t2 yii wjj xii-jj

yii

Page 8: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Data/control-path Synthesis NPA

Loadyii

Loadwjj

Loadxii-jj

10

10

+

Store

10

10

10

t1

t2 t3

Xr-1

Yr-1 yii

Page 9: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

PICO Methodology – Why it Works?

• Systematic design methodology– 1. Parameterized meta-architecture – all NPAs have

same general organization– 2. Performance/throughput is input– 3. Abstract architecture – We know how to build

compilers for this – 4. Mapping mechanism – Determine architecture

specifics from schedule for abstract architecture

Page 10: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Direct Generalization of PICO?

FSelectT

FSelectT

FSelectT TSelectF Load

LoadLoadTSelectF

+

Store

Copy

Copy

• Programmability would require full interconnect between elements• Back to the meta architecture!

• Generalize connectivity to enable post-programmability• But stylize it

Page 11: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Programmable Loop Accelerator – Design Strategy

• Compile for partially defined architecture– Build long distance communication into schedule– Limit global communication bandwidth

• Proposed meta-architecture– Multi-cluster VLIW

• Explicit inter-cluster transfers (varying latency/BW)• Intra-cluster communication is complete

– Hardware partially defined – expensive units

Page 12: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Programmable Loop Accelerator Schema

Intra-cluster Communication

Shift Register

SRAM

DRAM

Stream Buffer

Accelerator

Accelerator

Pipeline of Tiled orClustered Accelerators

Accelerator Datapath

ControlUnit

Stream Unit

Stream UnitII

FU

… …

FU

… …

FU

… …

MEM

Inter-clusterRegister File

… …

Page 13: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Flow Diagram

FU Alloc

Partition

ModuloSchedule

Assembly code,II

# clusters# expensive FUs

# cheap FUsFUs assigned to clusters

Shift register depth, width, portingIntercluster bandwidth

LoopAccelerator

Page 14: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Sobel Kernel for (i = 0; i < N1; i++) { for (j = 0; j < N2; j++) {

int t00, t01, t02, t10, t12, t20, t21, t22; int e, tmp;

t00 = x[i ][j ]; t01 = x[i ][j+1]; t02 = x[i ][j+2]; t10 = x[i+1][j ]; t12 = x[i+1][j+2]; t20 = x[i+2][j ]; t21 = x[i+2][j+1]; t22 = x[i+2][j+2];

e1 = ((t00 + t01) + (t01 + t02)) – ((t20 + t21) + (t21 + t22)); e2 = ((t00 + t10) + (t10 + t20)) – ((t02 + t12) + (t12 + t22));

e12 = e1*e1; e22 = e2*e2; e = e12 + e22; if (e > threshold) tmp = 1; else tmp = 0; edge[i][j] = tmp;

} }

Page 15: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

FU Allocation• Determine number of

clusters:

• Determine number of expensive FUs– MPY, DIV, memory

II

typeofops __#

IIops

4

#

• Sobel with II=4

41 ops 3 clusters

2 MPY ops 1 multiplier

9 memory ops 3 memory units

Page 16: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Partitioning

• Multi-level approach consists of two phases– Coarsening– Refinement

• Minimize inter-cluster communication• Load balance

– Max of 4 II operations per cluster• Take FU allocation into account

– Restricted # of expensive units– # of cheap units (ADD, logic) determined from partition

Page 17: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Coarsening

• Group highly related operations together– Pair operations together at each step– Forces partitioner to consider several operations as a

single unit• Coarsening Sobel subgraph into 2 groups:

+ + + +++

L LLL L

+ + + +++

L LLL L

+ + + +++

L LLL L

+ + + +++

L LLL L

Page 18: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Refinement

• Move operations between clusters• Good moves:

– Reduce inter-cluster communication– Improve load balance– Reduce hardware cost

• Reduce number of expensive units to meet limit

• Collect similar bitwidth operations together

+ + + +

++

L LLL L

?

Page 19: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Partitioning Example

• From sobel, II=4• Place MPYs together• Place each tree of ADD-

LOAD-ADDs together• Cuts 6 edges

Page 20: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Modulo Scheduling

• Determines shift register width, depth, and number of read ports

• Sobel II=4

LD

ADD

ADD ADD

LD

ADD0

3

1

2

cycle FU0 FU1 FU2 FU3

FU Cycle Max resultlifetime

Req’ddepth

Req’d ports

0 2 4 4 1

1 1 2 4 2

3 4

2 4 1 1 1

3 0 - 1 1

3 1

Page 21: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Test Cases

• Sobel and fsed kernels, II=4 designs• Each machine has 4 clusters with 4 FUs per cluster

M + -

+ - + -

M + -

+ - + -

M + -

* &

B <<

+ - <<

M + -

+ - <<

M + -

+ - <<

M + &

+ & + &

B + -

*

sobel

fsed

Page 22: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Cross Compile Results

• Computation is localized– sobel: 1.5 moves/cycle– fsed: 1 move/cycle

• Cross compile– Can still achieve II=4– More inter-cluster communication– May require more units– sobel on fsed machine: ~2 moves/cycle– fsed on sobel machine: ~3 moves/cycle

Page 23: University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of MichiganElectrical Engineering and Computer Science

Concluding Remarks

• Programmable loop accelerator design strategy– Meta-architecture with stylized interconnect– Systematic compiler-directed design flow

• Costs of programmability:– Interconnect, inter-cluster communication– Control – “micro-instructions” are necessary

• Just scratching the surface of this work• For more, see the CCCP group webpage

– http://cccp.eecs.umich.edu