fft compiler

9
FFT Compiler: From Math to Efficient Hardware James C. Hoe Department of ECE Carnegie Mellon University joint work with Peter A. Milder, Franz Franchetti, and Markus Pueschel in the SPIRAL project with support from NSF ACR-0234293, ITR/NGS-0325687, DARPA NBCH-105000,Intel Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 2 CMU/ECE/SPIRAL/Hoe The SPIRAL Project High performance implementations of linear DSP transforms (DFT, DCT, DWT, filters, etc) are an important class of design problems Hand design and tuning is tricky and expensive - needs both math and implementation knowledge - time-consuming and tedious - needs to repeat effort for every new context SPIRAL research goal: A flexible push-button design generation framework that produces SW and HW implementations equal or better than expert hand design Today I focus on FFT for HW

Upload: others

Post on 26-Nov-2021

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FFT Compiler

1

FFT Compiler:From Math to Efficient Hardware

James C. HoeDepartment of ECE

Carnegie Mellon University

joint work with Peter A. Milder, Franz Franchetti, and Markus Pueschel in the SPIRAL project

with support from NSF ACR-0234293, ITR/NGS-0325687, DARPA NBCH-105000,Intel

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 2CMU/ECE/SPIRAL/Hoe

The SPIRAL Project

♦ High performance implementations of linear DSP transforms (DFT, DCT, DWT, filters, etc) are an important class of design problems

♦ Hand design and tuning is tricky and expensive- needs both math and implementation knowledge- time-consuming and tedious - needs to repeat effort for every new context

♦ SPIRAL research goal: A flexible push-button design generation framework that produces SW and HW implementations equal or better than expert hand design

Today I focus on FFT for HW

Page 2: FFT Compiler

2

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 3CMU/ECE/SPIRAL/Hoe

Why we can do better than hand design

♦ SPIRAL is only focused on linear DSP transforms

♦ These transforms are highly structured, highly regular and very well understood mathematically

♦ Algorithmic implementations of a transform can be enumerated following a known set of rules

♦ For a given objective function and mapping target, a computer generates a solution at least as good as the best human effortby trying enough implementations

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 4CMU/ECE/SPIRAL/Hoe

Outline

♦ SPIRAL Formula Framework

♦ SPIRAL for HW FFT cores

♦ Conclusions

Page 3: FFT Compiler

3

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 5CMU/ECE/SPIRAL/Hoe

Linear Transforms

♦ Linear transform is matrix-vector multiplication- computing by definition takes O(N2) operations- the matrix has structure

♦ E.g. discrete Fourier transform: y = DFTN ⋅ x

y0y1.yj..

yN-1

=

x0x1.xk..

xN-1

.Njkie π2−

k → 0 .. N-1

j →0 .. N

-1

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 6CMU/ECE/SPIRAL/Hoe

“Fast” Algorithms♦ a “fast” algorithm factors the matrix into a sequence of

structured, sparse matricescheaper sparse multiplies ⇒ O(N log(N)) operations

♦ E.g. Cooley-Tukey Factorization of DFT4

♦ Matrix formula representation

⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅

−⋅⋅⋅⋅

⋅⋅−⋅⋅

⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅

−⋅⋅⋅−⋅

⋅⋅⋅⋅

=

−−−−−−

11

11

1111

1111

11

1

1111

1111

111111

111111

iii

ii

( ) ( ) mnnmn

mnnmnmn LDFTIDIDFTDFT ⋅⋅

⋅ ⊗⊗=

Page 4: FFT Compiler

4

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 7CMU/ECE/SPIRAL/Hoe

“Fast” Fourier Transform Algorithms♦ Recursively factorize by the Cooley-Tukey rule until only

leaf cases remain (e.g. DFTr for radix-r)

♦ Exponential number of alternatives

♦ Each ruletree corresponds a different algorithm♦ All cost O(N log(N))

( ) ( )( ) ( ) ( )( )( ) 8

24222

42222

8242

8242

82428

LLDFTIDIDFTIDIDFT

LDFTIDIDFTDFT

⊗⊗⊗⊗=

⊗⊗=

8DFT

2DFT 4DFT2DFT 2DFT

8DFT

4DFT 2DFT2DFT 2DFT

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 8CMU/ECE/SPIRAL/Hoe

Formula to HW (Combinational)

♦ Given where is:- apply , then - is a permutation permute- apply , times in parallel- is a diagonal scale

A

A

B A

×4

×2

×7

×8

Page 5: FFT Compiler

5

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 9CMU/ECE/SPIRAL/Hoe

How about good HW?

♦ Matrix formulas have a natural mapping to dataflow and hence combinational datapath

♦ However, real hardware designs must fit a given resource constraint ⇒ sequential datapath that reuse available HW- identify repeated kernels- instantiate kernels under resource constraints- schedule computation to reuse instantiated kernels

We want to do the analysis, mapping and scheduling at formula level, with high-level algorithm knowledge

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 10CMU/ECE/SPIRAL/Hoe

♦ Simple regular structure embodied in Pease FFT

♦ Example:

Regular Structure for HW

Page 6: FFT Compiler

6

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 11CMU/ECE/SPIRAL/Hoe

Pease DFT Example: DFT8

x

x

x

x

x

x

x

x

x

x

x

x

stage 1 stage 2 stage 3

(formula is applied from right to left)

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 12CMU/ECE/SPIRAL/Hoe

Horizontal folding

♦ a baseline datapath for DFT (sans bit-reverse)♦ degree of freedom: vertical parallelism

- parameter pp

inputbypass

register

pp

Page 7: FFT Compiler

7

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 13CMU/ECE/SPIRAL/Hoe

Vertical (V-)folding according to p

latencyFineFine--grained control over cost/latency tradeoffgrained control over cost/latency tradeoff

cost

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 14CMU/ECE/SPIRAL/Hoe

DFTgen [Milder, et al., DAC’05, FPGA’06]

functionality

controlarea/latencytradeoff

resourcemodel

Page 8: FFT Compiler

8

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 15CMU/ECE/SPIRAL/Hoe

Cost Performance Tradeoff

DFT64 DFT256 DFT1024

slices slices slices

late

ncy

(use

c)

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 16CMU/ECE/SPIRAL/Hoe

♦ Simple regular structure embodied in formula

Structures of Interest for Parameterized Cores

♦ Product: Horizontal folding

♦ Tensor Product: parallelism or streaming vertical folding

♦ Stride permutation: rewiring or buffer/delay

Page 9: FFT Compiler

9

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 17CMU/ECE/SPIRAL/Hoe

Applicability to other transforms?

♦ DFT radix 2

♦ DFT radix 2r

♦ 2-D DFTnxn

♦ WHT

♦ DCT (type II) [ ] Hk

iik k

k

k

k

k

k

PLLLADP2

22

22

1

0

22 11 −−∏

=−

( )[ ]∏=

⊗1

0

2

inn

nn DFTIL

( )[ ]∏−

=−− ⊗

1

0

22222 11

k

ii

k

kkk LDFTITR

( )[ ]∏−

=−− ⊗

1/

0

22222

rk

ii

k

rkrrkk LDFTITR

( )[ ]∏−

=−− ⊗

1/

0

2222

rk

i

k

rkrrk LWHTI

Dataflow to Synthesis Retrospective, CSAIL/MIT, May 2007, Slide 18CMU/ECE/SPIRAL/Hoe

Conclusions

♦ Mathematical approach to DSP transform implementation

♦ Encapsulating domain knowledge in a domain specific tool for 100% automatic design generation and optimization

♦ Generalizable to other linear DSP transforms

♦ Thank you (Arvind)