fast algorithm for digital signal processing - vlsi signal

36
1 VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-1 5012: VLSI Signal Processing 5012: VLSI Signal 5012: VLSI Signal Processing Processing Lecture 6 Fast Algorithms for Digital Lecture 6 Fast Algorithms for Digital Signal Processing Signal Processing VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-2 Algorithm Strength Reduction Motivation The number of strong operations, such as multiplications, is reduced possibly at the expense of an increase in the number of weaker operations, such as additions. Reduce computation complexity Example: Complex multiplication (a+jb)(c+jd)=e+jf, a,b,c,d,e,f R The direct implementation requires 4 multiplications and 2 additions However, the number of multiplication can be reduced to 3 at the expense of 3 extra additions by using the identities = b a c d d c f e ) ( ) ( ) ( ) ( b a d d c b bc ad b a d d c a bd ac + + = + + = 3 multiplications 5 additions

Upload: others

Post on 10-Feb-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

1

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-1

5012: VLSI Signal Processing

5012: VLSI Signal 5012: VLSI Signal ProcessingProcessing

Lecture 6 Fast Algorithms for Digital Lecture 6 Fast Algorithms for Digital Signal ProcessingSignal Processing

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-2

Algorithm Strength Reduction• Motivation

– The number of strong operations, such as multiplications, is reduced possibly at the expense of an increase in the number of weaker operations, such as additions.

• Reduce computation complexity• Example: Complex multiplication

– (a+jb)(c+jd)=e+jf, a,b,c,d,e,f ∈ R– The direct implementation requires 4 multiplications and 2

additions

– However, the number of multiplication can be reduced to 3 at the expense of 3 extra additions by using the identities

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡ −=⎥

⎤⎢⎣

⎡ba

cddc

fe

)()()()(

baddcbbcadbaddcabdac

−++=+−+−=− 3 multiplications

5 additions

2

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-3

Review of Digital Signal Processing• Given two sequences:

– Data sequence di, 0 ≤ i≤ N-1, of length N– Filter sequence gi, 0 ≤ i≤ L-1, of length L

• Linear convolution

• Express the convolution in the notation of polynomials∑ ∑−

=

=−− −+===

1

0

1

0

2,,1,0 ,or ,N

k

L

kkikikkii NLidgsdgs K

( ) ( )

( ) ∑

∑∑−+

=

=

=

===

==

2

0

1

0

1

0

where),()()()()(

Then . ,

NL

i

ii

L

i

ii

N

i

ii

xsxsxgxdxdxgxs

xgxgxdxd

NL multiplications

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-4

CooK-Toom Algorithm• An algorithm for linear convolution by using

multiplying polynomials.• Consider the following system

hx sL-point sequence N-point

sequence

(L+N-1)-point sequence

given

To find sL+N-2, …, s1, s0 By solving L+N-1 linear equations

3

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-5

Review of Polynomial Ring• For a field F, there is a polynomial ring F[x] called the ring of

polynomials over F.• Mathematical expression

f(x)=fnxn + fn-1xn-1 + …+ f1x + f0, f0,f1,…,fn∈F• If fn≠ 0, then the degree of polynomial f(x) is n• β is called a zero of polynomial f(x), if β∈F and f(β)=0.• At most n field elements are zeros of a polynomial of degree n,

otherwise, it is a zero polynomial• Lagrange Interpolation

– Let β0,…,βn be a set of distinct elements, and let p(βk), k=0,…,n, be given. There is exactly one polynomial p(x) of degree n or less that has value p(βk), k=0, …, n. p(x) is given by

( )( )( )∏

∏∑

= −

−=

ijji

ijjn

ii

xpxp

ββ

ββ

0)(

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-6

Cook-Toom (CT) Algorithm

4

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-7

Example: 2 by 2 CT Algorithm• 2 by 2 convolution in polynomial multiplication from is

s(p)=h(p)x(p), where h(p)=h0+h1p, x(p)=x0+x1p, and s(p)=s0+s1p+s2p2

• Direct implementation: – require 4 multiplications and 1 addition

• CT algorithm – Step 1: Choose β0=0, β1=1, β2=-1– Step 2 :

⎥⎦

⎤⎢⎣

⎥⎥⎥

⎢⎢⎢

⎡=

⎥⎥⎥

⎢⎢⎢

1

0

1

01

0

2

1

0

0

0

xx

hhh

h

sss

Require no multiplications

( ) ( )( ) ( )( ) ( ) 1021020

1011010

00000

.,1;,,1

;,,0

xxxhhhxxxhhh

xxhh

−=−=−=+=+==

===

βββββββββ

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-8

2 by 2 CT Algorithm– Step 3 : Calculate s(β0), s(β1), s(β2).

– Step 4: Compute s(p) by using Lagrange interpolation theorem

( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )222

111

000

βββββββββ

xhsxhsxhs

===

Require 3 multiplications

5

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-9

2 by 2 CT Linear Convolution

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-10

Remarks• Direct implementation needs 4 mutiplications and

1 addition• If we take sequence hi as filter coefficients and

sequence xi as the signal sequence, then the terms Hi need not be recomputed each time the filter is used. They can be precomputed once off-line and stored.

• 2 by 2 CT algorithm needs 3 multiplications and 5 additions (ignoring the additions in the pre-computation).

• The number of multiplications is reduced by 1 at the expense of 4 extra additions

6

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-11

Remarks• Some additions in the preaddition or postaddition

matrix can be shared. When we count the number of additions, we only count one instead of two or three.

• As can be seen from examples, the CT algorithm can be understood as a matrix decomposition

C H D

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-12

Cook-Toom Algorithm• Generally, the equation can be expresses as s=Tx=CHDx

– C is called the postaddition matrix and D the preadditionmatrix. H is a diagonal matrix with Hi, i=0, 1, …, L+N-2 on the diagonal

• Since T=CHD, it implies that the CT algorithm provides a way to factorize the convolution matrix T into three multiplying matrices and the total number of multiplications is determined by the non-zero elements on the main diagonal of the matrix H (note matrices C and D contain only small integers)

• Although the number of multiplications is reduced, the number of additions has increased. The Cook-Toomalgorithm can be modified in order to further reduce the number of additions

7

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-13

Concluding Remarks• The Cook-Toom algorithm is efficient as

measured by the number of multiplications• As the size of the problem increases, the number

of additions increase rapidly• The choices of βi=0, ±1 are good, while the choices

of ±2, ±4 (or other small integers) result in complicated pre-addition and post-addition matrices.

• For larger problems, CT algorithm becomes cumbersome

• Winograd Algorithm

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-14

Review of Integer Ring (1)• For every integer c and positive integer d, there is a unique

pair of integer Q, called the quotient, and integer s, the remainder, such that c=dQ+s, where 0≤s≤d-1

• Notation: Q=⎣c/d⎦, s=Rd[c]• Euclidean Algorithm: Given two positive integers s and t, t<s,

their GCD can be computed by an iterative application of the division algorithm.

)()1()1(

)()1()()2(

)3()2()3()1(

)2()1()2(

)1()1(

nnn

nnnn

tQtttQt

ttQtttQt

ttQs

−−

−−

=+=

+=+=+=

M

1. the process stops when a remainder of zero is obtained.

2. The last nonzero remainder t(n) is the GCD(s,t)

3. Matrix notation expression

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡−

=⎥⎦

⎤⎢⎣

⎡−

)1(

)1(

)()(

)(

110

r

r

rr

r

ts

Qts

8

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-15

Review of Integer Ring (2)• For any integer s and t, there exists integers a and b s.t.

GCD[s,t]= as + bt• It is possible to uniquely determine a nonnegative integer

given its moduli with respect to each of several integers, provided that the integer is known to be smaller than the product of the moduli.

• Example:M1

0

1

2

0

1

M2

0

1

2

3

4

→M1

2

0

1

2

0

M2

0

1

2

3

4

M1

1

2

0

1

2

M2

0

1

2

3

4

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

{m1=3,m2=5} M=3×5=15

Unique representation

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-16

4567891011

0123

M1

012012012012

M2

012301230123

M3

012340123401

1617181920212223

12131415

M3

234012340123

2829303132333435

24252627

M3

401234012340

4041424344454647

36373839

M3

123401234012

5253545556575859

48495051

M3

340123401234

M1

012012012012

M1

012012012012

M1

012012012012

M1

012012012012

M2

012301230123

M2

012301230123

M2

012301230123

M2

012301230123

{ } { }5,4,3,, 321 =mmm 60543 =××=M

Example

9

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-17

Chinese Remainder Theorem (1)• Given a set of integers m0, m1, …, mk that are pair-

wise relatively prime (co-prime), then for each integer c, 0≤c<M= m0m1…mk, there is a one-to-onemap between c and the vector of residues

• Conversely, given a set of co-prime integers m0, m1, …, mk and a set of integers c0, c1, …, ck with ci<mi. Then the system of equations

ci =c (mod mi), i=0,1,…,khas at most one solution for 0≤c<M

( )][,],[],[10

cRcRcRkmmm K

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-18

Chinese Remainder Theorem (2)• Define Mi=M/mi, then GCD[Mi, mi]=1. So there

exists integers Ni and ni with GCD[Mi, mi]= 1 = NiMi + nimi, i=0,1,…,k

• The system of equations ci=c (mod mi), 0≤i≤k, is uniquely solved by

∑=

=k

iiii MMNcc

0) (mod

given

We need to find Ni

10

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-19

GCD Example• GCD(993,186)

993 186 993=5×186+63930

63 186=2×63+60

63=1×60+3

12660

603 60=20×3+0

600

( )

( )

( )186169933

1861186599331861633

63218616360163

3186,993

×−×=×−×−×=

×−×=×−×−=

×−==GCD

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-20

Remark1.

2.

) (mod ) (mod

) (mod ) (mod 0

ii

iiii

k

iiiiii

mcmMNc

mMNcmc

==

=∑=

( )iii

iiiiii

mMNmMGCDmnMN

mod 1have then we,1),(

===+

11

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-21

Example• m0=3, m1=4, m2=5. Then by Euclidean theorem, we have

• The integer c can be calculated as

• Example

;15)5(12)2(,12,5;14)4(15)1(,15,4;13)7(20)1(,20,3

02

11

00

=+−===+−===+−==

MmMmMm

Ni

ni

( )

( ) ( )60 mod 241520

mod

210

0

ccc

MMNcck

iiii

−−−=

=∑=

( )17)60 (mod )224115220( ,Conversely

)2,1,2(,, i.e.,17 210

=×−×−×−===

ccccc

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-22

Remarks• By taking residues, large integers are broken

down into small pieces (that may be easy to add and multiply)

• Examples:7→(1, 3, 2 )

+3→(0, 3, 3 )10→(1 mod 3, 6 mod 4, 5 mod 5) = (1,2,0)

7→(1, 3, 2 )×3→(0, 3, 3 )21→(0 mod 3, 9 mod 4, 6 mod 5) = (0,1,1)

12

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-23

CRT for Polynomials (1)• Given a set of polynomials m(0)(x), m(1)(x), …, m(k)(x), that are

pair-wise relatively prime (co-prime), then for each polynomial c(x), deg(c(x))<deg(M(x)), M(x)= m(0)(x)m(1)(x)…m(k)(x), there is a one-to-one map between c(x) and the vector of residues

• Conversely, given a set of co-prime polynomials m(0)(x), m(1)(x), …, m(k)(x) and a set of polynomials c(0)(x), c(1)(x), …, c(k)(x) with deg(c(i)(x))< deg(m(i)(x)). Then the system of equations

c(i)(x) =c(x) (mod m(i)(x)), i=0,1,…,khas at most one solution for deg(c(x)) < deg(M(x))

)])([,)],([)],([()()()( )()1()0( xcRxcRxcR

xmxmxm kK

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-24

Chinese Remainder Theorem (2)• Define M(i)(x)=M(x)/ m(i)(x), then GCD[M(i)(x),m(i)(x)]=1. So

there exists polynomials N(i)(x) and n(i)(x) with GCD[M(i)(x),m(i)(x)]=1= N(i)(x)M(i)(x) + n(i)(x)m(i)(x), i=0,1,…,k

• The system of equations c(i)(x) =c(x) (mod m(i)(x) ), 0≤i≤k, is uniquely solved by

( ) ( ) ( ) ( ) ( )∑=

=k

i

(i)(i)(i) xM xMxNxcxc0

)( mod

13

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-25

Remarks• The remainder of a polynomial with regard to

modulus xi+f(x), where deg(f(x))<i, can be evaluated by substituting xi by –f(x) in the polynomial

• Example

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-26

Winograd Algorithm• Recall that we wish to compute s(p)=h(p)x(p) for

linear convolution• Consider the following system:

s(p)=h(p)x(p) mod m(p)• As long as deg(s)<deg(m), then the system can be

used for solving linear convolution problem• If m(p)= m(0)(p)m(1)(p)…m(k)(p). Efficient

implementation for linear convolution can be constructed using the CRT by choosing and factoring the polynomial m(p) appropriately.

14

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-27

Winograd Algorithm

This step requires multiplications

m(p)

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-28

Example: 2×3 WinogradAlgorithm

• The linear convolution h(p)x(p) has degree 3

Require no multiplication

15

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-29

Example: 2×3 WinogradAlgorithm

Require multiplication

m(p)

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-30

Example: 2×3 WinogradAlgorithm

16

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-31

Example: 2×3 WinogradAlgorithm

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-32

Remarks• It requires 5 multiplications and 11 additions, compared with 6

multiplications and 2 additions.• In above example, the order in which the additions are done is

unspecified. • One can experiment with the order of the additions to minimize

their number, however, there is no theory developed to aid in doing this.

• The number of multiplications is highly dependent on the degree of m(p)

• The degree of m(p) should be as small as possible. By CRT, the extreme case is deg(m(p))=deg(s(p))+1

• Let s’(p)=s(p)- hN-1xL-1 m(p). Note that s’(p) (mod m(p)) = s(p) (mod m(p))

• Modified Winograd Algorithm:– Choose m(p) with a degree equal to that of s(p)– Apply CRT to s’(p)– s(p)=s’(p)+ hN-1xL-1 m(p).

17

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-33

Iterated Convolution• To make use of efficient short-length convolution

algorithms iteratively, one can build long convolutions

• These algorithms do not achieve minimal multiplication complexity, but achieve a good balance between multiplications and addition complexity

• Iterated Convolution algorithm– Decompose the long convolution algorithm for short

convolutions– Construct fast convolution algorithm for short

convolutions– Use the short convolution algorithms to iteratively (or

hierarchically) implement the long convolution

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-34

Example• 4×4 linear convolution algorithm

Polyphase decomposition !!

18

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-35

Remarks• The 4×4 convolution is decomposed into two levels of nested

2×2 short convolutions• The top-level, which is expressed in terms of variable q, can

be using by 2×2 convolution algorithms

• The polynomial multiplications, for computing s’0,s’1,s’2, are again 2×2 convolutions, i.e. the second level 2×2 short convolutions

( )( )( )

( )( ) ( )

( )

( )( )⎥⎦

⎤⎢⎣

⎡′′

⎥⎥⎥

⎢⎢⎢

⎡−

⎥⎥⎥

⎢⎢⎢

′′−′

⎥⎥⎥

⎢⎢⎢

⎡−=

⎥⎥⎥

⎢⎢⎢

′′′

pxpx

phphph

ph

pspsps

1

0

1

10

0

2

1

0

1011

01

000000

100111001

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-36

Linear Convolution

Linear Shift

Linear Shift

19

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-37

Circular Shift

Conventional shift(linear shift)

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-38

Circular Convolution• Given two sequences xi and hi, 0≤i≤n-1, of block length n• Notation: ((n-k)) ≡ n-k (mod n)• Cyclic (or circular) convolution s’i, 0≤i≤n-1, is given by

( )( )∑−

=−=′

1

0

n

kkkii xhs

( ) ( ) ( ) ( )1mod 1mod )()( :product polynomialby n convolutio cyclic theexpresscan We

−=−=′ nn ppxphppsps

Coefficients with indices larger than n-1 are folded back into termswith indices small than n

20

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-39

Direct Implementation• Consider the following system

Ohx sn-point sequence n-point

sequence

n-point sequence

4×4 cyclic convolution

16 multiplications12 additions

Circular Convolution

Circular shift

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-40

Remarks• The cyclic convolution can be computed as a linear

convolution reduced by modulo pn-1• There are 2n-1 outputs of linear convolution, while

there are n outputs of cyclic convolution• The cyclic convolution can be computed by using

CRT with m(p)=pn-1

21

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-41

Example: 4×4 Cyclic Convolution

• 4×4 cyclic convolution by using m(p)=p4-1

p=1p=-1

p2=-1

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-42

Example

22

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-43

Examplem(p)

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-44

Example

23

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-45

5 multiplications15 additions

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-46

Discrete Fourier Transform• Discrete Fourier transform (DFT) pairs

knN

jknN

N

k

knN

N

n

knN

eW

NnWkXN

nx

NkWnxkX

π2

1

0

1

0

where

,1,,1,0 ,][1][

1,,1,0 ,][][

−−

=

=

=

−==

−==

K

K

• DFT/IDFT can be implemented by using the same hardware• It requires N2 complex multiplications and N(N-1) complex additions

N complex multiplicationsN-1 complex additions

2π/N

24

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-47

Decimation in Time

N+2(N/2)2 complex multiplications vs. N2 complex multiplication

twiddle factor

n2ℓ

2ℓ+1

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-48

25

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-49

Flow Graph of the DIT FFT

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-50

26

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-51

8-point DIT DFT

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-52

Remarks• It requires v=log2N stages. Each stage has N/2 butterfly

operation (radix-2 DIT FFT), which requires 2 complex multiplications and 2 complex additions

• Each stage has N complex multiplications and N complex additions

• The number of complex multiplications (as well as additions) is equal to N log2N

• By symmetry property, we have (butterfly operation)222 N

Njr

NN

Nr

NNr

N WeWWWW −=== −+ π

2 complex multiplications2 complex additions

1 complex multiplications2 complex additions

27

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-53

8-point FFT

Normal orderBit-Reversed order

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-54

In-Place Computation

Stage 1

X0[000]

X0[001]

X0[010]

X0[011]

X0[100]

X0[101]

X0[110]

X0[111]

X1[000]

X1[001]

X1[010]

X1[011]

X1[100]

X1[101]

X1[110]

X1[111]

X2[000]

X2[001]

X2[010]

X2[011]

X2[100]

X2[101]

X2[110]

X2[111]

Stage 3Stage 2

X3[000]

X3[001]

X3[010]

X3[011]

X3[100]

X3[101]

X3[110]

X3[111]

The same register array can be used in each stage

28

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-55

8-point FFT

Normal order Bit-reversed order

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-56

Normal-Order Sorting v.s. Bit-Reversed Sorting

Normal Order Bit-reversed Order

even

odd

top

bottom

29

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-57

DFT v.s. Radix-2 FFT• DFT: N2 complex multiplications and N(N-1)

complex additions• Recall that each butterfly operation requires one

complex multiplication and two complex additions• FFT: (N/2) log2N multiplications and N log2N

complex additions

• In-place computations: the input and the output nodes for each butterfly operation are horizontally adjacent only one storage arrays will be required

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-58

Decimation in Frequency (DIF)• Recall that the DFT is

• DIT FFT algorithm is based on the decomposition of the DFT computations by forming small subsequences in time domain index “n”: n=2ℓ or n=2ℓ+1

• One can consider dividing the output sequence X[k], in frequency domain, into smaller subsequences: k=2r or k=2r+1:

[ ] 10 ,][1

0−≤≤= ∑

=

NkWnxkXN

n

nkN

Substitution of variables

30

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-59

DIF FFT Algorithm (1)

is just N/2-point DFT. Similarly,

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-60

DIF FFT Algorithm (2)

v=log2N stages, each stage has N/2 butterfly operation.

(N/2)log2N complex multiplications, N complex additions

31

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-61

Remarks• The basic butterfly operations for DIT FFT and DIF FFT

respectively are transposed-form pair.

• The I/O values of DIT FFT and DIF FFT are the same• Applying the transpose transform to each DIT FFT

algorithm, one obtains DIF FFT algorithm

DIF BF unitDIT BF unit

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-62

Fast Convolution with the FFT• Given two sequences x1 and x2 of length N1 and N2

respectively– Direct implementation requires N1N2 complex

multiplications• Consider using FFT to convolve two sequences:

– Pick N, a power of 2, such that N≥N1+N2-1– Zero-pad x1 and x2 to length N– Compute N-point FFTs of zero-padded x1 and x2, then we

obtain X1 and X2– Multiply X1 and X2– Apply the IFFT to obtain the convolution sum of x1 and

x2– Computation complexity: 2(N/2) log2N + N + (N/2)log2N

32

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-63

Implementation Issues• Radix-2, Radix-4, Radix-8, Split-Radix,Radix-22, …, • I/O Indexing• In-place computation

– Bit-reversed sorting is necessary– Efficient use of memory– Random access (not sequential) of memory. An address

generator unit is required.– Good for cascade form: FFT followed by IFFT (or vice

versa)• E.g. fast convolution algorithm

• Twiddle factors– Look up table– CORDIC rotator

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-64

Algorithm Strength Reduction• Motivation

– The number of strong operations, such as multiplications, is reduced possibly at the expense of an increase in the number of weaker operations, such as additions.

• Reduce computation complexity• Example: Complex multiplication

– (a+jb)(c+jd)=e+jf, a,b,c,d,e,f ∈ R– The direct implementation requires 4 multiplications and 2

additions

– However, the number of multiplication can be reduced to 3 at the expense of 3 extra additions by using the identities

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡ −=⎥

⎤⎢⎣

⎡ba

cddc

fe

)()()()(

baddcbbcadbaddcabdac

−++=+−+−=− 3 multiplications

5 additions

33

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-65

Complex Multiplication

Reduce the number of strong operation (less switched capacitance), however, increase the critical path

Speed?, Area?, Power? ….

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-66

FIR Filters

⎥⎥⎥

⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

=

⎥⎥⎥⎥

⎢⎢⎢⎢

2

1

0

1

01

01

0

3

2

1

0

000

000

xxx

hhh

hhh

yyyy Transform-domainTime-domain

34

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-67

Example: Linear Phase FIRLinear phase FIR filter: with approximately constant frequency-response magnitude and linear phase (constant group delay) in pass-band

N-tap

N multipliersN-1 adders

(N+1)/2 multipliersN-1 adders, if odd N

N/2 multipliersN-1 adders, if even N

By exploiting substructure sharing to reduce area

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-68

An Efficient Decomposition• Example: 2-fold decomposition

• Example 3-fold decomposition

• General case (N-fold decomposition)

4444 34444 21444444 3444444 21)(

421

)(

642

654321

21

20

)]5[]3[]1[()]6[]4[]2[]0[( ]6[]5[]4[]3[]2[]1[]0[)(

zHzH

zhzhhzzhzhzhhzhzhzhzhzhzhhzH

−−−−−−

−−−−−−

++++++=

++++++=

44 344 2144 344 214444 34444 21)(

32

)(

31

)(

63

654321

32

31

30

)]5[]2[()]4[]1[()]6[]3[]0[( ]6[]5[]4[]3[]2[]1[]0[)(

zHzHzH

zhhzzhhzzhzhhzhzhzhzhzhzhhzH

−−−−−−

−−−−−−

++++++=

++++++=

∑∑∑∞

−∞=

−−

=

−∞

−∞=

− +===k

kl

N

l

Nl

l

k

k zlNkhzHzHzzkhzH ][)( where,)(][)(1

0

35

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-69

Traditional Parallel Architecture• 2-fold parallel architecture

4(N/2) multiplicationsN/2-tap 4(N/2-1)+2 additions

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-70

Traditional Parallel FIR

L-parallel FIR filter of length N/L requires 1. L2 (N/L) multiplications, i.e. LN multiplications2. L2 (N/L -1) +L(L-1) additions, i.e. L(N-1) additions

~ LN multiply-add operations

36

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-71

VSP Lecture6 - Fast Algorithms for DSP ([email protected]) 3-72

Fast FIR Algorithm (FFA)• First by applying L-fold polyphase

decomposition for H(z)– There are L filters of length N/L

• By applying Winograd algorithm– 2 polynomials of degree L-1 can be implemented

by using 2L-1 product terms.– Each product terms are equivalent to filtering

operations in the block formulation– Consequently, it can be realized using

approximately (2L-1) FIR filters of length N/LIt requires 2N-N/L multiplications