64-point fast efficient fft architecture using radix-2^3 single path delay feedback.pdf

654

2009 International Conference on Electrical Engineering and Informatics 5-7 August 2009, Selangor, Malaysia

978-1-4244-4913-2/09/$25.00 ©2009 IEEE

EE-27

64-point Fast Efficient FFT Architecture Using Radix-23 Single Path Delay Feedback

Trio Adiono, Muh Syafiq Irsyadi, Yan Syafri Hidayat, Ade Irawan Electrical Engineering and Informatics School, Bandung Institute of Technology

Jl. Ganesha 10, Bandung 40132, Indonesia [email protected]

[email protected]

[email protected]

[email protected]

Abstract — Here we present a new design of a 64-point Fast Fou-rier Transform circuit. The design is derived from Radix-23 algo-rithm and implemented using Single Path Delay Feedback archi-tecture. This approach ensures high memory and multiplier utilizations. The 64-Point FFT is realized by decomposing into two-dimensional structure of 8-point FFTs. Each of this FFT is re-decomposed into 4-point and 2-point FFTs. This decomposi-tion reduces the number of non-trivial twiddle factor into just one. Thus we only need one complex multiplier for the design. The complex multiplier is realized using modified Booth (radix-4) encoding algorithm to achieve faster computational speed. The validity and efficiency of the proposed circuit has been thor-oughly verified by functional simulation, timing simulation, and FPGA implementation. The proposed design has been success-fully synthesized using Synopsys with TSMC 0.18μ technology. The core area is 0.47 mm2. The power consumption is 29.7 mW. The time delay is 6 ns. The circuit computes one serial-to-serial data in 116 clock cycles. Thus our design has 3 advantages: small area, low power consumption, and fast computation. Keywords — FFT, R23SDF, radix-23.

I. INTRODUCTION FFT have been used in innumerable signal processing ap-

plications and are often an important building block in such systems. Many of these applications require real time opera-tion in order to be useful. While Digital Signal Processors (DSP) are available that can perform an FFT fast enough to keep up with many real-time applications, some systems re-quire additional computation or have speed requirements that exceed the capabilities of a DSP alone. It is in these situations that dedicated logic for computing an FFT proved to be useful. Pipeline FFT processor is a specific class of processors for DFT computation utilizing fast algorithms. It is characterized with real-time, non-stopping processing as the data sequence passing the processor.

II. THEORY This algorithm is based on fact that radix-8 FFT can be de-

composed into radix-4 and radix-2 FFT in order to reduce computation complexity. Recall the DFT algorithm,

( ) ( )1

0

Nnk

Nn

X k x n W−

==∑ 0 k N≤ <

Break this DFT algorithm into three dimensional index map.

1 2 3

1 2 3

4 84 8

N Nn n n n

k k k k

= + +

= + +

1 2 3

1 2 3

0 3,0 1,0 18

0 3,0 1,0 18

Nn n n

Nk k k

≤ ≤ ≤ ≤ ≤ ≤ −

≤ ≤ ≤ ≤ ≤ ≤ −

Substitute the new index into the DFT algorithm.

( )

( )

81 14

3 2 1

2 3 1 2 3 2 38 8

8

3 2

2 3 1 2 38

1 1 3

1 2 30 0 0

( ) ( )(4 8 )

1 1

2 3 10 0

( ) 4 8

4 8

4 ,8

NN

N N

N

N

n kN

n n n

n n k n n k kN N

n n

n n k k kN

N NX k x n n n W

W W

NBF n n k

W

−

= = =

+ + +

−

= =

+ + +

⎡ ⎤⎛ ⎞= + +⎢ ⎥⎜ ⎟⎝ ⎠⎣ ⎦

⎡ ⎤⎛ ⎞= +⎜ ⎟⎢ ⎥⎝ ⎠⎣ ⎦

∑ ∑ ∑

∑∑

Decompose the twiddle factor. ( ) ( ) ( )2 3 1 2 3 2 1 2 3 1 28 3 3

8

( ) 4 8 4 48

N

Nn n k k k n k k n k k n k

N NW W W W+ + + + +=

Substitute the decomposed twiddle factor.

( )( ) ( )

( )

8

3 2

2 1 2 3 1 2 3 3

8

83 3 1 2

83

1 1

2 3 10 0

4 48

13( 4 )

3 1 20

4 ,8

, ,

N

N

N

N

n n

n k k n k k n kN

n k n k kN

n

NX k BF n n k

W W W

H n k k W W

−

= =

+ +

−+

=

⎡ ⎤⎛ ⎞= +⎜ ⎟⎢ ⎥⎝ ⎠⎣ ⎦

= ⎡ ⎤⎣ ⎦

∑ ∑

∑

Now let’s recall 64-point radix-8 FFT algorithm [2].

( ) ( )8

1 2

3

173( 4 )

64 8 3 1 20 0

8 , ,N

n k ksl smN

m nW x l m W H n k k W

−+

= =

+ = ⎡ ⎤⎣ ⎦∑ ∑

We can clearly see that,

655

( ) ( )8

1 2

3

173( 4 )

64 8 3 1 20 0

8 , ,N

n k ksl smN

m nW x l m W H n k k W

−+

= =

+ = ⎡ ⎤⎣ ⎦∑ ∑

Where N=64 and, ( )

( ) ( )1 2

3 1 2

43 1 3 1 8

, ,

4 , 4 ,8

k k

H n k k

NBF n k BF n k W +

=

⎛ ⎞+ +⎜ ⎟⎝ ⎠

From the last equation we have shown that the first stage of 64-point radix-8 FFT can be decomposed into radix-4 and radix-2 FFTs. The second stage of radix-8 FFT can be decom-posed into radix-4 and radix-2 FFT using the same method.

The real advantage of this method is that W8sm and W8

lt is trivial twiddle factor. Its actually addition / subtraction opera-tion followed by multiplication with (1/ 2 ) that can be real-ized using only a hardwired shift-and-add operation [2]. The only non-trivial twiddle factor is W64

sl. Detailed derivation of radix-8 and radix-23 FFT algorithm can be found on [2] and [3].

III. DESIGN ARCHITECTURE The block diagram of the 64-point FFT processor derived

from section 2 is depicted in figure 1. It consists of four stages of butterfly feedback structure and one reorder stage. The ar-chitecture itself is based on Single Path Delay Feedback archi-tecture. The reason is the delay-feedback approach are always more efficient than corresponding delay-commutator approach in term of memory utilization since the stored butterfly output can be directly used by the multiplier [2]. The unusual mixed radix structure consists of radix-4 butterfly, followed by radix-2 butterfly, followed by radix-4, and radix-2 butterfly is in-tended to retain the radix-8 FFT advantage. That is there is only one non-trivial twiddle factor needed and yet this new approach has simpler butterfly structure higher utilization of butterfly compared to radix-8.

Controller : In this design we don’t implement a master controller. Each butterfly has its own controller that independ-ent from each other. This approach leads to modular and gen-eral structure of butterfly. Each controller is activated by the head signal from previous stage. The controller it self is actu-ally a (log2 N)-bit binary counter. In each butterfly, the counter is divided into four or two group cycle on radix-4 and radix-2 butterfly respectively. Each group of counting is

called phase (ph). These phases control the memory modules, butterfly operation and twiddle multiplication. Another con-trol signal called stage (st) is needed by twiddle stage to choose the multiplicative operation.

Radix-4 butterfly (stage 1 and stage 3): Stage 1 and stage 3 are radix-4 butterfly modules. There are four phases that con-trol the butterfly operation. In the first three phases the data input is directly inserted into shift register, while the previous data is taken to the output. The butterfly computation only happens on the last phase.

Radix-2 butterfly (stage 2 and stage 4): Stage 2 and stage 3 are radix-2 butterfly modules. Same as stage 1 and stage 3, the only difference between stage 2 and 4 is in the shift register length. There are two phases that control the butterfly opera-tion.

Trivial twiddle factor: In this design, there is four cases of trivial twiddle factor, each cases belongs to each phases. From the algorithm in section 2, we can conclude that only the sec-ond half of the data in each phase that needs to be multiplied with trivial twiddle factor. The first half will be remain con-stant. That’s why we need another control signal that change every eight clock cycle to tell the twiddle factor mechanism whether its needs to be multiplied or not.

TABLE 1 TRIVIAL TWIDDLE FACTOR CONSTANT

Phase Twiddle constant multiplier 00 1 01 (1- j)/ 2 10 -j 11 -(1+j)/ 2

As we can see on the table 1 that on phase 0 and phase 2,

the multiplication is merely no change at all or just swapping and inverting the real and imaginary part. On phase 1 and phase 3 it involves an addition/subtraction and multiplication

with 1/ 2 constant. From [2] we get that the constant to be multiplied is called priori. This constant can be decomposed as a summation / subtraction based on power of 2. This in essence results in a shift-and-add architecture. Constant

1/ 2 can be decomposed in terms of power of 2 into (2-1 +

656

Figure 1 Proposed R23SDF pipeline FFT architecture

2-3 + 2-4 + 2-8). With this representation, the multiplication of input data with this constant turns into addition of right shifted values of input data.

Non-trivial twiddle factor: This operation uses ROMs to save the twiddle factors and one complex multiplier to do the operation. The ROMs is very simple. We implement two array of constant to save the twiddle factor constant. The real and imaginary parts of the twiddle factor are saved in the first and second array respectively. We implement a custom built mul-tiplier based on radix-4 recoding technique (modified booth recoding technique). This approach is proven to be the most efficient multiplier in terms of A T (area time delay) com-pared to Synopsys standard multiplier (using “*” operator) and the standard multiplier plus shuffle network version (in-tended to reduce the twiddle factor constant). The complete comparisons are presented in table 2.

TABLE 2 MULTIPLIER COMPARISONS

Multiplier design

Area (μm2) Time delay (ns)

A T

Standard (“*”) 59918.4 6.77 405647.8

Standard + shuffle net-work

59901.8 8.46 506769.3

Radix-4 re-coding 87577.4 3.98 348558.2

From table 2 it can be clearly seen that radix-4 recoding is

the best choice in terms of speed and A T. The other advan-tage of using custom multiplier is that the synthesized circuit will be independent to synthesis tools

Radix-4 recoding multiplier itself is a recoding process in-tended to reduce the partial product. This can be achieved by the application of the multiplier recoding, changing from a 2s-complement format to a signed-digit representation from the set {0, ±1, ±2} [5]. The radix-4 recoding starts by appending a zero to the right of x0 (multiplier LSB). Triplets are taken beginning at position x –1 and continuing to the MSB with one bit overlapping between adjacent triplets. If the number of bits in X (excluding x –1) is odd, the sign (MSB) is extended

one position to ensure that the last triplet contains 3 bits. In every step we will get a signed digit that will multiply the multiplicand to generate a partial product. The recoding table is presented in table 3.

TABLE 3 RADIX-4 RECODING

xi+2 xi+1 xi Partial products 0 0 0 0 Y 0 0 1 +1 Y 0 1 0 +1 Y 0 1 1 +2 Y 1 0 0 -2 Y 1 0 1 -1 Y 1 1 0 -1 Y 1 1 1 0 Y

In the straightforward implementation, complex multiplica-tion needs four real multiplier and two adders. So, we need four booth recoders if we want to implement the multiplica-tion using radix-4 recoding. But, if we examine closely the multiplication formula,

( ) ( ) ( ) ( )a jb c jd ac bd j bc ad+ + = − + + and if we always keep one pair (a and b for example) as the multiplier and the other pair (c and d) as the multiplicand then we only need two radix-4 recoders instead of four [7]. The circuit block diagram is presented in figure below. There are four inputs. Input a and b are recoded to choose the appropri-ate partial product. Once the radix-4 recoded partial products have been generated, they need to be shifted and added. To produce the real part then the sum of the first partial product is subtracted by the sum of the second partial product. The imaginary part is an addition of the other two partial products. Micro architecture for radix 4 recoding is presented in figure 2.

657

Figure 2 Architecture for a complex multiplier circuit with twiddle factor ROM

Reorder: The reorder stage is an integral part of the design to realize data ordered serial-to-serial data input-output. We implement the reorder stage using only shift registers and multiplexers. The shift registers is used to save the data tem-porally before taken out as the output. We need 98 blocks of shift registers for the design. As the selector, we implement 64to1 mapping using multiplexers.

IV. VERIFICATION AND IMPLEMENTATION Verification process includes functional simulation, wave-

form simulation, and signal tap in FPGA. Functional simula-tion was done to know if HDL design was match with model. After the functional simulation is complete, the architecture was synthesized for TSMC 0.18μ library using Synopsys. The synthesis result is presented in table 4.

The FPGA implementation is used to know whether the de-signed circuit is function correctly in the real world or not. We use Altera Cyclone II EP2C35F672C6 board for this design

Figure 3 FFT ouput from FPGA captured using Signal Tap II

TABLE 4 PERFORMANCE COMPARISON OF THE PROPOSED FFT CIRCUIT WITH THE REFERENCE DESIGN AND WITH AVAILABLE CHIPSETS

FFT Circuit

Word length

Tech-nology

Cycle re-quired

Area Time delay Power (mW)

mm2 Norm. ns norm. Proposed (radix23SDF) 16 0.18 116 0.47 17780.6 6.03 24.26 29.7

Koushik[2] (radix-8) 16 0.25 23

(64) 6.8 - - - 41

T. Chen L..Zhu[2] 16 2 208 282 - - - -

T. Chen Sunanda[2] 16 0.75 222 156 - - - -

McCanny D. Trainor[2] 24 0.35 130 Core - - - 1300

implementation. We upload the test vector and the expected result in ROM, and compare the result. The output signal is captured using Signal Tap II function on Altera Quartus soft-ware.

On the figure 3, we use 15 cycle complex sinusoid as our test vector. The test vector signal continuously inputted into the designed circuit. We use 50 MHz internal clock to produce the clock signal. To capture the signals we implement a push button as our trigger. The push button itself only serves as a trigger and doesn’t have any connections to the design. The

head signal is automatically generated at the beginning of the first data using a counter.

TABLE 5 AREA AND TIME DELAY SYNTHESIS RESULT

Area Design μm2 normalized R23SDF 473163.78125 17780.62478 Time Delay Design ns normalized R23SDF 6.03 24.25862069

658

V. CONCLUSIONS 64 point FFT architecture for high speed WLAN systems

based on OFDM transmission has been presented. This archi-tecture is based on a decomposition of the 64 point FFT into four stages of 4-point and 2-point FFTs. The algorithm offers simple FFT computations so that the resulting algorithm to architecture mapping is well suited for hardware implementa-tion. The design exhibits numerous attractive features from a VLSI point of view, which include regularity, modularity, and high throughput.

The validity and efficiency of the proposed circuit has been thoroughly verified by functional simulation, timing simula-tion, and FPGA implementation. The proposed design has been successfully synthesized using Synopsys with TSMC 0.18μ technology library. The core area is 0.47 mm2. The power consumption is 29.7 mW. The time delay is 6 ns. The circuit computes one serial-to-serial data in 116 clock cycles. Thus our design has 3 advantages: small area, low power con-sumption, and fast computation in terms of speed and clock latency. Those advantages prove that this design is well suited for high performance WLAN system.

REFERENCES [1] Shousheng He, Mats Torkelson. A New Approach to Pipeline FFT

Processor. Department of Applied Electronics, Lund University. [2] Koushik Maharatna, Eckhard Grass, Ulrich Jagdhold. A 64-Point

Fourier Transform Chip for High-Speed Wireless LAN Application Us-ing OFDM. IEEE Journal of Solid State Circuit, Vol. 39, No. 3, March 2004.

[3] 吳安宇 . Modified radix- 23 FFT. Graduate Institute of Electronics Engineering, NTU.

[4] Wada Tomohisa. 64 Point Fast Fourier Transform Circuit (Version 1.0). Available: http://bw-www.ie.uryukyu.ac. jp/~wada/ de-sign07/spec_e.html

[5] J.A Hidalgo. A Radix-8 Multiplier Unit Design For Specific Purpose. Dept. de Electronics, E.T.S.I Industriales.

[6] Joel J. Fúster, Karl S. Gugel. Pipelined 64-Point Fast Fourier Trans-form For Programmable Logic Devices. Dept. of Electrical and Com-puter Engineering, University of Florida.

[7] Geoff Knagge. ASIC Design for Signal Processing. Available: http://www.geoff knagge.com/.

[8] Lo'ai A. Tawalbeh, Alexandre F. Tenca and C . K. Ko. A Radix-4 De-sign of a Scalable Modular Multiplier With Recoding Techniques. School of Electrical Engineering & Computer Science Oregon State University.

64-point fast efficient fft architecture using radix-2^3 single path delay feedback.pdf

Documents