[ieee 2012 international conference on devices, circuits and systems (icdcs 2012) - coimbatore...

Architectural Design of a Highly Programmable Radix-2 FFT Processor with

Efficient Addressing Logic Saikat Kumar Shome\ Abhinav Ahesh2, Durgesh Kr Gupta3 and SRK Vadali4

,,4CSIR-Central Mechanical Engineering Research Institute, Durgapur. 2,3National Institute of Technology, Durgapur.

{saikatkumarshome', ashesh,nitd2, dkgnitd2}@gmail.com, 4srk _ [email protected]

Abstract - A large number of efficient fixed geometry Fast Fourier Transform (FFT) VLSI designs have been developed till date. We propose a novel architectural design for a highly programmable Radix-2 Decimation-In-Frequency (D1F) FFT processor using relatively simple memory addressing logic. The 5-level programmability of the design, allows computation of 64, 128, 256, 512 or 1024 point FFT of the input signal, depending on application. Besides, the architecture provides the flexibility of computing an N point FFT for M length data (N > M), i.e. with an enhanced resolution also. A complete system flow of the entire FFT architecture along with twiddle factor multiplication, bit reversal and a detailed efficient Address Generation Block (AGB) are also presented. The address generation methodology adopted for the proposed design is based on counters and multiplexers which significantly saves the hardware as well as the latency requirement introduced thereon.

I. INTRODUCTION

Discrete Fourier Transform (OFT) has proved itself to be a key component in several research areas which include signal processing, communication, condition monitoring etc, Direct implementation of OFT has O(N2)complexity which reduces to O(N log2N) through FFT, The Cooley-Tukey algorithm, usually called the Fast Fourier Transform, is a collection of algorithms for quicker calculation of the OFT, The FFT transforms a waveform into a series of sines and cosines at each frequency present in the original signal. Typically, a waveform is "sampled" many times over its period, and the number of samples will affect the frequency resolution. It is a bi-directional transform and a time waveform can be formed back from the Inverse-Fourier Transform (IFFT) of the spectrum if both the amplitude and phase angle. For a series representing time sequence of length N, Fig. 1, illustrates the values in the frequency domain.

�"� . ... . • r

• t �� __ � t �-+�--��------�r

1n

Figure I. Symmetry property of the DFT

FFT is a computationally intensive digital signal processing function widely used in applications such as imaging, software-defined radio, wireless communication,

instrumentation and machine inspection. Historically, this has been a relatively difficult function to implement optimally in hardware, leading many software designers to use digital signal processors in soft implementations. Unfortunately, because of the function's computationally intensive nature, such an approach typically requires multiple digital signal processors within the system to support the processing requirements.

FPGA co-processors have become an extremely cost effective means of off-loading computationally intensive algorithms to improve overall system performance. A major advantage of the advanced technology FPGA nodes is that -they can achieve higher performance or throughput, while having more flexibility, faster design time and lower cost. It is for this reason that FPGAs are becoming more and more attractive for computationally intensive FFT based complex processing applications and are considered as a target platform in the present work.

Extensive study has been carried in the design of FFT architectures over VLSI platforms. Most of the designs in literature are based on fixed geometry scheme [1] while emphasizing on higher throughput [2], area-and-power [3,4] minimization. The fact that relatively little research work is aimed at exploiting the programmable aspect of FFT over an FPGA, encouraged us to focus on this issue in the present work. The proposed architecture allows for a 2-tier flexibility. First, the user can determine the number of data samples, based on some select inputs, on which FFT is to be computed. Secondly, the 5-level programmability allows for the computation of one among 64, 128, 256, 512 or 1024 point FFT on the previously mentioned data samples. Adding a flexibility of choosing N, the length of the FFT required at run time laces a new dimension towards effective spectral computation, which is a common requirement.Depending upon the requirement of a given application one can selectively control the number of input sample points and length of FFT for higher resolution spectral computation. The present work focuses on programmability, a not much explored facet for state of the art FFT architectures for FPGA implementation.

The proposed design also gives a address generation scheme for twiddle factor multiplication with relatively simple hardware logic. Conventional pipelined FFT addressing schemes utilize dedicated address generators in order to enable parallel access to the memory units and in-place calculation. Memory segmentation and simplified control logic have been

studied in [5],[6],[7]. In particular, Ma [8], [9], [10] proposed a method to realize parallel addressing of radix-2 FFT by using two banks of two-port memories. The drawback to the method given in [9] is the requirement of barrel shifters and the extra buffers being used for writing two simultaneous outputs to memory. These address generator units require substantial hardware logic including counters and rotational shifters to generate the addresses and registers to buffer the outputs of each butterfly operation.

The rest of the paper is organized as follows: Section-2 presents a theoretical background behind the FFT algorithm. Section-3 describes the architectural description of the proposed flexibly programmable FFT processor while Section-4 presents the address generation methodology for butterfly operation. Section-5 ends the paper with a few conclusive remarks on the work carried in the paper.

II. FAST FOURIER TRANSFORM

The FFT is an efficient method of implementing the Discrete Fourier Transform (DFT). The DFT of N complex data points x(k) is defined as

N-I _(j2101k) X(n)= L x(k)e N , k = 0,1,2,3, ..... , N-J ... (I) n=O

N-I L x(k)W{!/

n=O nk ( .21fnk ) where W N = e - J -N- and is called the twiddle factor.

... (2)

The direct implementation of DFT has a complexity of O(N2). Using the FFT, the complexity can be reduced to O(N log2N). The FFT is also more suitable for hardware implementation due to the physical regularity of the algorithm[11]. Two approaches exist for reducing the DFT computation complexity using Fast Fourier Transform algorithms: One is to perform Decimation In Frequency (DIF) and the other is to perform Decimation In Time (DIT). Both approaches require the same number of complex multiplications and additions. The key difference between the two is that the Cooley-Tukey FFT algorithm first rearranges the input elements in bit-reversed order, and then builds the output transform in normal order (decimation in time), as in (4). The Sande-Tukey algorithm, on the other hand first transforms using normal order inputs and then generates bit reversed outputs (decimation in frequency), shown in Figure 2 and (3). The manipulation of inputs and outputs is carried out by the so-called butterfly stages.

N /2-1 _(j 2101k ) N _C}101k)

FFTN = L x(k)e N + Lx(k)e N ... (3) n=O n=NI2-1

N12-1 -{j 2;(2n)k) N12-1 -{j2;(2rn-l)k) FFJ; = L x(2k)e N + L x(2k+I)e N • • • (4) n=D n=D

The radix-2 decimation-in-frequency algorithm rearranges the discrete Fourier transform equation into two parts: computation of the even-numbered discrete frequency indices X(k) for k = [0,2,4, ..... , N-2] or X(2r) and computation of the

odd-numbered indices k = [1,3,5, ... , N-l] or X(2r+1), as seen in Fig 3. X(2r) = DFT * [x(n) + x(n+!!...)} ... (5) 2 X(2r+ 1) = DFT * [x(n) - x(n+ -T) w,}/ } ... (6)

Figure 2. Radix-2 DIF Butterfly Unit

Mathematical simplification reveals that both the evenindexed and odd-indexed frequency outputs X(k) can each be computed by a length N/2 DFT. The inputs to these DFTs are sums or differences of the first and second halves of the input signal, respectively, where the input to the short DFT producing the odd-indexed frequencies2�� multiplied by a so called twiddle factor term W N"' = ,-UN). This is called a decimation-in-frequency because the frequency samples are computed separately in alternating groups and radix-2 algorithm because there are two groups. The conversion of the full DFT into a series of shorter DFTs with a simple processing step gives the decimation-in-frequency FFT its computational savings.

xlOJ XIO]

WO xiI] N X[41

x12] XI21 w�

X[6] x(3J

xJ4J X[I]

WO xiS]

N X[SJ

x[6J X[3J

WO x17]

N X[1l

Figure 3. Signal flow graph of radix 2 decimation in frequency (DIF) for N=8

The full radix-2 decimation-in-frequency decomposition requires M=log2N stages with each stage containing NI2 butterflies. Hence (NI2)log2N butterflies are required to be computed. Since each butterfly involves one complex multiplication and two complex addition operations, overall Npoint FFT contains NI2 log2N complex multiplications and N log2N complex additions.

III. ARCHITECTURAL DESCRIPTION OF THE FFT PROCESSOR

The proposed FFT processor is designed to perform variable length FFT on the input data using radix-2 decimation-infrequency (DlF) algorithm. Although radix-4 is by far the most efficient in terms of computation time, Radix-2 out performs the other algorithms in terms of area and system frequency, which are very important FPGA based VLSI realizations. The proposed design allows for two layers of flexibility. In the first level, user can choose the number of data samples (M) out of a maximum of 1024, on which FFT is to be performed. Next,

using the programmable nature of the architecture, one of 64, 128, 256,512 or 1024 point FFT can be computed based on the select inputs.

ti Table I. Control Se ect Inputs or M vs N SS S, S3 S, Sl So M N

000000 64 64 000001 128 000010 256 000011 512 000111 1024 001001 128 128 001010 256 001011 512 001111 1024 010010 256 256 010011 512 010111 1024 011011 512 512 011111 1024 111111 1024 1024

As a first step 1024 data points from the incoming data stream are sliced out at a time for FFT computation, which is equivalent to multiplication by a rectangular window of length 1024 and stored in RAM-I. Sampling rate of the input signal is selected in such a way that one complete cycle of the signal is accommodated within the window, so the problem of spectral leakage normally associated with Fourier transform is minimized. Applying FFT algorithm to an N point window of data taken from a signal results in a vector of N numbers corresponding to the energy at N frequencies spanning the range from 0 Hz to the sampling frequency.

A total of five 1024x16 bit dual port RAMs are used in the design as shown in Figure 6. The flow starts upon an active high enable EN' signal. This also starts a 10 bit counter CTR-l which not only enables the write-enable of RAM-I but also provides the write address for the real time data that is being acquired. CTR-l thus keeps track of the number of data values acquired in RAM-I and via a two input multiplexer arrangement disables the write enable of RAM I, once 1024 data points have been acquired.

Now, depending on select inputs S3-SS (Table 1), M out of 1024 data available in RAM-I are serially transferred to RAMII, on which FFT is to be computed. These select input values are fed to a second 10 bit up-counter, 0_ IN CTR through some predefined logic as seen in Figure 6. This counter counts till M and along with a comparator-multiplexer arrangement disables the write-enable of RAM-II once M data has been acquired. Simultaneously, another enable pin, EN, is made high that initiates the actual FFT computation, which activates the control signals of the Control Block as well as the Address Generation Block which in tum provide the RAMs with proper addresses for complex butterfly operation.

Table 2 Step Counter Bits as per Select Inputs S,SISO 08 07 06 05 0 0 0 0 0 0 0 0 0 1 0 0 0 X o 1 0 0 0 X X o 1 1 0 X X X 1 1 1 X X X X

Once the real data on which FFT is to be computed is available in RAM-II, comes the challenge of butterfly computation for each step of each stage of FFT computation. The initial input to the butterfly unit is A,B and output is (A+B) and (A-B)e-j�'"' , as shown in Fig 2. The input data in the first stage and the final output data of the FFT processor are purely real in nature, however, multiplication of (A-B) component by twiddle factor causes generation of imaginary part of the data in the intermediate stages. Initially the imaginary part applied to the second input is zero. However after first stage of butterfly operation, the imaginary part of the data that are generated are stored from another RAM, RAM-III and used as the imaginary input during the succeeding butterfly operations in the same way as the real part of data is fetched from RAM-II. This RAM-III is also 1024x16 bit capacity and dual port type.

Table 3. Variation of Initial Values of Stage Counter as

S,SISO FFT Length

000 64 001 128 010 256 011 512 111 1024

per FFT L h engt Initial Final Value Value

4 9 3 9 2 9 I 9 0 9

Count Value (Log,N)

6 7 8 9

10

Once one butterfly operation has been completed, before fetching the next set of data, these intermediate data needs to be stored for use in succeeding stages till the final result is available. These data are written back from where they were fetched for in-place computation(lPC) ie. using the memory efficiently. In this the intermediate computed data is overwritten over the input data, thereby reducing overall memory requirement. To perform all these operations concurrently, dual-port memory have been employed which allows simultaneous read and write to any memory location. For this the real part of the data is written back through a multiplexer arrangement so that the butterfly output is selected for writing in RAM locations which are the same as that of same read addresses delayed by the time period of butterfly computation. Similarly, the imaginary part of the generated data is written back in and read from RAM-III in the same way as that of the real part of the data corresponding to RAM-II.

Figure 4. Logic for Stage Counter

Once the FFT computation of the final stage is over, the resulting data (both real and imaginary) appear in bit reversed order according to DIF algorithm. So they need to be arranged in proper sequence before computation of the magnitude of the resulting frequency components from the real and imaginary data. For this, correct address corresponding to the bit-reversed

address is generated by the bit-reversed address generator and the output of the FFT unit is stored in two separate RAM's, RAM-IV and RAM-V corresponding to the real and imaginary data respectively.

Table 4. Variation of Final Values of Step Counter as per FFT Lenl!th

S2S1S0 FFT Initial Final Count Value Length Value Value (N/2)

000 64 0 31 32 001 128 0 63 64 010 256 0 127 128 011 512 0 255 256 111 1024 0 511 512

Figure 5. Logic for Step Counter

Another set of select inputs SO-S2 determine the value of N for which FFT is to be computed on M data available in RAM II. As we know, FFT computation of an N length data requires log2N stages with N/2 butterfly steps in each of the stages. For this logic, two counters are employed: Stage and Step Counter. The Stage Counter is used to count the number of stages of the FFT operation. It is a Mod-I 0, presettable counter whose initial value is preset to one among 0, 1, 2, 3 and 4, depending upon the select inputs, shown in Table.2. However, the variation of the Stage Counter is from initial preset value upto 9. Similarly, the step counter used is a Mod-512, resettable counter whose first four MSBs (D8:D5) may be reset to 0 depending on the select inputs as shown in Table-4. Thus, the step counter counts from 0 to either of 31, 63, 127, 255 and 511. The hardware logical blocks used for modification of step and stage counter are shown in Fig 4 & 5 respectively. The logic for step and stage counter as explained in Table 2-4, are realized through Fig 4,5 by solving the Karnaugh map.

IV. ADDRESS GENERATION UNIT

The address generation unit of an FFT processor provides the data RAM with proper address for butterfly operation of a specific stage. A 512 point FFT has been taken up for consideration. For 512 point FFT, there are 9 stages, with each stage having 256-point butterfly operations. For generating the read address of RAM from where the data is to be read, two counters, step and stage are used. The step counter that counts from 0 to 255 tracks the number of butterfly operation of a stage. On completion of one cycle of step count, the stage counter is incremented by 1 and it proceeds from 0 to 8 for 9 stages required for computing 512 point FFT.

Table 5 Partial Address Generation Table for 1024 point FFT Stage 0 Stage I Stage 2 Stage ... Stage 9

0 512 0 256 0 128 0 : I I 513 I 257 I . 129 ... 2 : 3 2 514 2 258 2 130 ... 4 5

3 515 3 259 3 131 6 7

510 : 1022 512 : 768 894 : 1022 1020 : 1021 511 1023 768 : 1023 895 : 1023 ... 1022 : 1023

T bl 6 B· E a e . mary :qUlvalent Of Addresses of Table 2 (Partial) Stage 0 Stage 1 Stage 2

Index 1 Index 2 Index 1 Index 2 Index 1 Index 2

0000000000 1000000000 0000000000 0100000000 0000000000 0010000000 0000000001 1000000001 0000000001 0100000001 0000000001 0010000001 0000000010 1000000010 0000000010 0100000010 0000000010 0010000010 0000000011 1000000011 0000000011 0100000011 0000000011 0010000011 0000000100 1000000100 0000000100 0100000100 0000000100 0010000100

0111111111 1111111111 0011111111 01111111111 0001111111 0011111111

As per the DIF algorithm for 512 point FFT, the addresses for different stages and their binary equivalent are given in Table 5 and 6. Close observations of the memory indexes reveal that for first stage, the lower 8-bits (Iog2N-l) of the address index are the output of a 8-bit up counter with the MSB fixed at 0 and 1 for first and second data of butterfly respectively. Similarly for other stages, the MSB bit of 0 or 1 shifts towards the right, depending on the stage of operation, as we proceed from stage 0 to stage 9. So it is obvious that all addresses can be generated from a 8-bit up counter with a specified bit associated with a stage number alternating between 0 and 1 for two successive memory addresses. This is implemented for a 512 point FFT by using nine 3: 1 mUltiplexers as in Fig 7. The multiplexers are fed with the input from a 8-bit up counter along with the LSB of the butterfly counter that alternates between 0 and 1. The selection of proper bit sequence is done by generating the MUX select signal from stage counter output as per the truth table and logical expressions provided for each select signal as tabulated in Table 8.

Table 8. Logical expression for generating MUX select signal

s o -A-B-C-D 51(1)-A-B-C 51 (1)- .�CD -.� -A BCD 51(0)- A B C D 52 (0)- A B C D

53(1)-A-B "-'(I)- . ...B- BD- BC 5S (I)- XBC -ABC D

53(0)- ABC D 5.(0)- XBC D 5S(0)- AB C D

56(1)- ABCD -A BC D 5-(1)-A 5S-A

56(0) - .:;;BcD

V. CONCLUSIONS

The proposed architecture provides a highly programmable FFT engine with a flexibility to choose for a computationally fast small length FFT or a high resolution (large sized) FFT depending on the application, without any architectural

modification. The proposed design considers using an effective address generation method enabling simultaneous access from different memory banks, besides utilizing IP to reduce memory consumption. The design can be used in various fast and low cost instrumentation applications, using highly reconfigurable FPGAs for reliable, online spectral analysis. The proposed design can be in wide range of applications including communication systems, radar signal processing, image processing and condition monitoring of industrial machinery.

REFERENCES

[I] J.H. Takala, T. S. Jarvinen, and H.T. Sorokin, "Conflict-Free Parallel Memory Access Scheme For FFT Processors", Proc. of the International Symposium on Circuits and Systems, ISCAS '03, vol. 4, pages 524-527, May 2003.

[2] Y. K. Xie and B. Fu, "Design And Implementation Of High Throughput FFT Processor," Journal of Computer Research and Development, Vol. 41, No.6, pp.1022 1029,2004,

[3] J. A. Hidalgo, J. Lopez, E Argiiello, and E. L. Zapata, "Area Efficient

INPUT I C-

Architecture For Fast Fourier Transform," IEEE Trans.Circuits Syst. J I, vol. 46, no. 2, pp. 187-193, Feb. 1999.

[4] T. Pitkanen, T. Partanen, J. Takala. "Low-Power Twiddle Factor Unit For FFT Computation" SAMOS 2007

[5] D. Cohen., "Simplified Control Of FFT Hardware". IEEE Trans. Acoust. , Speech, Signal Processing, 24:577-579, December 1976.

[6] B. P. Sinha, J. Dattagupta, and A. Sen., "A Cost Effective FFT Processor Using Memory Segmentation". In Proceedings of IEEE ISCAS, volume I, pages 20-23, March 1983.

[7] L. O. Johnson, "Conflict Free Memory Addressing For Dedicated FFT Hardware". IEEE Trans. Circuits Syst.ll, 39:312-316, May 1992.

[8] Yutai. Ma. "A Vlsi-Oriented Parallel FFT Algorithm". IEEE Transactions on Signal Processing, 44(2):445-448, June 1996.

[9] Yutai. Ma. "An Effective Memory Addressing Scheme For FFT Processors". IEEE Transactions on Signal Processing, 47(3):907- 911, March 1999.

[10] Yutai. Ma and L. Wan Hammar. "A Hardware Efficient Control Of Memory Addressing For High-Performance FFT Processors". IEEE Transactions on SignalProcessing, 48(3):917-921, March 2000

[II] John O. Proakis, Dimitris O. Manolakis, "Digital signal processing: principles, algorithms and applications", Pearson Prentice Hall, 2007.

llih I� __ �-+ ______ �--��A. 1

_w ={)1J!llI..1N<'. '-11 _I I jfr.tURffi.A Si--81-!SIX SELECT I s:r\ PIN �SI--

3)-

,� r--W--- Alr=�=jl========�====��-[t-��;;_�A �� "" 1J!llI..1N<'.

I � IU ' � \fE-A liM 1 '-� r - - f--

I rl-� 4J-.,- f---'l-- • .tURffi.B RESET CONTROL--+-+--+-+-------h

101[00 � I , I / 1J!llI..1t-B

I 1 \fE-B

S5S1S� Sl3) I 0000 oo� � 0000 01 tB 0000 102i3 0000 11 1 512 0001 1 11 I'mi 001 0 0 11tB Yl 0 101 1 21/ o 1 � I 512 1 1 11 l-mt 01 00 1 02i3l�� 0 11 512 1 1 1 0110 11512512 1 111 I� 1111 11�

.t1 1

", '-'

:� __ .",011 � � I

� �.lJw�� " . J'�

I S5S453

I 1, b �I DElAYED • FEED BACK

ADDRESS

I I I I I I I

COR�C

I ""

;;,��-� I � � . � '-� I � l...>.-'-� I ' r-� I I I I I

• .tURffi.A 1 1J!llI..1N<'. I· W-A

• .tURffi.E 1J!llI..1t-Bf-\fE-B

MAGNITUD COMPUTATION

BLOCK

I 1 j -

Figure 6. Overall Architecture

Table 7. Truth table for generating MUX select signal for 512 point FFT Stage SO SI S2 S3 S4 S5 S6 S7 S8

counter (I) (0) (I) (0) (I) (0) (I) (0) (I) (0) (I) (0) (I) (0) 0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0001 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0010 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0011 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0100 1 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0101 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0110 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 0111 1 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1000 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1----------------------------------------------------------

I� ��+-��-, I ffi I Al l Jl7 16 1 A5 M I A3 � I PJ JU I

� � ffiJ- '� 4---�------�----+_----4_----�----_+--------� � g

�----_r----r_----+_----_r----_r----_+----�r_------� � 4---�------_+----+_----�----_+----_+----�------+_------� � '----- -=

Figure 7. Address Generation Block

[ieee 2012 international conference on devices, circuits and systems (icdcs 2012) - coimbatore...

Documents