fast fourier transform

IEEE Paper Template in A4 (V1)

Sample IEEE Paper for A4 Page SizeFirst Author#, Second Author*, Third Author##First-Third Department, First-Third [email protected]@first-third.edu*Second Company Address Including Country [email protected]

Abstract The discrete Fourier transform is an important operation in digital communication systems. However, the DFT is computationally very expensive, and the fast Fourier transform is an algorithm that has been proposed to compute the discrete Fourier transform efficiently. FFT can be implemented as either decimation in time or decimation in frequency. Also, depending on the decomposition, FFT can be radix-2 or radix-4. We propose a split radix FFT design where we apply a radix-2 mapping to the even index terms and we apply a radix-4 mapping to the the odd index terms. The design is to be implemented on Xilinx Spartan FPGA. We present speed, area, and power results showing a comparison between split radix implementation and single radix implementation.

Keywords FFT, fast fourier transform, discrete fourier transform, FPGA, decimation in time, decimation in frequencyIntroductionThe fast Fourier transform is a fast implementation of the DFT. It is based on a divide- and-conquer approach in which the DFT computation is divided into smaller, simpler, problems and the nal DFT is rebuilt from the simpler DFTs. Another application of this divide-and-conquer approach is the computation of very large FFTs, in which the time data and their DFT are too large to be stored in main memory. In such cases the FFT is done in parts and the results are pieced together to formthe overall FFT, and saved in secondary storage such as on hard disk. In the simplest Cooley-Tukey version of the FFT, the dimension of the DFT is suc- cessively divided in half until it becomes unity. This requires the initial dimension N to be a power of two:

The problem of computing the N-point DFT is replaced by the simpler problems of computing two (N/2)-point DFTs. Each of these is replaced by two (N/4)-point DFTs, and so on. We will see shortly that an N-point DFT can be rebuilt from two (N/2)-point DFTs by an additional cost of N/2 complex multiplications. This basic merging step is shown in Fig. 1. Thus, if we compute the two (N/2)-DFTs directly, at a cost of (N/2)2 multiplications each, the total cost of rebuilding the full N-DFT will be:

where for large N the quadratic term dominates. This amounts to 50 percent savings over computing the N-point DFT directly at a cost of N2.

Merging two N/2-DFTs into an N-DFT and its repeated applicationSimilarly, if the two (N/2)-DFTs were computed indirectly by rebuilding each of them from two (N/4)-DFTs, the total cost for rebuilding an N-DFT would be:

Thus, we gain another factor of two, or a factor of four in efciency over the direct N-point DFT. In the above equation, there are 4 direct (N/4)-DFTs at a cost of (N/4)2 each, requiring an additional cost of N/4 each to merge them into (N/2)-DFTs, which require another N/2 for the nal merge. Proceeding in a similar fashion, we can show that if we start with (N/2m)-point DFTs and perform m successive merging steps, the total cost to rebuild the nal N-DFT will be:

The rst term, N2/2m, corresponds to performing the initial (N/2m)-point DFTs directly. Because there are 2m of them, they will require a total cost of 2m(N/2m)2= N2/2m. However, if the subdivision process is continued for m = B stages, as shown in Fig. 9.8.1, the nal dimension will be N/2m = N/2B = 1, which requires no computation at all because the 1-point DFT of a 1-point signal is itself. In this case, the rst term in Eq. (2) will be absent, and the total cost will arise from the second term. Thus, carrying out the subdivision/merging process to its logical extreme of m = B = log (N) stages, allows the computation to be done in:

It can be seen Fig. 1 that the total number of multiplications needed to perform all the mergings in each stage is N/2, and B is the number of stages. Thus, we may interpret Eq. (3) as(total multiplications) = (multiplications per stage) (no. stages) = (N/2)*BFor the N = 8 example shown in Fig. 1, we have B = log2(8)= 3 stages and N/2 = 8/2 = 4 multiplications per stage. Therefore, the total cost is BN/2 = 3 4 = 12 multiplications. Next, we discuss the so-called decimation-in-time radix-2 FFT algorithm. There is also a decimation-in-frequency version, which is very similar. The term radix-2 refers to the choice of N as a power of 2, in Eq. (1). Given a length-N sequence x(n), n = 0, 1,...,N1, its N-point DFT X(k)= X(k) can be written in the component-form of Eq. (2):

The summation index n ranges over both even and odd values in the range 0 n N1. By grouping the even-indexed and odd-indexed terms, we may rewrite Eq. (9.8.4) as

To determine the proper range of summations over n, we consider the two terms separately. For the even-indexed terms, the index 2n must be within the range 0 2n N 1. But, because N is even (a power of two), the upper limit N 1 will be odd. Therefore, the highest even index will be N 2. This gives the range:

Similarly, for the odd-indexed terms, we must have 0 2n + 1 N 1. Now the upper limit can be realized, but the lower one cannot; the smallest odd index is unity. Thus, we have:

Therefore, the summation limits are the same for both terms:

This expression leads us to dene the two length-(N/2) subsequences:

and their (N/2)-point DFTs:

Then, the two terms of Eq. (9.8.5) can be expressed in terms of G(k) and H(k).We note that the twiddle factors WN and WN/2 of orders N and N/2 are related as follows:WN/2 = e2j/(N/2) = e4j/N = WN2or WNk(2n) = WN2(kn) = WN/2knand WNk(2n+1) = WNk WN2kn = WNk WN/2knSo,

and X(k) = G(k) WNk H(k)k = 0, 1, ... , N - 1This is the basic merging result. It states that X(k) can be rebuilt out of the two (N/2)-point DFTs G(k) and H(k). There are N additional multiplications, Wk NH(k). Using the periodicity of G(k) and H(k), the additional multiplications may be reduced by half to N/2. To see this, we split the full index range 0 k N 1 into two half-ranges parametrized by the two indices k and k + N/2:

Therefore, we may write the N equations (9.8.8) as two groups of N/2 equations:X(k) = G(k) + WNk H(k)X(k + N/2) = G(k + N/2) + WN(k + N/2) H(k + N/2)Using the periodicity property that any DFT is periodic in k with period its length, we have G(k + N/2)= G(k) and H(k + N/2)= H(k). We also have the twiddle factor property:WNN/2 = (e-j2pi/N)N/2 = e-jpi = -1Then, the DFT merging equations become:X(k) = G(k) + WNk H(k)X(k + N/2) = G(k) - WNk H(k)k = 0, 1, ... , N/2 - 1They are known as the buttery merging equations. The upper group generates the upper half of the N-dimensional DFT vector X, and the lower group generates the lower half. The N/2 multiplications WNkH(k) may be used both in the upper and the lower equations, thus reducing the total extra merging cost to N/2. Vectorially, we may write them in the form:

where the indicated multiplication is meant to be component-wise. Together, the two equations generate the full DFT vector X. The operations are shown in Fig. 9.8.2.

Buttery merging builds upper and lower halves of length-N DFT.As an example, consider the case N = 2. The twiddle factor is now W2 =1, but only its zeroth power appears W0 2 = 1. Thus, we get two 1-dimensional vectors, making up the nal 2-dimensional DFT:[X0] = [G0] + [H0W20][X1] = [G0] - [H0W20]For N = 4

For N = 8

To begin the merging process shown in Fig. 9.8.1, we need to know the starting one- dimensional DFTs. Once these are known, they may be merged into DFTs of dimension 2,4,8, and so on. The starting one-point DFTs are obtained by the so-called shufing or bit reversal of the input time sequence. Thus, the typical FFT algorithm consists of three conceptual parts:1. Shufing the N-dimensional input into N one-dimensional signals.2. Performing N one-point DFTs.3. Merging the N one-point DFTs into one N-point DFT.Performing the one-dimensional DFTs is only a conceptual part that lets us pass from the time to the frequency domain. Computationally, it is trivial because the one-point DFT X = [X0] of a 1-point signal x = [x0] is itself, that is, X0 = x0, as follows by setting N = 1 in Eq. (4). The shufing process is shown in Fig. 3 for N = 8. It has B = log2(N) stages. During the rst stage, the given length-N signal block x is divided into two length-(N/2) blocks g and h by putting every other sample into g and the remaining samples into h. During the second stage, the same subdivision is applied to g, resulting into the length-(N/4) blocks {a, b} and to h resulting into the blocks {c, d}, and so on. Eventu- ally, the signal x is time-decimated down to N length-1 subsequences. These subsequences form the starting point of the DFT merging process, which is depicted in Fig. 4 for N = 8. The buttery merging operations are applied to each pair of DFTs to generate the next DFT of doubled dimension.FFT ImplementationsThe Fast Fourier Transform (FFT), as an efficient algorithm to compute the Discrete Fourier Transform (DFT), is one of the most important operations in modern digital signal processing and communication systems. The pipeline FFT is a special class of FFT algorithms which can compute the FFT in a sequential manner; it achieves real-time behavior with nonstop processing when data is continually fed through the processor. Pipeline FFT architectures have been studied since the 1970's when real-time large scale signal processing requirements became prevalent. Several different architectures have been proposed, based on different decomposition methods, such as the Radix-2 Multipath Delay Commutator (R2MDC) , Radix-2 Single-Path Delay Feedback (R2SDF) , Radix-4 Single-Path Delay Commutator (R4SDC) , and Radix-22Single-Path Delay Feedback (R22SDF) . More recently, Radix-22to Radix-24SDF FFTs were studied and R23SDF was implemented and shown to be area efficient for 2 or 3 multipath channels. Each of these architectures can be classified as multipath or single-path. Multipath approaches can processdata inputs simultaneously, though they have limitations on the number of parallel data-paths, FFT points, and radix. This paper focuses on single-path architectures.From the hardware perspective, Field Programmable Gate Array (FPGA) devices are increasingly being used for hardware implementations in communications applications. FPGAs at advanced technology nodes can achieve high performance, while having more flexibility, faster design time, and lower cost. As such, FPGAs are becoming more attractive for FFT processing applications and are the target platform of this paper.The primary goal of this research is to optimize pipeline FFT processors to achieve better performance and lower cost than prior art implementations.The FFT processor is an inevitable component used for implementing the systems of OFDM. The pipeline architectures in structured form are adopted to satisfy the requirements of energy and signal processing as far as mobile environment is concerned. The architectures of FFT based on decomposition method and radix-2 i algorithm were proposed by He et al (1998). The algorithm was exploited for the implementation of dominant elements to decrease the count of manipulations and the 21 storage capacity. The storage capacity had been optimized by adjusting the word length in a progressive manner. The efficiency of the design with respect to area and power was improved by using a multiplier based on distributed arithmetic. The specifications of the design were obtained by using a 1024 point FFT processor. Song-Nien Tang et al (2012) proposed a FFT processor for various types of wireless networks such as wireless LAN, wireless MAN etc. By adopting the Flexible-Radix Configuration Multiple Delay Feedback (FRCMDF) commutator, a high performance could be obtained by FFTs of variable-length in an efficient manner. In order to improve the efficiency with respect to area and energy, an optimized method of multiplication was also suggested. In addition, the architecture could provide a support for scaling the power across the modes of FFT. The chip had been realized with a size of 3.2 mm2 , a signal to noise ratio of 40 dB, power consumption of 507 mW at 300 MHz. This FFT processor of length 512-point was able to give higher performance and lower consumption of power when compared to other designs.Taesang Cho (2013) presented a radix-2 5 fast Fourier transform (FFT) processor of length 512-point for applications of personal area wireless networks. A modified version of radix-2 5 algorithm of FFT was used to decrease the level of hardware required. This technique could decrease the count of manipulations and the capacity of memory required. A complex multiplier was employed in the place of a Booth multiplier. The architecture obtained a SNR value of 35 dB with a word length of 12 bits at 1.2 V. This design had been implemented using 90-nm technology with the specifications - gate count was 2, 90, 000, the rate of throughput was 2.5 Gigabits per second at 310 MHz. Kyung Heo et al (2003) proposed a FFT processor using mixed radix algorithm and a new in-place technique. This processor employed only two numbers of N-word memories for implementation of FFT when compared to existing FFT processors which employ four numbers of N-word memories. Further this architecture obtained the optimum requirements with respect to area and signal processing. The number of clock cycles and number of gates were 640 and 37, 000 respectively for a FFT processor of length 512-point. Hence this design could reduce size of memory and gate count when compared to other FFT processors. Shousheng He et al (1998) discussed a FFT processor of length 1024-point using pipeline architecture. This architecture utilized the regularity of radix-2 2 FFT algorithm. The implementation of FFT processor was obtained with only four numbers of complex multipliers and a data memory of 1024 points. The chip had been realized with 0.5 m CMOS technology with an area of 40 mm2 at a frequency of 30 MHz. Han Ying et al (2003) introduced a FFT processor based on Xilinx FPGA. To reduce the complexity of logic, the serial mode was adopted to subject the data into three operations such as inclusion of multiplying window, manipulation of FFT and computation of module-square. The frequency of the clock was increased to obtain better performance by employing the serial and parallel architectures thereby avoiding the bottleneck. It was shown that the processor obtained high performance and was suitable for applications of Digital signal processing. Chang et al (2003) presented an architecture based on algorithm of Radix-4. By integrating the styles of feed forward and feedback commutators, this architecture obtained better usage of hardware and memory when compared to other FFT processors. Moreover various components of ROM were needed for storing the 23 twiddle values. The size of ROM was decremented by an integer of 2 using the concept of redundancy. The single-frequency networks are constructed on a large scale using Coded Orthogonal Frequnecy Division Multiplexing (COFDM) system, with appropriate guard durations to make the echoes invalid. The fast Fourier transforms of longer length have to be implemented for demodulating every symbol in order to reduce the losses of spectral efficiency to about 20%. Petrovsky et al (2006) proposed a technique for synthesizing the split radix FFT processors using pipeline architecture. This technique was adopted using hardware design of FPGA by considering the constraints of practical applications such as frequency, length of transform, performance etc into account. The illustrated examples of architecture showed the abilities of the technique to optimize the hardware. The transition from variable arithmetic to fixed arithmetic along with appropriate issues of accuracy was also discussed.Expected Results and ConclusionBy using various optimization techniques such as pipelining, parallel architecture, and split-radix implementation, we propose an FFT architecture that is optimized in terms of speed, power, and area. We will also show the interplay of these parameters and the tradeoffs involved in efficient design.ReferencesJaishri Katekhaye, Mr.Amit Lamba, Mr. Vipin REVIEW ON FFT PROCESSOR FOR OFDM SYSTEM IJAICT Vol-1,NOV:2014Anwar Bhasha Pattan, Dr. Madhavi Latha FastFourier Transform Architectures: A Survey IJAECT Weidong Li and Lars Wanhammar LOW- POWER FFT PROCESSORS IJAECT.Weidong Li and Lars WanhammarVLSI based FFT processor with improvement in computation speed and area reduction IJECSE 2013. H. Sorensen, D. Jones, M. Heideman, and C. Burrus, Real-valued fast Fourier transform algorithms, IEEE Trans. Acoust., Speech Signal Process., vol. 35, no. 6, pp. 849863, Jun. 1987.S. He and M. Torkelson, Design and implementation of a 1024-point pipeline FFT processor, in Proc. IEEE Custom Integr. Circuits Conf., J. Lee, H. Lee, S. I. Cho, and S. S. Choi, A high-speed two parallel radix-24 FFT/IFFT processor for MBOFDM UWB systems, in Proc. IEEE Int. Symp. Circuits Syst., May 2006, pp. 47194722. M. Ayinala, M. Brown, and K. K. Parhi, Pipelined parallel FFT architectures via folding transformation, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 6, pp. 10681081, Jun. 2012. R. Radhouane, P. Liu, and C. Modin, Minimizing the memory requirement for continuous flow FFT implementation: Continuous flow mixed mode FFT (CFMM-FFT), in Proc. IEEE Int. Symp. Circuits Syst., May 2000, pp. 116119.

fast fourier transform

Documents