energy efficient parameterized fft...

7
ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE * Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 90089 Email: {renchen, hoangle, prasanna}@usc.edu ABSTRACT Recently, there has been a growing interest within the re- search community to improve energy efficiency. In this pa- per, we revisit the classic Fast Fourier Transform (FFT) for energy efficient designs on FPGAs. Parameterized FFT ar- chitecture is proposed to identify design trade-offs in achiev- ing energy efficiency. We first perform design space explo- ration by varying the algorithm mapping parameters, such as the degree of vertical and horizontal parallelism, that char- acterize the decomposition based FFT algorithms. After em- pirical selection on the values of algorithm mapping parame- ters, an energy-performance-area trade-off design for energy efficiency is identified by varying the architecture param- eters, including the type of memory elements, the type of interconnection network and the number of pipeline stages. The tradeoffs between energy, area, and time are analyzed using two performance metrics: the Energy×Area×Time (EAT) composite metric and the energy efficiency (defined as the number of operations per Joule). From the experimen- tal results, a design space is generated to demonstrate the ef- fect of these parameters on the various performance metrics. For N -point FFT (16 N 1024), our designs achieve up to 28% and 38% improvement in the energy efficiency and EAT, respectively, compared with a state-of-the-art design. 1. INTRODUCTION FPGA is a promising implementation technology for com- putationally intensive applications such as signal, image, and network processing tasks [1, 2]. State-of-the-art FPGAs of- fer high operating frequency, unprecedented logic density and a host of other features. As FPGAs are programmed specifically for the problem to be solved, they can achieve higher performance with lower power consumption than gen- eral purpose processors. Fast Fourier Transform (FFT) is one of the most fre- quently used kernels for Discrete Fourier Transform (DFT) in a wide variety of image and signal processing applica- tions. Various derivative FFT algorithms have been pro- * This work has been funded by DARPA under grant number HR0011- 12-2-0023. posed and developed. Radix-x Cooley-Tukey algorithm is one of the most popular algorithms for hardware implemen- tation [3, 4, 5, 6]. Most hardware solutions for Radix-x FFT fall into the following categories: delay feedback or delay commutator architectures [4], such as Radix-2 2 single-path delay feedback FFT [4], Radix-4 single-path delay com- mutator FFT [5], etc. By focusing on circuit level opti- mizations, these solutions achieved improvement either in throughput, area, or power. Power is a key metric in computing today. To obtain an energy efficient design for FFT, we analyze the trade- offs between energy, area, and time for fixed-point FFT on a parameterized architecture, using Cooley-Tukey algorithm. Energy efficiency can be obtained both at the algorithm map- ping level and the architecture level [7, 8]. Optimizing at these two levels allows power to be effectively traded off with other performance parameters. For example, a design consuming 2× power but achieving 3× system throughput is actually 50% more energy efficient than the original design. We present the design space for the chosen architecture with respect to energy efficiency at the algorithm mapping level. Energy-performance-area trade-off design is achieved at the architecture level by empirical selection on the proposed ar- chitecture parameters. In this paper, we make the following contributions: 1. A parameterized architecture of the Radix-4 Cooley- Tukey algorithm for FFT (Section 3.1). 2. A design space that demonstrates the effect of the pa- rameters on the EAT and the energy efficiency metric (Section 4.3.2). 3. Demonstrate improved energy efficiency of the pro- posed trade-off design by identifying the energy hot- spots and varying the proposed architecture parame- ters (Section 4.3.2). 4. Optimized designs achieving significant improvement in energy efficiency compared with a state-of-the-art design (Section 4.4). The rest of the paper is organized as follows. Section 2 covers the background and related work. Section 3 describes the proposed parameterized architecture and its implemen- tation on FPGA. Section 4 presents experimental results and analysis. Section 5 concludes the paper. 1

Upload: ngothuan

Post on 20-May-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ENERGY EFFICIENT PARAMETERIZED FFT ...halcyon.usc.edu/~pk/prasannawebsite/papers/2013/fpl13.pdfENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE∗

Ren Chen, Hoang Le, and Viktor K. Prasanna

Ming Hsieh Department of Electrical EngineeringUniversity of Southern California, Los Angeles, USA 90089

Email: {renchen, hoangle, prasanna}@usc.edu

ABSTRACT

Recently, there has been a growing interest within the re-search community to improve energy efficiency. In this pa-per, we revisit the classic Fast Fourier Transform (FFT) forenergy efficient designs on FPGAs. Parameterized FFT ar-chitecture is proposed to identify design trade-offs in achiev-ing energy efficiency. We first perform design space explo-ration by varying the algorithm mapping parameters, such asthe degree of vertical and horizontal parallelism, that char-acterize the decomposition based FFT algorithms. After em-pirical selection on the values of algorithm mapping parame-ters, an energy-performance-area trade-off design for energyefficiency is identified by varying the architecture param-eters, including the type of memory elements, the type ofinterconnection network and the number of pipeline stages.The tradeoffs between energy, area, and time are analyzedusing two performance metrics: the Energy×Area×Time(EAT) composite metric and the energy efficiency (definedas the number of operations per Joule). From the experimen-tal results, a design space is generated to demonstrate the ef-fect of these parameters on the various performance metrics.For N -point FFT (16 ≤ N ≤ 1024), our designs achieve upto 28% and 38% improvement in the energy efficiency andEAT, respectively, compared with a state-of-the-art design.

1. INTRODUCTION

FPGA is a promising implementation technology for com-putationally intensive applications such as signal, image, andnetwork processing tasks [1, 2]. State-of-the-art FPGAs of-fer high operating frequency, unprecedented logic densityand a host of other features. As FPGAs are programmedspecifically for the problem to be solved, they can achievehigher performance with lower power consumption than gen-eral purpose processors.

Fast Fourier Transform (FFT) is one of the most fre-quently used kernels for Discrete Fourier Transform (DFT)in a wide variety of image and signal processing applica-tions. Various derivative FFT algorithms have been pro-

∗This work has been funded by DARPA under grant number HR0011-12-2-0023.

posed and developed. Radix-x Cooley-Tukey algorithm isone of the most popular algorithms for hardware implemen-tation [3, 4, 5, 6]. Most hardware solutions for Radix-x FFTfall into the following categories: delay feedback or delaycommutator architectures [4], such as Radix-22 single-pathdelay feedback FFT [4], Radix-4 single-path delay com-mutator FFT [5], etc. By focusing on circuit level opti-mizations, these solutions achieved improvement either inthroughput, area, or power.

Power is a key metric in computing today. To obtainan energy efficient design for FFT, we analyze the trade-offs between energy, area, and time for fixed-point FFT on aparameterized architecture, using Cooley-Tukey algorithm.Energy efficiency can be obtained both at the algorithm map-ping level and the architecture level [7, 8]. Optimizing atthese two levels allows power to be effectively traded offwith other performance parameters. For example, a designconsuming 2× power but achieving 3× system throughput isactually 50% more energy efficient than the original design.We present the design space for the chosen architecture withrespect to energy efficiency at the algorithm mapping level.Energy-performance-area trade-off design is achieved at thearchitecture level by empirical selection on the proposed ar-chitecture parameters. In this paper, we make the followingcontributions:

1. A parameterized architecture of the Radix-4 Cooley-Tukey algorithm for FFT (Section 3.1).

2. A design space that demonstrates the effect of the pa-rameters on the EAT and the energy efficiency metric(Section 4.3.2).

3. Demonstrate improved energy efficiency of the pro-posed trade-off design by identifying the energy hot-spots and varying the proposed architecture parame-ters (Section 4.3.2).

4. Optimized designs achieving significant improvementin energy efficiency compared with a state-of-the-artdesign (Section 4.4).

The rest of the paper is organized as follows. Section 2covers the background and related work. Section 3 describesthe proposed parameterized architecture and its implemen-tation on FPGA. Section 4 presents experimental results andanalysis. Section 5 concludes the paper.

1

Page 2: ENERGY EFFICIENT PARAMETERIZED FFT ...halcyon.usc.edu/~pk/prasannawebsite/papers/2013/fpl13.pdfENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna

2. BACKGROUND AND RELATED WORK

2.1. BackgroundGiven N complex numbers x0, ..., xN−1, DFT is computedas: Xk =

∑N−1n=0 xne

−i2πk nN , k = 0, ..., N − 1. Radix-

x Cooley-Tukey FFT is a well know decomposition basedalgorithm for N-point DFT. Radix-4 FFT is employed inthis paper. The description of Radix-4 FFT is presentedin Algorithm 1. In terms of the number of real operations,the computational complexity for N -point Radix-4 FFT isO(N log4 N). The algorithm performs N -point FFT in N/m(m < N) cycles using m Input/Output ports (I/Os) andlog4 N radix blocks, which are used for butterfly computa-tions. The algorithm iteratively decomposes the entire prob-lem into four subproblems. This feature enables us to map

Algorithm 1 Radix-4 FFT Algorithm1: q = N/4; d = N/4;2: for p := 0 to log4 N do3: for k := 0 to 4p − 1 do4: l = 4kq/4p; r = l + q/(4p − 1);5: tw1 = w[k]; tw2 = w[2k]; tw3 = w[3k];6: for i := l to r do7: t0 = i; t1 = i+d/4p; t2 = i+2d/4p; t3 = i+3d/4p;8: do parallel9: fp+1[t0] = fp[t0] + fp[t1] + fp[t2] + fp[t3];

10: fp+1[t1] = fp[t0]− jfp[t1]− fp[t2] + jfp[t3];11: fp+1[t2] = fp[t0]− fp[t1] + fp[t2] + jfp[t3];12: fp+1[t3] = fp[t0] + jfp[t1]− fp[t2]− jfp[t3];13: end parallel14: do parallel15: fp+1[t0] = fp+1[t0];16: fp+1[t1] = tw1 × fp+1[t1];17: fp+1[t2] = tw2 × fp+1[t2];18: fp+1[t3] = tw3 × xp+1[t3];19: end parallel20: end for21: end for22: end for

the algorithm by folding the FFT architecture vertically orhorizontally, thus providing much freedom to implement var-ious designs on FPGAs. We will propose our parameterizedarchitecture in Section 3.2 based on this characteristic.

2.2. Related WorkTo the best of our knowledge, there has been no previouswork targeted at exploring the design space for energy effi-ciency of FFT at both the algorithm mapping level and thearchitecture level on FPGAs. Existing work has mainly fo-cused on optimizing the performance, power and area of thedesign at the circuit level.

In [9], the authors designed an energy-efficient 1024-point FFT processor. Cache-based FFT algorithm was pro-posed for achieving low power and high performance. Energy-time performance metric was evaluated at different proces-sor operation points. In [10], a high-speed and low-powerFFT architecture was presented. They presented a delay bal-anced pipeline architecture based on split-radix algorithm.

Algorithm trade-offs for reducing computation complexitywere explored and the architecture was evaluated in area,power and timing performance.

Based on Radix-x FFT, various pipeline FFT architec-tures have been proposed, such as Radix-2 single-path de-lay feedback FFT [3], Radix-4 single-path delay commuta-tor FFT [5], Radix-2 multi-path delay commutator FFT [6],and Radix-22 single-path delay feedback FFT [4]. These ar-chitectures can achieve high throughput per unit area withsingle-path or multi-path pipelines, while energy efficiencyhas not been explored and evaluated in these works.

In [11], a mathematical model for generating DFT softcore was developed. This model can automatically producean optimized design with user inputs on performance and re-source constraints. The resource usage was estimated withavailable parameters. However, the power and performanceestimation have not been presented in this work. In [7], itpresented a parameterized FFT architecture for energy ef-ficiency. For energy efficiency, the optimized design wasachieved by varying the chosen architecture parameters. Someenergy efficient design techniques, such as clock gating andmemory binding, are also employed in their work.

Other than FPGA, there are also some techniques for en-ergy efficient FFT presented based on other different plat-forms [12, 13]. However, it is not clear how to apply thesetechniques on FPGAs. In this work, we extend the work of[7] by design space exploration for energy efficiency at dif-ferent levels. The design space exploration is performed onthe current state-of-the-art FPGAs. By exploring the energyefficiency at two levels, we obtained an energy-performance-area trade-off design for FFT.

3. ARCHITECTURE AND IMPLEMENTATIONS

3.1. Architecture building blocks

The proposed N -point FFT architecture is based on the Radix-4 Cooley-Tukey FFT algorithm. Note that the choice of theradix affects energy efficiency of the design. Compared withRadix-2 algorithm, Radix-4 uses less number of multiplyoperations.

The basic architecture consists of five building blocks(see Fig.1): Radix-4 block (R4), Data buffer, Data path per-mutation (PER), Parallel-to-serial/serial-to-parallel (PS/SP)multiplexer, and twiddle factor computation (TWC). A com-plete design for a given N -point FFT can be obtained fromcombinations of the basic blocks.

A. Radix-4 block

In this module, 16 signed adder/subtractors are used tocomplete butterfly computations. It takes four inputs andgenerates four outputs in parallel. Each input data containsreal and imaginary components. The data outputs of R4 willbe used by the twiddle factor computation block except inthe last stage (see Fig. 1a).

2

Page 3: ENERGY EFFICIENT PARAMETERIZED FFT ...halcyon.usc.edu/~pk/prasannawebsite/papers/2013/fpl13.pdfENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna

Radix Block

(a) (b) (c) (d)

X

(e)

Fig. 1: (a) Radix block, (b) Data buffer, (c) Data path permu-tation (PER) , (d) Parallel-to-serial/serial-to-parallel MUX(PS/SP), (e) Twiddle factor computation (TWC)

X3 X6 X9 X12

X2 X5 X8 X15

X1 X4 X11 X14

X0 X7 X10 X13

0 1 2 3Memory entry

Data output in parallel

Fig. 2: Data permutation in the data buffers for 16-point FFT

B. Data bufferThe data buffer consists of a dual-port RAM having N/m

(m equals to the number of I/Os) entries. Data is writteninto one port and read from the other port simultaneously.The data buffers are shown in Fig. 2 where N = 16. Infour cycles, 16 permutated data inputs are fed into the databuffers. And in each cycle, with alternating entries, fourdata outputs are read in parallel. For different architecturalparameters, the read and write addresses are generated withdifferent strides. For example, in Fig. 2, four data inputs(X0, X4, X8, X12) are written in cycle 0, cycle 1, cycle 2,and cycle 3 respectively. Then they are output simultane-ously in cycle 4.C. Data permutation block

Parallel input data are required to be permutated beforebeing processed by the subsequent modules. Fig. 2 showsthe data permutation for 16-point FFT. In the first cycle, fourdata inputs (X0, X1, X2, X3) are fed into the first entry ofeach data buffer without permutation. In the second cycle,another four data inputs are written into the second entry ofeach data buffer with one location permutated. The paralleloutput data (Xi, Xi+4, Xi+8, X(i+12)mod16, i = 0, 1, 2, 3)are stored in different RAMs after four cycles. These per-mutations are repeated for every four cycles.D. PS/SP module

This module is used to multiplex serial/parallel inputdata to output in parallel/serial respectively. As shown inFig. 3a, the number of I/Os is limited to one, but the radix-4block still operates on four data inputs in parallel, thus thePS/SP module is employed to match the data rate both be-fore and after the radix-4 block.

R4 TWC

Data Buffer

Data Buffer

(a) Hp = 1,Vp = 1

R4

TWC

TWC

TWC

R4

Data Buffer

Data Buffer

Data Buffer

X0[i]

X0[i+1]

X0[i+2]

X0[i+3]

X2[i]

X2[i+1]

X2[i+2]

X2[i+3]

(b) Hp = 2,Vp = 4

Fig. 3: Parameterized Architectures for 16-point FFT

E. Twiddle factor computation

This module consists of two blocks: the twiddle factorgeneration block and the complex number multiplier block.The twiddle factor generation block includes several lookuptables for storing twiddle factor coefficients, where the dataread addresses will be updated with the control signals. Thesize of the lookup tables will increase with the problem size.The complex number multiplier block consists of three mul-tipliers and three adder/subtractors.

3.2. Parameterized FFT Architecture

3.2.1. Algorithm Mapping Parameters

Decomposition based Radix-4 FFT offers much flexibility tomap various architectures. By folding the FFT architecturehorizontally or vertically, the radix-4 blocks can be reusediteratively, connected in a pipeline, or replicated to processinput data in parallel. Hence we use two algorithm map-ping parameters that characterize the decomposition-basedN -point FFT algorithm in our design:

1. Horizontal Parallelism (Hp): determines the numberof radix blocks used in one pipeline (1 ≤ Hp ≤log4 N ).

2. Vertical Parallelism (Vp): determines the number ofinputs being computed in parallel (1 ≤ Vp ≤ N ). Vpvaries with the number of data channels per pipeline(Nc) and the number of parallel pipelines (Np), andVp = Nc ×Np.

These two proposed architectural parameters are chosento create a design space. Two different architectures are pre-sented in Fig. 3. In Fig. 3a, Vp = Nc = Np = 1, Hp =1, N = 16, one radix-4 block is employed and iterativelyused by two stages, and one input data is processed per cy-cle. This architecture achieves higher resource efficiency

3

Page 4: ENERGY EFFICIENT PARAMETERIZED FFT ...halcyon.usc.edu/~pk/prasannawebsite/papers/2013/fpl13.pdfENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna

In0

In1

In2

In3

Out0 Out1Out2 Out3

(a) (b)

In0

In1

In2

In3

Out0

Out1

Out2

Out3

(c)

Fig. 4: (a) Crossbar network, (b) Complete binary tree, (c)Dynamic network

and consumes less I/Os power consumption, at the expenseof the throughput.

In Fig. 3b, Vp = 4, Hp = 2, N = 16, two radix-4 blocksare utilized. There is only one pipeline and Nc = 4, Np = 1.All the stages are fully pipelined, and four inputs can be pro-cessed in parallel per cycle. Note that there is no feedbackpath. The architecture achieves high throughput by usingmore basic blocks and I/Os, while resulting in higher powerconsumption.

We can also increase Vp by replicating the basic pipeline.This replication allows several pipelines to work in parallelto significantly increase the throughput at the cost of morecomplex interconnections.

3.2.2. Architecture Parameters

Three architecture parameters that significantly affect energy-efficiency are employed in our design and applied to differ-ent components:

1. Type of memory element: BRAM or distributed RAM(dist. RAM) can be used as memories. In our design,both data buffers and twiddle factor lookup tables canbe implemented using different memory elements.

2. Type of interconnection: three different types of inter-connection (see Fig.4) are used for implementation ofdata permutation blocks, including crossbar network,complete binary tree, as well as dynamic network.

3. Pipeline depth: Both adder/subtractors and DSP slicesin FPGA can be deep pipelined by inserting registers,so we parameterized the arithmetic units and multi-pliers with pipeline depth in our design to balance theperformance and resource usage.

According to [14], when used for large size memories,BRAM consumes less power than dist. RAM. Hence thischaracteristic can be utilized to make a trade-off betweenpower and performance for various problem sizes.

As there are 2 ×m × (Hp + 1) (when Vp = 4, m = 1,otherwise m = 0) permutation modules, using different inter-connection networks can significantly affect the energy effi-ciency of the designs. The physical layout of the completebinary tree is similar with that of crossbar network, while

it can be inserted with more pipeline registers between thelayers of tree. The dynamic network can be implementedby using shift registers. Among three of them, dynamicnetwork can lead to high performance while more powerconsumption; crossbar network consumes least resource andpower while will also bring long wire delay; complete bi-nary tree can be used to release routing burden to improveperformance at the expense of more area usage.

4. EXPERIMENTAL RESULTS AND ANALYSIS

4.1. Experimental Setup

In this section, we present a detailed analysis of several im-plementation experiments by varying the parameters. Allthe designs were implemented in Verilog on Virtex-7 FPGA(XC7VX690T, speed grade -2L) using Xilinx ISE 14.4. In-puts are 16-bit fixed point complex numbers. The designswere verified by post place-and-route simulation. The re-ported results are post place-and-route results. We used theSAIF file (Switching Activity Interchange Format) as inputto Xilinx XPower Analyzer to produce accurate power dis-sipation estimation [14].

4.2. Performance MetricsTwo metrics for performance evaluation are considered inthis paper:

1. Energy efficiency is defined as the number of opera-tions per unit energy consumed (Energy efficiency =number of operations / energy consumed by the de-sign). For N-point FFT, Energy efficiency is given by(2N log2 N + 9

4N log2 N ) / energy consumed by thedesign, Energy consumed by the design = time takenby the design × average power dissipation of the de-sign. Alternatively energy efficiency of the design isPower efficiency (Power efficiency = number of oper-ations per second / Watt).

2. Energy× Area× Time (EAT) is measured as the prod-uct of three important metrics: energy, area, and time.We define Energy in Joules consumed by the designfor one transformation of N points. Area is defined asarea usage of the design, which is considered as themaximum number of LUTs or flip-flops occupied bythe entire design. The area of design using BRAMs isequal to the area usage of the same design when onlyusing dist. RAMs. Time is the latency of N-point FFT.

4.3. Design space exploration

In this section, we first present the design space explorationby varying algorithm mapping parameters. Both the dist.RAM based design and the BRAM based design are usedin this experiment. The effect of the algorithm mappingparameters on energy efficiency is demonstrated by using

4

Page 5: ENERGY EFFICIENT PARAMETERIZED FFT ...halcyon.usc.edu/~pk/prasannawebsite/papers/2013/fpl13.pdfENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna

0

10

20

30

40

16 64 256 1024

Gig

a o

pe

rati

on

s/Jo

ule

Problem size N

�� � log��

�� � 1

�� � log�� � 1�/2

Fig. 5: Energy efficiency for various Hp with varying N forthe dist. RAM based design

0

10

20

30

40

16 64 256 1024

Gig

a o

pra

tio

ns/

Jou

le

Problem Size N

� �

� �

Fig. 6: Energy efficiency for various Hp with varying N forthe BRAM based design

the proposed performance metrics. Next we explore theenergy-performance-area trade-off design (denoted trade-off design) by varying the architecture parameters, based onthe conclusions of design space exploration in this section.

4.3.1. Algorithm mapping level explorationA. Horizontal Parallelism

In this experiment, we explore the energy efficiency forvarious horizontal parallelism, and Vp = 4, Nc = 4, Np =1. The range of Hp is [1, log4 N ]. The energy efficiencyfor various Hp are shown in Fig. 6 and Fig. 5 respectively.Based on the experimental results, we have the followingobservations:

• For all the considered problem sizes, increasing hori-zontal parallelism could significantly improve energyefficiency for both the dist. RAM and BRAM baseddesign.

• As the problem size N increases, the energy efficiencyof the dist. RAM based design declines, whereas,the energy efficiency of the BRAM based design in-creases.

• The improvement in energy efficiency brought by in-creasing Hp for the dist. RAM based design is sensi-tive to N . For example, when N = 1024, halving Hp

0

10

20

30

40

16 64 256 1024

Gig

a op

erat

ions

/ Jo

ule

Problem size N

�� � 1,�� � 1

�� � 1,�� � 4

�� � 4,�� � 4

�� � 4,�� � 1

Fig. 7: Energy efficiency for various Vp with varying N forthe dist. RAM based design

0

10

20

30

40

50

16 64 256 1024

Gig

a op

erat

ions

/ Jo

ule

Problem Size N

�� � 1,�� � 1

�� � 1,�� � 4

�� � 4,�� � 4

�� � 4,�� � 1

Fig. 8: Energy efficiency for various Vp with varying N forthe BRAM based design

only leads to slight performance decline in energy ef-ficiency. Considering reducing Hp to save area wouldbe a feasible alternative for larger size problems.• The improvement in energy efficiency brought by in-

creasing Hp for BRAM based designs is not sensitiveto N . Reducing Hp to save area is not a feasible ap-proach, which leads to a significant decline in energyefficiency.

B. Vertical Parallelism

Vertical parallelism is determined by three different val-ues: radix value (fixed at 4), Nc, and Np. Hp was set aslog4 N . Nc and Np were modified for evaluation. Both dist.RAM and BRAM based designs were evaluated. The en-ergy efficiency for various Vp are shown in Fig. 7 and Fig. 8.Based on the results, the conclusions are listed as below:

• Reducing Nc leads to performance decline in energyefficiency.• BRAM based design is more scalable than dist. RAM

based design with respect to energy efficiency. WhenN ≥ 64, the energy efficiency starts to decline fordist. RAM based designs due to high power consump-tion per access of dist. RAMs with large memory en-tries.

5

Page 6: ENERGY EFFICIENT PARAMETERIZED FFT ...halcyon.usc.edu/~pk/prasannawebsite/papers/2013/fpl13.pdfENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna

Table 1: Architecture parameters of designs for comparison

Memorytype

InterconnectionNetwork

Pipelinestages

Type Components Multiplier Adder

Design A Dist. RAM Dynamicnetwork Regitsers 5 2

Trade-offDesign

Dist. RAMor BRAM

Crossbarnetwork LUTs 3 2

Design C BRAM Completebinary tree

LUTs+Registers 2 1

20

30

40

50

16 64 256 1024

Gig

a op

erat

ions

/ Jo

ule

Problem size N

Design A Trade-off design Design C

Fig. 9: Energy efficiency of the trade-off design and thebaseline designs

• Increasing Nc instead Np is a more feasible approachto improve energy efficiency. Also there is no muchextra resource needed for increasing Nc, and we haveto replicate the pipeline to increase Np.

• Although increasing Hp leads to a high power and re-source consumption, it can produce improvement inenergy efficiency due to high throughput.

4.3.2. Architecture level exploration

In this section, the trade-off design is explored at the archi-tecture level. In this experiment, we choose Vp = 4 andHp = log4 N based on previous experimental conclusions.

A. Energy hot spots

As shown in Fig.10a, dominant portion of the entire poweris consumed by the data buffers for 1024-point FFT. Thisindicates that BRAM can be utilized to improve energy effi-ciency for large values of N . It also suggests that I/Os con-sumes a major power for small values of N . Fig.10b showsthat the core power consumption except I/O power and staticpower is dominant in the entire power for BRAM-based de-signs. And we observe that pipeline registers are the energyhot-spots among the architecture components.

B. Trade-off design

By varying the architecture parameters, a set of imple-mentations have been evaluated in this experiment. Theanalysis of effects of the architecture parameters on power,performance, and area is performed as below:

PER

3%TWC

4%

Data

buffer

72%

Radix

block

4%

Static

Power

9%

I/O

power

8%

(a) Dist. RAM based design

PER

8%

TWC

12%

Data

buffer

28%

Radix

block

13%

Static

Power

13%

I/O

power

26%

(b) BRAM based design

Fig. 10: % power consumed by the components for 1024-point FFT architecture

• Energy: Reducing the number of registers can signif-icantly reduce signal power, which is dominant in thedynamic power. Crossbar network can be evaluated toincrease energy efficiency.• Performance: Using BRAM can lead to a decline in

peak operating frequency. For large values of N, whenusing BRAMs, extra pipeline stages can be used tosolve the performance degradation issue.• Area: Area usage of pipeline registers is dominant in

the entire design area. Pipeline registers can be bal-anced to obtain the trade-off design between area andperformance.

The analysis above has been applied to achieve the trade-off design in our experiment and serves as a guide for designspace exploration. As shown in Table1, we use two baselinedesigns to compare with our proposed trade-off design. Thearchitecture parameters of the designs for comparison areshown in Table1. The comparison results of the designs onenergy efficiency are shown in Fig.9. It shows that the en-ergy efficiency can be improved up to 27% by our proposedtrade-off design, compared with the other two baseline de-signs.

4.4. Performance comparisonWe finally use SPIRAL FFT IP core to compare with ourproposed trade-off design. The SPIRAL DFT/FFT IP Gen-erator can automatically generate customized DFT soft IPcores in synthesizable RTL Verilog with user inputs [11].The available parameters of the DFT core generator includetransform size, data precision, etc. In this comparison, weuse the dist. RAM based design for N ≤ 64 and the BRAMbased design for N > 64. For the design from SPIRAL, thecodes of N-point (16-bit fixed point) FFT are automaticallygenerated by the SPIRAL Core generator. The architectureis fully streaming and the data are presented in their nat-ural ordering. As shown in Fig. 11, our proposed designimproves energy-efficiency by 8% to 28% and EAT by 23%to 38%, respectively, compared with the SPIRAL FFT IPCores.

6

Page 7: ENERGY EFFICIENT PARAMETERIZED FFT ...halcyon.usc.edu/~pk/prasannawebsite/papers/2013/fpl13.pdfENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna

1.15

1.2

1.25

1.3

1.35

1.4

25

30

35

40

45

16 64 256 1024

EA

T R

atio

Gig

a op

erat

ions

/ Jo

ule

Problem size N

Energy efficency of our designEnergy efficiency of SPIRAL FFT IP Core(EAT of SPIRAL IP CORE) / (EAT of our design)

Fig. 11: Comparison between the proposed trade off designand the SPIRAL FFT IP Cores for EAT and energy effi-ciency

5. CONCLUSION

In this work, we presented a parameterized architecture forenergy efficiency using Radix-4 Cooley-Tukey FFT algo-rithm. The effect of the multi-level parameters on energy-efficiency was demonstrated by using design space explo-ration. We studied the power consumption of the compo-nents for various problem sizes, and proposed our trade-off design by empirical selection on architecture parameters.Compared with the state-of-the-art design, our optimized ar-chitectures achieve up to 28% and 38% improvement in theenergy efficiency and EAT respectively. In the future weplan to work on an accurate high-level performance modelfor energy-efficiency estimation, which can be used to accel-erate design space exploration to obtain an energy efficientdesign.

6. REFERENCES

[1] N. Shirazi, P. M. Athanas, and A. L. Abbott, “Implementationof a 2-D Fast Fourier Transform on an FPGA-Based CustomComputing Machine,” in Field-Programmable Logic and Ap-plications, 1995, pp. 282–292.

[2] D. Chen, G. Yao, C. Koc, and R. Cheung, “Low complex-ity and hardware-friendly spectral modular multiplication,”in International Conference on Field-Programmable Tech-nology (FPT), 2012, pp. 368–375.

[3] E. H. Wold and A. M. Despain, “Pipeline and parallel-pipeline FFT processors for VLSI implementations,” IEEETransactions on Computers, vol. 100, no. 5, pp. 414–426,1984.

[4] S. He and M. Torkelson, “A new approach to pipeline FFTprocessor,” in Proceedings of IPPS’96, pp. 766–770.

[5] G. Bi and E. Jones, “A pipelined FFT processor for word-sequential data,” IEEE Transactions on Acoustics, Speechand Signal Processing, vol. 37, no. 12, pp. 1982–1985, 1989.

[6] L. R. Rabiner and B. Gold, “Theory and application of digitalsignal processing,” Englewood Cliffs, NJ, Prentice-Hall, Inc.,1975. 777 p., vol. 1.

[7] S. Choi, R. Scrofano, V. K. Prasanna, and J.-W. Jang,“Energy-efficient signal processing using FPGAs,” in Pro-ceedings of the 2003 FPGA, pp. 225–234.

[8] D. Aravind and A. Sudarsanam, “High level -ApplicationAnalysis Techniques Architectures - To Explore Design pos-sibilities for Reduced Reconfiguration Area Overheads in FP-GAs executing Compute Intensive Applications,” in Proc. ofIPDPS, 2005, pp. 158a–158a.

[9] B. Baas, “A low-power, high-performance, 1024-point FFTprocessor,” IEEE Journal of Solid-State Circuits, vol. 34,no. 3, pp. 380–387, 1999.

[10] C.-W. J. Wen-Chang Yeh, “High-speed and low-power split-radix FFT,” IEEE Transactions on Signal Processing, vol. 51,no. 3, pp. 864–874, 2003.

[11] P. A. Milder, M. Ahmad, J. C. Hoe, and M. Puschel, “Fastand accurate resource estimation of automatically generatedcustom DFT IP cores,” in Proceedings of the 2006 FPGA, pp.211–220.

[12] T. Sugimura, H. Yamasaki, H. Noda, O. Yamamoto,Y. Okuno, and K. Arimoto, “A high-performance and energy-efficient FFT implementation on super parallel processor(MX) for mobile multimedia applications,” in InternationalSymposium on Intelligent Signal Processing and Communi-cations Systems, 2009, pp. 1–4.

[13] H. Kimura, H. Nakamura, S. Kimura, and N. Yoshimoto,“Numerical analysis of dynamic snr management by con-trolling dsp calculation precision for energy-efficient ofdm-pon,” Photonics Technology Letters, IEEE, vol. 24, no. 23,pp. 2132–2135, 2012.

[14] “XST User Guide for Virtex-6, Spartan-6, and 7 Series De-vices,” http://www.xilinx.com/support/documentation.

7