implementation of high-speed 512-tap fir filters for...

Master of Science Thesis in Electrical EngineeringDepartment of Electrical Engineering, Linköping University, 2018

Implementation ofHigh-Speed 512-Tap FIRFilters for ChromaticDispersion Compensation

Cheolyong Bae and Madhur Gokhale

Master of Science Thesis in Electrical Engineering

Implementation of High-Speed 512-Tap FIR Filters for Chromatic DispersionCompensation

Cheolyong Bae and Madhur Gokhale

LiTH-ISY-EX--18/5179--SE

Supervisor: Oscar Gustafssonisy, Linköpings universitet

Examiner: Oscar Gustafssonisy, Linköpings universitet

Division of Computer EngineeringDepartment of Electrical Engineering

Linköping UniversitySE-581 83 Linköping, Sweden

Copyright © 2018 Cheolyong Bae and Madhur Gokhale

Abstract

A digital filter is a system or a device that modifies a signal. This is an essentialfeature in digital communication. Using optical fibers in the communication hasvarious advantages like higher bandwidth and distance capability over copperwires. However, at high-rate transmission, chromatic dispersion arises as a prob-lem to be relieved in an optical communication system. Therefore, it is necessaryto have a filter that compensates chromatic dispersion. In this thesis, we intro-duce the implementation of a new architecture of the filter and compare it witha previously proposed architecture.

iii

Acknowledgments

We would like to express our gratitude to our supervisor and examiner OscarGustafsson for his guidance in this thesis. Since the beginning of this thesis, wehave developed our knowledge in the field of computer engineering. We wouldalso like to thank our opponents, Aurélien Moine and Viswanaath Sundaram, fortheir valuable feedback on the results of this thesis.

Linköping, December 2018Cheolyong Bae and Madhur Gokhale

v

Contents

Notation ix

1 Introduction 11.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Previous Research at LiU . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 32.1 Optical Networks and Chromatic Dispersion . . . . . . . . . . . . 3

2.1.1 System Chain . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Analog-Digital Converter for Optical Transmission . . . . . 4

2.2 Finite-length Impulse Response Filter . . . . . . . . . . . . . . . . . 42.2.1 FIR Filtering in Frequency Domain . . . . . . . . . . . . . . 5

2.3 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.1 Various Radix of FFT . . . . . . . . . . . . . . . . . . . . . . 72.3.2 FFT Architecture . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Overlap-save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.6 Complex Multiplication . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Method 133.1 Programming Language . . . . . . . . . . . . . . . . . . . . . . . . 133.2 ModelSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Synopsys Design Compiler . . . . . . . . . . . . . . . . . . . . . . . 133.4 Power Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.5 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Implementation 154.1 1024-point FFT Based FIR Filter Architecture . . . . . . . . . . . . 15

4.1.1 Top Level Estimation . . . . . . . . . . . . . . . . . . . . . . 154.1.2 First Commutator . . . . . . . . . . . . . . . . . . . . . . . . 174.1.3 Second Commutator . . . . . . . . . . . . . . . . . . . . . . 18

vii

viii Contents

4.1.4 Twiddle Factor Multiplication . . . . . . . . . . . . . . . . . 194.1.5 Filter Coefficient Multiplier . . . . . . . . . . . . . . . . . . 204.1.6 Coefficient Selector . . . . . . . . . . . . . . . . . . . . . . . 204.1.7 Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1.8 Global Counter and Control Signals . . . . . . . . . . . . . 20

4.2 256-point FFT Based FIR Filter Architecture . . . . . . . . . . . . . 204.2.1 Top Level Estimation . . . . . . . . . . . . . . . . . . . . . . 224.2.2 Fast Fourier Transform with Overlap-save Method . . . . . 234.2.3 4-tap FIR Filters . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.4 Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.5 Inverse Fast Fourier Transform . . . . . . . . . . . . . . . . 25

5 Result 275.1 256-point Fast Fourier Transform . . . . . . . . . . . . . . . . . . . 27

5.1.1 Radix-2 vs Radix-4 vs Radix-16 . . . . . . . . . . . . . . . . 275.1.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.1.3 FFT with Overlap-save Block and Inverse FFT . . . . . . . . 29

5.2 Complex Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2.1 Power Estimation with Random Coefficients . . . . . . . . . 32

5.3 4-tap FIR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.3.1 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.3.2 Gauss Complex Multiplication Algorithm Based FIR Filters 375.3.3 Standard Complex Multiplication Algorithm Based FIR Fil-

ters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3.4 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . 385.3.5 Tap Configuration Results . . . . . . . . . . . . . . . . . . . 39

5.4 Different Procedures of Commutator and Twiddle Factor Multipli-cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.5 Different Cases in the First Commutator . . . . . . . . . . . . . . . 415.6 Comparison between Two Architectures . . . . . . . . . . . . . . . 425.7 Comparison with Previous Work . . . . . . . . . . . . . . . . . . . 45

6 Conclusion 476.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Bibliography 49

Notation

Abbreviations

Abbreviations Description

ADC Analog Digital ConverterAWGN Additive White Gaussian NoiseBER Bit Error RateCD Chromatic DispersionDAC Digital Analog ConverterDFT Discrete Fourier TransformDIF Decimation in FrequencyDIT Decimation in TimeDSP Digital Signal ProcessingFFT Fast Fourier TransformFIR Finite-length Impulse Response

FPGA Field Programmable Gate ArrayHDL Hardware Description LanguageIC Integrated CircuitICI Intercarrier InterferenceIDFT Inverse Discrete Fourier TransformIFFT Inverse Fast Fourier TransformIIR Infinite-length Impulse ResponseISI Intersymbol InterferenceLSB Least Significant BitMSB Most Significant BitOLS Overlap-save methodSAIF Switching Activity Information FileSDF Standard Delay FormatVCD Value Change DumpVHDL Very High-speed Integrated Circuit Hardware De-

scription Language

ix

1Introduction

Annual global IP traffic is growing and is predicted to reach 3.3 ZB (Zeta Byte) by2021. It was 1.2 ZB in 2016 [1]. Due to high demands for modern communication,fiber optic communication is widely used because of its various advantages overcopper-wired communication [19].

Optic communication can handle longer distances, higher bandwidth, andhas better reliability. Despite these benefits, the fiber optic communication hassome imperfections that need to be considered. One of these is chromatic disper-sion.

Chromatic dispersion (CD) is a form of dispersion in optical fiber [19]. Due todifferent frequencies of components in a signal or pulse, each component prop-agates with different speed. Thus, the signal or pulse is smeared and deliverswrong information. Therefore, to achieve correct information signal, the need forfiltering arises to compensate for chromatic dispersion.

In this thesis, we introduce a new filter architecture to compensate for chro-matic dispersion and compare it with a previously proposed architecture. Botharchitectures are made using VHDL, and they are verified by Matlab simulation.The blocks in these two architectures are synthesized by Design Compiler tooland analyzed in terms of area usage and power consumption. Some of the blockshave various options to change to increase performance. So, the results of thoseoptions and the total comparison between two architectures will be discussed.

1.1 Goal

The goal of the thesis is to design and evaluate a new architecture of high-speed512-tap finite impulse response filter for compensation of chromatic dispersionin optical fibers and perform a comparative analysis with previous architecture.The previous research shows that it is possible to achieve an operating speed of

1

2 1 Introduction

60 GS/s with a maximum frequency of 476 MHz which has a clock period of2.1 ns [10]. This means about 128 samples should be processed in every clockcycle. The new architecture should also achieve the same operating speed with asmaller usage of resources.

1.2 Previous Research at LiU

At Department of Electrical Engineering, Linköping University, 512-Tap complexFIR filter architectures for compensation of chromatic dispersion [10] was carriedout. Also, other studies about digital filters [4], representations of FFT [5, 18] andarchitectures of FFT [6, 7] have been published.

1.3 Limitation

This thesis uses fixed-point numbers, all the bits cannot be computed by arith-metic operation, and important information in samples is dependent on inputsand coefficients. Therefore, this thesis assumes to use external signals to choosedesired bits.

The choice of the coefficients is not the scope of this thesis. Instead of choosingcoefficients obtained from chromatic dispersion filter, randomly generated valuesare considered in this thesis.

Some blocks had greater hierarchy due to which propagation of switchingactivity to inner nets could not be ensured. Thus, the detailed gate level powerestimation cannot be said to be wholly accurate.

1.4 Outline of the Thesis

Chapter 2 presents theories behind FIR filters and optical networks.Chapter 3 covers the languages and tools used in this thesis.Chapter 4 explains how the filters are implemented.Chapter 5 contains the results of each block and the whole architecture.Chapter 6 contains a conclusion and discussion as well as expected future work.

2Theory

2.1 Optical Networks and Chromatic Dispersion

Optical networks have significant advantages over traditional networks based oncopper cables. They have much higher bandwidth and a lower Bit Error Rate(BER). Communication systems based on optical fiber are less susceptible to elec-tromagnetic interference. So, the communication systems based on optical fibercan be used for distances more than one kilometer at a speed of tens of megabitsper second [19].

Optical fibers which are guided wave structures propagate light signals inoptical networks. A narrow pulse when launched on fiber spreads, with its widthbroadening, as it travels along the fiber. Over long distances, the broadeningof pulses extends into neighboring pulses causing Intersymbol Interference (ISI).This ISI is referred to as fiber dispersion. There are two basic types of dispersiveeffects in a fiber [20]; Intermodal Dispersion and Chromatic Dispersion.

Intermodal Dispersion: This form of dispersion exists in multimode fiberssince different modes have different group velocity. The pulse power is differentfor different modes. The pulse arrivals for different modes are in different timewith each pulse carrying different power. This dispersion limits bit rate-distanceproduct of an optical communication link [19].

Chromatic Dispersion: This dispersion occurs due to the frequency depen-dence of the group velocity. The chromatic dispersion can be modeled as fre-quency response as:

C(exp(jwT )) = exp(−jK(wT )2), K =Dλ2z

4πcT 2 , (2.1)

where D is the fiber dispersion parameter, λ is the wavelength, c is the speed oflight, T is the sampling period and z is the propagation distance [4].

3

4 2 Theory

L GTX(ejwt) C(ejwt) LGRX(e

jwt)H(ejwt)x(n) y(n)

upsampling with a factor L

Anti-imagingfilter

Chromaticdispersion

Filter forcompensationof CD

Anti-aliasingfilter

Downsamplingwith afactor L

AWGN

Figure 2.1: The system of an optical network model.

The compensation of the chromatic dispersion is done by designing a filterwith frequency response [4]:

H(exp(jwT )) =1

C(exp(jwT ))= exp(jK(wT )2). (2.2)

2.1.1 System Chain

The full system chain is shown in Figure 2.1 [9]. In the system chain, in orderto reduce the effect of ISI and intercarrier interference (ICI), the interpolationon the transmitter side and the decimation on the receiver side has been added.Also, these interpolation and decimation require anti-aliasing filter which usuallyperforms low pass filtering [4].

The filter given by Equation (2.1) is added to simulate chromatic dispersion.Then, the data goes through Additive White Gaussian Noise (AWGN) channelto simulate the random process in nature. When receiving the signal, the CDcompensation filter reduces the effect of CD.

2.1.2 Analog-Digital Converter for Optical Transmission

The optical transmission relies on digital signal processing (DSP) and conversionbetween analog and digital [12]. There are several DAC and ADC aimed at opticalcommunication. The bit resolution of ADC is from 4 to 8, and the maximumsample rate is 20 to 70 GS/s using various technologies according to the sourcegiven in the paper [12]. In this thesis, we assume 6 bits as an input bit resolutionand target sample rate is 60 GS/s.

2.2 Finite-length Impulse Response Filter

Digital filters can be divided into two classes: Finite-length Impulse Response(FIR) and Infinite-length Impulse Response (IIR). FIR filter is a filter that has im-pulse response with a finite duration. On the contrary, if the impulse responsehas infinite duration, the filter is called IIR filter. The FIR filters can be guaran-teed to be stable unless used inside a recursive loop [22]. Equation (2.3) describesan FIR filter of length M with input x(n) and output y(n).

y(n) = b0x(n) + b1x(n − 1) + ... + bMx(n −M), (2.3)

2.2 Finite-length Impulse Response Filter 5

h(0) h(1) h(2) h(3)

D D D

Σ

Input

Output

Figure 2.2: Generic 4-tap filter.

where bk is a coefficient of the FIR filter for 0 ≤ k ≤ M. Similarly, the transferfunction can be expressed as

H(z) =M∑k=0

bkz−k . (2.4)

Also, the unit sample response of the FIR filter is the same as the coefficientsbk , that is,

h(k) ={bk , 0 ≤ k ≤ M0, otherwise (2.5)

The output sequence described by Equation (2.3) can be expressed as the con-volution summation of the system

y(n) =M∑k=0

h(k)x(n − k), (2.6)

where M is the order of the filter [17].Generally, an FIR filter is described using the length of the filter rather than

the order. The length of the filter is given by N = M + 1, where M is the order ofthe filter. The number of multiplications and additions in an FIR filter of lengthN is given by N and N − 1 respectively [22]. The direct form structure of the FIRfilter is one of the simple structures and is depicted in Figure 2.2.

2.2.1 FIR Filtering in Frequency Domain

As we discussed in the previous section, convolving the time domain signal withthe impulse response results in the output of the filter. This operation can besped up by performing Fourier transform of both input signal and coefficientand multiplying them. Taking inverse Fourier transform of the result of multipli-cation gives us the output same as a convolution of inputs. This method is much

6 2 Theory

FFT

MultiplyInverse

FFT

Input

Signal

x(n)

ImpulseResponseh(n)

X(k)

H(k)Y(k) = X(k)H(k)

Output

Signal

y(n) = x(n) * y(n)

FFT

Figure 2.3: Fast convolution.

faster than time domain convolution, due to the simplicity of multiplication andthe speed of Fast Fourier Transform (FFT). This approach is advantageous in fil-tering long data sequences. The complete filtering process is shown in Figure 2.3[14].

When it comes to filtering long data sequences, the filtering is done on theblock by block basis. The input stream of data is divided into segments of databits and then each segment is processed one by one by Discrete Fourier Trans-form (DFT) and inverse DFT. One of the methods of performing filtering of longdata sequences is Overlap-save (OLS) method [17]. The OLS method is describedfurther in Section 2.4.

2.3 Fast Fourier Transform

Fourier Transform (FT) is a mathematical way to decompose a function of time(signal) into a function of frequency. When FT is used for the discrete samples, wecall it Discrete Fourier Transform (DFT). Fast Fourier Transform (FFT) is simplyan optimized version of DFT [16].

The DFT is given by Equation (2.7).

Xk =N−1∑n=0

xn exp(−j2πN

kn), (2.7)

where xn is a sequence of samples, N is the size of the transformation and Xk isthe k−th frequency of the transform.

With this equation, the computation requires N2 complex multiplication andadders or subtraction without considering the elimination of some trivial com-putation such as multiplication by 1. FFT reduces the number of complex multi-plications from N2 to N log2 N . The well-known method of doing FFT is Cooleyand Tukey algorithm [2]. It uses a technique called divide and conquer algorithmwhich breaks down DFT into smaller DFTs recursively. In order to explain the

2.3 Fast Fourier Transform 7

W08

W18

W28

W38

W08

W28

W08

W28

W08

W08

W08

W08

x(0)

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

X(0)

X(4)

X(2)

X(6)

X(1)

X(5)

X(3)

X(7)

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

Figure 2.4: An 8-point decimation in frequency FFT algorithm.

algorithm, the equation of DFT (2.7) can be broken down into two parts [15]

Xk =(N/2)−1∑n=0

x2nWnkN/2 + Wm

N

(N/2)−1∑n=0

x2n+1WnkN/2 (2.8)

Xk+N/2 =(N/2)−1∑n=0

x2nWnkN/2 −W

mN

(N/2)−1∑n=0

x2n+1WnkN/2, (2.9)

where WN = exp(−2jπ

N

)is called twiddle factor. With Equations (2.8) and (2.9),

we can perform two N/2-point DFTs, one for even-indexed samples and one forodd-indexed samples in order to compute an N-point DFT. Those equations canbe further broken down until the size of DFT is equal to radix.

Figure 2.4 shows an example of an 8-point FFT and Table 2.1 shows the com-parison between DFT and FFT regarding on the number of complex multiplica-tions.

2.3.1 Various Radix of FFT

As it is discussed above, FFT breaks down the DFT into smaller DFTs. The min-imum size of DFT is dependent on what radix we select for the FFT. The radixis also the size of the butterfly [15]. A butterfly in this thesis is denoted as thebasic operation of FFT. For example, a radix-2 FFT uses one addition and onesubtraction for one butterfly:

X(0) = x(0) + x(1) (2.10)

X(1) = x(0) − x(1) (2.11)

The FFT can, of course, use higher radix. Figure 2.5 shows a radix-2 butterflyand a radix-4 butterfly.

8 2 Theory

Number of Complex multiplications Complex multiplicationspoints in direct computation in FFT algorithm

4 16 48 64 12

16 256 3232 1024 8064 4096 192

128 16384 448256 65536 1024512 262144 2304

1024 1048576 5120Table 2.1: Comparison of the number of complex multiplications in the di-rect computation of the DFT and the FFT algorithm [17].

x(0) X(0)

x(1) X(1)

(a) A radix-2 butterfly

x(0) X(0)

-j

x(1)

x(2)

x(3)

X(2)

X(1)

X(3)

(b) A radix-4 butterfly

Figure 2.5: Two types of butterfly.

2.4 Overlap-save 9

2.3.2 FFT Architecture

The FFT architecture used in this thesis is pipelined implementation and directimplementation. In the pipelined structure, the number of input samples is apower of two. The input samples are processed with data shuffling in a continu-ous flow. Data shuffling is done using buffers and multiplexers [7].

The direct implementation is also considered as parallel pipelined FFT wherethe degree of parallelization is equal to the size of FFT [7]. This direct imple-mentation is straightforward, mapping each operation according to the FFT flowgraph.

In the direct implementation, input samples arrive simultaneously so thereis no need to pipeline in the data flow. Also, decimation in time (DIT) and dec-imation in frequency (DIF) are same in architecture, the difference being in therotators [9].

2.4 Overlap-save

Figure 2.6 illustrates the Overlap-save (OLS) method. When the input is a verylong signal and one has an FIR filter, OLS is one of the methods to computediscrete convolution. In the OLS method, the input samples are divided intoblocks of M samples. The first block of M samples is appended with L − 1 zeros.Then the new block has total M + L − 1 = N samples with L − 1 zeros and Msamples. The next block saves the last L − 1 samples and appends to the next Msamples. So, it has total M + L − 1 samples with L − 1 old samples and M newsamples. The same applies to all the segments.

The N -point FFT is performed on each of these blocks. The FFT results ofthese blocks are then multiplied with filter coefficient in a frequency domain.Then IFFT is performed on each of these blocks. Then initial L − 1 samples fromeach of these blocks are discarded, and the result is concatenated. The concate-nated result is the final output [17].

2.5 Pipelining

Pipelining is a method of increasing throughput of a sequential algorithm. Thismethod is achieved by breaking down the critical path by adding delays to theoriginal path. Ideally, the critical path is broken into paths of equal length. WithP number of stages of pipelining, P computations can run concurrently. Thismeans there is an increase by a factor of P over sequential processing. Pipelininghelps to achieve a higher level of parallelism in the structure [22].

In this thesis, pipelining is done for two reasons. One reason is to achievehigher operating speed. Without inserting delay elements in the critical path,the system is unable to run at certain speeds. The other reason is to reduce thenumber of resources when synthesizing the design. The pipelining reduces over-all area usage and power consumption when operating at higher frequency.

10 2 Theory

INPUT

FFTMultiply

IFFT

OUTPUT

M

L-1 zeros

M M

DiscardL-1 samples

Overlap

L-1 samples

OverlapL-1 samples

DiscardL-1 samples

Discard

L-1 samples

Figure 2.6: Overlap-save method used in FIR filtering.

2.6 Complex Multiplication 11

2.6 Complex Multiplication

In this thesis, we consider two different algorithms for complex multiplication.One is a standard complex multiplication algorithm, and the other one is Gausscomplex multiplication algorithm [21].

Standard complex multiplication algorithm uses four real multiplications andtwo additions as can be seen in Equation (2.12).

(a + bj)(c + dj) = (ac − bd) + j(bc + ad) (2.12)

When using Gauss complex multiplication algorithm, it is possible to reducethe number of real multiplications. The algorithm is as follows:

k1 = c(a + b)k2 = a(d − c)k3 = b(c + d)

(2.13)

{ac − bd = k1 − k3bc + ad = k1 + k2

(2.14)

This algorithm gains in speed if one multiplication is more expensive thanthree additions or subtractions. However, it has three steps of computations thatmake the architecture more complicated. In this thesis, the difference betweenthese two multipliers will be discussed.

3Method

This chapter covers methods that are used in this thesis.

3.1 Programming Language

Firstly, Matlab is used in order to verify the system. Matlab is a mathemati-cal computing application and programming language developed by Mathworks.Due to its easier approach to complex computation, we used Matlab to verify oursystem.

The second language is VHDL which is an abbreviation for "Very High-speedIntegrated Circuit Hardware Description Language". VHDL is a language gener-ally used in the electronic design of an FPGA (Field Programmable Gate Array)and an Integrated Circuit (IC). We used it because some codes from previousstudies were available. Therefore, it reduced the amount of time required forimplementing both architectures in this thesis.

3.2 ModelSim

ModelSim is a simulation application for hardware description languages such asVHDL, Verilog, and system-level modeling language such as SystemC. This toolis used in order to verify that the system is functionally correct without using anyphysical equipment.

3.3 Synopsys Design Compiler

Synopsys Design Compiler is a tool to synthesize high-level design blocks withHDL code into physical hardware. It creates net-lists consisting of logic-level

13

14 3 Method

design blocks. When compiling with Design Compiler, it is possible to specifycertain parameters such as clock period and switching activities. This functionallows comparing the results with certain constraints.

It also has some useful commands to get optimized results such as "compile_ultra".It includes many features of optimizing such as automatic ungrouping, datapathoptimizing, timing analysis, and so on.

3.4 Power Estimation

In order to estimate more accurate power estimation, it is necessary to set properswitching activity for the ports in the design. The increase in switching activitywill cause more dynamic power consumption.

In the thesis, the power estimation is done by using Switching Activity Infor-mation File (SAIF) in the Design Compiler.

There are two ways of generating SAIF file. One is to write out a SAIF filedirectly, and the other one is to convert a VCD (Value Change Dump) file fromthe simulation to a SAIF file by using command "vcd2saif " in Design Compiler.Since the designs in the thesis are quite big to write a file directly, the lattermethod is used in the thesis.

The detailed procedure is following:

1. Read the design and compile it in Design Compiler.

2. Generate an SDF (Standard Delay Format) file in Design Compiler by using"write_sdf " command.

3. Read the SDF file and the testbench file of the design in ModelSim andcreate a VCD file.

4. Convert the VCD file to a SAIF file.

5. Read the SAIF file in Design Compiler and report power.

3.5 Approach

In order to achieve an optimized filter in terms of power, multiple variants ofevery block were designed and analyzed for power consumption. Each variant ofthe block was synthesized through a range of frequencies starting from 100 MHzto 667 MHz in order to have a better understanding of behaviour of each block.The most efficient variant of all options for each block in terms of power at fre-quency of 476 MHz was then selected for the filter.

4Implementation

This chapter presents the implementation of two different filter architectures.One is 1024-point FFT based FIR filter architecture which is a proposed archi-tecture in this thesis. The other one is 256-point FFT based FIR filter architecturewhich is a previously proposed architecture [9].

In this thesis, wordlengths are determined based on bit error rate (BER) sim-ulated in the previous work [9]. The input data wordlength is chosen as 12 bitswith 6 bits for real and 6 bits for imaginary as mentioned in Section 2.1.2. Insideof the architectures, data wordlength for quantization is chosen as 24 bits with 12bits for real and 12 bits for imaginary. The filter coefficient wordlength is chosenas 16 bits with 8 bits for real and 8 bits for imaginary.

4.1 1024-point FFT Based FIR Filter Architecture

Figure 4.1 shows an overview of the architecture. In this architecture, 1024-pointFFT is performed by using 4-point FFT and 256-point FFT. Commutators performdata shuffling of input samples to do correct FFTs, and after multiplication withfilter coefficients, inverse FFT is performed. The inverse FFT is done by mirroringof the FFT process with interchanging real and imaginary parts of samples frommultipliers and of outputs [3].

4.1.1 Top Level Estimation

Various operators with different wordlength are used in the architecture. Theoperators which influence area usage and power consumption in the design areadder (subtractor), general complex multiplier, complex multiplier with constants,complex multiplier with re-configurable constants, multiplexer and delay ele-ment.

15

16 4 Implementation

y(128n)

y(128n+1)

y(128n+127)

Global counter

First

Commutator

rstclk

4pointFFT

CoefficientSelection

Twiddle Factor

Multiplication

Second

Commutator

256 point FFT

(Parallel FFT)

256 point IFFT

(Parallel IFFT)

Third

Commutator

Twiddle Factor

Multiplication

Fourth

Commutator

s1 s2s3

s4

cs

tfs1

tfs2

4pointFFT

4pointFFT

4pointFFT

x(128n)

x(128n+1)

x(128n+127)

H(0) H(1024)

} Discard

Figure 4.1: 1024-point FFT based FIR filter architecture.

General complex multiplier refers to a complex multiplier that has both in-puts which are not specifically determined. Complex multiplier with constantsrefers to a complex multiplier whose value of the multiplier is constant whereasreconfigurable constants change in every clock cycle.

Although the detailed gates are selected by the synthesis tool, it is good toestimate overall performance by analyzing the number of operators in each block.In the 1024-point FFT based FIR filter architecture, 256 samples are processed atevery clock cycle since we use the OLS method.

First, the number of operators in 256-point FFT block can be computed by thefollowing equation:

Complex adders = N log2 N (4.1)

Complex multipliers =N2

(log2 N − 1) (4.2)

where N is the number of processed samples. The complex multipliers here usetwiddle factors which are constants.

Second, the number of multipliers in the twiddle factor multiplication blockis 3

4N or N depends on two different procedure of the second commutator andthe twiddle factor multiplication block. The difference will be discussed in Sec-tion 5.4. The multipliers use reconfigurable constants. Therefore, three 2 × 1multiplexers are used to select one constant out of four constants.

Third, the number of filter coefficient multipliers is N . The multipliers aregeneral multipliers whose inputs are not constants. Here, multiplexers are usedfor each multiplier. Three 2×1 multiplexers are used to select coefficient and two2 × 1 multiplexers are used for quantization.

Fourth, the commutators have delay blocks and 2 × 1 multiplexers. The num-ber of delay blocks is different in each commutator.

The total number for each operator in the architecture based on 1024-point

4.1 1024-point FFT Based FIR Filter Architecture 17

Operators NumberComplex adders 5120

General complex mult. 256Complex mult. const. 1792

Complex mult. w/ reconfigurable const. 384 or 5122 × 1 Multiplexers 3840 or 4224

Delay elements 2432Table 4.1: Number of operators in the 1024-point FFT based FIR filter archi-tecture.

FFT can be computed by the following equations:

Total complex adders = 4N + 2N log2 N (4.3)

Total general complex multipliers = N (4.4)

Total complex mult. w/ const. = N (log2 N − 1) (4.5)

Total complex mult. w/ reconfigurable const. =32N or 2N (4.6)

Total 2 × 1 multiplexers =(9

2N or 6N

)+ 2N + 3N + N + 2N + 2N +

N2

(4.7)

Total delay elements =52N + 3N + 3N + N (4.8)

The overall results of the architecture can be seen in Table 4.1. Note that theexact numbers may be different by a different optimizing process. More detailsin each block will be discussed further.

4.1.2 First Commutator

The architecture consists of 4-point FFT and 256-point FFT. Therefore, it is nec-essary to change the order of the inputs when they go through FFT blocks. Thetheory behind this block can be found in [6, 8].

The first commutator is used for the 4-point FFTs. The first stage of the FFTneeds to use samples from 0 to 255, 256 to 511, 512 to 767, and 768 to 1023. Also,this block needs to handle the OLS method. Therefore, it is needed to permutethe input data stream. Detail of the permutation flow can be seen in Figure 4.2.The important point in the figure is that b9 and b8 are placed on the right-handside of the matrix, and the total number of bits on the right-hand side is eight.Then, it is possible to pick the correct data to perform 4-point FFT.

In order to explain how to permute data, it is easier to express using bit indexnumbers as follows:

b9 b8 b7 | b6 b5 b4 b3 b2 b1 b0 (4.9)

Indexes left to the vertical bar in Equation (4.9) represents the vertical dimen-sion of the matrix in Figure 4.2, and indexes right to the vertical bar in Equation(4.9) represents the horizontal dimension of the matrix in Figure 4.2. Note that

18 4 Implementation

0

1

2

3

4

5

127

128

129

130

131

132

133

255

256384512640896

897

898

899

900

901

1023 767 639 511 383

257

258

259

260

261389

388

387

386

385513641

642 514

515643

516644

645 517

b9 b8 b7000001010011100101111

0000000b6 b5 b4 b3 b2 b1 b0

0000001

0000010

0000100

0000101

1111111

0000011

0

256

2

258

4

260

128

384

130

386

132

388

1129

257

3

259

5

261389

133

387

131

385

b0 b700011011

0000000b9 b6 b5 b4 b3 b2 b1 b8

0000001

0000010

0000100

0000101

0000011

890

636

892

638

894

1111011

1111100

1111101

1111110

1111111

1018

764

1020

766

1022

891

637

893

639

895

1019

765

1021

767

1023

Figure 4.2: Operation of first commutator.

bits on the left-hand side (b7, b8, and b9 in Equation (4.9)) are increased by aclock signal.

In order to handle the OLS method and select correct bits for the 4-point FFT,the first step is to move b9 to the right-hand side index as follows:

b8 b7 | b9 b6 b5 b4 b3 b2 b1 b0 (4.10)

This is done by adding delay blocks which take four clock cycles at the inputstage. By doing this, the 512 samples are overlapped and saved.

Next step is moving b8 to the right-hand side. In order to do it, b8 needs tobe exchanged with a bit from b0 to b6. According to [6, 8], this process requires aserial-parallel circuit with delay size of two. If we select b0 for example, the finalorder of indexes is following:

b0 b7 | b9 b6 b5 b4 b3 b2 b1 b8 (4.11)

Note that selection of the bit from b0 to b6 results in different power con-sumption. It will be discussed in Section 5.5. Figure 4.3 describes the wholearchitecture of the first commutator.

4.1.3 Second Commutator

After 4-point FFTs are performed, data stream goes to the 256-point FFT. But thedata from 4-point FFTs cannot be used directly because the data was permuted.In order to perform 256-point FFT, the order of the data needs to recover to havea normal order.

The exchanging of index numbers on the right-hand side of the expressioncan be done easily by changing wiring. Therefore, the important feature on thesecond commutator is to place b8 and b9 on the left side. It can be done similarly


4

4

4

4 2

2

2

2

2

2

2

2

First step Second step

Figure 4.3: Detailed architecture of the first commutator.

to the first commutator. The serial-parallel circuit with one delay and with twodelays are performed in the second commutator.

After using the serial-parallel circuit with one delay, b7 and b8 are exchanged.

b0 b8 | b9 b6 b5 b4 b3 b2 b1 b7 (4.12)

To exchange b9 with b0, changing of wiring is needed. The following expres-sion is after changing of wiring is done:

b0 b8 | b7 b6 b5 b4 b3 b2 b1 b9 (4.13)

Then, we can exchange b9 with b0 by using the circuit with two delays.

b9 b8 | b7 b6 b5 b4 b3 b2 b1 b0 (4.14)

The other commutators (third and fourth) are mirrored versions of the firstand second commutator. Although they use different data wordlength and thefinal commutator discards half of the outputs instead of saving it. More detailswill be discussed in Chapter 5.

4.1.4 Twiddle Factor Multiplication

Twiddle factor multiplication block is more complicated than those inside of thetypical FFT block because, in the architecture, it handles 256 samples at one clockcycle out of 1024 samples. Therefore, this block needs a two-bit control signaland a multiplexer to select correct twiddle factor for every sample.

20 4 Implementation

4.1.5 Filter Coefficient Multiplier

Three types of complex multipliers are implemented which can be seen in Fig-ure 4.4. Gauss complex multiplication algorithm can be modified to have pre-computed inputs. Here two adders can be removed as described in Figure 4.4(b).With pre-computed inputs, it is expected to reduce the amount of computationenergy by the removal of the two additions.

4.1.6 Coefficient Selector

When multiplying samples with filter coefficients, selecting correct filter coeffi-cients is necessary, similar to the twiddle factor multiplication block. The differ-ence in this coefficient selector is that registers are needed at every multiplication.Here, we assume that four coefficients for one multiplier are set externally, so fourregisters and one multiplexer is added to the complex multiplier as it is seen inFigure 4.5.

4.1.7 Multiplexer

After the multiplication, output wordlength from the multiplier is increased bycoefficient wordlength. In order to make it have the same wordlength, the quanti-zation multiplexer is implemented in order to select 12 bits from the filter output.This multiplexer is kept so that the user of the system has the flexibility to selectbits based on the configuration of input data with regards to integer and frac-tional bits depending on the number of fractional and integer bits in input datausing this multiplexer.

The multiplexer used in this architecture gives the user four choices by usingtwo-bit external control signal. With the first choice being of 12 MSBs of both realand imaginary bits and then going down by 1 bit for both real and imaginary.

Figure 4.6 shows the architecture of the multiplexer

4.1.8 Global Counter and Control Signals

Synchronization is needed in this architecture because it is necessary to select cor-rect twiddle factors and filter coefficients at the correct timing when we multiplythem with samples. Also, architecture of each block has different delay time dueto different pipelining. Therefore, we implement a global counter that generatescontrol signals for each block.

4.2 256-point FFT Based FIR Filter Architecture

Figure 4.7 shows an overview of the architecture. This architecture uses polyno-mial convolution [11]. It separates impulse response in polyphase componentsand performs FFT on that. So, this architecture performs FFT on input samplesand uses 4-tap FIR filter on each transformed sample. Inverse FFT is performedby interchanging real and imaginary parts from FIR filter, performing FFT on


a bi+ c di+

Real Imag

(a) Gauss complex multi-plier.

a bi+ c c + d

Real Imag

d - c

(b) Gauss complex multi-plier with pre-computation.

a bi c di+ +

Real Imag

(c) Standard complexmultiplier.

Figure 4.4: Multipliers used in this thesis.

(Selected externally)

Complex multiplier

Incoming sample Output sample

Control signal

Coefficients

Q

R R R R

Figure 4.5: Architecture of one filter coefficient multiplier.

22 4 Implementation

>>

Incoming

sampleOutput

sample

Control

signal(0)

Control

signal(1)

2 >> 1

1

0

1

0

Figure 4.6: Multiplexer for quantization

4-tap filter

4-tap filter

4-tap filter

4-tap filter

4-tap filter

4-tap filter

FF

T O

verl

ap S

ave

Blo

ck

Figure 4.7: 256-point FFT based FIR filter architecture.

them and interchanging real and imaginary part of the result as mentioned in[3].

4.2.1 Top Level Estimation

In the 256-point FFT based FIR filter architecture, complex multipliers in FFT useconstants, and complex multipliers in 4-tap FIR filter use configured constantsbecause the filter coefficients are configured once to operate the system. Thecomplex multipliers inside the FFT have constant multipliers thus they can beoptimized at the hardware level.

In 256-point FFT block of this architecture, input has 12-bit data wordlength,and half of the outputs from IFFT are discarded.

The number of operators in the FIR filter is computed by the following equa-tions:

Complex adders = N (M − 1) (4.15)

Delay elements = N (M − 1) (4.16)

Complex mult. w/ configured const. = NM (4.17)

where M is the number of taps for FIR filter, N is the number of FIR filters.Also, after the 4-tap FIR filter, there are two 2 × 1 multiplexers for quantiza-

tion. The total number of each operator in the architecture can be computed by


Operators NumberComplex adders 4736

Complex mult. w/ const. 1792Complex mult. w/ configured const. 1024

2 × 1 Multiplexers 512Delay elements 896

Table 4.2: Number of operators in the 256-point FFT based FIR filter.

the following equations:

Total complex adders = 2N log2 N + N (M − 1) − N2

(4.18)

Total complex mult. w/ const. = N (log2 N − 1) (4.19)

Total complex mult. w/ configured const. = NM (4.20)

Total 2 × 1 multiplexers = 2N (4.21)

Total delay elements =N2

+ N (M − 1) (4.22)

The detailed number of operators can be seen in Table 4.2. Note that the exactnumbers may be different for a different optimizing process. More details in eachblock will be discussed further.

4.2.2 Fast Fourier Transform with Overlap-save Method

As discussed in Chapter 1.1, it is required that 128 samples are processed in everyclock cycle. According to [13], we need to have a size of FFT two times the size ofoutput number of samples required. This leads us to have 256-point FFT outputswith 128 overlapped samples for generating 128 output samples.

The FFT based filtering uses the concept of the OLS method to perform filter-ing. The OLS method is implemented using the first block as indicated in Figure2.6. The block receives 128 new samples where each sample has 6 bits real and 6bits imaginary. These 128 samples are appended with 6 zeros in the LSB for bothreal and imaginary part. This makes these 128 samples of 24 bits each. These 128samples are then kept for one cycle. The 128 samples then come as new samplesand the same process of appending zeros to the number is followed. Then these128 new samples and old retained 128 samples are used as 256 samples whichare then fed into the 256-point FFT block. The purpose of this block is thereforeto perform OLS component of FFT based FIR filtering.

4.2.3 4-tap FIR Filters

A standard 4-tap FIR filter contains three delay elements and four multipliersand adders. The configuration for summation is left to the synthesizing tool. It

24 4 Implementation

could be a tree structure or a sequential structure. A generic 4-tap filter is con-structed which is then used in all other types of implementation of 4-tap filters.A general 4-tap filter is depicted in Figure 2.2.

4.2.3.1 Direct Form Implementation

The 4-tap filter in direct form consists of three delay elements. The followingmultipliers are used for separate implementation:

• With Gauss complex multiplier: The complex multiplier used in this 4-tapfilter is using Gauss complex multiplication algorithm as shown in Figure4.4(a).

• With Gauss complex multiplier with precomputed inputs: The complexmultiplier used in this 4-tap filter is using Gauss complex multiplicationalgorithm with precomputed inputs as shown in Figure 4.4(b).

• With standard complex multiplier: The complex multiplier uses standardcomplex multiplication algorithm as shown in Figure 4.4(c).

The sum of four multiplier outputs is left for the synthesizing tool to determinethe structure of adder. The filter is depicted in Figure 2.2.

4.2.3.2 Parallel Standard Structure

This filter consists of four generic 4-tap filters each having coefficient wordlengthof 8 bits. This structure implements standard complex multiplication algorithm.Two generic 4-tap filters take real coefficient values as filter coefficients, and othertwo take imaginary coefficient values as filter coefficients. The filter with realcoefficient values and another filter with imaginary coefficient values are fed withreal input values, and the other two filters are fed with imaginary input values.The outputs from these four generic 4-tap filters are then summed to get theoutputs. This filter is depicted in Figure 4.8. The real and imaginary outputs aregiven by Equation (4.23) and (4.24) respectively.

ahre − bhim (4.23)

ahim + bhre (4.24)

4.2.3.3 Parallel Gauss Structure

This filter implementation consists of three generic 4-tap filters. The two filtershave coefficient wordlength of 9 bits, and the remaining filter has coefficientwordlength of 8 bits. This filter uses Gauss complex multiplication algorithm.Two generic 4-tap filters with 9-bit coefficient wordlength have filter coefficientsas the sum and difference of real and imaginary parts of coefficients, and the re-maining generic 4-tap filter with 8-bit coefficient wordlength has imaginary partof coefficients as filter coefficients. This can be seen in Figure 4.9.


hre

him

hre

Input Real

Input Imag

Output Real

Output Imag

him

Figure 4.8: Parallel standard structure.

hre + him

him

hre - him

Input Real

Input Imag

Output Real

Output Imag

Figure 4.9: Parallel Gauss structure.

4.2.4 Multiplexer

Similar to the 1024-point FFT based filter architecture, the quantization multi-plexer is also implemented after the 4-tap filter. The difference is that the outputdata wordlength from the 4-tap filter is 22 bits real and 22 bits imaginary and theinput data wordlength for the inverse FFT is for 12 bits real and 12 bits imaginary.The architecture is same as shown in Figure 4.6

4.2.5 Inverse Fast Fourier Transform

An inverse FFT is performed by interchanging real and imaginary parts as dis-cussed in Section 4.2. Also, in this FIR filter architecture, as the OLS methodis used for filtering, we reject the overlapped outputs of FFT in the inverse FFTstage.

5Result

This chapter presents synthesis results of blocks used in two architectures includ-ing a comparison of various options that have been considered. This chapter alsoshows results in six different frequencies to investigate more details of each block.

Notice that the design is synthesized block-by-block. The whole design wasnot synthesized due to memory limitations as the entire design is quite large tobe synthesized as one. Therefore, synthesis results of individual blocks will bepresented in this thesis. The most efficient options of each block in terms ofpower consumption will be selected for each architecture. The comparison oftwo architecture will be discussed at the end of the chapter.

The synthesis is done in Design Compiler using a 65-nm low-power processand 1.2 V supply voltage.

5.1 256-point Fast Fourier Transform

5.1.1 Radix-2 vs Radix-4 vs Radix-16

Choice of a base for FFT can make a difference in performance of the system.In this thesis, 256-point FFT is implemented using radix-2, radix-4, and radix-16 butterflies. The radix-16 butterflies is based on the radix-4 butterfly. Eachcase includes a different number of complex multiplications. The three differentalgorithms can be represented by binary tree representations [18] as illustratedin Figure 5.1.

The basic computation of the number of complex multiplications isN (log2 N−1) for radix-2 case and N

2 (log2 N − 1) for radix-4 and radix-16. However, we canexclude some trivial cases where the angle of the rotation is 0◦, 90◦, 180◦, and270◦. These rotations are just multiplications by 1, j,−1, and −j, respectively.These rotations can be implemented easily on hardware by interchanging real

27

28 5 Result

11

1

1

1

1

1

1

2

3

4

5

6

7

8

(a) Radix-222

2

2 4

6

8

(b) Radix-422 2

4

8

2

4

(c) Radix-16

Figure 5.1: Binary tree representations used for 256-point FFT

Algorithm Non-trivial rotationsRadix-2 878Radix-4 492

Radix-16 480Table 5.1: Number of non-trivial rotations in FFT having various radixes.The radix-16 case is based on a radix-4 butterfly.

and imaginary parts or by changing signs [5].Table 5.1 shows the number of non-trivial rotations. The radix-2 case has

more non-trivial rotations whereas the number of rotations in radix-4 and radix-16 is similar. Due to this difference, synthesis results of those FFTs show thatmore non-trivial rotations have higher area usage and power consumption. Thiscan be seen in Tables 5.2 and 5.3. Note that the radix-2 case uses two pipeline reg-isters at every second stage, and radix-4 case and radix-16 case use two pipelineregisters at every stage. Details about pipelining will be discussed in the nextsection.

Overall, the results show that radix-16 case has lower area usage and powerconsumption at 476 MHz frequency.

5.1.2 Pipelining

Various options for pipelining are available in FFT. Among various options, theimportant point to introduce pipeline registers is near twiddle factor multiplierssince complex multiplications take most computation time among all the opera-tions in the FFT block. In this section, radix-16 based on radix-4 butterflies isused because the previous section concludes that radix-16 case is most efficientamong three radixes.

When using radix-4 butterflies in the FFT block, it has a total of four stages.Thus, three options are considered in this thesis. Two pipeline registers at everystage, one pipeline register at every stage, and one pipeline register only at thesecond stage. Figure 5.2 shows the three cases.

5.1 256-point Fast Fourier Transform 29

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzRadix-2 209 419 637 910 1283 2189Radix-4 172 291 520 827 898 1362Radix-16 168 336 509 816 875 1372Table 5.2: Synthesis results of power consumption for different radixes. Allpower numbers are in mW. The radix-16 case is based on a radix-4 butterfly.Two pipeline registers are used at every second stage in radix-2 and everystage in radix-4 and radix-16.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzRadix-2 1.746 1.746 1.760 1.911 1.976 2.311Radix-4 1.546 1.546 1.549 1.549 1.570 1.742Radix-16 1.508 1.508 1.508 1.520 1.540 1.732Table 5.3: Synthesis results of area usage for different radixes. All area num-bers are in mm2. The radix-16 case is based on a radix-4 butterfly. Twopipeline registers are used at every second stage in radix-2 and every stagein radix-4 and radix-16.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzTwo reg. at every stage 168 336 509 816 875 1372One reg. at every stage 155 308 471 1037 1095 2031One reg. at second stage 178 378 930 - - -

Table 5.4: Synthesis results of power consumption for different pipeliningmethods. All power numbers are in mW. Basis of FFT is radix-16.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzTwo reg. at every stage 1.508 1.508 1.508 1.520 1.540 1.732One reg. at every stage 1.344 1.343 1.351 1.736 1.756 2.066One reg. at second stage 1.247 1.306 1.730 - - -

Table 5.5: Synthesis results of area usage for different pipelining methods.All area numbers are in mm2. Basis of FFT is radix-16.

Table 5.5 shows the synthesis results of the FFT block with different pipelin-ing. The results show that the pipelining is useful when the speed requirementis high, but when slower speed is required, the pipelined structure has more areausage and power consumption due to extra registers. Note that a dash in the tablemeans the system did not meet the speed constraint.

5.1.3 FFT with Overlap-save Block and Inverse FFT

In the 256-point FFT based architecture, 256-point FFT is performed initially.The result of this block is different from the general FFT block as expected. Thepower consumption and area usage both are reduced when compared with nor-mal FFT. This can be explained by the fact that when zeros are appended as ex-plained in Section 4.2.2, the hardware size, such as adders and multipliers, in

30 5 Result

R4 butterflies

R4 butterflies

R4 butterflies

: one twiddle factor multiplier

(a) Two registers at every stage

R4 butterflies

R4 butterflies

R4 butterflies


(b) One register at every stage

R4 butterflies

R4 butterflies

R4 butterflies

R4 butterflies


(c) One register at the second stage

Figure 5.2: Different options of pipelining, when using radix-4 butterflies.Dot lines represent places of pipelining.

5.2 Complex Multiplier 31

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzFFT 168 336 509 816 875 1372FFT w/ OLS. 149 298 452 721 772 1163FFT w/ Post. 164 328 496 795 854 1332Table 5.6: Synthesis results of power consumption for various FFT blocks.All power numbers are in mW. Radix-16 pipelined FFT, based on a radix-4butterfly, is used.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzFFT 1.508 1.508 1.508 1.520 1.540 1.732FFT w/ OLS. 1.33 1.33 1.33 1.35 1.36 1.52FFT w/ Post. 1.46 1.46 1.46 1.48 1.49 1.69Table 5.7: Synthesis results of area usage for various FFT blocks. All areanumbers are in mm2. Radix-16 pipelined FFT, based on a radix-4 butterfly,is used.

initial stages will be reduced. This leads to lower power consumption and areausage.

Also, after the inverse FFT, 128 samples are rejected. This leads to a change inarea usage and power consumption as half of the butterflies from the last stageare removed. This leads to lower power consumption and lower area usage thana normal FFT block. The results can be found in Tables 5.6 and 5.7.

5.2 Complex Multiplier

In the 1024-point FFT based architecture, complex multipliers are used to multi-ply transformed input samples by filter coefficients. There are three choices forthe complex multipliers:

• Gauss complex multiplier: Complex multiplier using Gauss complex multi-plication algorithm as shown in Figure 4.4(a).

• Gauss complex multiplier with precomputed inputs: Using the same Gausscomplex multiplication algorithm with precomputation as shown in Figure4.4(b).

• Standard complex multiplier: The standard way of computation as shownin Figure 4.4(c).

As discussed in Section 2.6, Gauss complex multiplication algorithm usesthree real multiplications and five additions while standard complex multipli-cation algorithm uses four real multiplications and two additions.

Table 5.8 shows synthesis results of area usage for the three multipliers. Eventhough pre-computed Gauss complex multiplier removes two additions, it showshigher area usage. The reason for that is, pre-computed multiplier uses three in-put of which are two are of 9-bit and the remaining is of 8-bit, so it requires larger

32 5 Result

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzGauss mult. 3861 3861 3861 4692 4748 5445Pre. Gauss mult. 4125 4125 4125 4739 4802 5289Standard mult. 3756 3756 3756 4352 4380 5487

Table 5.8: Synthesis results of area usage for the three complex multipliers.All area numbers are in µm2.

registers and multiplexers. Meanwhile, standard complex multiplier preformsbetter than Gauss complex multiplier because of its simplicity. The synthesisresults of power consumption will be discussed in the next section.

5.2.1 Power Estimation with Random Coefficients

The coefficients in the FIR filter are determined by Equation (2.1). However, thisthesis does not handle the specific parameters. Instead, the coefficients are cho-sen randomly to see the performance of complex multipliers.

However, we cannot select specific random coefficients because the choiceof four different coefficients makes a difference in power result. If the numberof transitions between coefficients is small, power consumption on the gates be-comes less, and if the number of transitions is high, the power consumption be-comes higher. Therefore, it is necessary to address this variation. In order to dothis, 99 different sets of coefficients are simulated for each type of multipliers,and a histogram is made. Figure 5.3 shows the result.

The results show that the standard complex multiplier is more efficient thanthe Gauss complex multiplier because of its simplicity. Meanwhile, the pre-computedGauss complex multiplier is most efficient among three multipliers. The detailedresults of all the multipliers are shown in Table 5.11.

5.3 4-tap FIR Filters

5.3.1 Pipelining

All the architectures have been implemented with pipelining. The effects ofpipelining can be seen at higher frequencies where the power consumption re-duces significantly. As expected, the power consumption at lower frequenciesincreases when compared to non-pipelined structures. The pipelining for paral-lel Gauss structure is implemented as indicated in Figure 5.4(a). The critical pathin this structure is determined to be

2Tadder + Tfilter (5.1)

After pipelining, the critical path reduced to

Tadder + Tfilter (5.2)

5.3 4-tap FIR Filters 33

1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4

0

5

10

15

20

25

30

35

40

Pre-Gauss

GaussStandard

Figure 5.3: Histogram of power consumption of three complex multipliersat 476 MHz. Power numbers are in mW.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzMaximum 0.510 1.023 1.723 3.323 3.597 6.817Minimum 0.385 0.770 1.308 2.750 2.984 5.681Median 0.461 0.921 1.574 3.064 3.316 6.291Table 5.9: Synthesis results of power consumption for the Gauss complexmultiplier. All power numbers are in mW.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzMaximum 0.306 0.613 1.014 1.756 1.939 3.407Minimum 0.262 0.523 0.878 1.729 1.825 3.238Median 0.290 0.580 0.965 1.744 1.894 3.340Table 5.10: Synthesis results of power consumption for the precomputedGauss complex multiplier. All power numbers are in mW.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzMaximum 0.502 1.005 1.696 3.010 3.230 5.241Minimum 0.406 0.809 1.368 2.619 2.814 4.577Median 0.455 0.909 1.560 2.847 3.055 4.951Table 5.11: Synthesis results of power consumption for the standard com-plex multiplier. All power numbers are in mW.

34 5 Result

hre + him

him

hre - him

Input Real

Input Imag

Output Real

Output Imag

(a) Parallel Gauss structure with pipelining

hre

him

hre

Input Real

Input Imag

Output Real

Output Imag

him

(b) Parallel standard structure with pipelining

h(0) h(1) h(2) h(3)

D D D

Σ

Input

Output

(c) Direct form structure with pipelining

Figure 5.4: Pipelined structures of 4-tap FIR filters. Dot lines representplaces of pipelining.


Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzArea 15475 15477 15420 16149 16217 18206Power 1.167 2.334 3.540 5.350 5.820 11.490Table 5.12: Synthesis results of the direct form with standard complex mul-tipliers. All area numbers are in µm2 and power numbers are in mW.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzArea 17211 17211 17208 18832 18895 19371Power 1.326 2.651 4.019 7.018 7.440 10.702Table 5.13: Synthesis results of the direct form with standard complex mul-tipliers when pipelining is applied. All area numbers are in µm2 and powernumbers are in mW.

In the direct form structure, pipelining is implemented as in Figure 5.4(c)for 4-tap filters with standard complex multiplier and Gauss complex multiplier.As indicated in Tables 5.12 and 5.13, the pipelining shows its positive effects atthe frequency of 667 MHz and the power consumption of the pipelined struc-ture is lower when compared to the non-pipelined structure. This result is evenmore pronounced in direct form filter with Gauss complex multipliers. In Ta-bles 5.14 and 5.15, we can see that from the frequency of 476 MHz and higher,the power consumption is much lower than non-pipelined structure. Therefore,the pipelined structure will lead to much higher power saving at the higher fre-quency.

Direct form structure with standard complex multiplier also shows the sameeffect at 667 MHz. The power consumption is lower than the non-pipelined ver-sion. This can be seen in Tables 5.12 and 5.13.

If area is discussed, we can see that the pipelining also benefits in the reduc-tion in area usage when operating at the higher frequency. This effect is pro-nounced in direct form structure with Gauss complex multipliers. The area con-sumption is significantly reduced at 667 MHz. As expected at lower frequen-cies the area consumption is higher as extra registers are added when so muchspeedup of the circuit is not required.

The parallel standard structure is also implemented with pipelining as indi-cated in Figure 5.4(b). The critical path in this structure is determined to be

Tadder + Tfilter (5.3)

In the parallel standard structure, the pipelining also shows the same result.At the higher frequency, there is a reduction of both power consumption and areausage, whereas, at the lower frequency, the pipelined version has both higher areausage and power consumption.

The parallel Gauss pipelined structure also performs better. One of the rea-sons for this could be shortening of the critical path that has caused the circuit tobecome more efficient. The area is reduced at the frequency of 667 MHz. Figure5.5 shows the result of various structures with pipelining effects on power.

36 5 Result

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzArea 16904 16904 16415 21198 21724 31737Power 1.224 2.448 3.934 8.616 9.841 25.062Table 5.14: Synthesis results of the direct form filter with Gauss complexmultipliers. All area numbers are in µm2 and power numbers are in mW.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzArea 18118 18118 18190 21278 21373 22793Power 1.424 2.848 4.382 8.335 8.888 14.139Table 5.15: Synthesis results of the direct form filter with Gauss complexmultipliers when pipelining is applied. All area numbers are in µm2 andpower numbers are in mW.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzArea 15446 15716 15716 19575 20105 28959Power 1.155 2.309 3.953 8.518 9.353 22.457Table 5.16: Synthesis results of the direct form filter with Gauss complexmultipliers when precomputation is applied. All area numbers are in µm2

and power numbers are in mW.

0 200 400 600 800

Frequency(MHz)

0

5

10

15

20

25

Po

we

r(m

W)

Parallel standard structure

Non Pipelined

Pipelined

0 200 400 600 800

Frequency(MHz)

0

5

10

15

20

25

Po

we

r(m

W)

Parallel Gauss structure

Non Pipelined

Pipelined

0 200 400 600 800

Frequency(MHz)

0

5

10

15

20

25

Po

we

r(m

W)

Direct form standard structure

Non Pipelined

Pipelined

0 200 400 600 800

Frequency(MHz)

0

5

10

15

20

25

Po

we

r(m

W)

Direct form Gauss structure

Non Pipelined

Pipelined

Figure 5.5: Comparison of various pipelining results.


Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzArea 14876 14876 14395 15472 15611 18236Power 1.278 2.468 3.729 5.701 7.946 12.385Table 5.17: Synthesis results of the parallel Gauss structure. All area num-bers are in µm2 and power numbers are in mW.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzArea 15162 15162 14790 16049 16094 17116Power 1.219 2.437 3.670 5.912 6.209 10.566Table 5.18: Synthesis results of the parallel Gauss structure when pipeliningis applied. All area numbers are in µm2 and power numbers are in mW.

5.3.2 Gauss Complex Multiplication Algorithm Based FIR Filters

There are two filters implemented based on Gauss complex multiplication algo-rithm. The first uses Gauss complex multipliers in the direct form structure andthe other is the parallel Gauss structure. The parallel Gauss structure uses three4-tap filters with coefficients which are sum and difference of real and imaginarycoefficients and one filter having imaginary coefficients. The results show that theparallel Gauss structure is better among both the implementations. The resultsof both implementations can be seen in Tables 5.14 and 5.17.

This can be explained by the fact that the Gauss complex multipliers in thedirect form implementation consists of three stages inside. The output is depen-dent on three stages. Therefore, when it is made to operate at high frequency thepower consumption increases a lot because there are three stages through whichcomputations are to be made at very high speed. On the other hand, if we look atthe parallel Gauss structure, we will see that it consists of three real-valued 4-tapfilters which have real multipliers inside. This means the parallel Gauss structurethough uses the same Gauss complex multiplication principle, is a much simplerstructure.

Another implementation of the Gauss complex multiplication algorithm isone with pre-computed inputs as seen in Figure 4.4(b). Although in this multi-plier we remove two adders to increase the performance, that does not improveperformance much. This can be explained by the fact that although two addersare removed, the coefficients are constant so the adders also operate only oncewhen new coefficients arrive, for further cycles they are static. Thus, not muchimprovement is seen with this algorithm.

5.3.3 Standard Complex Multiplication Algorithm Based FIRFilters

There are two filters based on the standard complex multiplication algorithm.The direct form structure with standard complex multipliers performs betterthan the parallel standard structure.

38 5 Result

0 200 400 600 800

Frequency(MHz)

0

5

10

15

20

25

Po

we

r(m

W)

Gauss algorithm

Direct form

Direct form with pre

Parallel Gauss

0 200 400 600 800

Frequency(MHz)

0

5

10

15

20

25

Po

we

r(m

W)

Standard algorithm

Parallel standard

Direct form

Figure 5.6: Comparison of power consumption in Gauss and standard algo-rithm.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzArea 15766 15764 15519 17083 17574 20200Power 1.245 2.492 3.721 6.380 7.056 13.520Table 5.19: Synthesis results of the parallel standard structure. All areanumbers are in µm2 and power numbers are in mW.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzArea 16435 16435 16435 17218 17412 18725Power 1.368 2.736 4.138 6.352 6.692 11.053Table 5.20: Synthesis results of the parallel standard structure when pipelin-ing is applied. All area numbers are in µm2 and power numbers are in mW.

The difference in both architectures is in the way summation is being per-formed. The order of adders after the multipliers is different in both architec-tures. This results in one architecture to consume more power than the other.The difference in power consumption is depicted in Figure 5.6.

5.3.4 Summary of Results

The results of all synthesis of all 4-tap FIR filter implementations point to a con-clusion that the direct form with standard complex multipliers is the best filterto operate at 476 MHz. The overall architecture of this block is much simplerand dependency on various blocks is minimal compared to other 4-tap FIR filterstructures.

The power estimation is performed multiple times in order to get the rangeof power consumption. The histogram in Figure 5.7 shows a distribution of a setof 99 values of power consumption of direct form filter with standard complexmultipliers. The minimum, maximum and median values of this implementation


5.24 5.26 5.28 5.3 5.32 5.34 5.36 5.38 5.4 5.42

0

5

10

15

20

25

Figure 5.7: Histogram of power consumption of direct form filter with thestandard complex multipliers at 476 MHz. Power numbers are in mW.

Frequency 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzMedian 1.166 2.333 3.537 5.340 5.810 11.475Minimum 1.150 2.299 3.487 5.253 5.717 11.322Maximum 1.179 2.361 3.578 5.415 5.891 11.636Table 5.21: Power consumption range of direct form filter with standardcomplex multipliers. All power numbers are in mW.

can be seen in Table 5.21.

5.3.5 Tap Configuration Results

The direct form structure with standard complex multipliers is the most powerefficient of all the filters as discussed in Section 5.3.4. A 4-tap filter can also beused with only some taps on. Table 5.22 shows various tap configuration of 4-tapfilters and their related power consumption. It is clear when the taps are switchedoff, the power consumption reduces because multipliers have lower switchingactivity. When all taps are switched off, the power consumption is quite low. Thedynamic power consumption is mostly of delay elements as all multipliers haveminimal switching activity because coefficients are set to zero. Consequently, allinputs to adder are zero. This leads to total power consumption to be lower.

40 5 Result

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzAll Taps On 1.167 2.334 3.540 5.350 5.820 11.490One Tap Off 0.815 1.630 2.505 4.822 5.256 10.445Two Taps Off 0.695 1.390 2.136 4.173 4.566 9.140Three Taps Off 0.577 1.154 1.758 3.465 3.816 7.616All Taps Off 0.311 0.621 0.986 1.813 1.987 3.572Table 5.22: Comparison of power consumption of direct form filter withstandard complex multipliers with various tap configurations. All powernumbers are in mW.

4pointFFT

Secon

dCom

mutator

4pointFFT

4pointFFT

4pointFFT


12 16 24

(a) Twiddle factor multipliers arefirst.

4pointFFT


clk 0 :2ndComm

4pointFFTclk 1 :

4pointFFTclk 2 :

4pointFFTclk 3 :

2ndComm

2ndComm

2ndComm

12 16 16 24

(b) Commutator is first.

Figure 5.8: Different procedures of commutator.

5.4 Different Procedures of Commutator and TwiddleFactor Multiplication

The second commutator in the 1024-point FFT based architecture can be locatedbefore and after the twiddle factor multiplication block. Both configurations havethe same functional result but have different results in terms of efficiency, since in-put data wordlength of the system is set to be 12-bit, and output data wordlengthfrom the twiddle factor multiplier needs to be 24-bit. Also, as seen in Figure 5.8,the first output of the 4-point FFT always has an angle of 0◦ so it is possible toremove one complex multiplication from one of four outputs.

Therefore, if the twiddle factor multiplication block comes first, the blockmaintains the same configuration regardless of a clock cycle by sending every

5.5 Different Cases in the First Commutator 41

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzT.F. mult. first 0.492 0.492 0.508 0.550 0.551 0.581T.F. mult. last 0.685 0.685 0.717 0.799 0.799 0.806Comm. first 0.131 0.131 0.131 0.131 0.131 0.131Comm. last 0.196 0.196 0.196 0.196 0.196 0.196Case1 total 0.687 0.687 0.704 0.746 0.747 0.777Case2 total 0.815 0.815 0.847 0.930 0.930 0.937Table 5.23: Synthesis results of area usage for different procedures. All areanumbers are in mm2.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzT.F. mult. first 65 129 201 349 355 514T.F. mult. last 92 184 291 519 538 733Comm. first 23 44 67 106 120 159Comm. last 31 64 97 167 178 238Case1 total 96 193 298 516 533 752Case2 total 115 228 358 625 658 892Table 5.24: Synthesis results of power consumption for different procedures.All power numbers are in mW.

first output from 4-point FFT directly to the next stage. Meanwhile, the commu-tator has higher area usage and power consumption because it uses inputs with24-bit data wordlength.

On the other hand, if the commutator block comes first, the commutator blockhas lower area usage and power consumption while the twiddle factor multipli-cation block becomes more complicated because it handles different operationsdepending on clock cycles.

The results of two configurations can be seen in Tables 5.23 and 5.24. ‘Case1’in the tables refers to the case when twiddle factor multiplication block comesfirst, and ‘Case2’ in the tables refers to the case when commutator block comesfirst. As expected, the block placed first has lower area usage and power con-sumption in both twiddle factor multiplication block and commutator block. Butthere are greater savings in area usage and power consumption when the twiddlefactor multiplication block is placed first.

For the IFFT side, the same wordlength is used in both blocks. Therefore,there is no saving in area usage and power consumption in the commutator blockwherever it is located. But significant saving in area usage and power consump-tion is observed in the twiddle factor multiplication block due to the same reasonas explained earlier. The results of the IFFT side can be seen in Tables 5.25 and5.26.

5.5 Different Cases in the First Commutator

As discussed in Section 4.1.2. To place index bit 8 and bit 9 on the right-handside, it is necessary to exchange bit 8 with another bit. In this regard, selection

42 5 Result

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzT.F. mult. first 0.685 0.685 0.687 0.768 0.774 0.829T.F. mult. last 0.935 0.935 0.943 1.032 1.040 1.146Table 5.25: Synthesis results of area usage for the twiddle factor multiplica-tion block in IFFT side. All area numbers are in mm2.

Freq 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHzT.F. mult. first 86 172 261 465 492 762T.F. mult. last 113 226 349 645 689 1124Table 5.26: Synthesis results of power consumption for the twiddle factormultiplication block in IFFT side. All power numbers are in mW.

Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7Area 0.551 0.600 0.697 0.592 0.586 0.578 5.644

Power 349 397 395 387 449 371 362Table 5.27: Synthesis results of different cases for the twiddle factor mul-tiplication block at 476 MHz frequency. All area numbers are in mm2 andpower numbers are in mW.

of index bit can influence the efficiency of the twiddle factor multiplication blockbecause of different transition on twiddle factors while there is no difference inthe first commutator in terms of efficiency. Below expressions are the cases whichare possible for the first commutator:

Case 1: b0 b7 | b9 b6 b5 b4 b3 b2 b1 b8 (5.4)

Case 2: b1 b7 | b9 b6 b5 b4 b3 b2 b8 b0 (5.5)

Case 3: b2 b7 | b9 b6 b5 b4 b3 b8 b1 b0 (5.6)

Case 4: b3 b7 | b9 b6 b5 b4 b8 b2 b1 b0 (5.7)

Case 5: b4 b7 | b9 b6 b5 b8 b3 b2 b1 b0 (5.8)

Case 6: b5 b7 | b9 b6 b8 b4 b3 b2 b1 b0 (5.9)

Case 7: b6 b7 | b9 b8 b5 b4 b3 b2 b1 b0 (5.10)

The synthesis results of the above cases for the twiddle factor multiplicationblock are shown in Table 5.27. In terms of both area usage and power consump-tion, case 1 performs best because index bit 0 has the least effect on changes ofangle in the twiddle factors.

5.6 Comparison between Two Architectures

Tables 5.28 and 5.29 show a summary of the 1024-point FFT based FIR filterarchitecture. Here, we picked the best results of each block based on the 476 MHzfrequency and the least power consumption.

5.6 Comparison between Two Architectures 43

No. 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz256-point FFT 2 336 672 1018 1632 1750 27464-point FFTs 64 5 11 16 25 27 354-point IFFTs 64 14 27 43 67 71 117T.F. mult. (FFT) 1 65 129 201 349 355 514T.F. mult. (IFFT) 1 86 172 261 465 492 762Filter coeff. mult. 256 74 148 247 446 485 855Commutator 1 1 19 37 56 92 97 129Commutator 2 1 23 44 67 106 120 159Commutator 3 1 31 64 97 167 178 238Commutator 4 1 10 20 30 48 54 71One stage Registers 6 54 108 162 276 288 294Total 717 1432 2198 3673 3917 5920

Table 5.28: Synthesis results of power consumption for the 1024-point FFTbased FIR filter architecture. The numbers for each block are total powerconsumption. All power numbers are in mW.

No. 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz256-point FFT 2 3.016 3.016 3.016 3.040 3.080 3.4644-point FFTs 64 0.066 0.066 0.066 0.066 0.066 0.0664-point IFFTs 64 0.118 0.118 0.118 0.118 0.118 0.137T.F. mult. (FFT) 1 0.492 0.492 0.508 0.551 0.551 0.581T.F. mult. (IFFT) 1 0.685 0.685 0.687 0.768 0.774 0.829Filter coeff. mult. 256 1.056 1.056 1.056 1.213 1.230 1.354Commutator 1 1 0.109 0.109 0.109 0.109 0.109 0.109Commutator 2 1 0.131 0.131 0.131 0.131 0.131 0.131Commutator 3 1 0.196 0.196 0.196 0.196 0.196 0.196Commutator 4 1 0.062 0.062 0.062 0.062 0.062 0.062One stage Registers 6 0.048 0.048 0.048 0.048 0.048 0.048Total 6.074 6.074 6.074 6.397 6.460 7.072

Table 5.29: Synthesis results of area usage for the 1024-point FFT basedfilter architecture. The numbers for each block are total area usage. All areanumbers are in mm2.

1. 256-point FFT block: Radix-16 based on a radix-4 butterfly is used. Twopipeline registers at every stage are used.

2. Complex multipliers: Pre-computed Gauss complex multipliers are used.

3. Commutator and twiddle factor multiplication: Twiddle factor multiplica-tion block comes first.

4. First commutator: Index bit 0 is exchanged with index bit 8.

Tables 5.30 and 5.31 show a summary of the 256-point FFT based FIR filterarchitecture. Here also, we picked the best results of each block based on the476 MHz frequency and the least power consumption.

1. 256-point FFT block: Radix-16 based on a radix-4 butterfly used. The OLSmethod is considered and input data wordlength is 12 bits.

44 5 Result

No. 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz256-point FFT 1 149 298 452 721 772 11634-Tap Filters 256 298 597 906 1367 1487 2938256-point IFFT 1 164 328 497 795 855 1333One stage Registers 2 18 36 54 92 96 98Total 630 1259 1908 2976 3210 5532

Table 5.30: Synthesis results of power consumption for the 256-point FFTbased FIR filter architecture. The numbers for each block are total powerconsumption. All power numbers are in mW.

No. 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz256-point FFT 1 1.337 1.337 1.337 1.350 1.366 1.5214-Tap Filters 256 3.962 3.962 3.948 4.134 4.152 4.661256-point IFFT 1 1.468 1.468 1.468 1.483 1.499 1.690One stage Registers 2 0.096 0.096 0.096 0.096 0.096 0.096Total 6.863 6.863 6.848 7.064 7.112 7.967

Table 5.31: Synthesis results of area usage for the 256-point FFT based FIRfilter architecture. The numbers for each block are total area usage. All areanumbers are in mm2.

2. 4-tap FIR filters: Direct form structure with standard complex multipliersis used.

3. 256-point IFFT block: Radix-16 based on a radix-4 butterfly is used. Inoutput 128 samples are rejected.

Total area usage and power consumption in the 1024-point FFT based FIR fil-ter architecture is calculated on basis of Equations (5.11) and (5.12) respectively.Registers between two blocks are also considered. Note that small blocks likecontrol block and multiplexer for selecting bit are not included because they con-sume less than 0.1% of total power.

Total area = 2A256 FFT + 64A4 FFT + 64A4 IFFT + 256ACoeffmult + Acomm1 + Acomm2

+ Acomm3 + Acomm4 + ATFmult_FFT + ATFmult_IFFT + 6Areg (5.11)

Total power = 2P256 FFT + 64 P4 FFT + 64P4 IFFT + 256PCoeffmult + Pcomm1 + Pcomm2

+ Pcomm3 + Pcomm4 + PTFmult_FFT + PTFmult_IFFT + 6Preg (5.12)

Similarly, total area usage and power consumption in the 256-point FFT basedFIR filter architecture are calculated based on Equations (5.13) and (5.14) respec-tively.

Total Area = A256 FFT + 256A4-tap filter + A256 IFFT + 2Areg (5.13)

Total Power = P256 FFT + 256P4-tap filter + P256 IFFT + 2Preg (5.14)

5.7 Comparison with Previous Work 45

Architecture 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz1024-point FFTbased filter

717 1432 2198 3673 3917 5920

256-point FFT 630 1259 1908 2976 3210 5532based filterTable 5.32: Comparison of power consumption for both architectures. Allpower numbers are in mW.

Architecture 100 MHz 200 MHz 300 MHz 476 MHz 500 MHz 667 MHz1024-point FFTbased filter

6.074 6.074 6.074 6.397 6.460 7.072

256-point FFTbased filter

6.863 6.863 6.848 7.064 7.112 7.967

Table 5.33: Comparison of area usage for both architectures. All area num-bers are in mm2.

Architecture Area [mm2] Power [W]1024-point FFT based filter 6.40 3.67256-point FFT based filter 7.06 2.98

1024-point FFT based filter from [10] 21.6 7.32256-point FFT based filter from [10] 7.6 3.12

Table 5.34: Comparison with a previous work at 476 MHz frequency.

The results of both architectures can be compared in Table 5.32 and 5.33. Itcan be seen that the 256-point FFT based FIR filter architecture has less powerconsumption. However, the 256-point FFT based FIR filter architecture has ahigher area usage than the 1024-point FFT based FIR filter architecture. This canbe seen in Table 5.33.

5.7 Comparison with Previous Work

In addition to the comparison of two filter architectures in this thesis, a compari-son with previous work [10] is made as shown in Table 5.34. The 1024-point FFTbased filter architecture is found to be significantly more efficient than the onein the previous work. The area saving is about 70% and power saving is about50%. Also, 256-point FFT based filter architecture is slightly improved as well.The savings of area usage is about 7% and power consumption is about 5%.

The savings in the 1024-point FFT based filter is due to the different imple-mentation. In this thesis, a single FFT is used instead of two even though bothimplementations use the same algorithm. It made a considerable saving in botharea usage and power consumption. The savings in the 256-point FFT based filteris due to the optimization of each block in the filter.

46 5 Result

0 200 400 600 800

Frequency(MHz)

0

2

4

6

8

10

12

Are

a (

mm

2)

Area comparison

1024-point FFT based filter


0 200 400 600 800

Frequency(MHz)

0

1000

2000

3000

4000

5000

6000

7000

8000

Pow

er(

mW

)

Power comparison



Figure 5.9: Comparison of power consumption and area usage in both archi-tectures.

6Conclusion

6.1 Discussion

The goal of this thesis is to design and evaluate a new high-speed 512-tap FIRfilter architecture. Therefore, all the blocks in two architectures are optimized.As seen in Chapter 5, both architectures work at the frequency of 476 MHz whichis target frequency to achieve 60 GS/s throughput of the filter.

The comparison shows that the 1024-point FFT based FIR filter is better interms of area and the 256-point FFT based FIR filter architecture is better interms of power at 476 MHz frequency. Even though area usage of 4-tap FIR filtersis higher, power consumption is comparably low because of its fixed coefficients.In addition, the 1024-point FFT based FIR filter has more operators except forcomplex multipliers. It results in higher power consumption overall.

It is interesting to mention that the multiplication process in both architec-tures makes a significant impact on total power and area. In the 1024-point FFTbased FIR filter, three multiplication blocks consume about 34% of total power.Meanwhile, 4-tap FIR filter blocks in the 256-point FFT based FIR filter architec-ture consume about 46% of total power. In this regard, if it is possible to simulatewith parameters of chromatic dispersion filter, the results can be more accurate.

Another point to mention is that this post-synthesizing is not accurate sinceDesign Compiler has its randomness when synthesizing a design. Also, we couldnot figure out how to handle detailed net power estimation in big designs.

On the other hand, the results still provide a good estimation and show adetailed impact of variations in each block.

In the 4-tap FIR filter, standard multiplication performs better while in thefilter multiplier, pre-computed Gauss multiplication performs better. This indi-cates that operating conditions create a significant impact on the power consump-tion of multipliers.

47

48 6 Conclusion

In the 256-point FFT, radix-16 performs the best among three different radixeswith a smaller number of non-trivial twiddle factor multiplication. It is expectedthat other radixes that have less non-trivial cases may have better results. Also,pipelining is a good method to reduce the power consumption at higher frequen-cies.

6.2 Future Work

In this thesis, a combination of 4-point FFT and 256-point FFT is used. However,there are many different types of FFT combinations. It could be worth to figureout other FFT configuration. Also, as discussed above, it would be worth tryingother post-synthesizing method or find a way to handle all the detailed nets in adesign. It would give more accurate results.

Bibliography

[1] Cisco. Cisco visual networking index: Forecast and method-ology, 2016–2021. Cisco public, 2017. URL https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/complete-white-paper-c11-481360.pdf. Cited on page 1.

[2] J. W. Cooley and J. W. Tukey. An algorithm for the machine calculationof complex Fourier series. Mathematics of Computation, 19(90):297–297,1965. doi: 10.1090/s0025-5718-1965-0178586-1. Cited on page 6.

[3] P. Duhamel, B. Piron, and J.M. Etcheto. On computing the inverse DFT.IEEE Transactions on Acoustics, Speech, and Signal Processing, 36(2):285–286, 1988. doi: 10.1109/29.1519. Cited on pages 15 and 22.

[4] A. Eghbali, H. Johansson, O. Gustafsson, and S. J. Savory. Optimal least-squares FIR digital filters for compensation of chromatic dispersion in digi-tal coherent optical receivers. Journal of Lightwave Technology, 32(8):1449–1456, 2014. doi: 10.1109/JLT.2014.2307916. Cited on pages 2, 3, and 4.

[5] M. Garrido. A new representation of FFT algorithms using triangular ma-trices. IEEE Transactions on Circuits and Systems I: Regular Papers, 63(10):1737–1745, 2016. doi: 10.1109/TCSI.2016.2587822. Cited on pages2 and 28.

[6] M. Garrido, J. Grajal, M. A. Sanchez, and O. Gustafsson. Pipelined radix-2kfeedforward FFT architectures. IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems, 21(1):23–32, 2013. doi: 10.1109/TVLSI.2011.2178275. Cited on pages 2, 17, and 18.

[7] M. Garrido, M. Acevedo, A. Ehliar, and O. Gustafsson. Challenging thelimits of FFT performance on FPGAs (invited paper). In International Sym-posium on Integrated Circuits (ISIC). IEEE, 2014. doi: 10.1109/isicir.2014.7029571. Cited on pages 2 and 9.

[8] M. Garrido, J. Grajal, and O. Gustafsson. Optimum circuits for bit-dimension permutations. In preparation. Cited on pages 17 and 18.

49

https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/complete-white-paper-c11-481360.pdf




50 Bibliography

[9] A. Kovalev. Implementation and evaluation of two 512-tap complex FIRfilter architectures for compensation of chromatic dispersion in optical net-works. Master’s thesis, Linköping University, Department of Electrical En-gineering, 2017. Cited on pages 4, 9, and 15.

[10] A. Kovalev, O. Gustafsson, and M. Garrido. Implementation approachesfor 512-tap 60 GSa/s chromatic dispersion FIR filters. In 51st AsilomarConference on Signals, Systems, and Computers. IEEE, 2017. doi: 10.1109/acssc.2017.8335667. Cited on pages 2 and 45.

[11] H. Kwan and M. Tsim. High speed 1-D FIR digital filtering architecturesusing polynomial convolution. In IEEE International Conference on Acous-tics, Speech, and Signal Processing, volume 12, pages 1863–1866. Instituteof Electrical and Electronics Engineers, 1987. doi: 10.1109/icassp.1987.1169501. Cited on page 20.

[12] C. Laperle and M. O’Sullivan. Advances in high-speed DACs, ADCs, andDSP for optical coherent transceivers. Journal of Lightwave Technology, 32(4):629–643, 2014. doi: 10.1109/JLT.2013.2284134. Cited on page 4.

[13] I.-S. Lin and S.K. Mitra. Fast FIR filtering algorithms based on overlappedblock structure. In IEEE International Symposium on Circuits and Systems,page 363. IEEE, 1993. doi: 10.1109/iscas.1993.393733. Cited on page 23.

[14] P. A. Lynn and W. Fuerst. Introductory Digital Signal Processing with Com-puter Applications. Wiley, second edition, 1998. ISBN 0471976318. Citedon page 6.

[15] R. G. Lyons. Understanding Digital Signal Processing. Addison Wesley PUBCO INC, third edition, 2010. ISBN 0137027419. Cited on page 7.

[16] M. Parker. Digital Signal Processing 101: Everything You Need to Know toGet Started. Newnes, 2010. ISBN 9781856179218. Cited on page 6.

[17] J. G. Proakis and D. K. Manolakis. Digital Signal Processing. Pearson, fourthedition edition, 2006. ISBN 9780131873742. Cited on pages 5, 6, 8, and 9.

[18] F. Qureshi and O. Gustafsson. Generation of all radix-2 fast Fourier trans-form algorithms using binary trees. In 20th European Conference onCircuit Theory and Design (ECCTD), pages 677–680. IEEE, 2011. doi:10.1109/ecctd.2011.6043634. Cited on pages 2 and 27.

[19] R. Ramaswami, K. N. Sivarajan, and G. H. Sasaki. Optical Networks: APractical Perspective. Morgan Kaufmann Publishers INC, 2011. ISBN0123740924. Cited on pages 1 and 3.

[20] T. E. Stern and K. Bala. Multiwavelength Optical Networks: A Layered Ap-proach (Professional Computing). Prentice Hall, 1999. ISBN 020130967x.Cited on page 3.

Bibliography 51

[21] K. R. Stromberg. An Introduction to Classical Real Analysis (AMS ChelseaPublishing). American Mathematical Society, 2015. ISBN 978-1-4704-2544-9. Cited on page 11.

[22] L. Wanhammar. DSP Integrated Circuits. Academic PR INC, 1999. ISBN0127345302. Cited on pages 4, 5, and 9.

implementation of high-speed 512-tap fir filters for...

Documents