coarse grain reconfigurable arrays, are signal processing ... · coarse grain reconfigurable...

Department of Computer Systems

Coarse Grain Reconfigurable Arrays, are SignalProcessing Engines!

Digital Design for FPGA, TKT-1426, Lecture # 11

Waqar HussainResearch Scientist

[email protected] of Computer Systems

Tampere University of Technology, Finland


Electronic Products

Multifunction devices are becoming popular besides theirreliability and durability

Example__ Mobile Phone

• The key selling features of a cell phone are size, weight, longer battery times,audio/video streaming and several games running onto it

• Adaptability to many communication standards

• Expectations for Real Time performance

• No Limits to Human Desire

2


Embedded Technology

The embedded technology empowers a mobilephone to carry all these features.

Intended for a specific use which consist of a hardwarecapable to perform a set of different tasks with the help ofsoftware

ExampleEmbedded System = RISC + Accelerator(s)

3


Why Coarse Grain Reconfigurable Arrays ?

Computationally Intensive Kernels (CIK) need to be accelerated ina Signal Processing System.

Examples of CIKs1. FIR Filtering2. Encoding and Decoding

a) Viterbib) Reed-Solomon

3. Matrix-Vector Multiplication4. Fast Fourier Transform

4


Why Coarse Grain Reconfigurable Arrays ?

Question: So why CGRA, why not traditional accelerators?

Its more desirable to use devices that could acceleratedifferent kernels than typical traditional accelerators that weredesigned to accelerate only a single kernel.

Thanks to Reconfigurability!

5


Why CGRAs are Powerful Engines ?

Answer: Due to its structure!

CGRAs offer high parallelism and throughput due to its array-based structure.Algorithms containing parallelism are most suitable to bemapped on a CGRA.It can process large streams of data.Unit of Structure of a CGRA is an ALU, called ProcessingElements (PE).Each PE is connected to other PEs using point-to-point or aNetwork on Chip (NoC).

6


CGRA in an Embedded System

An Example of Embedded System isRISC + Accelerator(s)

RISC = COFFEEAccelerator = BUTTER

Both COFFEE and BUTTER were designed at the Department of ComputerSystems, Tampere University of Technology, Finland

BUTTERA general purpose Coarse Grain Reconfigurable Array (CGRA)which is a martix of processing elements (PEs). Each PE iscapable to perform a set of different tasks and connected witheach other using point to point interconnections. BUTTER wascapable to process many computationally intensive kernels.

7


Problems with BUTTER !

BUTTER’s presence in the system was expensive if it is notused most of the time

BUTTER occupies a large number of hardware resources

A General Purpose CGRA requires a few million gates ofFPGA

8


Solution

CREMAA parameterized general purpose CGRA to generate specialpurpose accelerators.

9


Category of Interconnections


Processing Elements in CREMA

Two Operand Registers

Decoder for Operation Selection

Supports Integer and Floating point operations

Blocks with dashed border are scalable andselectable for instantiation

LUT for logical operations

Processing Element Template


CREMA based System

COFFEE for general purposeprocessing

CREMA generated acceleratorfor CIK

Network of SwitchedInterconnections for faster datatransfer between modules

12


Applications Mapped on CREMA andBUTTER

Integer and Floating-point Matrix-Vector MultiplicationExecution Time Compared with RISC and DSP

2D-Low Pass Image Filtering based on Averaging WindowFFT

Satisfied Execution Time Constraints for SISO and MIMO OFDMApplicationsResource utilization and execution time was compared with otherstate-of-the-art

W-CDMA cell searchExecution time compared with a RISC core

In all of the above applications, CREMA as a template-based device required lesser resources for its generatedaccelerator than BUTTER

13


Application Mapping

14


Number Scaling

Very important, so the signals don’t overflowbefore processing >> scale downafter processing >> scale up

If x[n] and y[n] is input and output signal thenscaling down = (x[n] / |max x[n]|) x 2^bscaling up = (y[n] / 2^b) x |max x[n]|

15


Example

Consider a set of numbersS = {-3,-2,-1,0,1,2,3}

Trying to compute -3 x 3 = -9 in 16-bit binary integer representation

Scaling Down• S/|max. S| = {-1, -0.6667, -0.3333, 0, 0.3333 0.6667 1}• S/|max. S|*2^15 = {-3.2768 -2.1845 -1.0923 0 1.0923 2.1845

3.2768} * 10^4• -32768*32768= -1.0737x10^9• After multiplication there is a shift operation

-1.0737x10^9 / 2^15 = -32768

Scaling Up• The answer was -32768• So (-32768 / 2^15) x 3 = -9

16


First Order Linear Constant CoefficientDifference Equation

y[n] = x[n-1] + x[n], n=0,1,2,3,…,N-1

17

Z^-1

+x[n] y[n]


Finite Impulse Response Filtering

Transfer Function of the Filter

There is no feedback so N = 0

FIR Structure

18

Z^-1

b(0)

+

x[n]

y[n]

Z^-1

b(1)

+

Z^-1

b(2)

+

Z^-1

b(M-1)

+


Polynomial Division

Very important and used many times in Signal ProcessingExample: Encoding process of Reed-Solomon codes

Best way of doing it is by using a Linear Feedback Shift Register (LFSR)

19


Reed Solomon Codes-Encoding inSystematic Form, (7, 3) Example

531

111110010)()()(

)(modulo)()()()()()(

XmXXpXUXgXmXXp

XpxgXqXmX

kn

kn

kn

20


Encoding in Systematic Form

655341362420

362420

)(

)(

XXXXXXXU

XXXXp

21


Systematic Encoding with an (n-k)-StageShift Register

3 1 0 3

43210 XXXXX

531

22



___3__

021

00000

6420

42231

0156131

5531

CYCLESFEEDBACKCONTENTSREGISTERCLOCKQUEUEINPUT

23


Message arrives and resetting the LFSR

Systematic Encoding with an (nSystematic Encoding with an (n--k)k)--Stage Shift RegisterStage Shift Register

110 010

000 000 000 000

000

100 110

43210 XXXXX

111110010

24


1st clock cycle in LFSR


110 010

000 000 000 000

111100 110

43210 XXXXX

110010

25


2nd clock cycle in LFSR


110 010

010 101 111 010

100100 110

43210 XXXXX

010

26


3rd clock cycle in LFSR


110 010

110 100 001 001

011100 110

43210 XXXXX

27


4th clock cycle in LFSR


110 010

100 001 011 101

----100 110

43210 XXXXX

28


The parity 100 001 011 101 bits will come out from the LFSRserially


110 010

100 001 011 101

----100 110

43210 XXXXX

29



65432

65432

6

0

)111()110()010()101()011()001()100(

)(

)(

XXXXXX

XXXXXXXU

XuXUn

nn

5316420

30


Correlation

The slot timing synchronization in W-CDMA cell search requires several correlationcalculations over a window of 256 elements.

The correlation can be defined as sum-of-products of complex input samples (R_i)and coefficients (C_i), mathematically can be expressed as

After each correlation process, the window shifts by one input sample so the secondcorrelation can be defined as

and the n-th as

31


Correlation

Assuming that R_{Ri}, C_{Ri} are the real and R_{Ii}, C_{Ii} are the imaginary partsof R_i and C_i respectively then the first equation can be expanded in its real andimaginary parts as

Using CREMA or BUTTER, a context can be designed for its processing, F_Ri andF_Ii can be loaded in the local memory of BUTTER or CREMA

32


Fast Fourier Transform

33


FFT Implementation

Radix-2 Butterfly Radix-4 Butterfly


FFT Implementation

64-point FFT Radix-2 Structure 64-point FFT Radix-4 Structure


Radix-2 vs Radix-4


Radix-2 FFT Implementation

Single Context

Two Radix-2 Butterflies



Three context for oneRadix-4 Butterfly

The first contextperforming onlyadditions andsubtractions

38



39

The second contextperforming multiplicationsand rest of additions andsubtractions

The third contextperforms the shiftoperations


Data Reordering

x(A)

x(B)

x(C)

x(D)

X(A)

X(B)

X(C)

X(D)

Splitting required into x(A),x(B), x(C) and x(D)


Data Reordering


Performance Comparison

Radix-2 vs Radix-4Execution

PerformanceAlmost the Same!


Thank You

*Questions*

coarse grain reconfigurable arrays, are signal processing ... · coarse grain reconfigurable...

Documents