coarse grain reconfigurable arrays, are signal processing ... · coarse grain reconfigurable...

22
Department of Computer Systems Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11 Waqar Hussain Research Scientist [email protected] Department of Computer Systems Tampere University of Technology, Finland Department of Computer Systems Electronic Products Multifunction devices are becoming popular besides their reliability and durability Example__ Mobile Phone The key selling features of a cell phone are size, weight, longer battery times, audio/video streaming and several games running onto it Adaptability to many communication standards Expectations for Real Time performance No Limits to Human Desire 2

Upload: others

Post on 19-Jul-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Coarse Grain Reconfigurable Arrays, are SignalProcessing Engines!

Digital Design for FPGA, TKT-1426, Lecture # 11

Waqar HussainResearch Scientist

[email protected] of Computer Systems

Tampere University of Technology, Finland

Department of Computer Systems

Electronic Products

Multifunction devices are becoming popular besides theirreliability and durability

Example__ Mobile Phone

• The key selling features of a cell phone are size, weight, longer battery times,audio/video streaming and several games running onto it

• Adaptability to many communication standards

• Expectations for Real Time performance

• No Limits to Human Desire

2

Page 2: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Embedded Technology

The embedded technology empowers a mobilephone to carry all these features.

Intended for a specific use which consist of a hardwarecapable to perform a set of different tasks with the help ofsoftware

ExampleEmbedded System = RISC + Accelerator(s)

3

Department of Computer Systems

Why Coarse Grain Reconfigurable Arrays ?

Computationally Intensive Kernels (CIK) need to be accelerated ina Signal Processing System.

Examples of CIKs1. FIR Filtering2. Encoding and Decoding

a) Viterbib) Reed-Solomon

3. Matrix-Vector Multiplication4. Fast Fourier Transform

4

Page 3: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Why Coarse Grain Reconfigurable Arrays ?

Question: So why CGRA, why not traditional accelerators?

Its more desirable to use devices that could acceleratedifferent kernels than typical traditional accelerators that weredesigned to accelerate only a single kernel.

Thanks to Reconfigurability!

5

Department of Computer Systems

Why CGRAs are Powerful Engines ?

Answer: Due to its structure!

CGRAs offer high parallelism and throughput due to its array-based structure.Algorithms containing parallelism are most suitable to bemapped on a CGRA.It can process large streams of data.Unit of Structure of a CGRA is an ALU, called ProcessingElements (PE).Each PE is connected to other PEs using point-to-point or aNetwork on Chip (NoC).

6

Page 4: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

CGRA in an Embedded System

An Example of Embedded System isRISC + Accelerator(s)

RISC = COFFEEAccelerator = BUTTER

Both COFFEE and BUTTER were designed at the Department of ComputerSystems, Tampere University of Technology, Finland

BUTTERA general purpose Coarse Grain Reconfigurable Array (CGRA)which is a martix of processing elements (PEs). Each PE iscapable to perform a set of different tasks and connected witheach other using point to point interconnections. BUTTER wascapable to process many computationally intensive kernels.

7

Department of Computer Systems

Problems with BUTTER !

BUTTER’s presence in the system was expensive if it is notused most of the time

BUTTER occupies a large number of hardware resources

A General Purpose CGRA requires a few million gates ofFPGA

8

Page 5: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Solution

CREMAA parameterized general purpose CGRA to generate specialpurpose accelerators.

9

Department of Computer Systems

Category of Interconnections

Page 6: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Processing Elements in CREMA

Two Operand Registers

Decoder for Operation Selection

Supports Integer and Floating point operations

Blocks with dashed border are scalable andselectable for instantiation

LUT for logical operations

Processing Element Template

Department of Computer Systems

CREMA based System

COFFEE for general purposeprocessing

CREMA generated acceleratorfor CIK

Network of SwitchedInterconnections for faster datatransfer between modules

12

Page 7: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Applications Mapped on CREMA andBUTTER

Integer and Floating-point Matrix-Vector MultiplicationExecution Time Compared with RISC and DSP

2D-Low Pass Image Filtering based on Averaging WindowFFT

Satisfied Execution Time Constraints for SISO and MIMO OFDMApplicationsResource utilization and execution time was compared with otherstate-of-the-art

W-CDMA cell searchExecution time compared with a RISC core

In all of the above applications, CREMA as a template-based device required lesser resources for its generatedaccelerator than BUTTER

13

Department of Computer Systems

Application Mapping

14

Page 8: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Number Scaling

Very important, so the signals don’t overflowbefore processing >> scale downafter processing >> scale up

If x[n] and y[n] is input and output signal thenscaling down = (x[n] / |max x[n]|) x 2^bscaling up = (y[n] / 2^b) x |max x[n]|

15

Department of Computer Systems

Example

Consider a set of numbersS = {-3,-2,-1,0,1,2,3}

Trying to compute -3 x 3 = -9 in 16-bit binary integer representation

Scaling Down• S/|max. S| = {-1, -0.6667, -0.3333, 0, 0.3333 0.6667 1}• S/|max. S|*2^15 = {-3.2768 -2.1845 -1.0923 0 1.0923 2.1845

3.2768} * 10^4• -32768*32768= -1.0737x10^9• After multiplication there is a shift operation

-1.0737x10^9 / 2^15 = -32768

Scaling Up• The answer was -32768• So (-32768 / 2^15) x 3 = -9

16

Page 9: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

First Order Linear Constant CoefficientDifference Equation

y[n] = x[n-1] + x[n], n=0,1,2,3,…,N-1

17

Z^-1

+x[n] y[n]

Department of Computer Systems

Finite Impulse Response Filtering

Transfer Function of the Filter

There is no feedback so N = 0

FIR Structure

18

Z^-1

b(0)

+

x[n]

y[n]

Z^-1

b(1)

+

Z^-1

b(2)

+

Z^-1

b(M-1)

+

Page 10: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Polynomial Division

Very important and used many times in Signal ProcessingExample: Encoding process of Reed-Solomon codes

Best way of doing it is by using a Linear Feedback Shift Register (LFSR)

19

Department of Computer Systems

Reed Solomon Codes-Encoding inSystematic Form, (7, 3) Example

531

111110010)()()(

)(modulo)()()()()()(

XmXXpXUXgXmXXp

XpxgXqXmX

kn

kn

kn

20

Page 11: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Encoding in Systematic Form

655341362420

362420

)(

)(

XXXXXXXU

XXXXp

21

Department of Computer Systems

Systematic Encoding with an (n-k)-StageShift Register

3 1 0 3

43210 XXXXX

531

22

Page 12: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Systematic Encoding with an (n-k)-StageShift Register

___3__

021

00000

6420

42231

0156131

5531

CYCLESFEEDBACKCONTENTSREGISTERCLOCKQUEUEINPUT

23

Department of Computer Systems

Message arrives and resetting the LFSR

Systematic Encoding with an (nSystematic Encoding with an (n--k)k)--Stage Shift RegisterStage Shift Register

110 010

000 000 000 000

000

100 110

43210 XXXXX

111110010

24

Page 13: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

1st clock cycle in LFSR

Systematic Encoding with an (nSystematic Encoding with an (n--k)k)--Stage Shift RegisterStage Shift Register

110 010

000 000 000 000

111100 110

43210 XXXXX

110010

25

Department of Computer Systems

2nd clock cycle in LFSR

Systematic Encoding with an (nSystematic Encoding with an (n--k)k)--Stage Shift RegisterStage Shift Register

110 010

010 101 111 010

100100 110

43210 XXXXX

010

26

Page 14: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

3rd clock cycle in LFSR

Systematic Encoding with an (nSystematic Encoding with an (n--k)k)--Stage Shift RegisterStage Shift Register

110 010

110 100 001 001

011100 110

43210 XXXXX

27

Department of Computer Systems

4th clock cycle in LFSR

Systematic Encoding with an (nSystematic Encoding with an (n--k)k)--Stage Shift RegisterStage Shift Register

110 010

100 001 011 101

----100 110

43210 XXXXX

28

Page 15: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

The parity 100 001 011 101 bits will come out from the LFSRserially

Systematic Encoding with an (nSystematic Encoding with an (n--k)k)--Stage Shift RegisterStage Shift Register

110 010

100 001 011 101

----100 110

43210 XXXXX

29

Department of Computer Systems

Systematic Encoding with an (n-k)-StageShift Register

65432

65432

6

0

)111()110()010()101()011()001()100(

)(

)(

XXXXXX

XXXXXXXU

XuXUn

nn

5316420

30

Page 16: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Correlation

The slot timing synchronization in W-CDMA cell search requires several correlationcalculations over a window of 256 elements.

The correlation can be defined as sum-of-products of complex input samples (R_i)and coefficients (C_i), mathematically can be expressed as

After each correlation process, the window shifts by one input sample so the secondcorrelation can be defined as

and the n-th as

31

Department of Computer Systems

Correlation

Assuming that R_{Ri}, C_{Ri} are the real and R_{Ii}, C_{Ii} are the imaginary partsof R_i and C_i respectively then the first equation can be expanded in its real andimaginary parts as

Using CREMA or BUTTER, a context can be designed for its processing, F_Ri andF_Ii can be loaded in the local memory of BUTTER or CREMA

32

Page 17: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Fast Fourier Transform

33

Department of Computer Systems

FFT Implementation

Radix-2 Butterfly Radix-4 Butterfly

Page 18: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

FFT Implementation

64-point FFT Radix-2 Structure 64-point FFT Radix-4 Structure

Department of Computer Systems

Radix-2 vs Radix-4

Page 19: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Radix-2 FFT Implementation

Single Context

Two Radix-2 Butterflies

Department of Computer Systems

Radix-4 FFT Implementation

Three context for oneRadix-4 Butterfly

The first contextperforming onlyadditions andsubtractions

38

Page 20: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Radix-4 FFT Implementation

39

The second contextperforming multiplicationsand rest of additions andsubtractions

The third contextperforms the shiftoperations

Department of Computer Systems

Data Reordering

x(A)

x(B)

x(C)

x(D)

X(A)

X(B)

X(C)

X(D)

Splitting required into x(A),x(B), x(C) and x(D)

Page 21: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Data Reordering

Department of Computer Systems

Performance Comparison

Radix-2 vs Radix-4Execution

PerformanceAlmost the Same!

Page 22: Coarse Grain Reconfigurable Arrays, are Signal Processing ... · Coarse Grain Reconfigurable Arrays, are Signal Processing Engines! Digital Design for FPGA, TKT-1426, Lecture # 11

Department of Computer Systems

Thank You

*Questions*