coarse grain reconfigurable arrays, are signal processing ... · coarse grain reconfigurable...
TRANSCRIPT
Department of Computer Systems
Coarse Grain Reconfigurable Arrays, are SignalProcessing Engines!
Digital Design for FPGA, TKT-1426, Lecture # 11
Waqar HussainResearch Scientist
[email protected] of Computer Systems
Tampere University of Technology, Finland
Department of Computer Systems
Electronic Products
Multifunction devices are becoming popular besides theirreliability and durability
Example__ Mobile Phone
• The key selling features of a cell phone are size, weight, longer battery times,audio/video streaming and several games running onto it
• Adaptability to many communication standards
• Expectations for Real Time performance
• No Limits to Human Desire
2
Department of Computer Systems
Embedded Technology
The embedded technology empowers a mobilephone to carry all these features.
Intended for a specific use which consist of a hardwarecapable to perform a set of different tasks with the help ofsoftware
ExampleEmbedded System = RISC + Accelerator(s)
3
Department of Computer Systems
Why Coarse Grain Reconfigurable Arrays ?
Computationally Intensive Kernels (CIK) need to be accelerated ina Signal Processing System.
Examples of CIKs1. FIR Filtering2. Encoding and Decoding
a) Viterbib) Reed-Solomon
3. Matrix-Vector Multiplication4. Fast Fourier Transform
4
Department of Computer Systems
Why Coarse Grain Reconfigurable Arrays ?
Question: So why CGRA, why not traditional accelerators?
Its more desirable to use devices that could acceleratedifferent kernels than typical traditional accelerators that weredesigned to accelerate only a single kernel.
Thanks to Reconfigurability!
5
Department of Computer Systems
Why CGRAs are Powerful Engines ?
Answer: Due to its structure!
CGRAs offer high parallelism and throughput due to its array-based structure.Algorithms containing parallelism are most suitable to bemapped on a CGRA.It can process large streams of data.Unit of Structure of a CGRA is an ALU, called ProcessingElements (PE).Each PE is connected to other PEs using point-to-point or aNetwork on Chip (NoC).
6
Department of Computer Systems
CGRA in an Embedded System
An Example of Embedded System isRISC + Accelerator(s)
RISC = COFFEEAccelerator = BUTTER
Both COFFEE and BUTTER were designed at the Department of ComputerSystems, Tampere University of Technology, Finland
BUTTERA general purpose Coarse Grain Reconfigurable Array (CGRA)which is a martix of processing elements (PEs). Each PE iscapable to perform a set of different tasks and connected witheach other using point to point interconnections. BUTTER wascapable to process many computationally intensive kernels.
7
Department of Computer Systems
Problems with BUTTER !
BUTTER’s presence in the system was expensive if it is notused most of the time
BUTTER occupies a large number of hardware resources
A General Purpose CGRA requires a few million gates ofFPGA
8
Department of Computer Systems
Solution
CREMAA parameterized general purpose CGRA to generate specialpurpose accelerators.
9
Department of Computer Systems
Category of Interconnections
Department of Computer Systems
Processing Elements in CREMA
Two Operand Registers
Decoder for Operation Selection
Supports Integer and Floating point operations
Blocks with dashed border are scalable andselectable for instantiation
LUT for logical operations
Processing Element Template
Department of Computer Systems
CREMA based System
COFFEE for general purposeprocessing
CREMA generated acceleratorfor CIK
Network of SwitchedInterconnections for faster datatransfer between modules
12
Department of Computer Systems
Applications Mapped on CREMA andBUTTER
Integer and Floating-point Matrix-Vector MultiplicationExecution Time Compared with RISC and DSP
2D-Low Pass Image Filtering based on Averaging WindowFFT
Satisfied Execution Time Constraints for SISO and MIMO OFDMApplicationsResource utilization and execution time was compared with otherstate-of-the-art
W-CDMA cell searchExecution time compared with a RISC core
In all of the above applications, CREMA as a template-based device required lesser resources for its generatedaccelerator than BUTTER
13
Department of Computer Systems
Application Mapping
14
Department of Computer Systems
Number Scaling
Very important, so the signals don’t overflowbefore processing >> scale downafter processing >> scale up
If x[n] and y[n] is input and output signal thenscaling down = (x[n] / |max x[n]|) x 2^bscaling up = (y[n] / 2^b) x |max x[n]|
15
Department of Computer Systems
Example
Consider a set of numbersS = {-3,-2,-1,0,1,2,3}
Trying to compute -3 x 3 = -9 in 16-bit binary integer representation
Scaling Down• S/|max. S| = {-1, -0.6667, -0.3333, 0, 0.3333 0.6667 1}• S/|max. S|*2^15 = {-3.2768 -2.1845 -1.0923 0 1.0923 2.1845
3.2768} * 10^4• -32768*32768= -1.0737x10^9• After multiplication there is a shift operation
-1.0737x10^9 / 2^15 = -32768
Scaling Up• The answer was -32768• So (-32768 / 2^15) x 3 = -9
16
Department of Computer Systems
First Order Linear Constant CoefficientDifference Equation
y[n] = x[n-1] + x[n], n=0,1,2,3,…,N-1
17
Z^-1
+x[n] y[n]
Department of Computer Systems
Finite Impulse Response Filtering
Transfer Function of the Filter
There is no feedback so N = 0
FIR Structure
18
Z^-1
b(0)
+
x[n]
y[n]
Z^-1
b(1)
+
Z^-1
b(2)
+
Z^-1
b(M-1)
+
Department of Computer Systems
Polynomial Division
Very important and used many times in Signal ProcessingExample: Encoding process of Reed-Solomon codes
Best way of doing it is by using a Linear Feedback Shift Register (LFSR)
19
Department of Computer Systems
Reed Solomon Codes-Encoding inSystematic Form, (7, 3) Example
531
111110010)()()(
)(modulo)()()()()()(
XmXXpXUXgXmXXp
XpxgXqXmX
kn
kn
kn
20
Department of Computer Systems
Encoding in Systematic Form
655341362420
362420
)(
)(
XXXXXXXU
XXXXp
21
Department of Computer Systems
Systematic Encoding with an (n-k)-StageShift Register
3 1 0 3
43210 XXXXX
531
22
Department of Computer Systems
Systematic Encoding with an (n-k)-StageShift Register
___3__
021
00000
6420
42231
0156131
5531
CYCLESFEEDBACKCONTENTSREGISTERCLOCKQUEUEINPUT
23
Department of Computer Systems
Message arrives and resetting the LFSR
Systematic Encoding with an (nSystematic Encoding with an (n--k)k)--Stage Shift RegisterStage Shift Register
110 010
000 000 000 000
000
100 110
43210 XXXXX
111110010
24
Department of Computer Systems
1st clock cycle in LFSR
Systematic Encoding with an (nSystematic Encoding with an (n--k)k)--Stage Shift RegisterStage Shift Register
110 010
000 000 000 000
111100 110
43210 XXXXX
110010
25
Department of Computer Systems
2nd clock cycle in LFSR
Systematic Encoding with an (nSystematic Encoding with an (n--k)k)--Stage Shift RegisterStage Shift Register
110 010
010 101 111 010
100100 110
43210 XXXXX
010
26
Department of Computer Systems
3rd clock cycle in LFSR
Systematic Encoding with an (nSystematic Encoding with an (n--k)k)--Stage Shift RegisterStage Shift Register
110 010
110 100 001 001
011100 110
43210 XXXXX
27
Department of Computer Systems
4th clock cycle in LFSR
Systematic Encoding with an (nSystematic Encoding with an (n--k)k)--Stage Shift RegisterStage Shift Register
110 010
100 001 011 101
----100 110
43210 XXXXX
28
Department of Computer Systems
The parity 100 001 011 101 bits will come out from the LFSRserially
Systematic Encoding with an (nSystematic Encoding with an (n--k)k)--Stage Shift RegisterStage Shift Register
110 010
100 001 011 101
----100 110
43210 XXXXX
29
Department of Computer Systems
Systematic Encoding with an (n-k)-StageShift Register
65432
65432
6
0
)111()110()010()101()011()001()100(
)(
)(
XXXXXX
XXXXXXXU
XuXUn
nn
5316420
30
Department of Computer Systems
Correlation
The slot timing synchronization in W-CDMA cell search requires several correlationcalculations over a window of 256 elements.
The correlation can be defined as sum-of-products of complex input samples (R_i)and coefficients (C_i), mathematically can be expressed as
After each correlation process, the window shifts by one input sample so the secondcorrelation can be defined as
and the n-th as
31
Department of Computer Systems
Correlation
Assuming that R_{Ri}, C_{Ri} are the real and R_{Ii}, C_{Ii} are the imaginary partsof R_i and C_i respectively then the first equation can be expanded in its real andimaginary parts as
Using CREMA or BUTTER, a context can be designed for its processing, F_Ri andF_Ii can be loaded in the local memory of BUTTER or CREMA
32
Department of Computer Systems
Fast Fourier Transform
33
Department of Computer Systems
FFT Implementation
Radix-2 Butterfly Radix-4 Butterfly
Department of Computer Systems
FFT Implementation
64-point FFT Radix-2 Structure 64-point FFT Radix-4 Structure
Department of Computer Systems
Radix-2 vs Radix-4
Department of Computer Systems
Radix-2 FFT Implementation
Single Context
Two Radix-2 Butterflies
Department of Computer Systems
Radix-4 FFT Implementation
Three context for oneRadix-4 Butterfly
The first contextperforming onlyadditions andsubtractions
38
Department of Computer Systems
Radix-4 FFT Implementation
39
The second contextperforming multiplicationsand rest of additions andsubtractions
The third contextperforms the shiftoperations
Department of Computer Systems
Data Reordering
x(A)
x(B)
x(C)
x(D)
X(A)
X(B)
X(C)
X(D)
Splitting required into x(A),x(B), x(C) and x(D)
Department of Computer Systems
Data Reordering
Department of Computer Systems
Performance Comparison
Radix-2 vs Radix-4Execution
PerformanceAlmost the Same!
Department of Computer Systems
Thank You
*Questions*