dynamically reconfigurable management of energy, performance…llamocca/research/llamocca2012... ·...
TRANSCRIPT
Dynamically Reconfigurable Management Dynamically Reconfigurable Management of Energy, Performance, and Accuracy of Energy, Performance, and Accuracy applied to Digital Signal, Image, and applied to Digital Signal, Image, and
Video Processing Applications Video Processing Applications
Daniel Daniel LlamoccaLlamocca
Electrical and Computer Engineering Department
The University of New Mexico
December 2nd, 2011
Image and Video Processing and Communications LaboratoryImage and Video Processing and Communications Laboratory
OutlineOutline
�� MotivationMotivation
�� Related WorkRelated Work
�� Thesis StatementThesis Statement
�� ContributionsContributions
�� General ApproachGeneral Approach
�� General implementation detailsGeneral implementation details
�� Digital signal, image, and video processing applications:Digital signal, image, and video processing applications:
�� General Implementation detailsGeneral Implementation details
�� Pixel ProcessorPixel Processor
�� 1D FIR Filter1D FIR Filter
�� 2D FIR Filter/2D FIR Filter/FilterbankFilterbank
�� ConclusionsConclusions
MotivationMotivationDigital signal, image, and video processing systems can be characterized by
three properties:
Energy, Performance, and Accuracy (EPA).
The controlling of these variables at run-time is defined as Dynamic Energy-Performance-Accuracy (EPA) management.
Dynamic EPA management will enable us to deliver:
�A dynamically self-adaptive system (by dynamic allocation of computational resources and dynamic frequency control) that satisfies time-varying EPA requirements.
�An Optimal resulting realization: We want to investigate optimal solutions that can meet dynamic EPA requirements . The system should minimize energy consumption, and at the same time maximize performance and accuracy, while satisfying the given EPA requirements.
MotivationMotivationDynamic Energy-Performance-Accuracy management can rely on Dynamic Partial Reconfiguration (DPR) and Dynamic Frequency Control on FPGAs.
Dynamic Partial Reconfiguration
DPR technology enables the adaptation of hardware resources by modifying or switching off portions of the FPGA while the rest remains intact, continuing its operation. To perform DPR, the Partial Reconfiguration Region (PRR) must be defined. The PRR isdynamically reconfigured via the internal c0nfiguration access port (ICAP).
Dynamic Frequency Control
Digital Clock Managers (DCMs) inside FPGAs provide a wide range of clock management features.
The Dynamic Reconfiguration Port (DRP) of the DCM enables dynamic control of the frequency and phase. We can use it to dynamically adjust the frequency without reloading a new bitstream to the FPGA.
FPGA
Module 1
Module 2
Module n
t1
t2
tn
PRR
static region
clkfx
clkin
rst
clkfb clk0
daddr[6:0]
di[15:0]dw e
dclk
den
do[15..0]drdy
DRP
DCM _ADV
DC
R s
lave
inte
rface
processor
DCR MASTER
BUFG
BUFG
reference clock
DC
R B
US
DCR SLAVE
dcr_a_bus
dcr_sl_dbus
dcr_read
dcr_write
sl_dcrack
sl_dcr_dbus
clkinD
Mclkfx ×
=
clkfx
MotivationMotivationThe system can then carry out independent tasks in time:
task 1, task 2, ….
Examples:
� Task 1: A video processing system is asked to deliver real time performance at 30 frames per second (fps) on limited battery life that will also need to operate for at least 10 hours. This is a multi-objective optimization problem. If solutions are found, pick the system realization with the highest precision.
� Task 2: Now, suppose that we are asked to deliver performance at 100 fps at some minimum level of accuracy (60dB). In this case, we can select the hardware realization with the lowest energy requirements while meeting the performance and accuracy constraints.
Related workRelated work (1 of 3)Image processing with DPR:� DPR implementation of mean and median filters [Bhandari09],
[Raikovich10].� Fingerprint image processing algorithms whose stages (segmentation,
normalization, smoothing, etc.) are multiplexed in time via DPR [Fons10].� 3D Haar Wavelet Transform DPR implementation by dynamically
reconfiguring a 1D HWT thrice [Afandi09].� JPEG2000 decoder where the blocks are dynamically swapped
[Bouchoux04]� All these works are DPR implementations that exhibit some resemblance to
our work. However they did not explore the EPA space.
CHREC (NSF Center for High-Performance Reconfigurable Computing):� Acceleration of the Partial Reconfiguration Process (e.g. bitstream
rellocator, high level PR description for fast PR implementation, platformfor rapid deployment of PR embedded systems, using hard macros to reduce FPGA compilation time).
� Adaptive filtering, optical flow static implementations (no DPR)� JTAG encoder/decoder (modules are swapped via DPR)� No exploration of the EPA space via DPR.
Related workRelated work (2 of 3)DPR Application in Dynamic Arithmetic [Vera08]:� The use of DPR provided a low-
energy example where the use of dynamic dual fixed-point (DDFX)arithmetic was shown to performas well as double floating point (FP) in a Linear Algebra example.
� DDFX maintains a performanceadvantage with respect to FP whenreconfiguring once every 10000operations or less (i.e., DDFX canchange precision 250 times per second or switch operations 150 per second.
� Arithmetic cores were measured in terms of their power, performance, and precision.
� A model was formulated that relates power, performance, and precisionof the dynamic arithmetic architecture. It explored the use of DPR to dynamically adjust performance, precision, and power consumption.
� No multi-objective optimization of the Power-Performance-Precision space
QP
: Q
ua
nti
za
tio
n
P
ara
me
ter
Related workRelated work (3 of 3)DPR Application for scalable DCT computation [Huang09]:� Parameterized Discrete Cosine Transform (DCT) systolic modules (no
Distributed Arithmetic approach suitable for FPGAs)� The system dynamically reconfigures among Discrete Cosine Transform
modules of different sizes (e.g., 8x8, 5x5,4x4).� Different DCT configurations were studied in terms of power, performance, power, performance,
and accuracyand accuracy. A configuration manager can adapt DCTs of different sizes based on power, performance, and accuracy power, performance, and accuracy constraints.
� Exploration of power, performance, and precision dependence on the DCT size. No multi-objective optimization of the EPA space.
Thesis StatementThesis StatementThis Dissertation develops a dynamic Energy-Performance-Accuracy
management framework for Digital Signal, Image, and Video processing applications.
This entails: 1. Development and parameterization of efficient and dynamically
reconfigurable architectures for the following signal, image, and video processing applications:
2. Development of a Multi-objective Pareto optimization approach to meet global Energy-Performance-Accuracy (EPA) constraints.
3. Description of the Pareto-optimal realizations extracted from the EPA space: We are interested in how the architecture parameters generate Pareto-optimal solutions from the EPA space.
4. Dynamic EPA management to meet time-varying global EPA constraints. The system receives stimuli in the form of EPA constraints and reconfigures itself via DPR and/or dynamic frequency control to meet the EPA constraints.
DPR 2D FilterbankDPR 2D FIR Filter
DPR 1-D FIR FilterDPR Pixel Processor
ContributionsContributions� Development of fully-parameterized hardware cores for signal, image, and
video processing applications. The architectures are implemented with techniques that minimize the amount of computing resources and take advantage of Dynamic Partial Reconfiguration.
� Characterization of the optimal (in the multi-objective sense) hardware realizations from the EPA/PPA space for the architectures presented.
� A new framework for dynamic energy/power, performance, and accuracy (EPA/PPA) management based on a multi-objective optimization approach that guarantees low energy, high accuracy, and high performance. The framework is applicable to a wide array of signal, image, and video processing architectures.
� Development of hardware systems that support dynamic energy/power, performance, and accuracy management that meet real-time EPA/PPA constraints. On hardware, dynamic EPA/PPA management is based on the run-time control of hardware resources and frequency of operation.
General approach (1/5)General approach (1/5)
Steps:
1) Definition of Objective Functions
2) Development of efficient cores
3) Parameterization of Hardware Cores
4) Multi-objective Pareto Optimization in the EPA Space
5) Dynamic management based on real-time EPA constraints
General approach (2/5)General approach (2/5)1) Definition of objective functions: Energy, performance, and accuracy
are considered the objective functions of system parameters. These properties may have a slightly different definition depending on the application.Energy can be measured as the total energy spent during the system operation, or the energy spent during an operation (e.g., energy per video frame). In some instances, measuring Power is more useful.Performance can be measured by: Megasamples per second, frames per second, Megabytes per second, etc.Accuracy can be measured by: numerical representation, or accuracy with respect to an idealized result (e.g., PSNR).
2) Development of efficient cores: The signal, image, and video processing architectures should use techniques that: i) minimize the amount of computational resources (e.g. LUT-based approaches, Distributed Arithmetic), and ii) make intensive use of DPR. The cores must be implemented in Hardware Description Language (HDL), so that they remain portable across FPGA devices and vendors.
General approach (3/5)General approach (3/5)
PIXEL
PROCESSOR
NC LUT values
(from text f ile)
NI NO F
NI××××NC NO××××NC
3) Parameterization of hardware cores: To achieve a fine control of energy, performance, and accuracy, we require realistic parameterization of the hardware cores (e.g., I/O bit-width, number of parallel cores).
The parameterized HDL code let us create a set of hardware realizations by varying the parameters. Each realization comes with different energy, performance, and accuracy values, which we can control by varying the hardware parameters.
Example: Parameterization of the ‘Pixel processor’ architecture:
NC (number of cores), NI (number of input bits per pixel),
NO (number of output bits per pixel), F (function to be implemented),
LUT values (text file with LUT values)
General approach (4/5)General approach (4/5)4) Multi-objective Pareto Optimization in the EPA Space: The Energy-
Performance-Accuracy (EPA) space is represented by a set of hardware realizations along with their EPA values.An optimal hardware realization is defined as the one that minimizes energy, while maximizing performance and accuracy. We are interested in the set of optimal realizations from the EPA space. We want to find a subset whose EPA values cannot be improved by any other realization for all three (EPA). These realizations are called optimal in the Pareto (multi-objective) sense.
Energ
y
-Accuracy
Pareto front
-Accuracy -Performance
Energ
yPareto front
The Energy-Performance-Accuracy space is shown along with the ParetoPareto--optimaloptimal points. In some cases, we may want to explore a space of just 2 variables, e.g., the Energy-Accuracy space.
General approach (5/5)General approach (5/5)5) Dynamic management based on real-time EPA constraints: Once the
Pareto front has been extracted, we can cast optimization problems based on EPA constraints.Example: We set constraints on all the three variables. The feasible set is represented by the golden points. We prioritize energy consumption, so our selected realization is the one that also minimizes energy consumption.The previous problem could be cast as the following optimization problem:
( ) ( )
( ) fpsRiePerformanc
dBRiAccuracy:tosubject
RiEnergy
Ri
min
30
50
≥
≥
Circled point: Realization from the feasible set that minimizes energy consumption.
If we ignore one variable (say, performance), we have a 2D optimization problem.
-Performance-Accuracy
constr
ain
ts
Energ
y
Energ
y
-Accuracy
constraints
Digital signal, image, and video processing Digital signal, image, and video processing applicationsapplications
The following systems are discussed:
� Pixel Processor and Dynamic EPA Management
� 1D FIR Filter
� 2D Separable FIR Filter/ Filterbank & Dynamic EPA Management for the 2D FIR Filter
General Implementation Details General Implementation Details Embedded FPGA system that supports Dynamic Partial Reconfiguration and Dynamic Frequency Control:Pareto-optimal point: Represented by <<bitstreambitstream, frequency of operation>, frequency of operation>It is a hardware realization that becomes active in the FPGA via Dynamic Partial Reconfiguration (DPR) and/or Dynamic Frequency Control.
System receives an EPA constraint:� It looks for a solution in the Pareto-optimal set: <bitsream*, freq*>� It reconfigures FPGA dynamic region and /or frequency of operation, so as to meet the EPA constraint.
processor
PLB
System
ACEmemory
Ethernet
MAC
ICAP
core
ext. interface
FSL
PLB
ICAP
port
CF
card
freq. ctrlDCR Bus clkfx
EP
A
co
nstr
ain
ts
Energ
y
-Accuracy -Performance
PO 1
Pareto front
'n' bitstreams
in memory
PO 3
PO m
bitstream freq.
Module n
Module 1
Module 3
f1
f2
f1
Module 1 f3
PO 4
PO 2 Module 2 f2
Module
n
Module
2
Module
1
'm ' Pareto points
n ≤ ≤ ≤ ≤ m
DMA
core
M S
Hardw are core
Dynamic
Region
Inte
rface
PRR
Pixel Processor (1/9)Pixel Processor (1/9)� Single-pixel operations (e.g.,
gamma correction, Huffman encoding, histogram equalization, contrast stretching) can be dynamically swapped. Parameter F modifies the function.
� In addition to dynamically modifying the input-output function, we might want to change:
� Input pixel bitwidth (NI)
� Output pixel bitwidth (NO),
� Number of parallel processing elements (NC)
t1
t2
t3
Input Frame
Inv. Gamma correction
gamma = 2
Contrast Stretching
alpha = 2, m = 0.5
Gamma correction
gamma = 0.5
Module 3 Module 2 Module 1
FPGA
PRR
PIXEL
PROCESSOR
NC LUT values
(from text f ile)
NI NO F
NI××××NC NO××××NC
Pixel Processor (2/9)Pixel Processor (2/9)� LUT-based architecture: LUT4 (Virtex-4). LUT6 (Virtex-5, Virtex-6)
Up to LUT8-to-1 can be implemented efficiently with Xilinx primitives.
For LUT inputs > 8, a recursive implementation is employed.
4
M UXF5
8
4 4
M UXF5
4 4
M UXF5
4 4 4 4 4 4 4 4 4 4 4
4 M
SB
s
4 LSBs
LUT5-to-1
LUT6-to-1
LUT7-to-1
LUT8-to-1
LUT
NI-1 to 1
LUT
NI-1 to 1
NI-1NI
LUT NI-to-1
LUT
NI to 1
LUT
NI to 1
LUT
NI to 1
NI
NI
NI
NO
bNO-1
b1
b0
bNO-1b
1b
0
≡
NI
LUT NI-to-NO
NO bits
2N
I word
s o
f N
O b
its
LUT NI-to-NO
LUT
NI-to-NO
NI NO
LUT
NI-to-NO
NI NO
LUT
NI-to-NO
NI NO
NI ×× ××
NC
NO
×× ××N
C
'NC' LUTs NI-to-NO
M UXF5 M UXF5 M UXF5 M UXF5 M UXF5
M UXF6 M UXF6 M UXF6 M UXF6
M UXF7 M UXF7
M UXF8
LU
T4
LU
T4
LU
T4
LU
T4
LU
T4
LU
T4
LU
T4
LU
T4
LU
T4
LU
T4
LU
T4
LU
T4
LU
T4
LU
T4
LU
T4
LU
T4
Pixel Processor (3/9)Pixel Processor (3/9)
� Embedded System: We create a PLB slave burst interface around the pixel processor core. The figure shows a PRR with NC=4, NI=NO=8.
� The system dynamically reconfigures: NC, NI, NO, FUNCTION, under the following constraints: NI×NC≤32, and NO×NC ≤ 32
� Five ‘clkfx’ frequencies allowed: 100.00, 66.66, 50.0, 40.00, and 33.33 MHz.
� FIFOs are required to properly isolate different clock regions (PLB clock= 100 MHz and ‘clkfx’)
ord
en
irden
LUT
8-to-8
LUT
8-to-8
LUT
8-to-8
32 32
upix_ip
LUT
8-to-8
Bus2IP
_D
ata
Slave
Reg 0
oFIFOFWFT mode
DO
rden
IP2B
us_D
ata
32 IP
2B
us_W
rack
IP2B
us_R
dack
Bus2IP
_W
rCE
(0)
Bus2IP
_R
dC
E(1
)
FSM
Bus2IP
_C
lkDI
w ren
PLB Bus
PLB Interface
512x32
Bus2IP
_B
E
rdw r
IP2B
us_A
ddrA
ck
PLB46 Slave Burst IP
Bus2IP
_R
dR
eq
Bus2IP
_W
rReq
full
em
pty
ow
ren
ofull oempty
iFIFOFWFT mode
DO
rden
DI
w ren
full
em
pty
iwre
nifull iempty
FSM
clkfx
iw ren
Slave
Reg 1
iwre
n
512x32
Pow erPC
PLB
System
ACEmemory
Ethernet
MAC
ICAP
core
ICAP
port
CF
card
freq. ctrlDCR Bus clkfx
EP
A
co
nstr
ain
ts
'n' bitstreams
in memory
Module
n
Module
2
Module
1
DMA
core
M S
Hardw are core
Pixel
Processor
Inte
rface
PRRPRR
Implemented on a ML405 Development Board
Device: XC4VFX20-11FF672
Pixel Processor (4/9)Pixel Processor (4/9)
� Experimental Setup: The Pixel Processor is tested under 3 different scenarios. Performance and energy are measured for the IP core.
� Test images: 8-bit ‘lena’, 12-bit ‘oilp’
� Scenario A (implemented on the embedded system): 32-bit I/O constrained cases. 5 frequencies considered. Parameter NC not independent.
� Scenario B: 8/12-bit fixed input pixel cases. 5 frequencies considered.
� Scenario C: Fixed-frequency constrained implementation. Fixed frequency (100 MHz).
Scenario A
Scenario B
Scenario C
Pixel Processor (5/9)Pixel Processor (5/9)� Resource scalability: Use of Virtex-4 XC4VFX60 FPGA device (25280
Slices) to account for the largest pixel processor realizations. The cases listed in Scenario C are considered (frequency does not vary resources).
� Resource consumption (a function of NI, NO, and NC) grows exponentially with NI, linearly with NC and NO. In the figure, the results are clearly clustered for NI and NC.
� For NI>10 the resource requirements become suboptimal.
910
1112
910
1112
1314
1516
2000
4000
8000
12000
16000
20000
5
67
8
56
78
910
1112
200
400
600
800
1000
Num
ber
of S
lices
Number of
output bits (NO)Number of
output bits (NO)Number of
input bits (NI)
Number of
input bits (NI)
15
75
185
37
30
54
270
150 63
103
515
315
136
200
1000
680
306
547
1224
2188
690
1096
4384
27601514
2215
6056
8860
3323
4449
13290
17800
200040008000
120001600020000
Num
ber
of S
lices NC=2
NC=4
NC=6
NC=8
6789101112
200400600800
1000
Number of output bits (NO)
Num
ber
of S
lices
NC=2
NC=4
NC=6
NC=8
NC=10
Pixel Processor (6/9)Pixel Processor (6/9)
� Multi-objective optimization of the EPA/PPA space:
� Gamma correction function (γ=0.5). Power is presented for Scenarios A and B, and energy per frame for Scenario C.
� Scenario A: 12-bit image (NI:12�5). Pareto points cover a wide range of the PPA space (43%) � the approach is effective in generating varied Pareto points.
0
500
1000
1500
200040 60 80 100 120 140
0
50
100
150
200
fps
psnr(dB)
Pow
er(
mW
)
NI = 5
NI = 6
NI = 7
NI = 8
NI = 9
NI = 10
NI = 11
NI = 12
(509 fps , 48.14 dB, 42.37mW)
NI=NO=5, NC=4, 33 MHz
(1526 fps , 48.14 dB , 48.15mW)
NI=NO=5, NC=4, 100 MHz
(245 fps , 128.6 dB , 156.7mW)
NI=12, NO=16, NC=2, 33 MHz
lowest
power
max. accuracy
highest performance
Pixel Processor (7/9)Pixel Processor (7/9)
� Multi-objective optimization of the PPA space:
� Scenario B: 8-bit input image (NI=8 fixed). Pareto points are clustered as a function of NO. A similar trend occurs with NC. (not shown)
� Left side shows how power and performance depend on frequency
0
2000
400075 80 85 90 95 100 105
0
20
40
60
80
100
fps
psnr(dB)
Pow
er(
mW
)
NO = 8
NO = 9
NO = 10
NO = 11
NO = 12
050010001500200025003000350010
20
30
40
50
60
70
80
90
100
fps
Pow
er(
mW
)
100
66.66
50
40
33.33
(217 fps , 83.34 dB, 12.88mW)
NO=9, NC=2, 33 MHz
(3255 fps , 83.34 dB, 70.94mW)
NO=9, NC=10, 100 MHz
highest
performance
lowest
power
max.
accuracy
(217 fps , 101.71 dB, 17mW)
NO=12, NC=2, 33 MHz
Pixel Processor (8/9)Pixel Processor (8/9)
� Multi-objective optimization of the EPA space:
� Scenario C: 12-bit input image (NI:12�9), fixed frequency = 100 MHz.
� Performance clusters are defined in terms of the number of cores (NC)
� Energy clusters are defined in terms of the input pixel bitwidth (NI)
500100015002000250030003500
50
100
150
0
50
100
150
200
250
300
fps
psnr(dB)
Energ
y p
er
fram
e(u
J)
NI = 9
NI = 10
NI = 11
NI = 12
500100015002000250030003500
50
100
150
0
50
100
150
200
250
300
fps
psnr(dB)
NC = 2
NC = 4
NC = 6
NC = 8
NO16
15
1413
12
NO
16..1
1
NO
16..1
0
NO
16..9 NI=10, NO=11
NI=11, NO=12
NI=11, NO=11
Pixel Processor (9/9)Pixel Processor (9/9)� Dynamic EPA management: We show an
example on 2D (ignoring performance) with time-varying constraints:
1. Require accuracy≥80dB and Energy ≤160uJ.
2. Minimize energy subject to Accuracy ≥ 100dB
3. Maximize Accuracy.
4. Minimize Energy consumption.
7080901001101201300
50
100
150
200
250
300
psnr(dB)
Ene
rgy p
er
fra
me(u
J)
1001500
200
NI = 9
NI = 10
NI = 11
NI = 12
70809010011012013020
40
60
80
100
120
140
160
180
200
220
psnr(dB)
Energ
y p
er
fram
e(u
J)
EPP Results -Function 1
Energy per frame
Accuracy (ps nr)
0.16 mJ
80dB
min
100 dB
--
max
min
--
� � � �
�
�
�
�
NI=11, NO=11, NC=6
NI=12, NO=12, NC=8
NI=NO=9
NC=8
NI=12, NO=16,
NC=8
* Submitted to IEEE * Submitted to IEEE Transactions on Circuits and Transactions on Circuits and Systems for Video TechnologySystems for Video Technology
1D FIR Filter (1/3)1D FIR Filter (1/3)
EX_in E E E
y
c[0] c[1] c[2 c[3
EX_in E E E
y
shifts and adds
LUT array
Multiply and add Filter Distributed Arithmethic Filter
TWO TYPES OF IMPLEMENTATIONS
� Efficient implementation of a 1D FIR Filter via DPR: Dynamic Partial Reconfiguration turns the fixed-coefficient DA filter into a variable-coefficient DA filter, at the expense of partial reconfiguration time overhead.
� Parameterization of the VHDL-coded FIR filter core:
Distributed Arithmetic (DA) approach is more efficient since it is a LUT-based approach that turns the multiplications into shifts and adds. But it requires the coefficients to be constant.
FIR_DAX_in B
E
NO
N NH
[NO
NQ
]
OP
SY
MM
ET
RY
COEFFICIENTS
Y
L
sclr
BNumber of taps
Input bit-w idth
Coefficient bit-w idth
Output f ixed point format
LUT input size
Output truncation scheme
1D FIR 1D FIR Filter (2/3)Filter (2/3)
� LUT-based architecture for the Distributed Arithmetic Implementation:
...
...
...
...
...
...
s[0]s[M-2]
s[0]:
s[L-1]:
X_in B
x[N-1] x[N-2] x[0]x[1]
X_in
x[0]
x[N-1] x[M +1]
x[M ]x[M -1]x[1]
x[N-2]B
only when
N is even
B+1 B+1 B+1
E E E E
E
E E E
E E E
E
B B B B
sB
s0
Y
+
Adder
tree
s[M-1]
...sB
s1
s0
...
...
-2B
20 +...
LUT L-to-LO
LUT L-to-LO
Filter B lock 0
s[L]:
s[2L-1]:
sB
s0s
Bs
1s
0...
...
-2B
20 +...
LUT L-to-LO
LUT L-to-LO
Filter B lock 1
s[M -L]:
s[M -1]:
sB
s0s
Bs
1s
0
...
...
-2B
20 +...
LUT L-to-LO
LUT L-to-LO
Filter B lock M /L-1
x[0]:
x[L-1]:
xB-1
x0
Y
+
Adder
tree
xB-1
x1
x0
...
...
-2B-1
20 +...
LUT L-to-LO
LUT L-to-LO
Filter B lock 0
x[L]:
x[2L-1]:
xB-1
x0x
B-1x
1x
0
...
...
-2B-1
20 +...
LUT L-to-LO
LUT L-to-LO
Filter B lock 1
x[N-L]:
x[N-1]:
xB-1
x0x
B-1x
1x
0
...
...
-2B-1
20 +...
LUT L-to-LO
LUT L-to-LO
Filter B lock N/L-1
...
...
...
...
...
LO=NH+log2LLO=NH+log
2L
1D FIR Filter (3/3)1D FIR Filter (3/3)
� Filter Block Implementation ����
� Two implementations:
1) Coefficient-only reconfiguration: Only the coefficient values can be dynamically modified (PRR is made of the set of LUTs)
2) FullFull--filter reconfigurationfilter reconfiguration: It allows the run-time modification of all parameters.
� Performance dependence as the reconfiguration rate increases was shown.
LUT LUT LUT LUT LUT LUT LUT LUT LUT
s8
s7
s6
s5
s4
s3
s2
s1
s0
++++
022022
022022
+
122122
+
+
+
322
LL L L L L L L L
LO LO LO LO LO LO LO LO LO
Filter B lock Implementation (Symmetric, B=8)
222
* This work was published in the 2009 * This work was published in the 2009 International Journal of Reconfigurable International Journal of Reconfigurable ComputingComputing
2D FIR Filter (1/10)2D FIR Filter (1/10)
2D Separable Filter Implementation:
� Separable FIR filters allow for efficient implementations by means of two 1D FIR Filters.
� The reconfiguration rate is constant (twice per frame).
� Cyclic Dynamic reconfiguration of two 1-D filters (usually full-filter reconfiguration):
- Implement row filter
- Replace by column filter
- Implement column filter
- Replace by row filter
…
DPR
tr1 tc1FPGA
ROW COL
COL
tr2 tc2
ROW COL
1 Frame is streamed
Processed row -by-row frame is streamed
2
3
4
Replacing row f ilter by column filter via DPR
Replacing column filter by row filter via DPR
ROW
DPRROWCOL
Processing
frame 1
Processing
frame 2
5 Go to Step 1 to process a new frame
ROW PRR
row f ilter
col f ilter
COL
* A comparison of this 2D FIR Filter and a * A comparison of this 2D FIR Filter and a GPU implementation for different number of GPU implementation for different number of coefficients was published 2011 IEEE Field coefficients was published 2011 IEEE Field Programmable Logic Conference (FPLProgrammable Logic Conference (FPL’’2011)2011)
2D FIR Filter (2/10)2D FIR Filter (2/10)
2D Separable Filterbank Implementation:
Simple modifications to 2-D filter implementation:
� Reconfigure with the next 2-D filter and re-process frame
� When all filters have been applied, move to the next-frame and back to the first 2-D filter
ROW 1
COL 1
ROW 2
COL 2
ROW 3
COL 3
bitstreams
in memory
1st 2D filter
2nd 2D filter
3rd 2D filter
PRR
FPGA
ROW 1t1
t2
t4
t3
COL 1
ROW 2
COL 2
COL 3
ROW 3t5
t6
DPRROW i COL i
COL i
1 Frame is streamed
Processed row-by-row frame is streamed
2
3
4
Replacing row filter by column filter via DPR
Replacing column filter by row filter via DPR
ROW i
DPRCOL i
5 Go to Step 1 to:
- Process a new frame if i = 3. In this case j = 1
- Process the same frame if i = 1,2. In this case j = i+1
row f ilter
col filter
ROW j
EXAMPLE: 3-CHANNEL 2-D FILTERBANK
Pro
ce
ssin
g o
f 1
fra
me
* This work was published in the in the 2010 * This work was published in the in the 2010 IEEE Southwest Symposium on Image Analysis IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI 2010)and Interpretation (SSIAI 2010)
2D FIR Filter (3/10)2D FIR Filter (3/10)� Embedded System: We create a FSL interface around the FIR filter core.
The full-filter reconfiguration core is considered since it let us vary all the filter parameters, thereby allowing the creation of a large EPA space.
� FSL interface: We included this interface inside the PRR, so that we dynamically modify the I/O bitwidth.
� DPR control block: disables the PRR outputs during reconfiguration and resets the flip flops of the PRR after each partial reconfiguration.
� Each 2D filter realization is represented by 2 bitstreams.
PPC
PLB
System
ACEmemory
Ethernet
MAC
ICAP
core
FIR Filter processor
FSL
CF
card
FIR IP
core
DP
R c
trl
PRR
FS
L
inte
rface
ICAP
port
EP
A
co
nstr
ain
ts
'2n' bitstreams in memory
FS
L_S
_R
ead
FS
L_S
_E
xis
ts
FSL Slave
FS
L_S
_D
ata
FIFOw
FSL Master
FIFOrF
SL_S
_C
ontro
l
FS
L_S
_C
lk
32
FS
L_M
_W
rite
FS
L_M
_F
ull
FS
L_M
_D
ata
FS
L_M
_C
ontro
l
FS
L_M
_C
lk
32
... ...
FSM
X_in
8
E
8 Y
Filter core
sclr
FSM
DPR control - in DPR control - out
FS
L_S
_R
st
reset PRR
FSL interfaceModule 1 - row Module 1 - col
Module 2 - row Module 2 - col
Module n - row Module n - col
...
...
Implemented on a ML405 Development
Board. Device: XC4VFX20-11FF672
2D FIR Filter (4/10)2D FIR Filter (4/10)Experimental Setup: The FIR core is fully parameterized. The table
describes the combination of parameters we utilize that creates the Energy-Performance-Precision (EPA) space.
� Performance (fps) and energy are measured for the IP core.
� Test image: 8-bit ‘lena’ (VGA, CIF, QCIF frame sizes).
� Three different Gaussian filters:
8
16
10
12
16
32242016128
640x480 (VGA)
352x288 (CIF)
176x144 (QCIF)
Output bitwidth
(OB)
Coefficient bit-
width (NH)Number of coefficients (N)Frame Size
-pi
0
pi
-pi
0
pi0
0.5
ωx
DoG - σ1=2,σ
2=4
ωy
-pi
0
pi
-pi
0
pi0
0.5
1
1.5
ωx
Gaussian f ilter - σx=σ
y=1.5
ωy
Magn
itude
-pi
0
pi
-pi
0
pi0
0.5
1
ωx
Gaussian f ilter - σx=4,σ
y=2
ωy
N=24 N=32 N=48
2D FIR Filter (5/10)2D FIR Filter (5/10)Multi-objective optimization of the EPA space: Results for the 3 filters and 3 image sizes: The highest accuracy is achieved by increasing the number of coefficients (N), the coefficient bitwidth (NH), and with 16 output bits (OB). Frame size increases energy per frame. Thus, we present Pareto-optimal results independently of the frame size.
HA: highest accuracy realization from the Pareto frontLE: lowest energy realization from the Pareto front.
0
500
100040 60 80 100 120
0
0.2
0.4
0.6
0.8
fps
psnr(dB)
Energ
y p
er
fram
e(m
J)
0
500
100020 40
60 80
0
0.2
0.4
0.6
0.8
fps
psnr(dB)
0
500
100020 40 60 80 100
0
0.5
1
fps
psnr(dB)
640x480 640x480
640x480
352x288352x288
352x288176x144 176x144
176x144
200400
OB = 16
OB = 8200400
OB = 16
OB = 8 200400
OB = 16
OB = 8
686fps, 51dB,
0.105mJ, N=8,NH=12
326fps
67.13dB
0.323mJ
N=24,NH=12
136.4fps
85.5dB
0.865mJ
N=32,NH=16
Low pass Gaussian f ilter, σ=1.5 Low pass Gaussian f ilter, σx=4,σy
=2 Band pass (DoG) f ilter, σ1=2,σ2
=4
HA
LE LE
HA HA
LE
67dB
137fps,
0.6874mJ,
N=24,NH=12
2D FIR Filter (6/10)2D FIR Filter (6/10)Multi-objective optimization of the EPA space:
Results for the 3 filters and CIF frame size: We show the Pareto-optimal realizations as a function of N, NH, and OB.
Low-pass Gaussian Filter, σx=σy=1.5:
325
330
33550 60 70 80 90 100 110
0.2
0.25
0.3
0.35
fps
psnr(dB)
325
330
335
50 60 70 80 90 100 110
0.2
0.25
0.3
0.35
fps
psnr(dB)
Energ
y p
er
fram
e(m
J)
N = 24
N = 20
N = 16
N = 12
N = 8 325330
NH = 16
NH = 12
NH = 10
LE, HP1, LA
HA, HE, LP
OB=8
U1HP2 HP3HP4HP5U2,U3 U4 U5 U6 HA, HE, LP
LE, HP1, LA
2D FIR Filter (7/10)2D FIR Filter (7/10)Multi-objective optimization of the EPA space
324
326
328
330
332
334
2030
4050
6070
0.2
0.25
0.3
0.35
fpspsnr(dB)
Energ
y p
er
fram
e(m
J)
324
326
328
330
332
334
2030
4050
6070
0.2
0.25
0.3
0.35
fpspsnr(dB)
N = 24
N = 20
N = 16
N = 12
N = 8
OB=8332fps, 29.6dB, 0.156mJLE, HP, LA
326fps, 67dB, 0.323mJHA, HE, LP
325330
NH = 16
NH = 12
NH = 10
HP
HP
LP
LP
HA, HE, LP
LE, HP, LA
Low-pass Gaussian Filter, σx=4,σy=2
Difference of Gaussians (DoG) filterσ1=2,σ2=4
320
325
330
335
20
40
60
80
100
0.2
0.25
0.3
0.35
0.4
fpspsnr(dB)
320
325
330
335
2040
6080
100
0.2
0.25
0.3
0.35
0.4
fpspsnr(dB)
Energ
y p
er
fram
e(m
J)
N = 32
N = 24
N = 20
N = 16
N = 12
N = 8
OB=8
LE, HP, LA
HA, HE
(a) (b)
325330
NH = 16
NH = 12
NH = 10
HP
HP
LP
LP
332fps, 23.6dB, 0.155mJ
332fps, 23.6dB, 0.155mJ HA, HE
LE, HP, LA
2D FIR Filter (8/10)2D FIR Filter (8/10)Multi-objective optimization of the Energy-Accuracy space: Performance results are over 100 fps (VGA), and 300 fps (CIF). Overall, for a
fixed frame size, performance does not vary significantly. Thus, it makes sense to restrict our attention to the Energy-Accuracy Space
Results for the 3 filters and CIF frame size: We show the Pareto-optimal realizations as a function of N (number of coefficients).
20406080100
0.2
0.25
0.3
0.35
0.4
psnr(dB)
203040506070
0.2
0.25
0.3
0.35
psnr(dB)
406080100120
0.2
0.25
0.3
0.35
psnr(dB)
Energ
y p
er
fram
e(m
J)
Low pass Gaussian f ilter, σ=1.5 Low pass Gaussian f ilter, σx
=4,σy=2 Band pass (DoG) f ilter, σ1
=2,σ2=4
fps(min) = 325.9
fps(max) 332.2
N = 24
N = 20
N = 16
N = 12
N = 8
Frame size:352x288N = 32
N = 24
N = 20
N = 16
N = 12
N = 8
HA
LE LE
HA
HA
LE
U11U10
U9
U8U7 U6
U5U4U3
U2
U1
fps(min) = 325.9
fps(max) 332.2
fps(min) = 322.6
fps(max) 332.2
2D FIR Filter (9/10)2D FIR Filter (9/10)Dynamic EPA Management (1st example): Applied on the Pareto front of the DoG filter (see table). Video sequence: ‘foreman’.
Time-varying sequence of constraints:1. Require Accuracy ≥ 45dB and Energy ≤ 0.3mJ
2. Minimize Accuracy subject to Energy ≤ 0.3mJ
3. Minimize Energy per frame consumption
4. Minimize Energy subject to Accuracy ≥ 65 dB
5. Maximize Accuracy
0 50 100 150 200 250 30020
30
40
50
60
70
80
90
Frame #
PS
NR
(dB
)
2030405060708090
0.2
0.25
0.3
0.35
0.4
psnr(dB)
Energ
y p
er
fram
e(m
J)
�
�N=20, NH=10, OB=8
N=32, NH=16, OB=16
� N=24, NH=16, OB=16
N=8, NH=12, OB=8
�N=24, NH=10
OB=16
�
�Energy per frame
Accuracy (ps nr)
0.3 mJ
45dB
0.3mJ
max
min
--
min
65dB
� � � �--
max
avg=49.65
std=0.0977
avg=61.14
std=0.1425
avg=23.56
std=0.18
avg=66.18
std=0.5568
avg=88.45
std=0.0475
�
�
�
�
�
frame # 30
frame # 90
frame # 150
frame # 215
frame # 270
2D FIR Filter (10/10)2D FIR Filter (10/10)Dynamic EPA Management (2nd example):Suppose that the video output will be streamed through a communications channel. Here, it makes sense to impose real-time EPA constraints based on the Group Of Pictures (GOP). The GOP describes the prediction relationships between frame types (MPEG-1 recommendation):
* To be * To be submitted to submitted to IEEE IEEE Transactions Transactions on Image on Image ProcessingProcessing
I B B B B P B B I B B B P
The GOP can now be defined for different videos.It makes sense to impose the following accuracy constraints:-I-frames should be of the highest accuracy (all frames depend on it): Point �-P-frames should be of very high accuracy (many frames depend on it): Point �-B-frames can be of low accuracy (no frames depend on it): Point �
20304050607080900.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
psnr(dB)
Energ
y p
er
fram
e(m
J)
0 50 100 15040
50
60
70
80
90
Frame #
PS
NR
(dB
)
I
P
B
�N=20, NH=10, OB=8
N=32, NH=16, OB=16
� N=24, NH=10, OB=16
I
P
B
'missamerica' video sequence (QCIF)
�
ConclusionsConclusions� A framework was presented for the generation of optimal realizations (in
the multi-0bjective sense) from the Energy-Performance-Accuracy space. The framework allows for dynamic EPA management for digital signal, image, and video processing applications.
� Dynamic EPA management is based on Dynamic Partial Reconfiguration (DPR) and Dynamic Frequency Control to deliver performance with limited hardware resources and relatively low energy consumption.
� The framework was tested on a Pixel Processor architecture and a 2D FIR Filtering system. Dynamic EPA management was demonstrated on twostandard video sequences.
� The results suggest that the general framework can be applied to a variety of digital signal, image, and video processing systems. The framework can be greatly improved by the automatic generation of time-varying constraints (e.g., detection of a scene triggers a requirement for increased accuracy, a scene remaining still triggers a requirement for a decrease in energy consumption)
� Ultimately, this framework will lead to exciting new methods that allow for systems to only switch between architectures that are optimal in the multi-objective sense.
Related publicationsRelated publications[1] D. Llamocca, M. Pattichis, and A. Vera, “A Dynamically Reconfigurable Parallel Pixel A Dynamically Reconfigurable Parallel Pixel
Processing SystemProcessing System”, in Proceedings of 2009 International Conference on Field Programmable Logic and Applications FPL’2009, Prague, Czech Republic, Sep. 2009.
[2] D. Llamocca, M. Pattichis, and A. Vera, “A dynamically reconfigurable platform for fixeddynamically reconfigurable platform for fixed--point FIR filterspoint FIR filters,” in Proceedings of the International Conference on Reconfigurable Computing and FPGAs ReConFig’09, Cancun, Mexico, Dec. 2009.
[3] D. Llamocca, M.S. Pattichis, and G. A. Vera, “A dynamic computing platform for image A dynamic computing platform for image and video processing applicationsand video processing applications,” in Proceedings of the 43rd Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, Nov. 2009.
[4] G. A. Vera, D. Llamocca, M. S. Pattichis, and J. Lyke, “A dynamically reconfigurable A dynamically reconfigurable computing model for video processing applicationscomputing model for video processing applications,” in Proc. of the 43rd AsilomarConference on Signals, Systems and Computers, Pacific Grove, CA, USA, Nov. 2009.
[5] D. Llamocca, M. Pattichis, and G. A. Vera, “Partial Reconfigurable FIR Filtering system Partial Reconfigurable FIR Filtering system using Distributed Arithmeticusing Distributed Arithmetic”, International Journal of Reconfigurable Computing, vol. 2010, Article ID 357978, 14 pages, 2010.
[6] D. Llamocca, M. Pattichis, “RealReal--time dynamically reconfigurable 2time dynamically reconfigurable 2--D D filterbanksfilterbanks”, in Proceedings of 2010 IEEE Southwest Symposium on Image Analysis & Interpretation, Austin, TX, May. 2010.
[7] D. Llamocca, M.S. Pattichis, G. A. Vera, and J. Lyke, “Dynamic Partial Reconfiguration Dynamic Partial Reconfiguration through Ethernet Linkthrough Ethernet Link”, in Proceedings of the 2010 AIAA Infotech Conference at Aerospace, Atlanta, GA, USA, April 2010.
[8]D. Llamocca, C. Carranza, and M. Pattichis, “Separable FIR Filtering in FPGA and GPU Separable FIR Filtering in FPGA and GPU implementations: Energy, Performance, and Accuracy considerationimplementations: Energy, Performance, and Accuracy considerationss”, in Proceedings of 2011 International Conference on Field Programmable Logic and Applications FPL’2011, Chania, Greece, Sep. 2011.