dynamically reconfigurable management of energy, performance…llamocca/research/llamocca2012... ·...

Dynamically Reconfigurable Management Dynamically Reconfigurable Management of Energy, Performance, and Accuracy of Energy, Performance, and Accuracy applied to Digital Signal, Image, and applied to Digital Signal, Image, and

Video Processing Applications Video Processing Applications

Daniel Daniel LlamoccaLlamocca

Electrical and Computer Engineering Department

The University of New Mexico

December 2nd, 2011

Image and Video Processing and Communications LaboratoryImage and Video Processing and Communications Laboratory

OutlineOutline

�� MotivationMotivation

�� Related WorkRelated Work

�� Thesis StatementThesis Statement

�� ContributionsContributions

�� General ApproachGeneral Approach

�� General implementation detailsGeneral implementation details

�� Digital signal, image, and video processing applications:Digital signal, image, and video processing applications:

�� General Implementation detailsGeneral Implementation details

�� Pixel ProcessorPixel Processor

�� 1D FIR Filter1D FIR Filter

�� 2D FIR Filter/2D FIR Filter/FilterbankFilterbank

�� ConclusionsConclusions

MotivationMotivationDigital signal, image, and video processing systems can be characterized by

three properties:

Energy, Performance, and Accuracy (EPA).

The controlling of these variables at run-time is defined as Dynamic Energy-Performance-Accuracy (EPA) management.

Dynamic EPA management will enable us to deliver:

�A dynamically self-adaptive system (by dynamic allocation of computational resources and dynamic frequency control) that satisfies time-varying EPA requirements.

�An Optimal resulting realization: We want to investigate optimal solutions that can meet dynamic EPA requirements . The system should minimize energy consumption, and at the same time maximize performance and accuracy, while satisfying the given EPA requirements.

MotivationMotivationDynamic Energy-Performance-Accuracy management can rely on Dynamic Partial Reconfiguration (DPR) and Dynamic Frequency Control on FPGAs.

Dynamic Partial Reconfiguration

DPR technology enables the adaptation of hardware resources by modifying or switching off portions of the FPGA while the rest remains intact, continuing its operation. To perform DPR, the Partial Reconfiguration Region (PRR) must be defined. The PRR isdynamically reconfigured via the internal c0nfiguration access port (ICAP).

Dynamic Frequency Control

Digital Clock Managers (DCMs) inside FPGAs provide a wide range of clock management features.

The Dynamic Reconfiguration Port (DRP) of the DCM enables dynamic control of the frequency and phase. We can use it to dynamically adjust the frequency without reloading a new bitstream to the FPGA.

FPGA

Module 1

Module 2

Module n

t1

t2

tn

PRR

static region

clkfx

clkin

rst

clkfb clk0

daddr[6:0]

di[15:0]dw e

dclk

den

do[15..0]drdy

DRP

DCM _ADV

DC

R s

lave

inte

rface

processor

DCR MASTER

BUFG

BUFG

reference clock

DC

R B

US

DCR SLAVE

dcr_a_bus

dcr_sl_dbus

dcr_read

dcr_write

sl_dcrack

sl_dcr_dbus

clkinD

Mclkfx ×

=

clkfx

MotivationMotivationThe system can then carry out independent tasks in time:

task 1, task 2, ….

Examples:

� Task 1: A video processing system is asked to deliver real time performance at 30 frames per second (fps) on limited battery life that will also need to operate for at least 10 hours. This is a multi-objective optimization problem. If solutions are found, pick the system realization with the highest precision.

� Task 2: Now, suppose that we are asked to deliver performance at 100 fps at some minimum level of accuracy (60dB). In this case, we can select the hardware realization with the lowest energy requirements while meeting the performance and accuracy constraints.

Related workRelated work (1 of 3)Image processing with DPR:� DPR implementation of mean and median filters [Bhandari09],

[Raikovich10].� Fingerprint image processing algorithms whose stages (segmentation,

normalization, smoothing, etc.) are multiplexed in time via DPR [Fons10].� 3D Haar Wavelet Transform DPR implementation by dynamically

reconfiguring a 1D HWT thrice [Afandi09].� JPEG2000 decoder where the blocks are dynamically swapped

[Bouchoux04]� All these works are DPR implementations that exhibit some resemblance to

our work. However they did not explore the EPA space.

CHREC (NSF Center for High-Performance Reconfigurable Computing):� Acceleration of the Partial Reconfiguration Process (e.g. bitstream

rellocator, high level PR description for fast PR implementation, platformfor rapid deployment of PR embedded systems, using hard macros to reduce FPGA compilation time).

� Adaptive filtering, optical flow static implementations (no DPR)� JTAG encoder/decoder (modules are swapped via DPR)� No exploration of the EPA space via DPR.

Related workRelated work (2 of 3)DPR Application in Dynamic Arithmetic [Vera08]:� The use of DPR provided a low-

energy example where the use of dynamic dual fixed-point (DDFX)arithmetic was shown to performas well as double floating point (FP) in a Linear Algebra example.

� DDFX maintains a performanceadvantage with respect to FP whenreconfiguring once every 10000operations or less (i.e., DDFX canchange precision 250 times per second or switch operations 150 per second.

� Arithmetic cores were measured in terms of their power, performance, and precision.

� A model was formulated that relates power, performance, and precisionof the dynamic arithmetic architecture. It explored the use of DPR to dynamically adjust performance, precision, and power consumption.

� No multi-objective optimization of the Power-Performance-Precision space

QP

: Q

ua

nti

za

tio

n

P

ara

me

ter

Related workRelated work (3 of 3)DPR Application for scalable DCT computation [Huang09]:� Parameterized Discrete Cosine Transform (DCT) systolic modules (no

Distributed Arithmetic approach suitable for FPGAs)� The system dynamically reconfigures among Discrete Cosine Transform

modules of different sizes (e.g., 8x8, 5x5,4x4).� Different DCT configurations were studied in terms of power, performance, power, performance,

and accuracyand accuracy. A configuration manager can adapt DCTs of different sizes based on power, performance, and accuracy power, performance, and accuracy constraints.

� Exploration of power, performance, and precision dependence on the DCT size. No multi-objective optimization of the EPA space.

Thesis StatementThesis StatementThis Dissertation develops a dynamic Energy-Performance-Accuracy

management framework for Digital Signal, Image, and Video processing applications.

This entails: 1. Development and parameterization of efficient and dynamically

reconfigurable architectures for the following signal, image, and video processing applications:

2. Development of a Multi-objective Pareto optimization approach to meet global Energy-Performance-Accuracy (EPA) constraints.

3. Description of the Pareto-optimal realizations extracted from the EPA space: We are interested in how the architecture parameters generate Pareto-optimal solutions from the EPA space.

4. Dynamic EPA management to meet time-varying global EPA constraints. The system receives stimuli in the form of EPA constraints and reconfigures itself via DPR and/or dynamic frequency control to meet the EPA constraints.

DPR 2D FilterbankDPR 2D FIR Filter

DPR 1-D FIR FilterDPR Pixel Processor

ContributionsContributions� Development of fully-parameterized hardware cores for signal, image, and

video processing applications. The architectures are implemented with techniques that minimize the amount of computing resources and take advantage of Dynamic Partial Reconfiguration.

� Characterization of the optimal (in the multi-objective sense) hardware realizations from the EPA/PPA space for the architectures presented.

� A new framework for dynamic energy/power, performance, and accuracy (EPA/PPA) management based on a multi-objective optimization approach that guarantees low energy, high accuracy, and high performance. The framework is applicable to a wide array of signal, image, and video processing architectures.

� Development of hardware systems that support dynamic energy/power, performance, and accuracy management that meet real-time EPA/PPA constraints. On hardware, dynamic EPA/PPA management is based on the run-time control of hardware resources and frequency of operation.

General approach (1/5)General approach (1/5)

Steps:

1) Definition of Objective Functions

2) Development of efficient cores

3) Parameterization of Hardware Cores

4) Multi-objective Pareto Optimization in the EPA Space

5) Dynamic management based on real-time EPA constraints

General approach (2/5)General approach (2/5)1) Definition of objective functions: Energy, performance, and accuracy

are considered the objective functions of system parameters. These properties may have a slightly different definition depending on the application.Energy can be measured as the total energy spent during the system operation, or the energy spent during an operation (e.g., energy per video frame). In some instances, measuring Power is more useful.Performance can be measured by: Megasamples per second, frames per second, Megabytes per second, etc.Accuracy can be measured by: numerical representation, or accuracy with respect to an idealized result (e.g., PSNR).

2) Development of efficient cores: The signal, image, and video processing architectures should use techniques that: i) minimize the amount of computational resources (e.g. LUT-based approaches, Distributed Arithmetic), and ii) make intensive use of DPR. The cores must be implemented in Hardware Description Language (HDL), so that they remain portable across FPGA devices and vendors.

General approach (3/5)General approach (3/5)

PIXEL

PROCESSOR

NC LUT values

(from text f ile)

NI NO F

NI××××NC NO××××NC

3) Parameterization of hardware cores: To achieve a fine control of energy, performance, and accuracy, we require realistic parameterization of the hardware cores (e.g., I/O bit-width, number of parallel cores).

The parameterized HDL code let us create a set of hardware realizations by varying the parameters. Each realization comes with different energy, performance, and accuracy values, which we can control by varying the hardware parameters.

Example: Parameterization of the ‘Pixel processor’ architecture:

NC (number of cores), NI (number of input bits per pixel),

NO (number of output bits per pixel), F (function to be implemented),

LUT values (text file with LUT values)

General approach (4/5)General approach (4/5)4) Multi-objective Pareto Optimization in the EPA Space: The Energy-

Performance-Accuracy (EPA) space is represented by a set of hardware realizations along with their EPA values.An optimal hardware realization is defined as the one that minimizes energy, while maximizing performance and accuracy. We are interested in the set of optimal realizations from the EPA space. We want to find a subset whose EPA values cannot be improved by any other realization for all three (EPA). These realizations are called optimal in the Pareto (multi-objective) sense.

Energ

y

-Accuracy

Pareto front

-Accuracy -Performance

Energ

yPareto front

The Energy-Performance-Accuracy space is shown along with the ParetoPareto--optimaloptimal points. In some cases, we may want to explore a space of just 2 variables, e.g., the Energy-Accuracy space.

General approach (5/5)General approach (5/5)5) Dynamic management based on real-time EPA constraints: Once the

Pareto front has been extracted, we can cast optimization problems based on EPA constraints.Example: We set constraints on all the three variables. The feasible set is represented by the golden points. We prioritize energy consumption, so our selected realization is the one that also minimizes energy consumption.The previous problem could be cast as the following optimization problem:

( ) ( )

( ) fpsRiePerformanc

dBRiAccuracy:tosubject

RiEnergy

Ri

min

30

50

≥

≥

Circled point: Realization from the feasible set that minimizes energy consumption.

If we ignore one variable (say, performance), we have a 2D optimization problem.

-Performance-Accuracy

constr

ain

ts

Energ

y

Energ

y

-Accuracy

constraints

Digital signal, image, and video processing Digital signal, image, and video processing applicationsapplications

The following systems are discussed:

� Pixel Processor and Dynamic EPA Management

� 1D FIR Filter

� 2D Separable FIR Filter/ Filterbank & Dynamic EPA Management for the 2D FIR Filter

General Implementation Details General Implementation Details Embedded FPGA system that supports Dynamic Partial Reconfiguration and Dynamic Frequency Control:Pareto-optimal point: Represented by <<bitstreambitstream, frequency of operation>, frequency of operation>It is a hardware realization that becomes active in the FPGA via Dynamic Partial Reconfiguration (DPR) and/or Dynamic Frequency Control.

System receives an EPA constraint:� It looks for a solution in the Pareto-optimal set: <bitsream*, freq*>� It reconfigures FPGA dynamic region and /or frequency of operation, so as to meet the EPA constraint.

processor

PLB

System

ACEmemory

Ethernet

MAC

ICAP

core

ext. interface

FSL

PLB

ICAP

port

CF

card

freq. ctrlDCR Bus clkfx

EP

A

co

nstr

ain

ts

Energ

y

-Accuracy -Performance

PO 1

Pareto front

'n' bitstreams

in memory

PO 3

PO m

bitstream freq.

Module n

Module 1

Module 3

f1

f2

f1

Module 1 f3

PO 4

PO 2 Module 2 f2

Module

n

Module

2

Module

1

'm ' Pareto points

n ≤ ≤ ≤ ≤ m

DMA

core

M S

Hardw are core

Dynamic

Region

Inte

rface

PRR

Pixel Processor (1/9)Pixel Processor (1/9)� Single-pixel operations (e.g.,

gamma correction, Huffman encoding, histogram equalization, contrast stretching) can be dynamically swapped. Parameter F modifies the function.

� In addition to dynamically modifying the input-output function, we might want to change:

� Input pixel bitwidth (NI)

� Output pixel bitwidth (NO),

� Number of parallel processing elements (NC)

t1

t2

t3

Input Frame

Inv. Gamma correction

gamma = 2

Contrast Stretching

alpha = 2, m = 0.5

Gamma correction

gamma = 0.5

Module 3 Module 2 Module 1

FPGA

PRR

PIXEL

PROCESSOR

NC LUT values

(from text f ile)

NI NO F

NI××××NC NO××××NC

Pixel Processor (2/9)Pixel Processor (2/9)� LUT-based architecture: LUT4 (Virtex-4). LUT6 (Virtex-5, Virtex-6)

Up to LUT8-to-1 can be implemented efficiently with Xilinx primitives.

For LUT inputs > 8, a recursive implementation is employed.

4

M UXF5

8

4 4

M UXF5

4 4

M UXF5

4 4 4 4 4 4 4 4 4 4 4

4 M

SB

s

4 LSBs

LUT5-to-1

LUT6-to-1

LUT7-to-1

LUT8-to-1

LUT

NI-1 to 1

LUT

NI-1 to 1

NI-1NI

LUT NI-to-1

LUT

NI to 1

LUT

NI to 1

LUT

NI to 1

NI

NI

NI

NO

bNO-1

b1

b0

bNO-1b

1b

0

≡

NI

LUT NI-to-NO

NO bits

2N

I word

s o

f N

O b

its

LUT NI-to-NO

LUT

NI-to-NO

NI NO

LUT

NI-to-NO

NI NO

LUT

NI-to-NO

NI NO

NI ×× ××

NC

NO

×× ××N

C

'NC' LUTs NI-to-NO

M UXF5 M UXF5 M UXF5 M UXF5 M UXF5

M UXF6 M UXF6 M UXF6 M UXF6

M UXF7 M UXF7

M UXF8

LU

T4

LU

T4

LU

T4

LU

T4

LU

T4

LU

T4

LU

T4

LU

T4

LU

T4

LU

T4

LU

T4

LU

T4

LU

T4

LU

T4

LU

T4

LU

T4

Pixel Processor (3/9)Pixel Processor (3/9)

� Embedded System: We create a PLB slave burst interface around the pixel processor core. The figure shows a PRR with NC=4, NI=NO=8.

� The system dynamically reconfigures: NC, NI, NO, FUNCTION, under the following constraints: NI×NC≤32, and NO×NC ≤ 32

� Five ‘clkfx’ frequencies allowed: 100.00, 66.66, 50.0, 40.00, and 33.33 MHz.

� FIFOs are required to properly isolate different clock regions (PLB clock= 100 MHz and ‘clkfx’)

ord

en

irden

LUT

8-to-8

LUT

8-to-8

LUT

8-to-8

32 32

upix_ip

LUT

8-to-8

Bus2IP

_D

ata

Slave

Reg 0

oFIFOFWFT mode

DO

rden

IP2B

us_D

ata

32 IP

2B

us_W

rack

IP2B

us_R

dack

Bus2IP

_W

rCE

(0)

Bus2IP

_R

dC

E(1

)

FSM

Bus2IP

_C

lkDI

w ren

PLB Bus

PLB Interface

512x32

Bus2IP

_B

E

rdw r

IP2B

us_A

ddrA

ck

PLB46 Slave Burst IP

Bus2IP

_R

dR

eq

Bus2IP

_W

rReq

full

em

pty

ow

ren

ofull oempty

iFIFOFWFT mode

DO

rden

DI

w ren

full

em

pty

iwre

nifull iempty

FSM

clkfx

iw ren

Slave

Reg 1

iwre

n

512x32

Pow erPC

PLB

System

ACEmemory

Ethernet

MAC

ICAP

core

ICAP

port

CF

card

freq. ctrlDCR Bus clkfx

EP

A

co

nstr

ain

ts

'n' bitstreams

in memory

Module

n

Module

2

Module

1

DMA

core

M S

Hardw are core

Pixel

Processor

Inte

rface

PRRPRR

Implemented on a ML405 Development Board

Device: XC4VFX20-11FF672


� Experimental Setup: The Pixel Processor is tested under 3 different scenarios. Performance and energy are measured for the IP core.

� Test images: 8-bit ‘lena’, 12-bit ‘oilp’

� Scenario A (implemented on the embedded system): 32-bit I/O constrained cases. 5 frequencies considered. Parameter NC not independent.

� Scenario B: 8/12-bit fixed input pixel cases. 5 frequencies considered.

� Scenario C: Fixed-frequency constrained implementation. Fixed frequency (100 MHz).

Scenario A

Scenario B

Scenario C

Pixel Processor (5/9)Pixel Processor (5/9)� Resource scalability: Use of Virtex-4 XC4VFX60 FPGA device (25280

Slices) to account for the largest pixel processor realizations. The cases listed in Scenario C are considered (frequency does not vary resources).

� Resource consumption (a function of NI, NO, and NC) grows exponentially with NI, linearly with NC and NO. In the figure, the results are clearly clustered for NI and NC.

� For NI>10 the resource requirements become suboptimal.

910

1112

910

1112

1314

1516

2000

4000

8000

12000

16000

20000

5

67

8

56

78

910

1112

200

400

600

800

1000

Num

ber

of S

lices

Number of

output bits (NO)Number of

output bits (NO)Number of

input bits (NI)

Number of

input bits (NI)

15

75

185

37

30

54

270

150 63

103

515

315

136

200

1000

680

306

547

1224

2188

690

1096

4384

27601514

2215

6056

8860

3323

4449

13290

17800

200040008000

120001600020000

Num

ber

of S

lices NC=2

NC=4

NC=6

NC=8

6789101112

200400600800

1000

Number of output bits (NO)

Num

ber

of S

lices

NC=2

NC=4

NC=6

NC=8

NC=10


� Multi-objective optimization of the EPA/PPA space:

� Gamma correction function (γ=0.5). Power is presented for Scenarios A and B, and energy per frame for Scenario C.

� Scenario A: 12-bit image (NI:12�5). Pareto points cover a wide range of the PPA space (43%) � the approach is effective in generating varied Pareto points.

0

500

1000

1500

200040 60 80 100 120 140

0

50

100

150

200

fps

psnr(dB)

Pow

er(

mW

)

NI = 5

NI = 6

NI = 7

NI = 8

NI = 9

NI = 10

NI = 11

NI = 12

(509 fps , 48.14 dB, 42.37mW)

NI=NO=5, NC=4, 33 MHz

(1526 fps , 48.14 dB , 48.15mW)

NI=NO=5, NC=4, 100 MHz

(245 fps , 128.6 dB , 156.7mW)

NI=12, NO=16, NC=2, 33 MHz

lowest

power

max. accuracy

highest performance


� Multi-objective optimization of the PPA space:

� Scenario B: 8-bit input image (NI=8 fixed). Pareto points are clustered as a function of NO. A similar trend occurs with NC. (not shown)

� Left side shows how power and performance depend on frequency

0

2000

400075 80 85 90 95 100 105

0

20

40

60

80

100

fps

psnr(dB)

Pow

er(

mW

)

NO = 8

NO = 9

NO = 10

NO = 11

NO = 12

050010001500200025003000350010

20

30

40

50

60

70

80

90

100

fps

Pow

er(

mW

)

100

66.66

50

40

33.33

(217 fps , 83.34 dB, 12.88mW)

NO=9, NC=2, 33 MHz

(3255 fps , 83.34 dB, 70.94mW)

NO=9, NC=10, 100 MHz

highest

performance

lowest

power

max.

accuracy

(217 fps , 101.71 dB, 17mW)

NO=12, NC=2, 33 MHz


� Multi-objective optimization of the EPA space:

� Scenario C: 12-bit input image (NI:12�9), fixed frequency = 100 MHz.

� Performance clusters are defined in terms of the number of cores (NC)

� Energy clusters are defined in terms of the input pixel bitwidth (NI)

500100015002000250030003500

50

100

150

0

50

100

150

200

250

300

fps

psnr(dB)

Energ

y p

er

fram

e(u

J)

NI = 9

NI = 10

NI = 11

NI = 12

500100015002000250030003500

50

100

150

0

50

100

150

200

250

300

fps

psnr(dB)

NC = 2

NC = 4

NC = 6

NC = 8

NO16

15

1413

12

NO

16..1

1

NO

16..1

0

NO

16..9 NI=10, NO=11

NI=11, NO=12

NI=11, NO=11

Pixel Processor (9/9)Pixel Processor (9/9)� Dynamic EPA management: We show an

example on 2D (ignoring performance) with time-varying constraints:

1. Require accuracy≥80dB and Energy ≤160uJ.

2. Minimize energy subject to Accuracy ≥ 100dB

3. Maximize Accuracy.

4. Minimize Energy consumption.

7080901001101201300

50

100

150

200

250

300

psnr(dB)

Ene

rgy p

er

fra

me(u

J)

1001500

200

NI = 9

NI = 10

NI = 11

NI = 12

70809010011012013020

40

60

80

100

120

140

160

180

200

220

psnr(dB)

Energ

y p

er

fram

e(u

J)

EPP Results -Function 1

Energy per frame

Accuracy (ps nr)

0.16 mJ

80dB

min

100 dB

--

max

min

--

� � � �

�

�

�

�

NI=11, NO=11, NC=6

NI=12, NO=12, NC=8

NI=NO=9

NC=8

NI=12, NO=16,

NC=8

* Submitted to IEEE * Submitted to IEEE Transactions on Circuits and Transactions on Circuits and Systems for Video TechnologySystems for Video Technology

1D FIR Filter (1/3)1D FIR Filter (1/3)

EX_in E E E

y

c[0] c[1] c[2 c[3

EX_in E E E

y

shifts and adds

LUT array

Multiply and add Filter Distributed Arithmethic Filter

TWO TYPES OF IMPLEMENTATIONS

� Efficient implementation of a 1D FIR Filter via DPR: Dynamic Partial Reconfiguration turns the fixed-coefficient DA filter into a variable-coefficient DA filter, at the expense of partial reconfiguration time overhead.

� Parameterization of the VHDL-coded FIR filter core:

Distributed Arithmetic (DA) approach is more efficient since it is a LUT-based approach that turns the multiplications into shifts and adds. But it requires the coefficients to be constant.

FIR_DAX_in B

E

NO

N NH

[NO

NQ

]

OP

SY

MM

ET

RY

COEFFICIENTS

Y

L

sclr

BNumber of taps

Input bit-w idth

Coefficient bit-w idth

Output f ixed point format

LUT input size

Output truncation scheme

1D FIR 1D FIR Filter (2/3)Filter (2/3)

� LUT-based architecture for the Distributed Arithmetic Implementation:

...

...

...

...

...

...

s[0]s[M-2]

s[0]:

s[L-1]:

X_in B

x[N-1] x[N-2] x[0]x[1]

X_in

x[0]

x[N-1] x[M +1]

x[M ]x[M -1]x[1]

x[N-2]B

only when

N is even

B+1 B+1 B+1

E E E E

E

E E E

E E E

E

B B B B

sB

s0

Y

+

Adder

tree

s[M-1]

...sB

s1

s0

...

...

-2B

20 +...

LUT L-to-LO

LUT L-to-LO

Filter B lock 0

s[L]:

s[2L-1]:

sB

s0s

Bs

1s

0...

...

-2B

20 +...

LUT L-to-LO

LUT L-to-LO

Filter B lock 1

s[M -L]:

s[M -1]:

sB

s0s

Bs

1s

0

...

...

-2B

20 +...

LUT L-to-LO

LUT L-to-LO

Filter B lock M /L-1

x[0]:

x[L-1]:

xB-1

x0

Y

+

Adder

tree

xB-1

x1

x0

...

...

-2B-1

20 +...

LUT L-to-LO

LUT L-to-LO

Filter B lock 0

x[L]:

x[2L-1]:

xB-1

x0x

B-1x

1x

0

...

...

-2B-1

20 +...

LUT L-to-LO

LUT L-to-LO

Filter B lock 1

x[N-L]:

x[N-1]:

xB-1

x0x

B-1x

1x

0

...

...

-2B-1

20 +...

LUT L-to-LO

LUT L-to-LO

Filter B lock N/L-1

...

...

...

...

...

LO=NH+log2LLO=NH+log

2L


� Filter Block Implementation ��

� Two implementations:

1) Coefficient-only reconfiguration: Only the coefficient values can be dynamically modified (PRR is made of the set of LUTs)

2) FullFull--filter reconfigurationfilter reconfiguration: It allows the run-time modification of all parameters.

� Performance dependence as the reconfiguration rate increases was shown.

LUT LUT LUT LUT LUT LUT LUT LUT LUT

s8

s7

s6

s5

s4

s3

s2

s1

s0

++++

022022

022022

+

122122

+

+

+

322

LL L L L L L L L

LO LO LO LO LO LO LO LO LO

Filter B lock Implementation (Symmetric, B=8)

222

* This work was published in the 2009 * This work was published in the 2009 International Journal of Reconfigurable International Journal of Reconfigurable ComputingComputing


2D Separable Filter Implementation:

� Separable FIR filters allow for efficient implementations by means of two 1D FIR Filters.

� The reconfiguration rate is constant (twice per frame).

� Cyclic Dynamic reconfiguration of two 1-D filters (usually full-filter reconfiguration):

- Implement row filter

- Replace by column filter

- Implement column filter

- Replace by row filter

…

DPR

tr1 tc1FPGA

ROW COL

COL

tr2 tc2

ROW COL

1 Frame is streamed

Processed row -by-row frame is streamed

2

3

4

Replacing row f ilter by column filter via DPR

Replacing column filter by row filter via DPR

ROW

DPRROWCOL

Processing

frame 1

Processing

frame 2

5 Go to Step 1 to process a new frame

ROW PRR

row f ilter

col f ilter

COL

* A comparison of this 2D FIR Filter and a * A comparison of this 2D FIR Filter and a GPU implementation for different number of GPU implementation for different number of coefficients was published 2011 IEEE Field coefficients was published 2011 IEEE Field Programmable Logic Conference (FPLProgrammable Logic Conference (FPL’’2011)2011)


2D Separable Filterbank Implementation:

Simple modifications to 2-D filter implementation:

� Reconfigure with the next 2-D filter and re-process frame

� When all filters have been applied, move to the next-frame and back to the first 2-D filter

ROW 1

COL 1

ROW 2

COL 2

ROW 3

COL 3

bitstreams

in memory

1st 2D filter

2nd 2D filter

3rd 2D filter

PRR

FPGA

ROW 1t1

t2

t4

t3

COL 1

ROW 2

COL 2

COL 3

ROW 3t5

t6

DPRROW i COL i

COL i

1 Frame is streamed

Processed row-by-row frame is streamed

2

3

4

Replacing row filter by column filter via DPR

Replacing column filter by row filter via DPR

ROW i

DPRCOL i

5 Go to Step 1 to:

- Process a new frame if i = 3. In this case j = 1

- Process the same frame if i = 1,2. In this case j = i+1

row f ilter

col filter

ROW j

EXAMPLE: 3-CHANNEL 2-D FILTERBANK

Pro

ce

ssin

g o

f 1

fra

me

* This work was published in the in the 2010 * This work was published in the in the 2010 IEEE Southwest Symposium on Image Analysis IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI 2010)and Interpretation (SSIAI 2010)

2D FIR Filter (3/10)2D FIR Filter (3/10)� Embedded System: We create a FSL interface around the FIR filter core.

The full-filter reconfiguration core is considered since it let us vary all the filter parameters, thereby allowing the creation of a large EPA space.

� FSL interface: We included this interface inside the PRR, so that we dynamically modify the I/O bitwidth.

� DPR control block: disables the PRR outputs during reconfiguration and resets the flip flops of the PRR after each partial reconfiguration.

� Each 2D filter realization is represented by 2 bitstreams.

PPC

PLB

System

ACEmemory

Ethernet

MAC

ICAP

core

FIR Filter processor

FSL

CF

card

FIR IP

core

DP

R c

trl

PRR

FS

L

inte

rface

ICAP

port

EP

A

co

nstr

ain

ts

'2n' bitstreams in memory

FS

L_S

_R

ead

FS

L_S

_E

xis

ts

FSL Slave

FS

L_S

_D

ata

FIFOw

FSL Master

FIFOrF

SL_S

_C

ontro

l

FS

L_S

_C

lk

32

FS

L_M

_W

rite

FS

L_M

_F

ull

FS

L_M

_D

ata

FS

L_M

_C

ontro

l

FS

L_M

_C

lk

32

... ...

FSM

X_in

8

E

8 Y

Filter core

sclr

FSM

DPR control - in DPR control - out

FS

L_S

_R

st

reset PRR

FSL interfaceModule 1 - row Module 1 - col

Module 2 - row Module 2 - col

Module n - row Module n - col

...

...

Implemented on a ML405 Development

Board. Device: XC4VFX20-11FF672

2D FIR Filter (4/10)2D FIR Filter (4/10)Experimental Setup: The FIR core is fully parameterized. The table

describes the combination of parameters we utilize that creates the Energy-Performance-Precision (EPA) space.

� Performance (fps) and energy are measured for the IP core.

� Test image: 8-bit ‘lena’ (VGA, CIF, QCIF frame sizes).

� Three different Gaussian filters:

8

16

10

12

16

32242016128

640x480 (VGA)

352x288 (CIF)

176x144 (QCIF)

Output bitwidth

(OB)

Coefficient bit-

width (NH)Number of coefficients (N)Frame Size

-pi

0

pi

-pi

0

pi0

0.5

ωx

DoG - σ1=2,σ

2=4

ωy

-pi

0

pi

-pi

0

pi0

0.5

1

1.5

ωx

Gaussian f ilter - σx=σ

y=1.5

ωy

Magn

itude

-pi

0

pi

-pi

0

pi0

0.5

1

ωx

Gaussian f ilter - σx=4,σ

y=2

ωy

N=24 N=32 N=48

2D FIR Filter (5/10)2D FIR Filter (5/10)Multi-objective optimization of the EPA space: Results for the 3 filters and 3 image sizes: The highest accuracy is achieved by increasing the number of coefficients (N), the coefficient bitwidth (NH), and with 16 output bits (OB). Frame size increases energy per frame. Thus, we present Pareto-optimal results independently of the frame size.

HA: highest accuracy realization from the Pareto frontLE: lowest energy realization from the Pareto front.

0

500

100040 60 80 100 120

0

0.2

0.4

0.6

0.8

fps

psnr(dB)

Energ

y p

er

fram

e(m

J)

0

500

100020 40

60 80

0

0.2

0.4

0.6

0.8

fps

psnr(dB)

0

500

100020 40 60 80 100

0

0.5

1

fps

psnr(dB)

640x480 640x480

640x480

352x288352x288

352x288176x144 176x144

176x144

200400

OB = 16

OB = 8200400

OB = 16

OB = 8 200400

OB = 16

OB = 8

686fps, 51dB,

0.105mJ, N=8,NH=12

326fps

67.13dB

0.323mJ

N=24,NH=12

136.4fps

85.5dB

0.865mJ

N=32,NH=16

Low pass Gaussian f ilter, σ=1.5 Low pass Gaussian f ilter, σx=4,σy

=2 Band pass (DoG) f ilter, σ1=2,σ2

=4

HA

LE LE

HA HA

LE

67dB

137fps,

0.6874mJ,

N=24,NH=12

2D FIR Filter (6/10)2D FIR Filter (6/10)Multi-objective optimization of the EPA space:

Results for the 3 filters and CIF frame size: We show the Pareto-optimal realizations as a function of N, NH, and OB.

Low-pass Gaussian Filter, σx=σy=1.5:

325

330

33550 60 70 80 90 100 110

0.2

0.25

0.3

0.35

fps

psnr(dB)

325

330

335

50 60 70 80 90 100 110

0.2

0.25

0.3

0.35

fps

psnr(dB)

Energ

y p

er

fram

e(m

J)

N = 24

N = 20

N = 16

N = 12

N = 8 325330

NH = 16

NH = 12

NH = 10

LE, HP1, LA

HA, HE, LP

OB=8

U1HP2 HP3HP4HP5U2,U3 U4 U5 U6 HA, HE, LP

LE, HP1, LA

2D FIR Filter (7/10)2D FIR Filter (7/10)Multi-objective optimization of the EPA space

324

326

328

330

332

334

2030

4050

6070

0.2

0.25

0.3

0.35

fpspsnr(dB)

Energ

y p

er

fram

e(m

J)

324

326

328

330

332

334

2030

4050

6070

0.2

0.25

0.3

0.35

fpspsnr(dB)

N = 24

N = 20

N = 16

N = 12

N = 8

OB=8332fps, 29.6dB, 0.156mJLE, HP, LA

326fps, 67dB, 0.323mJHA, HE, LP

325330

NH = 16

NH = 12

NH = 10

HP

HP

LP

LP

HA, HE, LP

LE, HP, LA

Low-pass Gaussian Filter, σx=4,σy=2

Difference of Gaussians (DoG) filterσ1=2,σ2=4

320

325

330

335

20

40

60

80

100

0.2

0.25

0.3

0.35

0.4

fpspsnr(dB)

320

325

330

335

2040

6080

100

0.2

0.25

0.3

0.35

0.4

fpspsnr(dB)

Energ

y p

er

fram

e(m

J)

N = 32

N = 24

N = 20

N = 16

N = 12

N = 8

OB=8

LE, HP, LA

HA, HE

(a) (b)

325330

NH = 16

NH = 12

NH = 10

HP

HP

LP

LP

332fps, 23.6dB, 0.155mJ

332fps, 23.6dB, 0.155mJ HA, HE

LE, HP, LA

2D FIR Filter (8/10)2D FIR Filter (8/10)Multi-objective optimization of the Energy-Accuracy space: Performance results are over 100 fps (VGA), and 300 fps (CIF). Overall, for a

fixed frame size, performance does not vary significantly. Thus, it makes sense to restrict our attention to the Energy-Accuracy Space

Results for the 3 filters and CIF frame size: We show the Pareto-optimal realizations as a function of N (number of coefficients).

20406080100

0.2

0.25

0.3

0.35

0.4

psnr(dB)

203040506070

0.2

0.25

0.3

0.35

psnr(dB)

406080100120

0.2

0.25

0.3

0.35

psnr(dB)

Energ

y p

er

fram

e(m

J)

Low pass Gaussian f ilter, σ=1.5 Low pass Gaussian f ilter, σx

=4,σy=2 Band pass (DoG) f ilter, σ1

=2,σ2=4

fps(min) = 325.9

fps(max) 332.2

N = 24

N = 20

N = 16

N = 12

N = 8

Frame size:352x288N = 32

N = 24

N = 20

N = 16

N = 12

N = 8

HA

LE LE

HA

HA

LE

U11U10

U9

U8U7 U6

U5U4U3

U2

U1

fps(min) = 325.9

fps(max) 332.2

fps(min) = 322.6

fps(max) 332.2

2D FIR Filter (9/10)2D FIR Filter (9/10)Dynamic EPA Management (1st example): Applied on the Pareto front of the DoG filter (see table). Video sequence: ‘foreman’.

Time-varying sequence of constraints:1. Require Accuracy ≥ 45dB and Energy ≤ 0.3mJ

2. Minimize Accuracy subject to Energy ≤ 0.3mJ

3. Minimize Energy per frame consumption

4. Minimize Energy subject to Accuracy ≥ 65 dB

5. Maximize Accuracy

0 50 100 150 200 250 30020

30

40

50

60

70

80

90

Frame #

PS

NR

(dB

)

2030405060708090

0.2

0.25

0.3

0.35

0.4

psnr(dB)

Energ

y p

er

fram

e(m

J)

�

�N=20, NH=10, OB=8

N=32, NH=16, OB=16

� N=24, NH=16, OB=16

N=8, NH=12, OB=8

�N=24, NH=10

OB=16

�

�Energy per frame

Accuracy (ps nr)

0.3 mJ

45dB

0.3mJ

max

min

--

min

65dB

� � � �--

max

avg=49.65

std=0.0977

avg=61.14

std=0.1425

avg=23.56

std=0.18

avg=66.18

std=0.5568

avg=88.45

std=0.0475

�

�

�

�

�

frame # 30

frame # 90

frame # 150

frame # 215

frame # 270

2D FIR Filter (10/10)2D FIR Filter (10/10)Dynamic EPA Management (2nd example):Suppose that the video output will be streamed through a communications channel. Here, it makes sense to impose real-time EPA constraints based on the Group Of Pictures (GOP). The GOP describes the prediction relationships between frame types (MPEG-1 recommendation):

* To be * To be submitted to submitted to IEEE IEEE Transactions Transactions on Image on Image ProcessingProcessing

I B B B B P B B I B B B P

The GOP can now be defined for different videos.It makes sense to impose the following accuracy constraints:-I-frames should be of the highest accuracy (all frames depend on it): Point �-P-frames should be of very high accuracy (many frames depend on it): Point �-B-frames can be of low accuracy (no frames depend on it): Point �

20304050607080900.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

psnr(dB)

Energ

y p

er

fram

e(m

J)

0 50 100 15040

50

60

70

80

90

Frame #

PS

NR

(dB

)

I

P

B

�N=20, NH=10, OB=8

N=32, NH=16, OB=16

� N=24, NH=10, OB=16

I

P

B

'missamerica' video sequence (QCIF)

�

ConclusionsConclusions� A framework was presented for the generation of optimal realizations (in

the multi-0bjective sense) from the Energy-Performance-Accuracy space. The framework allows for dynamic EPA management for digital signal, image, and video processing applications.

� Dynamic EPA management is based on Dynamic Partial Reconfiguration (DPR) and Dynamic Frequency Control to deliver performance with limited hardware resources and relatively low energy consumption.

� The framework was tested on a Pixel Processor architecture and a 2D FIR Filtering system. Dynamic EPA management was demonstrated on twostandard video sequences.

� The results suggest that the general framework can be applied to a variety of digital signal, image, and video processing systems. The framework can be greatly improved by the automatic generation of time-varying constraints (e.g., detection of a scene triggers a requirement for increased accuracy, a scene remaining still triggers a requirement for a decrease in energy consumption)

� Ultimately, this framework will lead to exciting new methods that allow for systems to only switch between architectures that are optimal in the multi-objective sense.

Related publicationsRelated publications[1] D. Llamocca, M. Pattichis, and A. Vera, “A Dynamically Reconfigurable Parallel Pixel A Dynamically Reconfigurable Parallel Pixel

Processing SystemProcessing System”, in Proceedings of 2009 International Conference on Field Programmable Logic and Applications FPL’2009, Prague, Czech Republic, Sep. 2009.

[2] D. Llamocca, M. Pattichis, and A. Vera, “A dynamically reconfigurable platform for fixeddynamically reconfigurable platform for fixed--point FIR filterspoint FIR filters,” in Proceedings of the International Conference on Reconfigurable Computing and FPGAs ReConFig’09, Cancun, Mexico, Dec. 2009.

[3] D. Llamocca, M.S. Pattichis, and G. A. Vera, “A dynamic computing platform for image A dynamic computing platform for image and video processing applicationsand video processing applications,” in Proceedings of the 43rd Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, Nov. 2009.

[4] G. A. Vera, D. Llamocca, M. S. Pattichis, and J. Lyke, “A dynamically reconfigurable A dynamically reconfigurable computing model for video processing applicationscomputing model for video processing applications,” in Proc. of the 43rd AsilomarConference on Signals, Systems and Computers, Pacific Grove, CA, USA, Nov. 2009.

[5] D. Llamocca, M. Pattichis, and G. A. Vera, “Partial Reconfigurable FIR Filtering system Partial Reconfigurable FIR Filtering system using Distributed Arithmeticusing Distributed Arithmetic”, International Journal of Reconfigurable Computing, vol. 2010, Article ID 357978, 14 pages, 2010.

[6] D. Llamocca, M. Pattichis, “RealReal--time dynamically reconfigurable 2time dynamically reconfigurable 2--D D filterbanksfilterbanks”, in Proceedings of 2010 IEEE Southwest Symposium on Image Analysis & Interpretation, Austin, TX, May. 2010.

[7] D. Llamocca, M.S. Pattichis, G. A. Vera, and J. Lyke, “Dynamic Partial Reconfiguration Dynamic Partial Reconfiguration through Ethernet Linkthrough Ethernet Link”, in Proceedings of the 2010 AIAA Infotech Conference at Aerospace, Atlanta, GA, USA, April 2010.

[8]D. Llamocca, C. Carranza, and M. Pattichis, “Separable FIR Filtering in FPGA and GPU Separable FIR Filtering in FPGA and GPU implementations: Energy, Performance, and Accuracy considerationimplementations: Energy, Performance, and Accuracy considerationss”, in Proceedings of 2011 International Conference on Field Programmable Logic and Applications FPL’2011, Chania, Greece, Sep. 2011.

dynamically reconfigurable management of energy, performance…llamocca/research/llamocca2012... ·...

Documents