lecture 2 (i)

Lecture 2 (I): Lecture 2 (I): Pipelining & RetimingPipelining & Retiming

Hsie-Chia Chang 張錫嘉

E-mail : [email protected]

Fall 2006

2Optimized Application-Specific Integrated Systems

OutlineOutline

Pipelining of FIR Digital filters– Data-Broadcast Structures

– Fine-Grain Pipelining

Parallel Processing

Pipelining and Parallel Processing for Low Power

Retiming– Definitions and Properties

– Solving Systems of Inequalities

– Retiming Techniques• Cutset Retiming & Pipelining• Retiming for Clock Period Minimization• Retiming for Register Minimization


IntroductionIntroduction

– If some real-time application requires a faster input rate, the critical path can be reduced by either pipelining or parallel processing


Pipelining & Parallel Processing (1/2)Pipelining & Parallel Processing (1/2)

Pipelining– Reduce the effective critical path by introducing pipelining

latches along the critical datapath

– Without any pipelining latches, the critical path can be reducedby

Parallel processing– Increase the sampling by replicating hardware so that inputs can

be processed in parallel; outputs can be produced at the same time

This techniques applied in the non-recursive computations

continue sending

Tsample=TCLK Tsample≠TCLK


Pipelining & Parallel Processing (2/2)Pipelining & Parallel Processing (2/2)

Example 2:


Pipelining of FIR Digital FiltersPipelining of FIR Digital Filters

Schedule of Events in the Pipelined FIR Filter

TCritical=TM+TA


CutsetCutset Pipelining (1/2)Pipelining (1/2)

The speed is limited by the longest path between– any two latches– an input & a latch– a latch & an output– The input & the output

2-level pipelined structure– The longest path can be reduced by suitably placing the pipelining

latches in the architecture

– In this system, at any time, 2 consecutive outputs are computed in an interleaved manner

– Drawbacks••


CutsetCutset Pipelining (2/2)Pipelining (2/2)

Cutset

Feed-forward cutset

– We can arbitrarily place latches on

a feed-forward cutset of any FIR

filter structure without affecting the

functionality of the algorithm

+ kD

+kD

+ kD

cutset

G2

G1


Example 3.2.1Example 3.2.1


DataData--Broadcast StructuresBroadcast Structures


FineFine--grain Pipelininggrain Pipelining


Parallel ProcessingParallel Processing

Parallel processing are also referred to as block processing– Block size = no. of inputs processed in a clock cycle

– For a 3-tap FRI filter, the duplicate hardware can be shown as:

In MIMO,

)2()1()()( −+−+= ncxnbxnaxny

++++=+−+++=+−+−+=

)3()13()23()23()13()3()13()13()23()13()3()3(

kcxkbxkaxkykcxkbxkaxkykcxkbxkaxky

delayBlock delay


Complete Parallel Processing SystemsComplete Parallel Processing Systems

– A serial-to-parallel converter – A parallel-to-serial converter


Why use Parallel Processing??Why use Parallel Processing??

Communication bounded– When the critical path is less than Tcommunication, the I/O bound

dominates and this system is communication bounded.

– Pipelining can be used only to the extent such that the critical path is limited by the communication bound.

– Once this is reached, pipelining can no longer increase the speed


Combined Pipelining & Parallel ProcessingCombined Pipelining & Parallel Processing

– After combining M-level pipelining and L-level parallel processing,


CMOS Power Consumption (1/2)CMOS Power Consumption (1/2)

Ptotal=Pdynamic+Pshort-circuit+Pstatic

Short circuit– current spikes

Static Power– leakage current


CMOS Power Consumption (2/2)CMOS Power Consumption (2/2)

Based on simple approximation & 1st-order analysis– Propagation delay

Ccharge the capacitance to be charged or discharged in a singleclock cycle (along the critical path)

V0、Vt the supply voltage、the threshold voltage

K a function of technology parameters

– Power consumption

Ctotal the total capacitance of the CMOS circuit

f clock frequency of the circuit

fVCP total ⋅⋅= 20

( )20

0chargepd

tVVkVC

T−

⋅=


Low Power DesignLow Power Design

To reduce– Capacitances

• Transistor/Gate C• Load C• Interconnects• External

– Activity– Frequency– Power supply

Other issues– Off-chip connections have high capacitive load

– System integration


Pipelining for Low Power (1/2)Pipelining for Low Power (1/2)

For an M-level pipelined architecture,– the critical path is reduced to 1/M and the capacitance to be

charged/discharged in a single cycle (Ccharge) is also reduced to 1/M

If the same clock speed is maintained (f = 1/Tpd),– only 1/M of the non-pipelined capacitance is required to be charged

or discharged, which suggests voltage reduction– Suppose the voltage can be reduced to ,

the power consumption becomes0V⋅β

( )

pipelinednon

totalpipelined

P

fVCP

−⋅=

⋅⋅⋅=2

20

β

β


Pipelining for Low Power (2/2)Pipelining for Low Power (2/2)

– propagation delay of the original architecture

– propagation delay of the pipelined architecture

– setting the above two equations equal, the following quadratic equation can be obtained to solve β

( ) ( )202

0 tt VVVVM −⋅=−⋅ ββ


Example 3.4.1: Reduce Power by PipeliningExample 3.4.1: Reduce Power by Pipelining

Consider the following two FIR filters.

– What is the supply voltage of the pipelined architecture if the clock periods are identical?

– What is the relative power consumption?

D y(n)D

x(n)

D y(n)D

x(n)

D D D

m1

m2

m1 m1

m2 m2


SolutionSolution


Parallel Processing for Low Power (1/2)Parallel Processing for Low Power (1/2)

For an L-parallel architecture, – the charge capacitance remains the same,

but the total capacitance (Ctotal) is increased L times

To maintain the same sample rate,– The clock speed is reduced to 1/L (f = 1/LTpd), which means the

Ccharge is charged or discharged L times longer.

– The supply voltage can be reduced to , the power consumption becomes

0V⋅β

( ) ( )

parallelnon

totalparallel

PLfVCLP

−⋅=

⋅⋅⋅⋅=

2

20

β

β


Parallel Processing for Low Power (2/2)Parallel Processing for Low Power (2/2)

– propagation delay of the original architecture

– propagation delay of the parallel architecture

– setting these two propagation delays equal, the following quadratic equation can be obtained to solve β

( ) ( )202

0 tt VVVVL −⋅=−⋅ ββ


Example 3.4.2: Reduce Power by ParallelExample 3.4.2: Reduce Power by Parallel

Consider the following two FIR filters, with critical paths denoted in dash lines respectively

– What is the supply voltage of the parallel architecture?

– What is the relative power consumption?

D y(n)D

x(n)

D D y(2k+1)

x(2k)

y(2k)D D

x(2k+1)


SolutionSolution


Example 3.4.3Example 3.4.3

Area-efficient architecture


SummarySummary

In pipelining & parallel processing,– M-level pipelining,

– L-level parallel processing,

– Combining M-level pipelining & L-level parallel processing,

For low power design,– Pipelining

– Parallel Processing

– Combining Pipelining and Parallel Processing

lecture 2 (i)

Documents