lecture 2 (i)

28
Lecture 2 (I): Lecture 2 (I): Pipelining & Retiming Pipelining & Retiming Hsie-Chia Chang 張錫嘉 E-mail : [email protected] Fall 2006

Upload: others

Post on 26-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 2 (I)

Lecture 2 (I): Lecture 2 (I): Pipelining & RetimingPipelining & Retiming

Hsie-Chia Chang 張錫嘉

E-mail : [email protected]

Fall 2006

Page 2: Lecture 2 (I)

2Optimized Application-Specific Integrated Systems

OutlineOutline

Pipelining of FIR Digital filters– Data-Broadcast Structures

– Fine-Grain Pipelining

Parallel Processing

Pipelining and Parallel Processing for Low Power

Retiming– Definitions and Properties

– Solving Systems of Inequalities

– Retiming Techniques• Cutset Retiming & Pipelining• Retiming for Clock Period Minimization• Retiming for Register Minimization

Page 3: Lecture 2 (I)

3Optimized Application-Specific Integrated Systems

IntroductionIntroduction

– If some real-time application requires a faster input rate, the critical path can be reduced by either pipelining or parallel processing

Page 4: Lecture 2 (I)

4Optimized Application-Specific Integrated Systems

Pipelining & Parallel Processing (1/2)Pipelining & Parallel Processing (1/2)

Pipelining– Reduce the effective critical path by introducing pipelining

latches along the critical datapath

– Without any pipelining latches, the critical path can be reducedby

Parallel processing– Increase the sampling by replicating hardware so that inputs can

be processed in parallel; outputs can be produced at the same time

This techniques applied in the non-recursive computations

continue sending

Tsample=TCLK Tsample≠TCLK

Page 5: Lecture 2 (I)

5Optimized Application-Specific Integrated Systems

Pipelining & Parallel Processing (2/2)Pipelining & Parallel Processing (2/2)

Example 2:

Page 6: Lecture 2 (I)

6Optimized Application-Specific Integrated Systems

Pipelining of FIR Digital FiltersPipelining of FIR Digital Filters

Schedule of Events in the Pipelined FIR Filter

TCritical=TM+TA

Page 7: Lecture 2 (I)

7Optimized Application-Specific Integrated Systems

CutsetCutset Pipelining (1/2)Pipelining (1/2)

The speed is limited by the longest path between– any two latches– an input & a latch– a latch & an output– The input & the output

2-level pipelined structure– The longest path can be reduced by suitably placing the pipelining

latches in the architecture

– In this system, at any time, 2 consecutive outputs are computed in an interleaved manner

– Drawbacks••

Page 8: Lecture 2 (I)

8Optimized Application-Specific Integrated Systems

CutsetCutset Pipelining (2/2)Pipelining (2/2)

Cutset

Feed-forward cutset

– We can arbitrarily place latches on

a feed-forward cutset of any FIR

filter structure without affecting the

functionality of the algorithm

+ kD

+kD

+ kD

cutset

G2

G1

Page 9: Lecture 2 (I)

9Optimized Application-Specific Integrated Systems

Example 3.2.1Example 3.2.1

Page 10: Lecture 2 (I)

10Optimized Application-Specific Integrated Systems

DataData--Broadcast StructuresBroadcast Structures

Page 11: Lecture 2 (I)

11Optimized Application-Specific Integrated Systems

FineFine--grain Pipelininggrain Pipelining

Page 12: Lecture 2 (I)

12Optimized Application-Specific Integrated Systems

Parallel ProcessingParallel Processing

Parallel processing are also referred to as block processing– Block size = no. of inputs processed in a clock cycle

– For a 3-tap FRI filter, the duplicate hardware can be shown as:

In MIMO,

)2()1()()( −+−+= ncxnbxnaxny

++++=+−+++=+−+−+=

)3()13()23()23()13()3()13()13()23()13()3()3(

kcxkbxkaxkykcxkbxkaxkykcxkbxkaxky

delayBlock delay

Page 13: Lecture 2 (I)

13Optimized Application-Specific Integrated Systems

Complete Parallel Processing SystemsComplete Parallel Processing Systems

– A serial-to-parallel converter – A parallel-to-serial converter

Page 14: Lecture 2 (I)

14Optimized Application-Specific Integrated Systems

Why use Parallel Processing??Why use Parallel Processing??

Communication bounded– When the critical path is less than Tcommunication, the I/O bound

dominates and this system is communication bounded.

– Pipelining can be used only to the extent such that the critical path is limited by the communication bound.

– Once this is reached, pipelining can no longer increase the speed

Page 15: Lecture 2 (I)

15Optimized Application-Specific Integrated Systems

Combined Pipelining & Parallel ProcessingCombined Pipelining & Parallel Processing

– After combining M-level pipelining and L-level parallel processing,

Page 16: Lecture 2 (I)

16Optimized Application-Specific Integrated Systems

CMOS Power Consumption (1/2)CMOS Power Consumption (1/2)

Ptotal=Pdynamic+Pshort-circuit+Pstatic

Short circuit– current spikes

Static Power– leakage current

Page 17: Lecture 2 (I)

17Optimized Application-Specific Integrated Systems

CMOS Power Consumption (2/2)CMOS Power Consumption (2/2)

Based on simple approximation & 1st-order analysis– Propagation delay

Ccharge the capacitance to be charged or discharged in a singleclock cycle (along the critical path)

V0、Vt the supply voltage、the threshold voltage

K a function of technology parameters

– Power consumption

Ctotal the total capacitance of the CMOS circuit

f clock frequency of the circuit

fVCP total ⋅⋅= 20

( )20

0chargepd

tVVkVC

T−

⋅=

Page 18: Lecture 2 (I)

18Optimized Application-Specific Integrated Systems

Low Power DesignLow Power Design

To reduce– Capacitances

• Transistor/Gate C• Load C• Interconnects• External

– Activity– Frequency– Power supply

Other issues– Off-chip connections have high capacitive load

– System integration

Page 19: Lecture 2 (I)

19Optimized Application-Specific Integrated Systems

Pipelining for Low Power (1/2)Pipelining for Low Power (1/2)

For an M-level pipelined architecture,– the critical path is reduced to 1/M and the capacitance to be

charged/discharged in a single cycle (Ccharge) is also reduced to 1/M

If the same clock speed is maintained (f = 1/Tpd),– only 1/M of the non-pipelined capacitance is required to be charged

or discharged, which suggests voltage reduction– Suppose the voltage can be reduced to ,

the power consumption becomes0V⋅β

( )

pipelinednon

totalpipelined

P

fVCP

−⋅=

⋅⋅⋅=2

20

β

β

Page 20: Lecture 2 (I)

20Optimized Application-Specific Integrated Systems

Pipelining for Low Power (2/2)Pipelining for Low Power (2/2)

– propagation delay of the original architecture

– propagation delay of the pipelined architecture

– setting the above two equations equal, the following quadratic equation can be obtained to solve β

( ) ( )202

0 tt VVVVM −⋅=−⋅ ββ

Page 21: Lecture 2 (I)

21Optimized Application-Specific Integrated Systems

Example 3.4.1: Reduce Power by PipeliningExample 3.4.1: Reduce Power by Pipelining

Consider the following two FIR filters.

– What is the supply voltage of the pipelined architecture if the clock periods are identical?

– What is the relative power consumption?

D y(n)D

x(n)

D y(n)D

x(n)

D D D

m1

m2

m1 m1

m2 m2

Page 22: Lecture 2 (I)

22Optimized Application-Specific Integrated Systems

SolutionSolution

Page 23: Lecture 2 (I)

23Optimized Application-Specific Integrated Systems

Parallel Processing for Low Power (1/2)Parallel Processing for Low Power (1/2)

For an L-parallel architecture, – the charge capacitance remains the same,

but the total capacitance (Ctotal) is increased L times

To maintain the same sample rate,– The clock speed is reduced to 1/L (f = 1/LTpd), which means the

Ccharge is charged or discharged L times longer.

– The supply voltage can be reduced to , the power consumption becomes

0V⋅β

( ) ( )

parallelnon

totalparallel

PLfVCLP

−⋅=

⋅⋅⋅⋅=

2

20

β

β

Page 24: Lecture 2 (I)

24Optimized Application-Specific Integrated Systems

Parallel Processing for Low Power (2/2)Parallel Processing for Low Power (2/2)

– propagation delay of the original architecture

– propagation delay of the parallel architecture

– setting these two propagation delays equal, the following quadratic equation can be obtained to solve β

( ) ( )202

0 tt VVVVL −⋅=−⋅ ββ

Page 25: Lecture 2 (I)

25Optimized Application-Specific Integrated Systems

Example 3.4.2: Reduce Power by ParallelExample 3.4.2: Reduce Power by Parallel

Consider the following two FIR filters, with critical paths denoted in dash lines respectively

– What is the supply voltage of the parallel architecture?

– What is the relative power consumption?

D y(n)D

x(n)

D D y(2k+1)

x(2k)

y(2k)D D

x(2k+1)

Page 26: Lecture 2 (I)

26Optimized Application-Specific Integrated Systems

SolutionSolution

Page 27: Lecture 2 (I)

27Optimized Application-Specific Integrated Systems

Example 3.4.3Example 3.4.3

Area-efficient architecture

Page 28: Lecture 2 (I)

28Optimized Application-Specific Integrated Systems

SummarySummary

In pipelining & parallel processing,– M-level pipelining,

– L-level parallel processing,

– Combining M-level pipelining & L-level parallel processing,

For low power design,– Pipelining

– Parallel Processing

– Combining Pipelining and Parallel Processing