elec516/10 lecture 9 1 elec 516 vlsi system design and design automation spring 2010 lecture 9 - low...

Post on 20-Dec-2015

237 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ELEC516/10 Lecture 91

ELEC 516 VLSI System Design and Design Automation Spring

2010Lecture 9 - Low Power Digital

CMOS Design

Reading Assignment: Rabaey: Chapter 4 and 7,11

Note: some of the figures in this slide set are adapted from the slide setof “ Digital Integrated Circuits” by Rabaey, Copyright UCB 2002

ELEC516/10 Lecture 92

Why worry about power?-- Heat Dissipation

Besides low power required for portable applications !!!!

Power is a very big concern in today’s advanced technologies. Power produces heat on the chip which has to be carried off through the chip socket expensive packaging solutions

ELEC516/10 Lecture 93

Motivation for low power design

• Packaging costs• Power supply rail design• Chip and system cooling costs• Noise immunity and system reliability• Battery life (in portable systems)• Environmental concerns

– ICT equipment accounted for 10% of total US commercial energy usage in 2010 and may reach 20% by 2020

– Energy Star compliant systems

ELEC516/10 Lecture 94

Why worry about power — Portability

Multimedia Terminals

Laptop Computers

Digital Cellular Telephony

BATTERY(40+ lbs)

Year

Nom

inal

Cap

acity

(W

att-

hour

s / l

b)

Nickel-Cadium

Ni-Metal Hydride

65 70 75 80 85 90 95 0

10

20

30

40

50

Rechargable Lithium

Expected Battery Lifetime increaseover next 5 years: 30-40%

ELEC516/10 Lecture 95

Why worry about power? -- Standby Power

Drain leakage will increase as VT decreases to maintain noise margins and meet frequency demands, leading to excessive battery draining standby power consumption.

8KW

1.7KW

400W

88W 12W

0%

10%

20%

30%

40%

50%

2000 2002 2004 2006 2008

Sta

nd

by

Po

wer

Source: Borkar, De Intel

Year 2002 2005 2008 2011 2014

Power supply Vdd (V) 1.5 1.2 0.9 0.7 0.6

Threshold VT (V) 0.4 0.4 0.35 0.3 0.25

…and phones leaky!

ELEC516/10 Lecture 96

Power and Energy Figures of Merit

• Power consumption in Watts– determines battery life in hours

• Peak power– determines power ground wiring designs– sets packaging limits– impacts signal noise margin and reliability analysis

• Energy efficiency in Joules– rate at which power is consumed over time

• Energy = power * delay– Joules = Watts * seconds– lower energy number means less power to perform a

computation at the same frequency

ELEC516/10 Lecture 97

Power versus Energy

Watts

time

Power is height of curve

Watts

time

Approach 1

Approach 2

Approach 2

Approach 1

Energy is area under curve

Lower power design could simply be slower

Two approaches require the same energy

ELEC516/10 Lecture 98

PDP and EDP• Power-delay product (PDP) = Pav * tp = (CLVDD

2)/2

– PDP is the average energy consumed per switching event (Watts * sec = Joule)

– lower power design could simply be a slower design

allows one to understand tradeoffs better

0

5

10

15

0.5 1 1.5 2 2.5

Vdd (V)

Energ

y-Dela

y (no

rmali

zed)

energy-delay

energy

delay

Energy-delay product (EDP) = PDP * tp = Pav * tp2

EDP is the average energy consumed multiplied by the computation time required

takes into account that one can trade increased delay for lower energy/operation (e.g., via supply voltage scaling that increases delay, but decreases energy consumption)

ELEC516/10 Lecture 99

• Dynamic power– Charge-discharge current:

• Dominant source of power (CV2 per transition)

– Short circuit current (both NMOS and PMOS on during transit)• <10% of c/d current if transitions are fast

• Subthreshold leakage (transistors not OFF completely)– Becoming important 10-30% active power in <0.18um techn– Diode leakage from reverse source and drain diodes (neglig)– Gate leakage (no longer negligible due to very thin gate oxide)

Where Does Power Go in CMOS?

ELEC516/10 Lecture 910

Dynamic Power Consumption

Vin Vout

CL

Energy/transition = C L * V dd2

Power = Energy/transition * f = C L * V dd2 * f

Need to reduce C L , V dd, and f to reduce power.

Vdd

Not a function of transistor sizes!

ELEC516/10 Lecture 911

Dynamic Power Consumption - Revisited

Power = Energy/transition * transition rate

= CL * V dd2 * f 0 1

= C L * V dd2 * P 0 1* f

= C EFF * V dd2 * f

Power Dissipation is Data DependentFunction of Switching Activity

CEFF = Effective Capacitance = C L * P0 1

ELEC516/10 Lecture 912

Power Consumption is Data Dependent

Example: Static 2 Input NOR Gate

Assume:P(A=1) = 1/2P(B=1) = 1/2

P(Out=1) = 1/4P(01)

= 3/4 X 1/4 = 3/16

Then:

= P(Out=0).P(Out=1)

CEFF = 3/16 * CL

ELEC516/10 Lecture 913

Transition Probabilities for Basic Gates

ELEC516/10 Lecture 914

Transition Probability of 2-input NOR Gate

ELEC516/10 Lecture 916

Inter-signal Correlations

B

A

Z

X

P(Z=1) = P(B=1) & P(A=1 | B=1)

0.5

0.5

(1-0.5)(1-0.5)x(1-(1-0.5)(1-0.5)) = 3/16

(1- 3/16 x 0.5) x (3/16 x 0.5) = 0.085Reconvergent

• Determining switching activity is complicated by the fact that signals exhibit correlation in space and time– reconvergent fan-out

• Have to use conditional probabilities

ELEC516/10 Lecture 917

Logic Restructuring

Chain implementation has a lower overall switching activity than the tree implementation for random inputs

Ignores glitching effects

Logic restructuring: changing the topology of a logic network to reduce transitions

A

BC

D F

AB

CD Z

FW

X

Y0.5

0.5

(1-0.25)*0.25 = 3/16

0.50.5

0.5

0.5

0.5

0.5

7/64

15/256

3/16

3/16

15/256

AND: P01 = P0 x P1 = (1 - PAPB) x PAPB

ELEC516/10 Lecture 919

Input Ordering

Beneficial to postpone the introduction of signals with a high transition rate (signals with signal probability close to 0.5)

A

BC

X

F

0.5

0.20.1

B

CA

X

F

0.2

0.10.5

(1-0.5x0.2)x(0.5x0.2)=0.09 (1-0.2x0.1)x(0.2x0.1)=0.0196

ELEC516/10 Lecture 920

How about Dynamic Circuits?

Power is Only Dissipated when Out=0!

Mp

Me

VDD

PDN

In1In2In3

Out

CEFF = P(Out=0).C L

ELEC516/10 Lecture 921

2-input NOR Gate

Example: Dynamic 2 Input NOR Gate

Assume:P(A=1) = 1/2P(B=1) = 1/2

P(Out=0) = 3/4

Then:

CEFF = 3/4 * CL

Switching Activity Is Always Higher in Dynamic Circuits

ELEC516/10 Lecture 922

Transition Probabilities for Dynamic Gates

Switching Activity for Precharged Dynamic Gates

P01 = P0

ELEC516/10 Lecture 924

Glitching in Static CMOS Networks

ABC

X

Z

101 000

Unit Delay

AB

X

ZC

• Gates have a nonzero propagation delay resulting in spurious transitions or glitches (dynamic hazards)– glitch: node exhibits multiple transitions in a single cycle

before settling to the correct logic value

ELEC516/10 Lecture 925

Example : Adder Circuit

0 5 100.0

2.0

4.0

Time, ns

Sum

Out

put V

olta

ge, V

olts

Cin

S15

S10

6

5

4

3

2S1

Add0 Add1 Add2 Add14 Add15

S0 S1 S2 S14 S15

Cin

ELEC516/10 Lecture 926

How to Cope with Glitching?

F1

F2

F3

F1

F3

F2

0

0

0

0

1

2

0

0

0

01

1

Equalize Lengths of Timing Paths Through Design

ELEC516/10 Lecture 927

Balanced Delay Paths to Reduce Glitching

So equalize the lengths of timing paths through logic

F1

F2

F3

0

0

0

0

1

2

F1

F2

F3

0

0

0

0

1

1

Glitching is due to a mismatch in the path lengths in the logic network; if all input signals of a gate change simultaneously, no glitching occurs

ELEC516/10 Lecture 928

Glitch Reduction by Pipelining• Glitches depend on the logic depth of the circuit - gates

deeper in the logic network are more prone to glitching– arrival times of the gate inputs are more spread due to

delay imbalances– usually affected more by primary input switching

• Reduce logic depth by adding pipeline registers– additional energy used by the clock and pipeline registers

PC

Fetch Decode Execute Memory WriteBack

Inst

ruct

ion

MA

R

MD

R

I$ D$

clk

pipelinestage

isolationregister

ELEC516/10 Lecture 929

Short Circuit Power Consumption

Finite slope of the input signal causes a direct current path between VDD and GND for a short period of time during switching when both the NMOS and PMOS transistors are conducting.

Vin Vout

CL

Isc

ELEC516/10 Lecture 930

Short Circuit Currents Determinates

• Duration and slope of the input signal, tsc

• Ipeak determined by – the saturation current of the P and N transistors which

depend on their sizes, process technology, temperature, etc.– strong function of the ratio between input and output slopes

• a function of CL

Esc = tsc VDD Ipeak P01

Psc = tsc VDD Ipeak f01

ELEC516/10 Lecture 931

Impact of CL on Psc

Vin Vout

CL

Isc 0

Vin Vout

CL

Isc Imax

Large capacitive load

Output fall time significantly larger than input rise time.

Small capacitive load

Output fall time substantially smaller than the input rise

time.

ELEC516/10 Lecture 932

Ipeak as a Function of CL

-0.5

0

0.5

1

1.5

2

2.5

0 2 4 6

I pea

k (A

)

time (sec)

x 10-10

x 10-4

CL = 20 fF

CL = 100 fF

CL = 500 fF

500 psec input slope

Short circuit dissipation is minimized by matching the rise/fall times of the input and output signals - slope engineering.

When load capacitance is small, Ipeak is large.

ELEC516/10 Lecture 933

Static Power Consumption

Vin=5V

Vout

CL

Vdd

Istat

Pstat = P(In=1) .Vdd . Istat

• Dominates over dynamic consumption

• Not a function of switching frequency

ELEC516/10 Lecture 934

Leakage (Static) Power Consumption

Sub-threshold current is the dominant factor.

All increase exponentially with temperature!

VDD Ileakage

Vout

Drain junction leakage

Sub-threshold currentGate leakage

ELEC516/10 Lecture 935

Power consideration – leakage current

)1( //)( kTqVnkTqVVsub

dsthgs eeKI K: technology constant; q: electronic charge; k: Boltzman constantN: nonlinearity constant (between 1 and 2); T: Temperature

ELEC516/10 Lecture 936

Leakage as a Function of VT

0 0.2 0.4 0.6 0.8 1

VGS (V)

ID (A

)

VT=0.4V

VT=0.1V

10-2

10-12

10-7

Continued scaling of supply voltage and the subsequent scaling of threshold voltage will make subthreshold conduction a dominate component of power dissipation.

An 90mV/decade VT roll-off - so each 255mV increase in VT gives 3 orders of magnitude reduction in leakage (but adversely affects performance)

ELEC516/10 Lecture 937

Sub-Threshold in MOS

VT=0.6VT=0.2

ID

VGS

Lower Bound on Threshold to Prevent Leakage

ELEC516/10 Lecture 938

Low Power Design Space

• The dynamic power consumption equation reveals the three degrees of freedom inherent in the low power design space:– Voltage– Physical capacitance– Data activity

• Optimization for power entails an attempt to reduce one or more of these factors. Interactions among these factors complicate the optimization problem.

• Deep sub-micron design - need to minimize leakage and sub-threshold current

ELEC516/10 Lecture 939

Power Reduction Strategy (I)

• Voltage Reduction

– 5V ->3.3 V -> 2.5V ->1.8V-> 1.0V

– Mixed supplies in system and/or on chip, by using the minimum voltage for different chips of functions within a chip, together with on-chip voltage converters if required.

• Low-voltage circuit techniques are required to give good performances even with low voltages

• Less noisy structures and better signal integrity handling is required

• Lower Vth process is required to maintain good transistor speed performances

ELEC516/10 Lecture 940

Reducing Vdd

P x td = E t = CL * Vdd2

E(Vdd=2)=

(CL) * (2) 2

(CL) * (5) 2E(Vdd=5)

Strong function of voltage (V 2 dependence).

Relatively independent of logic function and style.

E(Vdd=2) 0.16 E (Vdd =5)

0.03

0.05

0.07

0.1

0.15

0.20

0.30

0.50

0.70

1.00

1.5

1 2 5

51 stage ring oscillator

8-bit adder

Vdd (volts)

quadratic dependence

NO

RM

AL

IZE

D P

OW

ER

-DE

LA

Y P

RO

DU

CT

Power Delay Product Improves with lowering V DD .

ELEC516/10 Lecture 941

Lower Vdd Increases Delay

C L * Vdd

I=Td

Td(Vdd=5)

Td(Vdd=2)=

(2) * (5 - 0.7)2

(5) * (2 - 0.7)2

4

I ~ (Vdd - Vt)2

Relatively independent of logic function and style.

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

5.50

6.00

6.50

7.00

7.50

2.00 4.00 6.00Vdd (volts)

NO

RM

AL

I ZE

D D

EL

AY

adder (SPICE)

microcoded DSP chip

multiplier

adder

ring oscillator

clock generator2.0m technology

ELEC516/10 Lecture 942

Lowering the Threshold

DESIGN FOR PLeakage == PDynamic

Vt = 0.2Vt = 0

ID

VGS

Reduces the Speed Loss, But Increases Leakage

Vdd

Delay

2Vt

Interesting Design Approach:

ELEC516/10 Lecture 943

What Threshold Voltage to Use?

• Energy vs. Vt for a fixed throughput

• An “optimal” Vt/Vdd point trades switching and leakage energy

ELEC516/10 Lecture 944

Stack Effect• Leakage is a function of the circuit topology and the value

of the inputsVT = VT0 + (|-2F + VSB| - |-2F|)

where VT0 is the threshold voltage at VSB = 0; VSB is the source- bulk (substrate) voltage; is the body-effect coefficient

A B

B

A

Out

VX

A B VX ISUB

0 0 VT ln(1+n) VGS=VBS= -VX

0 1 0 VGS=VBS=0

1 0 VDD-VT VGS=VBS=0

1 1 0 VSG=VSB=0

• Leakage is least when A = B = 0• Leakage reduction due to stacked

transistors is called the stack effect

ELEC516/10 Lecture 945

Leakage as a Function of Design Time VT

• Reducing the VT increases the sub-threshold leakage current (exponentially)

– 90mV reduction in VT increases leakage by an order of magnitude

• But, reducing VT decreases gate delay (increases performance)

0 0.2 0.4 0.6 0.8 1

VGS (V)ID

(A)

VT=0.4V

VT=0.1V

• Determine the critical path(s) at design time and use low VT devices on the transistors on those paths for speed. Use a high VT on the other logic for leakage control.

– A careful assignment of VT’s can reduce the leakage by as much as 80%

ELEC516/10 Lecture 946

Dual-Thresholds Inside a Logic Block

• Minimum energy consumption is achieved if all logic paths are critical (have the same delay)

• Use lower threshold on timing-critical paths– Assignment can be done on a per gate or transistor basis;

no clustering of the logic is needed– No level converters are needed

ELEC516/10 Lecture 947

Variable VT (ABB) at Run Time

• VT = VT0 + (|-2F + VSB| - |-2F|)

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

-2.5 -2 -1.5 -1 -0.5 0

VSB (V)

VT (

V)

A negative bias on VSB causes VT to increase

Adjusting the substrate bias at run time is called adaptive body-biasing (ABB)

Requires a dual well fab process

For an n-channel device, the substrate is normally tied to ground (VSB = 0)

ELEC516/10 Lecture 948

Techniques for Burst Mode Computation

High VT

High VT

Low VT

SLEEP

SLEEP

-+

-+

onstandby

standbyon

Vin Vout

Vdd

Vp >0

Vn < 0

Multiple VT Technology Substrate Bias ControlledVariable VT Devices

(Disable high VT devices during idle periods) (increase VT during idle periods)e.g. [Sakata-93], [Mutoh-93] e.g. [Seta-95]

ELEC516/10 Lecture 949

Multiple-Threshold Circuits

• Previous approaches cannot reduce leakage power during active mode.

• Use dual Vt CMOS logic, low Vt for critical paths and high Vt for non-critical paths, problem - large threshold swing

• Triple Vt CMOS circuit to reduce the sub-thresold swing [Fujii et. al. ISSCC-98]– for high speed low power active mode, low and medium

Vt are used for critical and non-critical paths, respectively.

– For standby mode, high Vt MOSFET is inserted between the supply rail and the virtual supply rail.

ELEC516/10 Lecture 950

Multiple-Threshold Circuits

ELEC516/10 Lecture 951

Power considerations – reduce leakage

• Leakage proportional to device width– Use smallest devices for critical path.

• Leakage drops with stacked devices (drain voltage divider)– Use stacked transistors for critical path.

• Leakage drops with increasing channel length – Slightly increase L for critical Path

• Use dual VT process providing two threshold VT – Use high VT transistors for critical path

ELEC516/10 Lecture 952

Power consideration – reduce leakage

• Switch off critical path transistors when not needed.

• Stand-by mode between supply and virtual supply lines

• Stand-by vectors – Apply input vector which

minimizes leakage.

– Achieved using Mux

ELEC516/10 Lecture 953

Power gating transistor Sizing• The effect of power gating transistor size

– As the size decreases, logic performance also decreases.

– As the size increases, leakage current and chip area also increase.

– Proper sizing is very important.– power gating transistor size should be decided within

2% performance degradation.

Vop = VDD - V

V must be sizedwithin 2% performance degradation.

VDD

GND

Low Vt

High Vt SwitchControl

ELEC516/10 Lecture 954

Power Reduction Strategy (II)

• Reducing capacitance– Process scaling and better integration, with smaller

capacitances in more aggressive processes– Improved devices and interconnect technology– Efficient clock generation and distribution– Good memory hierarchy – In-place optimization using a library containing ranges

of gates with different strengths, through replacement of gates to use the optimum drive in the critical paths and minimum drive elsewhere

ELEC516/10 Lecture 955

Reducing Effective Capacitance

Global bus architecture Local bus architecture

Shared Resources incur Switching Overhead

ELEC516/10 Lecture 956

Power consideration – Reduce Capacitance

• Reduce switched capacitance C– Careful transistor sizing, transistor ordering, tighter and

more compact layout

– Hierarchical architecture and add TG to isolate buses

– Segmented structures

Shared bus driven by A or B whenSending values to C

Insert switch to isolate bus segmentwhen B sending to C

ELEC516/10 Lecture 957

Power Reduction Strategy (III)

• Reducing activities– lowering operating frequency– Using power management strategies, such as Gated

clocks, Power-Down of non-operational units– Reduce switching activities, Power = Energy/transition *

transition rate =

• Power dissipation is data dependent and hence is a function of switching activity.

• P0->1 is the switching probability

• Effective switching capacitance = Ceff=CL* P0->1

fVCfPVCfVC ddeffddLddL 2

102

102

ELEC516/10 Lecture 958

Factors Affecting Power Consumption - Revisited

• Degree of freedom for low power design space:– Voltage

– Physical capacitance

– Data activity

• Power minimization approaches:– Run at minimum allowable voltage

– Minimize effective switching capacitances

ELEC516/10 Lecture 959

System Level - Power Down Techniques

Operating States:

Active or Full-On(fastest clock)

Standby(slow clock)

Suspend or Sleep(slow clock or shut down)

Micro

Processor

Activity Analyzer

ELEC516/10 Lecture 960

Dynamic Power as a Function of VDD

• Decreasing the VDD

decreases dynamic energy consumption (quadratically)

• But, increases gate delay (decreases performance) 1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

VDD (V) t p

( no

r ma

l ize

d)

• Determine the critical path(s) at design time and use high VDD for the transistors on those paths for speed. Use a lower VDD on the other gates, especially those that drive large capacitances (as this yields the largest energy benefits).

ELEC516/10 Lecture 961

Multiple VDD Considerations

• How many VDD? – Two is becoming common– Many chips already have two supplies (one for core and one for

I/O)• When combining multiple supplies, level converters are required

whenever a module at the lower supply drives a gate at the higher supply (step-up)– If a gate supplied with VDDL drives a gate at VDDH, the PMOS never

turns off• The cross-coupled PMOS transistors

do the level conversion• The NMOS transistor operate on a

reduced supply

– Level converters are not needed for a step-down change in voltage

– Overhead of level converters can be mitigated by doing conversions at register boundaries and embedding the level conversion inside the flipflop

VDDH

Vin

VoutVDDL

ELEC516/10 Lecture 962

Dual-Supply Inside a Logic Block

• Minimum energy consumption is achieved if all logic paths are critical (have the same delay)

• Clustered voltage-scaling– Each path starts with VDDH and switches to VDDL (gray logic

gates) when delay slack is available– Level conversion is done in the flipflops at the end of the

paths

ELEC516/10 Lecture 963

Power Conscious Behavioral Design

• Run functional units at the minimum allowed voltage while satisfying the timing constraints

• Parallelize or pipeline data-path, memory and controllers to compensate for throughput loss due to reduced supply voltage

• Power down functional units which are not is use; Put in “dynamic power management” capability

• Avoid centralized resources (controllers, functional blocks, global busses, etc.) as much possible

• Map functions to hardware so that inter-chip communication is minimized

• Schedule and bind operations to functional units so as to reduce the activity of the input operands

• Reorder operands to reduce switching activity; Keep the inputs to an idle unit unchanged

ELEC516/10 Lecture 964

Low Power Datapath Design - reducing the supply voltage

• Reducing supply voltage has quadratic effect on power saving, but a negative effect on performance.

• Performance can be gained back by logical and architectural optimizations, e.g. lookahead adder instead of ripple-carry adder, using parallelism to increase performance

ELEC516/10 Lecture 965

Power Considerations – reduce f and VDD

• Reducing frequency does not save energy, just reduces rate at which it is consumed– Power is lower but system must run longer

• Reducing supply voltage is very effective (reduce voltage by 0.5 improves energy/transition by 0.25).

• Dropping the voltage will result in reduced performance in terms of speed (need to recover the performance using parallelism).

• Trade surplus performance for lower energy by reducing the supply voltage until performance are as required

ELEC516/10 Lecture 966

Power considerations – Dynamic voltage Scaling

Can run at lower voltage and hence improves quadraticaly power comsp.

– 8-bit adder/comparator: (Chandrakasan er. Al.)• consumes Pref at 40Mhz at 5V with Area= 0.53mm2

– Two parallel interleaved architecture:• Consumes 0.36Pref at 20MHz, 2.9V with Area=1.80mm2

– One Pipelined architecture:• Consumes 0.39Pref at 40MHz, 2.9V with Area=0.69mm2

– Pipelined and parallel• Consumes 0.2Pref at 20MHz, 2.0V with Area=1.96mm2

ELEC516/10 Lecture 967

Minimizing the power consumption using parallelism

• Reference Design

Critical path = 25ns, clock frequency = 40MHzSupply voltage = 5V

refrefrefref fVCP 2

ELEC516/10 Lecture 968

• Parallel implementation of the reference design

New critical path = 50nsec,Cpar->2.15 Cref, Vpar ->2.9V, fpar ->0.5 fref

refref

refrefparparparpar Pf

VCfVCP 36.0)2

()58.0)(15.2( 22

ELEC516/10 Lecture 969

Pipelined Data Path

• Critical path delay is less => max[Tadder, Tcomparator].

• Keeping clock rate constant: fpipe=fref, Voltage can be dropped to Vpipe = Vref/1.7, while maintaining the original througput

• Capacitance slightly higher: Cpipe=1.15Cref

• Ppipe=(1.15Cref)(Vref/1.7)2fref 0.39Pref

ELEC516/10 Lecture 970

How low a voltage can be used

• Capacitance overhead starts to dominate at “high” levels of parallelism and results in an optimum voltage.

ELEC516/10 Lecture 971

Voltage as a design variable

Adapting Voltage to Workload yields cubic reduction

ELEC516/10 Lecture 972

Multiple supply voltages – Filter Example

1

2

3

4

5

6

7

8

9

10

* * * ** * * *

+ +* * * *

+ ++ +

+ +

+ +

+ +

+ + + +

Power (5V)/Power(5V,3V,2.4V) = 1.5 [Raje95]

2.4V

3V

5V

ELEC516/10 Lecture 973

Using Multiple Voltages

C1 C2 C3

Vdd1Vdd2 Vdd3

Vdd1Vdd2

Vdd3

critical path

non-critical

Vdd1 Vdd2

Vdd1Vdd2

[Horowitz-95]

ELEC516/10 Lecture 974

Processor Usage Model

• System Optimizations:– Maximize Peak Throughput

– Minimize Average Energy/operation

– Maximize computation per battery life

DesiredThroughput

time

Compute-intensive andLow-latency processes

Ceiling: set by topSpeed of the processor

Single-user system not always computing

Background and high-latency processes

ELEC516/10 Lecture 975

Scale Supply Voltage with fclk

ELEC516/10 Lecture 976

Dynamic Voltage Scaling Implementation

• VCO: ring oscillator which matches P critical path• DC-DC: perform D/A, converts battery to regulated

Vdd

• Provides both voltage regulation and clock generation.

Loop Filter DC-DC VCO

P

-+Mfref

fout

Frequency detector battery

Vdd

ELEC516/10 Lecture 977

Dynamic Voltage Scaling in Practice

Fixed Throughput, Energy/operation

Occasionally Demand Peak Throughput

Throughput = 8MIPSEnergy/ops = 0.24nJ/inst. 1.92mW

Throughput = 100MIPSEnergy/ops = 2.2nJ/inst. 220mW

Peak Throughput = 100MIPSAverage Energy/op. = 0.24nJ/inst (~1.8mW)

DVS only advantageous when a majority of computation is performed at low throughput

ELEC516/10 Lecture 978

Low-voltage Switching Regulator

• Arbitrary Vdd (<Vin) generated using the Buck converter– Vdd = Vin Duty Cycle at Node X

• Chief sources of inefficiencies:– Conduction loss (I2R)

– Switching loss (CxVin2fs and LsI2fs)

– Gate-drive loss (CgVin2fs)

Page 17

[Stratakos-94]

ELEC516/10 Lecture 979

Adaptive Power Supply Voltage

• Exploit Data Dependent Computation Times To vary the Supply [Nielsen94]

RE

G

FIF

O

FIF

O

RE

G

Control

Self-timedProcessor

Power Supply

Vdd(t)

ELEC516/10 Lecture 980

Variable Supply Voltage Control Scheme

ELEC516/10 Lecture 981

Voltage Scheduling for variable workload system

• Voltage scheduling under timing constraints• Example [Ishihara-98]

– Energy consumption of a processor:• 10nJ/cycle at 2.5V• 25nJ/cycle at 4 V• 40nJ/cycle at 5V

– maximum clock frequencies:• 50MHz at 5V, 40MHz at 4V, 25MHz at 2.5V

– Given that an application needs 1000M cycles to finish and the timing constraint is 25sec.

ELEC516/10 Lecture 982

Different Voltage Schedules

0 5 10 15 20 25 Time(sec)

5.021000Mcycles50MHz

40J

(A)

0 5 10 15 20 25 Time(sec)

5.02750Mcycles50MHz

32.5J

(B)

0 5 10 15 20 25 Time(sec)

5.02

1000Mcycles40MHz

25J (C)

Timing constraint

2.52

250Mcycles25MHz

4.02

En

ergy

con

sum

pti

on (

Vd

d2 )

ELEC516/10 Lecture 983

Reduce Power Further by Buffering

Two samples processed every two sample periods-> increased latency

ELEC516/10 Lecture 984

Example of Buffering

Block 1

Block 2

Block 3

Block 4

Block 1, 2

Block 1, 2

Block 3,4

Block 3,4

Tsample Tsample

Vdd =5

Vdd =2.5

Vdd =5

Vdd =2.5

Vdd =3.75

Vdd =3.75

Vdd =3.75

Vdd =3.75

C

CCEnergysample

142

)5.2(21

)5( 22

C

CEnergysample

5.102

)75.3()75.0(2 2

ELEC516/10 Lecture 985

Voltage Island Concept

Trade off power for delay by running functional blocks at different voltages Can use mix of Low and High Vt to balance performance and leakage Switch off inactive blocks to reduce leakage power Requires IP standards for power management, clock gating, etc.

Delay vs. Voltage30

25

20

15

10

5

0

Dde

lay

(ps)

0.7 0.8 0 .9 1.0 1.1 1.2 1.3Voltage (Vdd)

Std. Vt Low Vt

E.g.: Telecom ASIC with 1.0/1.2 V islands saved :16 % active power50 % standby power

Power Management Unit

SWITCH SWITCH

LogicLow VT

Logic

Vddo

Vdd1 Vdd2

IP1 IP2

Source from Bergamaschi

*Slide from Prof Kyung of KAIST

ELEC516/10 Lecture 986

Power Management

I/O’s, VReg, Gnd

Memory ArraysVdd 4

High Vt device arraysOptimized for low active

power

Memory ArraysVdd 3

Low Vt device arraysOptimized for low active

power

MicrocontrollerVdd 2

DSPVdd 2

ROMVdd 1

Monitor Logic Vdd 4

ROMVdd1

RLM 1

RLM 2

Memory ArraysVdd 3

Low Vt device arraysOptimized for low active

power

I/O’s, VReg, Gnd

Analog Vdd 5

RLM 3

Vdd 1

I/O

’s,

VR

eg,

Gnd

I/O

’s,

VR

eg,

Gnd

Independently controlled domain power switches Multiple On-Chip Voltage Islands On-Chip Voltage Regulators

*Slide from Prof Kyung of KAIST

ELEC516/10 Lecture 987

Controlling VDD and VTH for low power

Active Stand-byMultiple VTH Dual-VTH MTCMOS

Variable VTH VTH hopping VTCMOS

Multiple VDD Dual-VDD Boosted gate MOS

Variable VDD VDD hopping

Software-hardware cooperation

Technology-circuit cooperation

MTCMOS : Multi-Threshold CMOS VTCMOS : Variable Threshold CMOS

Multiple : spatial assignment Variable : temporal assignment

*Slide from Prof Kyung of KAIST

ELEC516/10 Lecture 988

RTL-level optimization-Reducing effective switching activity

• General Principle: Avoid Waste.– Application-specific processing

– Resource sharing/Locality of reference

– Data representations

– Preservation of Data correlations

– architectural restructuring

– Distributed processing

– Demand-driven/data-driven computation

ELEC516/10 Lecture 989

Application Specific Processing

Application Specific Processing Reduces“Implementation Overhead”

ELEC516/10 Lecture 990

Eliminating Redundant Computation

• Dynamically vary the number of operations per sample.

• Trade power consumption and filter quality [Ludwig-96]

fs fs fs fs fs

….

Out

Power Down

ELEC516/10 Lecture 991

Eliminating Redundant Computation

Switched Capacitance Reduction ~=Peak Number of Operations

Average Number of Operations

Strong Function of Signal Statistics~=2

ELEC516/10 Lecture 992

Reducing switching activity

• Multiplexing multiple operations on a single hardware unit can have detrimental effect on the power consumption, because the switching activity may be increased, e.g. shared bus

ELEC516/10 Lecture 993

Bus Multiplexing

• Share long data buses with time multiplexing (S1 uses even cycles, S2 odd)

S2

S1D1

D2

S1

S2 D2

D1

• Buses are a significant source of power dissipation due to high switching activities and large capacitive loading– 15% of total power in Alpha 21064– 30% of total power in Intel 80386

• But what if data samples are correlated (e.g., sign bits)?

ELEC516/10 Lecture 994

Reducing the Effective Capacitance

• Circuit and Logic Style - select a circuit style with low capacitance and/or switching activity, e.g. 8-bit adder

ELEC516/10 Lecture 995

Reducing Glitching Activity• Some circuit structures can be the cause of

spurious transients, e.g. a 16-bit ripple carry adder

• Glitches can be reduced by selecting structures that have balanced signal paths, e.g. tree.

• The Brent-Kung lookahead adder and Wallace tree multiplier both have this properties, thus more power attractive.

ELEC516/10 Lecture 996

Data Representation

Two’s complement Sign Magnitude

• Sign-extension activity significantly reduced using sign magnitude representation.

• An accumulator example: sign magnitude datapath switches 30% less capacitance for uniformly distributed inputs

ELEC516/10 Lecture 997

Data Representation – Accumulator Example

Sign magnitude datapath switches 30% less capacitance for uniformly distributed inputs

ELEC516/10 Lecture 998

Two’s Complement vs. Sign-Magnitude

Two’s complement datapath has a significantly higher switching activity

ELEC516/10 Lecture 999

Bus encoding to reduce switching activity

• Minimizing temporal bit transition activity by data representation

• Bit encoding: – Active high encoding

• high-level voltage for 1, low-level voltage for 0– Transition-based encoding

• voltage change identifies logic 1, no voltage range identifies logic 0

• Word encoding– assign patterns of 1’s and 0’s to each word of information– Non-redundant codes vs. redundant codes

• Example of low power coding– Limited-weight code– Gray code– One-hot code– Bus-invert code

ELEC516/10 Lecture 9100

Example of Bit Encoding

• Reducing average no. of switching on a data bus• transition-based encoding may limit the no. of transitions for

non equiprobable input lines• Let p(0) > p(1) and no temporal correlation exists. If active-

high encoding is used, the average no. of transition is

• For transition-based encoding, it is simply p(1). If p(1) << p(0), transition-based is better than active-high encoding.

• To reduce transition, the input patterns are transformed in such a way that the p(0) and p(1) prob. of each input line becomes as different as possible, and then to apply a transition-based encoding before the data is transmitted.

)1())1(1(2)0()1()1()0( pppppp

ELEC516/10 Lecture 9101

Word encoding• One-hot coding

•Gray coding - good for sequential data, e.g. addressing for microprocessor [Su-94]

•Disadvantage of Gray code - only good for address bus, not for data bus, additional conversion circuitry is needed

ELEC516/10 Lecture 9102

Word encoding (cont.)• Bus-invert encoding - use redundancy to save power.• Given a n bit data bus with 2n patterns to be represented and

assumes all the patterns are equiprobable and no temporal correlation. p(0) = p(1) = 0.5, the average no. of transition is n/2 per cycle, while the worst case transition is n.

• Bus-invert coding - add an extra line to the bus, I, and then comparing the consecutive patterns before transmission. Two cases– - If the Hamming distance between the two patterns =< n/2, the

current pattern is transmitted as it is and I is set to 0.– - if the Hamming distance between the two patterns > n/2, the

current pattern is first inverted and then transmitted, and I is set to 1.

• Max. transition is limited to n/2 and average transition is reduced by 25%

• Drawback - extra line I to indicate whether a pattern has been inverted.

ELEC516/10 Lecture 9103

Conditional Inversion Coding

ELEC516/10 Lecture 9104

Bus encoding for cross coupling cap.

• Cross-coupling cap dominates for very deep sub-micron technology, e.g. < 70nm

• Bus model – Stand alone cap. Cs

– Cross coupling cap. Cc

• Stand alone switching – Apply to single bit line

– 0-1 transition

• Cross coupling switching – Occurs on adjacent wires

– Four types of coupling transition H --> L

H --> L

L --> H

L --> L

H --> L

L --> H

H --> L

H --> L

Type 1 Type 2 Type 3 Type 4

0 1 2 0

CsCc

Cc Cs

Cs

bit 0

bit 1

bit 2

ELEC516/10 Lecture 9105

Permutation Based Address code

• Rearrange the physical order of the bit lines

• Work efficiently on address bus (40%), but not on instruction bus (4%)– Correlation is lower – Permutation is fixed

Sender Receiver

ELEC516/10 Lecture 9106

Dynamic Reconfigurable Bus Encoding Scheme for Instruction Bus

ELEC516/10 Lecture 9107

Overview of the SchemeDecoding informationstored as header

Decoding information are loaded to LUT first

Instruction is called

Decoding information stored in LUT will control the Cross_bar

Instructions enter Cross_bar to decode

Encoding

.

.

.

MEM

B1

Bn

Mem_bus

Processing Element

.

.

B1

Bm

CACHE

CPU Core

Cache_busDecoder

(Cross_bar)

Look-UpTable

Address_bus

Target ProcessorDecoding

Computer Target

ProcessorBit lines reordering

Encoding during compilation

Mem

ELEC516/10 Lecture 9108

Demand/Data-driven operation

• Clocking strategy– Gated clocking

• System Power Down• Computing Paradigm

ELEC516/10 Lecture 9109

Logic Level Power Down Technique

• Activity driven - precomputation-based sequential logic optimization– Selectively precompute the output logic values of the circuit

one clock cycle before they are required, and use the precomputed values to reduce internal switching activity in the succeeding cycle

It is required thatg1 = 1 -> f = 1g2 = 1 -> f = 0

A

g2

LE

Original Circuit

Modified Circuit

fAR1

R1 R2

FF

FF

g1

R2

ELEC516/10 Lecture 9110

An example: n-bit comparator

• This circuit compares two n-bit numbers A & and computes the function A > B

• In general, precomputation works best when there are a small number of complex functions corresponding to the logic block A

top related