elec516/10 lecture 9 1 elec 516 vlsi system design and design automation spring 2010 lecture 9 - low...

ELEC516/10 Lecture 91

ELEC 516 VLSI System Design and Design Automation Spring

2010Lecture 9 - Low Power Digital

CMOS Design

Reading Assignment: Rabaey: Chapter 4 and 7,11

Note: some of the figures in this slide set are adapted from the slide setof “ Digital Integrated Circuits” by Rabaey, Copyright UCB 2002

Why worry about power?-- Heat Dissipation

Besides low power required for portable applications !!!!

Power is a very big concern in today’s advanced technologies. Power produces heat on the chip which has to be carried off through the chip socket expensive packaging solutions

Motivation for low power design

• Packaging costs• Power supply rail design• Chip and system cooling costs• Noise immunity and system reliability• Battery life (in portable systems)• Environmental concerns

– ICT equipment accounted for 10% of total US commercial energy usage in 2010 and may reach 20% by 2020

– Energy Star compliant systems

Why worry about power — Portability

Multimedia Terminals

Laptop Computers

Digital Cellular Telephony

BATTERY(40+ lbs)

Nickel-Cadium

Ni-Metal Hydride

65 70 75 80 85 90 95 0

Rechargable Lithium

Expected Battery Lifetime increaseover next 5 years: 30-40%

Why worry about power? -- Standby Power

Drain leakage will increase as VT decreases to maintain noise margins and meet frequency demands, leading to excessive battery draining standby power consumption.

88W 12W

2000 2002 2004 2006 2008

Source: Borkar, De Intel

Year 2002 2005 2008 2011 2014

Power supply Vdd (V) 1.5 1.2 0.9 0.7 0.6

Threshold VT (V) 0.4 0.4 0.35 0.3 0.25

…and phones leaky!

Power and Energy Figures of Merit

• Power consumption in Watts– determines battery life in hours

• Peak power– determines power ground wiring designs– sets packaging limits– impacts signal noise margin and reliability analysis

• Energy efficiency in Joules– rate at which power is consumed over time

• Energy = power * delay– Joules = Watts * seconds– lower energy number means less power to perform a

computation at the same frequency

Power versus Energy

Power is height of curve

Approach 1

Approach 2

Approach 1

Energy is area under curve

Lower power design could simply be slower

Two approaches require the same energy

PDP and EDP• Power-delay product (PDP) = Pav * tp = (CLVDD

– PDP is the average energy consumed per switching event (Watts * sec = Joule)

– lower power design could simply be a slower design

allows one to understand tradeoffs better

0.5 1 1.5 2 2.5

Vdd (V)

y-Dela

energy-delay

energy

Energy-delay product (EDP) = PDP * tp = Pav * tp2

EDP is the average energy consumed multiplied by the computation time required

takes into account that one can trade increased delay for lower energy/operation (e.g., via supply voltage scaling that increases delay, but decreases energy consumption)

• Dynamic power– Charge-discharge current:

• Dominant source of power (CV2 per transition)

– Short circuit current (both NMOS and PMOS on during transit)• <10% of c/d current if transitions are fast

• Subthreshold leakage (transistors not OFF completely)– Becoming important 10-30% active power in <0.18um techn– Diode leakage from reverse source and drain diodes (neglig)– Gate leakage (no longer negligible due to very thin gate oxide)

Where Does Power Go in CMOS?

Dynamic Power Consumption

Vin Vout

Energy/transition = C L * V dd2

Power = Energy/transition * f = C L * V dd2 * f

Need to reduce C L , V dd, and f to reduce power.

Not a function of transistor sizes!

Dynamic Power Consumption - Revisited

Power = Energy/transition * transition rate

= CL * V dd2 * f 0 1

= C L * V dd2 * P 0 1* f

= C EFF * V dd2 * f

Power Dissipation is Data DependentFunction of Switching Activity

CEFF = Effective Capacitance = C L * P0 1

Power Consumption is Data Dependent

Example: Static 2 Input NOR Gate

Assume:P(A=1) = 1/2P(B=1) = 1/2

P(Out=1) = 1/4P(01)

= 3/4 X 1/4 = 3/16

= P(Out=0).P(Out=1)

CEFF = 3/16 * CL

Transition Probabilities for Basic Gates

Transition Probability of 2-input NOR Gate

Inter-signal Correlations

P(Z=1) = P(B=1) & P(A=1 | B=1)

(1-0.5)(1-0.5)x(1-(1-0.5)(1-0.5)) = 3/16

(1- 3/16 x 0.5) x (3/16 x 0.5) = 0.085Reconvergent

• Determining switching activity is complicated by the fact that signals exhibit correlation in space and time– reconvergent fan-out

• Have to use conditional probabilities

Logic Restructuring

Chain implementation has a lower overall switching activity than the tree implementation for random inputs

Ignores glitching effects

Logic restructuring: changing the topology of a logic network to reduce transitions

(1-0.25)*0.25 = 3/16

0.50.5

15/256

AND: P01 = P0 x P1 = (1 - PAPB) x PAPB

Input Ordering

Beneficial to postpone the introduction of signals with a high transition rate (signals with signal probability close to 0.5)

0.20.1

0.10.5

(1-0.5x0.2)x(0.5x0.2)=0.09 (1-0.2x0.1)x(0.2x0.1)=0.0196

How about Dynamic Circuits?

Power is Only Dissipated when Out=0!

In1In2In3

CEFF = P(Out=0).C L

2-input NOR Gate

Example: Dynamic 2 Input NOR Gate

Assume:P(A=1) = 1/2P(B=1) = 1/2

P(Out=0) = 3/4

CEFF = 3/4 * CL

Switching Activity Is Always Higher in Dynamic Circuits

Transition Probabilities for Dynamic Gates

Switching Activity for Precharged Dynamic Gates

P01 = P0

Glitching in Static CMOS Networks

101 000

Unit Delay

• Gates have a nonzero propagation delay resulting in spurious transitions or glitches (dynamic hazards)– glitch: node exhibits multiple transitions in a single cycle

before settling to the correct logic value

Example : Adder Circuit

0 5 100.0

Time, ns

Add0 Add1 Add2 Add14 Add15

S0 S1 S2 S14 S15

How to Cope with Glitching?

Equalize Lengths of Timing Paths Through Design

Balanced Delay Paths to Reduce Glitching

So equalize the lengths of timing paths through logic

Glitching is due to a mismatch in the path lengths in the logic network; if all input signals of a gate change simultaneously, no glitching occurs

Glitch Reduction by Pipelining• Glitches depend on the logic depth of the circuit - gates

deeper in the logic network are more prone to glitching– arrival times of the gate inputs are more spread due to

delay imbalances– usually affected more by primary input switching

• Reduce logic depth by adding pipeline registers– additional energy used by the clock and pipeline registers

Fetch Decode Execute Memory WriteBack

pipelinestage

isolationregister

Short Circuit Power Consumption

Finite slope of the input signal causes a direct current path between VDD and GND for a short period of time during switching when both the NMOS and PMOS transistors are conducting.

Vin Vout

Short Circuit Currents Determinates

• Duration and slope of the input signal, tsc

• Ipeak determined by – the saturation current of the P and N transistors which

depend on their sizes, process technology, temperature, etc.– strong function of the ratio between input and output slopes

• a function of CL

Esc = tsc VDD Ipeak P01

Psc = tsc VDD Ipeak f01

Impact of CL on Psc

Vin Vout

Isc Imax

Large capacitive load

Output fall time significantly larger than input rise time.

Small capacitive load

Output fall time substantially smaller than the input rise

Ipeak as a Function of CL

0 2 4 6

time (sec)

x 10-10

x 10-4

CL = 20 fF

CL = 100 fF

CL = 500 fF

500 psec input slope

Short circuit dissipation is minimized by matching the rise/fall times of the input and output signals - slope engineering.

When load capacitance is small, Ipeak is large.

Static Power Consumption

Vin=5V

Pstat = P(In=1) .Vdd . Istat

• Dominates over dynamic consumption

• Not a function of switching frequency

Leakage (Static) Power Consumption

Sub-threshold current is the dominant factor.

All increase exponentially with temperature!

VDD Ileakage

Drain junction leakage

Sub-threshold currentGate leakage

Power consideration – leakage current

)1( //)( kTqVnkTqVVsub

dsthgs eeKI K: technology constant; q: electronic charge; k: Boltzman constantN: nonlinearity constant (between 1 and 2); T: Temperature

Leakage as a Function of VT

0 0.2 0.4 0.6 0.8 1

VGS (V)

VT=0.4V

VT=0.1V

Continued scaling of supply voltage and the subsequent scaling of threshold voltage will make subthreshold conduction a dominate component of power dissipation.

An 90mV/decade VT roll-off - so each 255mV increase in VT gives 3 orders of magnitude reduction in leakage (but adversely affects performance)

Sub-Threshold in MOS

VT=0.6VT=0.2

Lower Bound on Threshold to Prevent Leakage

Low Power Design Space

• The dynamic power consumption equation reveals the three degrees of freedom inherent in the low power design space:– Voltage– Physical capacitance– Data activity

• Optimization for power entails an attempt to reduce one or more of these factors. Interactions among these factors complicate the optimization problem.

• Deep sub-micron design - need to minimize leakage and sub-threshold current

Power Reduction Strategy (I)

• Voltage Reduction

– 5V ->3.3 V -> 2.5V ->1.8V-> 1.0V

– Mixed supplies in system and/or on chip, by using the minimum voltage for different chips of functions within a chip, together with on-chip voltage converters if required.

• Low-voltage circuit techniques are required to give good performances even with low voltages

• Less noisy structures and better signal integrity handling is required

• Lower Vth process is required to maintain good transistor speed performances

Reducing Vdd

P x td = E t = CL * Vdd2

E(Vdd=2)=

(CL) * (2) 2

(CL) * (5) 2E(Vdd=5)

Strong function of voltage (V 2 dependence).

Relatively independent of logic function and style.

E(Vdd=2) 0.16 E (Vdd =5)

51 stage ring oscillator

8-bit adder

Vdd (volts)

quadratic dependence

Power Delay Product Improves with lowering V DD .

Lower Vdd Increases Delay

C L * Vdd

Td(Vdd=5)

Td(Vdd=2)=

(2) * (5 - 0.7)2

(5) * (2 - 0.7)2

I ~ (Vdd - Vt)2

Relatively independent of logic function and style.

2.00 4.00 6.00Vdd (volts)

adder (SPICE)

microcoded DSP chip

multiplier

ring oscillator

clock generator2.0m technology

Lowering the Threshold

DESIGN FOR PLeakage == PDynamic

Vt = 0.2Vt = 0

Reduces the Speed Loss, But Increases Leakage

Interesting Design Approach:

What Threshold Voltage to Use?

• Energy vs. Vt for a fixed throughput

• An “optimal” Vt/Vdd point trades switching and leakage energy

Stack Effect• Leakage is a function of the circuit topology and the value

of the inputsVT = VT0 + (|-2F + VSB| - |-2F|)

where VT0 is the threshold voltage at VSB = 0; VSB is the source- bulk (substrate) voltage; is the body-effect coefficient

A B VX ISUB

0 0 VT ln(1+n) VGS=VBS= -VX

0 1 0 VGS=VBS=0

1 0 VDD-VT VGS=VBS=0

1 1 0 VSG=VSB=0

• Leakage is least when A = B = 0• Leakage reduction due to stacked

transistors is called the stack effect

Leakage as a Function of Design Time VT

• Reducing the VT increases the sub-threshold leakage current (exponentially)

– 90mV reduction in VT increases leakage by an order of magnitude

• But, reducing VT decreases gate delay (increases performance)

0 0.2 0.4 0.6 0.8 1

VGS (V)ID

VT=0.4V

VT=0.1V

• Determine the critical path(s) at design time and use low VT devices on the transistors on those paths for speed. Use a high VT on the other logic for leakage control.

– A careful assignment of VT’s can reduce the leakage by as much as 80%

Dual-Thresholds Inside a Logic Block

• Minimum energy consumption is achieved if all logic paths are critical (have the same delay)

• Use lower threshold on timing-critical paths– Assignment can be done on a per gate or transistor basis;

no clustering of the logic is needed– No level converters are needed

Variable VT (ABB) at Run Time

• VT = VT0 + (|-2F + VSB| - |-2F|)

-2.5 -2 -1.5 -1 -0.5 0

VSB (V)

A negative bias on VSB causes VT to increase

Adjusting the substrate bias at run time is called adaptive body-biasing (ABB)

Requires a dual well fab process

For an n-channel device, the substrate is normally tied to ground (VSB = 0)

Techniques for Burst Mode Computation

High VT

Low VT

onstandby

standbyon

Vin Vout

Vn < 0

Multiple VT Technology Substrate Bias ControlledVariable VT Devices

(Disable high VT devices during idle periods) (increase VT during idle periods)e.g. [Sakata-93], [Mutoh-93] e.g. [Seta-95]

Multiple-Threshold Circuits

• Previous approaches cannot reduce leakage power during active mode.

• Use dual Vt CMOS logic, low Vt for critical paths and high Vt for non-critical paths, problem - large threshold swing

• Triple Vt CMOS circuit to reduce the sub-thresold swing [Fujii et. al. ISSCC-98]– for high speed low power active mode, low and medium

Vt are used for critical and non-critical paths, respectively.

– For standby mode, high Vt MOSFET is inserted between the supply rail and the virtual supply rail.

Multiple-Threshold Circuits

Power considerations – reduce leakage

• Leakage proportional to device width– Use smallest devices for critical path.

• Leakage drops with stacked devices (drain voltage divider)– Use stacked transistors for critical path.

• Leakage drops with increasing channel length – Slightly increase L for critical Path

• Use dual VT process providing two threshold VT – Use high VT transistors for critical path

Power consideration – reduce leakage

• Switch off critical path transistors when not needed.

• Stand-by mode between supply and virtual supply lines

• Stand-by vectors – Apply input vector which

minimizes leakage.

– Achieved using Mux

Power gating transistor Sizing• The effect of power gating transistor size

– As the size decreases, logic performance also decreases.

– As the size increases, leakage current and chip area also increase.

– Proper sizing is very important.– power gating transistor size should be decided within

2% performance degradation.

Vop = VDD - V

V must be sizedwithin 2% performance degradation.

Low Vt

High Vt SwitchControl

Power Reduction Strategy (II)

• Reducing capacitance– Process scaling and better integration, with smaller

capacitances in more aggressive processes– Improved devices and interconnect technology– Efficient clock generation and distribution– Good memory hierarchy – In-place optimization using a library containing ranges

of gates with different strengths, through replacement of gates to use the optimum drive in the critical paths and minimum drive elsewhere

Reducing Effective Capacitance

Global bus architecture Local bus architecture

Shared Resources incur Switching Overhead

Power consideration – Reduce Capacitance

• Reduce switched capacitance C– Careful transistor sizing, transistor ordering, tighter and

more compact layout

– Hierarchical architecture and add TG to isolate buses

– Segmented structures

Shared bus driven by A or B whenSending values to C

Insert switch to isolate bus segmentwhen B sending to C

Power Reduction Strategy (III)

• Reducing activities– lowering operating frequency– Using power management strategies, such as Gated

clocks, Power-Down of non-operational units– Reduce switching activities, Power = Energy/transition *

transition rate =

• Power dissipation is data dependent and hence is a function of switching activity.

• P0->1 is the switching probability

• Effective switching capacitance = Ceff=CL* P0->1

fVCfPVCfVC ddeffddLddL 2

Factors Affecting Power Consumption - Revisited

• Degree of freedom for low power design space:– Voltage

– Physical capacitance

– Data activity

• Power minimization approaches:– Run at minimum allowable voltage

– Minimize effective switching capacitances

System Level - Power Down Techniques

Operating States:

Active or Full-On(fastest clock)

Standby(slow clock)

Suspend or Sleep(slow clock or shut down)

Processor

Activity Analyzer

Dynamic Power as a Function of VDD

• Decreasing the VDD

decreases dynamic energy consumption (quadratically)

• But, increases gate delay (decreases performance) 1

0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

VDD (V) t p

• Determine the critical path(s) at design time and use high VDD for the transistors on those paths for speed. Use a lower VDD on the other gates, especially those that drive large capacitances (as this yields the largest energy benefits).

Multiple VDD Considerations

• How many VDD? – Two is becoming common– Many chips already have two supplies (one for core and one for

I/O)• When combining multiple supplies, level converters are required

whenever a module at the lower supply drives a gate at the higher supply (step-up)– If a gate supplied with VDDL drives a gate at VDDH, the PMOS never

turns off• The cross-coupled PMOS transistors

do the level conversion• The NMOS transistor operate on a

reduced supply

– Level converters are not needed for a step-down change in voltage

– Overhead of level converters can be mitigated by doing conversions at register boundaries and embedding the level conversion inside the flipflop

VoutVDDL

Dual-Supply Inside a Logic Block

• Minimum energy consumption is achieved if all logic paths are critical (have the same delay)

• Clustered voltage-scaling– Each path starts with VDDH and switches to VDDL (gray logic

gates) when delay slack is available– Level conversion is done in the flipflops at the end of the

Power Conscious Behavioral Design

• Run functional units at the minimum allowed voltage while satisfying the timing constraints

• Parallelize or pipeline data-path, memory and controllers to compensate for throughput loss due to reduced supply voltage

• Power down functional units which are not is use; Put in “dynamic power management” capability

• Avoid centralized resources (controllers, functional blocks, global busses, etc.) as much possible

• Map functions to hardware so that inter-chip communication is minimized

• Schedule and bind operations to functional units so as to reduce the activity of the input operands

• Reorder operands to reduce switching activity; Keep the inputs to an idle unit unchanged

Low Power Datapath Design - reducing the supply voltage

• Reducing supply voltage has quadratic effect on power saving, but a negative effect on performance.

• Performance can be gained back by logical and architectural optimizations, e.g. lookahead adder instead of ripple-carry adder, using parallelism to increase performance

Power Considerations – reduce f and VDD

• Reducing frequency does not save energy, just reduces rate at which it is consumed– Power is lower but system must run longer

• Reducing supply voltage is very effective (reduce voltage by 0.5 improves energy/transition by 0.25).

• Dropping the voltage will result in reduced performance in terms of speed (need to recover the performance using parallelism).

• Trade surplus performance for lower energy by reducing the supply voltage until performance are as required

Power considerations – Dynamic voltage Scaling

Can run at lower voltage and hence improves quadraticaly power comsp.

– 8-bit adder/comparator: (Chandrakasan er. Al.)• consumes Pref at 40Mhz at 5V with Area= 0.53mm2

– Two parallel interleaved architecture:• Consumes 0.36Pref at 20MHz, 2.9V with Area=1.80mm2

– One Pipelined architecture:• Consumes 0.39Pref at 40MHz, 2.9V with Area=0.69mm2

– Pipelined and parallel• Consumes 0.2Pref at 20MHz, 2.0V with Area=1.96mm2

Minimizing the power consumption using parallelism

• Reference Design

Critical path = 25ns, clock frequency = 40MHzSupply voltage = 5V

refrefrefref fVCP 2

• Parallel implementation of the reference design

New critical path = 50nsec,Cpar->2.15 Cref, Vpar ->2.9V, fpar ->0.5 fref

refref

refrefparparparpar Pf

VCfVCP 36.0)2

()58.0)(15.2( 22

Pipelined Data Path

• Critical path delay is less => max[Tadder, Tcomparator].

• Keeping clock rate constant: fpipe=fref, Voltage can be dropped to Vpipe = Vref/1.7, while maintaining the original througput

• Capacitance slightly higher: Cpipe=1.15Cref

• Ppipe=(1.15Cref)(Vref/1.7)2fref 0.39Pref

How low a voltage can be used

• Capacitance overhead starts to dominate at “high” levels of parallelism and results in an optimum voltage.

Voltage as a design variable

Adapting Voltage to Workload yields cubic reduction

Multiple supply voltages – Filter Example

* * * ** * * *

+ +* * * *

+ ++ +

+ + + +

Power (5V)/Power(5V,3V,2.4V) = 1.5 [Raje95]

Using Multiple Voltages

C1 C2 C3

Vdd1Vdd2 Vdd3

Vdd1Vdd2

critical path

non-critical

Vdd1 Vdd2

Vdd1Vdd2

[Horowitz-95]

Processor Usage Model

• System Optimizations:– Maximize Peak Throughput

– Minimize Average Energy/operation

– Maximize computation per battery life

DesiredThroughput

Compute-intensive andLow-latency processes

Ceiling: set by topSpeed of the processor

Single-user system not always computing

Background and high-latency processes

Scale Supply Voltage with fclk

Dynamic Voltage Scaling Implementation

• VCO: ring oscillator which matches P critical path• DC-DC: perform D/A, converts battery to regulated

• Provides both voltage regulation and clock generation.

Loop Filter DC-DC VCO

-+Mfref

Frequency detector battery

Dynamic Voltage Scaling in Practice

Fixed Throughput, Energy/operation

Occasionally Demand Peak Throughput

Throughput = 8MIPSEnergy/ops = 0.24nJ/inst. 1.92mW

Throughput = 100MIPSEnergy/ops = 2.2nJ/inst. 220mW

Peak Throughput = 100MIPSAverage Energy/op. = 0.24nJ/inst (~1.8mW)

DVS only advantageous when a majority of computation is performed at low throughput

Low-voltage Switching Regulator

• Arbitrary Vdd (<Vin) generated using the Buck converter– Vdd = Vin Duty Cycle at Node X

• Chief sources of inefficiencies:– Conduction loss (I2R)

– Switching loss (CxVin2fs and LsI2fs)

– Gate-drive loss (CgVin2fs)

[Stratakos-94]

Adaptive Power Supply Voltage

• Exploit Data Dependent Computation Times To vary the Supply [Nielsen94]

Control

Self-timedProcessor

Power Supply

Vdd(t)

Variable Supply Voltage Control Scheme

Voltage Scheduling for variable workload system

• Voltage scheduling under timing constraints• Example [Ishihara-98]

– Energy consumption of a processor:• 10nJ/cycle at 2.5V• 25nJ/cycle at 4 V• 40nJ/cycle at 5V

– maximum clock frequencies:• 50MHz at 5V, 40MHz at 4V, 25MHz at 2.5V

– Given that an application needs 1000M cycles to finish and the timing constraint is 25sec.

Different Voltage Schedules

0 5 10 15 20 25 Time(sec)

5.021000Mcycles50MHz

0 5 10 15 20 25 Time(sec)

5.02750Mcycles50MHz

0 5 10 15 20 25 Time(sec)

1000Mcycles40MHz

25J (C)

Timing constraint

250Mcycles25MHz

Reduce Power Further by Buffering

Two samples processed every two sample periods-> increased latency

Example of Buffering

Block 1

Block 2

Block 3

Block 4

Block 1, 2

Block 3,4

Tsample Tsample

Vdd =5

Vdd =2.5

Vdd =5

Vdd =2.5

Vdd =3.75

CCEnergysample

)5.2(21

)5( 22

CEnergysample

)75.3()75.0(2 2

Voltage Island Concept

Trade off power for delay by running functional blocks at different voltages Can use mix of Low and High Vt to balance performance and leakage Switch off inactive blocks to reduce leakage power Requires IP standards for power management, clock gating, etc.

Delay vs. Voltage30

0.7 0.8 0 .9 1.0 1.1 1.2 1.3Voltage (Vdd)

Std. Vt Low Vt

E.g.: Telecom ASIC with 1.0/1.2 V islands saved :16 % active power50 % standby power

Power Management Unit

SWITCH SWITCH

LogicLow VT

Vdd1 Vdd2

IP1 IP2

Source from Bergamaschi

*Slide from Prof Kyung of KAIST

Power Management

I/O’s, VReg, Gnd

Memory ArraysVdd 4

High Vt device arraysOptimized for low active

Memory ArraysVdd 3

Low Vt device arraysOptimized for low active

MicrocontrollerVdd 2

DSPVdd 2

ROMVdd 1

Monitor Logic Vdd 4

ROMVdd1

Memory ArraysVdd 3

Low Vt device arraysOptimized for low active

I/O’s, VReg, Gnd

Analog Vdd 5

Independently controlled domain power switches Multiple On-Chip Voltage Islands On-Chip Voltage Regulators

Controlling VDD and VTH for low power

Active Stand-byMultiple VTH Dual-VTH MTCMOS

Variable VTH VTH hopping VTCMOS

Multiple VDD Dual-VDD Boosted gate MOS

Variable VDD VDD hopping

Software-hardware cooperation

Technology-circuit cooperation

MTCMOS : Multi-Threshold CMOS VTCMOS : Variable Threshold CMOS

Multiple : spatial assignment Variable : temporal assignment

RTL-level optimization-Reducing effective switching activity

• General Principle: Avoid Waste.– Application-specific processing

– Resource sharing/Locality of reference

– Data representations

– Preservation of Data correlations

– architectural restructuring

– Distributed processing

– Demand-driven/data-driven computation

Application Specific Processing

Application Specific Processing Reduces“Implementation Overhead”

Eliminating Redundant Computation

• Dynamically vary the number of operations per sample.

• Trade power consumption and filter quality [Ludwig-96]

fs fs fs fs fs

Power Down

Eliminating Redundant Computation

Switched Capacitance Reduction ~=Peak Number of Operations

Average Number of Operations

Strong Function of Signal Statistics~=2

Reducing switching activity

• Multiplexing multiple operations on a single hardware unit can have detrimental effect on the power consumption, because the switching activity may be increased, e.g. shared bus

Bus Multiplexing

• Share long data buses with time multiplexing (S1 uses even cycles, S2 odd)

• Buses are a significant source of power dissipation due to high switching activities and large capacitive loading– 15% of total power in Alpha 21064– 30% of total power in Intel 80386

• But what if data samples are correlated (e.g., sign bits)?

Reducing the Effective Capacitance

• Circuit and Logic Style - select a circuit style with low capacitance and/or switching activity, e.g. 8-bit adder

Reducing Glitching Activity• Some circuit structures can be the cause of

spurious transients, e.g. a 16-bit ripple carry adder

• Glitches can be reduced by selecting structures that have balanced signal paths, e.g. tree.

• The Brent-Kung lookahead adder and Wallace tree multiplier both have this properties, thus more power attractive.

Data Representation

Two’s complement Sign Magnitude

• Sign-extension activity significantly reduced using sign magnitude representation.

• An accumulator example: sign magnitude datapath switches 30% less capacitance for uniformly distributed inputs

Data Representation – Accumulator Example

Sign magnitude datapath switches 30% less capacitance for uniformly distributed inputs

Two’s Complement vs. Sign-Magnitude

Two’s complement datapath has a significantly higher switching activity

Bus encoding to reduce switching activity

• Minimizing temporal bit transition activity by data representation

• Bit encoding: – Active high encoding

• high-level voltage for 1, low-level voltage for 0– Transition-based encoding

• voltage change identifies logic 1, no voltage range identifies logic 0

• Word encoding– assign patterns of 1’s and 0’s to each word of information– Non-redundant codes vs. redundant codes

• Example of low power coding– Limited-weight code– Gray code– One-hot code– Bus-invert code

Example of Bit Encoding

• Reducing average no. of switching on a data bus• transition-based encoding may limit the no. of transitions for

non equiprobable input lines• Let p(0) > p(1) and no temporal correlation exists. If active-

high encoding is used, the average no. of transition is

• For transition-based encoding, it is simply p(1). If p(1) << p(0), transition-based is better than active-high encoding.

• To reduce transition, the input patterns are transformed in such a way that the p(0) and p(1) prob. of each input line becomes as different as possible, and then to apply a transition-based encoding before the data is transmitted.

)1())1(1(2)0()1()1()0( pppppp

Word encoding• One-hot coding

•Gray coding - good for sequential data, e.g. addressing for microprocessor [Su-94]

•Disadvantage of Gray code - only good for address bus, not for data bus, additional conversion circuitry is needed

Word encoding (cont.)• Bus-invert encoding - use redundancy to save power.• Given a n bit data bus with 2n patterns to be represented and

assumes all the patterns are equiprobable and no temporal correlation. p(0) = p(1) = 0.5, the average no. of transition is n/2 per cycle, while the worst case transition is n.

• Bus-invert coding - add an extra line to the bus, I, and then comparing the consecutive patterns before transmission. Two cases– - If the Hamming distance between the two patterns =< n/2, the

current pattern is transmitted as it is and I is set to 0.– - if the Hamming distance between the two patterns > n/2, the

current pattern is first inverted and then transmitted, and I is set to 1.

• Max. transition is limited to n/2 and average transition is reduced by 25%

• Drawback - extra line I to indicate whether a pattern has been inverted.

Conditional Inversion Coding

Bus encoding for cross coupling cap.

• Cross-coupling cap dominates for very deep sub-micron technology, e.g. < 70nm

• Bus model – Stand alone cap. Cs

– Cross coupling cap. Cc

• Stand alone switching – Apply to single bit line

– 0-1 transition

• Cross coupling switching – Occurs on adjacent wires

– Four types of coupling transition H --> L

H --> L

L --> H

L --> L

H --> L

L --> H

H --> L

Type 1 Type 2 Type 3 Type 4

0 1 2 0

Permutation Based Address code

• Rearrange the physical order of the bit lines

• Work efficiently on address bus (40%), but not on instruction bus (4%)– Correlation is lower – Permutation is fixed

Sender Receiver

Dynamic Reconfigurable Bus Encoding Scheme for Instruction Bus

Overview of the SchemeDecoding informationstored as header

Decoding information are loaded to LUT first

Instruction is called

Decoding information stored in LUT will control the Cross_bar

Instructions enter Cross_bar to decode

Encoding

Mem_bus

Processing Element

CPU Core

Cache_busDecoder

(Cross_bar)

Look-UpTable

Address_bus

Target ProcessorDecoding

Computer Target

ProcessorBit lines reordering

Encoding during compilation

Demand/Data-driven operation

• Clocking strategy– Gated clocking

• System Power Down• Computing Paradigm

Logic Level Power Down Technique

• Activity driven - precomputation-based sequential logic optimization– Selectively precompute the output logic values of the circuit

one clock cycle before they are required, and use the precomputed values to reduce internal switching activity in the succeeding cycle

It is required thatg1 = 1 -> f = 1g2 = 1 -> f = 0

Original Circuit

Modified Circuit

An example: n-bit comparator

• This circuit compares two n-bit numbers A & and computes the function A > B

• In general, precomputation works best when there are a small number of complex functions corresponding to the logic block A

elec516/10 lecture 9 1 elec 516 vlsi system design and design automation spring 2010 lecture 9 - low...

power portability slide

energy watts time power

energy slide

standby power consumption

standby power source

curve lower power design

power reduction strategy

joule lower power design

Documents

system level design: orthogonalization of concerns and...

jan m. rabaey low power design essentials ©2008 chapter 8...

digital integrated circuits: a design perspective · 2019....

elec516/10 lecture 10 1 elec 516 vlsi system design and...

jan m. rabaey anantha chandrakasan borivoje...

digital integrated circuits a design perspective design...

jan m. rabaey low power design essentials ©2008 chapter 3...

© digital integrated circuits 2nd design methodologies...

ee415 vlsi design the inverter [adapted from rabaey’s...

elec516/10 course_des 1 elec516 vlsi system design and...

elec516/10 lecture 11 1 elec 516 vlsi system design and...

ee414 vlsi design introduction introduction to vlsi design...

ee415 vlsi design the devices: diode [adapted from...

jan m. rabaey anantha chandrakasan borivoje nikolic design...

optimizing power @ design time -...

jan m. rabaey low power design essentials ©2008 chapter 1...

rabaey 2005 trends biotechnol

jan m. rabaey anantha chandrakasan borivoje...

ee415 vlsi design 1 the wire [adapted from rabaey’s...

jan m. rabaey anantha chandrakasan borivoje...