curing the ailments of nanometer cmos through self-healing...
TRANSCRIPT
SOCC, Sept. 25, 2006
Curing the Ailments of Nanometer CMOS through Self-Healing and Resiliency
Curing the Ailments of Nanometer CMOS through Self-Healing and Resiliency
Jan M. RabaeyDirector Gigascale Silicon Research Center
Co-Director Berkeley Wireless Research Center
University of California at Berkeley
2
SOCC, Sept. 2006SOCC, Sept. 2006
The Silicon Age Still on a Roll, But …
Medium High Very HighVariability
Energy scaling will slow down>0.5>0.5>0.35Energy/Logic Op scaling
0.5 to 1 layer per generation8-97-86-7Metal Layers
11111111RC Delay
Reduce slowly towards 2-2.5<3~3ILD (K)
Low Probability High ProbabilityAlternate, 3G etc
128
11
2016
High Probability Low ProbabilityBulk Planar CMOS
Delay scaling will slow down>0.7~0.70.7Delay = CV/I scaling
256643216842Integration Capacity (BT)
8162232456590Technology Node (nm)
2018201420122010200820062004High Volume Manufacturing
Some Major Hurdles on The Way!
2003 ITRS Roadmap2003 ITRS Roadmap2003 ITRS Roadmap
3
SOCC, Sept. 2006SOCC, Sept. 2006
The Challenges of the Next Decade(s)
•The Physics and Manufacturing Challenges
– A whole slew of static and dynamic variations and error mechanisms
•The Design Introduction Challenge
– Complexity, risk, time, cost
•The n-furcation of the Market
4
SOCC, Sept. 2006SOCC, Sept. 2006
Variations Becoming Pronounced
0.01
0.1
1
1980 1990 2000 2010 2020
micron
10
100
1000
nm
193nm193nm248nm248nm
365nm365nmLithographyLithographyWavelengthWavelength
65nm65nm90nm90nm
130nm130nm
GenerationGeneration
GapGap
45nm45nm
32nm32nm13nm 13nm EUVEUV
180nm180nm
Design becoming “statistical”• makes verification substantially harder• challenging synchronization strategies• “error-free” design untenable
Courtesy: Shekhar Borkar, Intel
XY 40
50
60
70
80
90
100
110
Tem
per
atu
re (
C)
130nm
30%
5X
0.90.9
1.01.0
1.11.1
1.21.2
1.31.3
1.41.4
11 22 33 44 55Normalized Leakage (Isb)Normalized Leakage (Isb)
No
rmal
ized
Fre
qu
ency
No
rmal
ized
Fre
qu
ency
5
SOCC, Sept. 2006SOCC, Sept. 2006
Just One Example of Where We are Going
VT Variation – Long/WideVT Variation – Long/Wide
VT Variation – Short/NarrowVT Variation – Short/Narrow
Courtesy: Colin McAndrew, FreescaleCourtesy: Colin McAndrew, Freescale
6
SOCC, Sept. 2006SOCC, Sept. 2006
Variations Come in Many Different Flavors
Also, local versus global, correlated versus random, temperal versus spatial Also, local versus global, correlated versus random, temperal versus spatial
Different sources lead to different solutions
Different sources lead to different solutions
7
SOCC, Sept. 2006SOCC, Sept. 2006
Variations Become Indistinguishable from Failure
Source: K. Nowka, IBMSource: K. Nowka, IBM
8
SOCC, Sept. 2006SOCC, Sept. 2006
Failures Becoming More ProminentElectromigration
(Weak-defective interconnects)
Manufacturing DefectsThat Escape Testing
(Inefficient Burn-in Testing)
Time-DependentDielectric Breakdown (TDDB)
(Ultra-thin gate oxides)
Transient Faults due toCosmic Rays & Alpha Particles
(Increase exponentially withnumber of devices on chip)
Tra
nsi
sto
r R
elia
bili
ty
Transistor Lifetime (years)
Now
Future
Increased Heating
HigherTransistorLeakage
ThermalRunaway
HigherPower
Dissipation
Courtesy: T. AustinCourtesy: T. Austin+ just more complexity+ just more complexity
9
SOCC, Sept. 2006SOCC, Sept. 2006
Failures Becoming More Prominent
Erratic bit failures in memories caused by temporary trapped charges Erratic bit failures in memories caused by temporary trapped charges
10
SOCC, Sept. 2006SOCC, Sept. 2006
Dealing with variations and faults
20052005 20102010 The far beyondThe far beyondBeyondBeyond
Co
mp
lexi
tyC
om
ple
xity
20002000
Self-HealingSelf-Healing
EmbracingRandomnessEmbracing
Randomness
Error-resiliencyError-resiliency
Fully structured and regular fabrics
11
SOCC, Sept. 2006SOCC, Sept. 2006
Curing the Nanometer Ailments
• Regularity and Structure
• Self-Healing
• Error-Resiliency
• Embracing Randomness
Absolutely required for manufacturabilityDriven by photo-lithography and eventually self-assembly constraints
Also for variability, reliability, and time-to-market
Regular implementation fabricsRegular implementation fabrics
12
SOCC, Sept. 2006SOCC, Sept. 2006
Regular Fabrics – A Plethora of Choices
FPGAFPGA
VPGACMU
VPGACMU
River PLABerkeley
River PLABerkeley
Structured ASIC (e.g. LSI RapidChip)Structured ASIC (e.g. LSI RapidChip)
Trade-off between area, performance, power and
time-to-market (factors 5 to 10)
TradeTrade--off between area, off between area, performance, power and performance, power and
timetime--toto--market market (factors 5 to 10)(factors 5 to 10)
13
SOCC, Sept. 2006SOCC, Sept. 2006
Regular Fabrics - Example
CMU Regular Logic BricksStandard-cell library with fewer (~10),
coarser, configurable (w/ vias), micro-regular brick layouts…
…that exhibit macro-regularitywhen assembled at chip-level
2-D FFT plotsof poly-Si
patterns
ASIC “spatial” regularity2-D FFT plots
of poly-Si patterns
Brick “spatial” regularity
[Courtesy: Larry Pileggi, Andrzej Strojwas, CMU – C2S2]
14
SOCC, Sept. 2006SOCC, Sept. 2006
CMU Regular Logic Bricks
[Courtesy: Larry Pileggi, Andrzej Strojwas, CMU – C2S2]
15
SOCC, Sept. 2006SOCC, Sept. 2006
Curing the Nanometer Ailments
• Regularity and Structure
• Self-Healing
• Error-Resiliency
• Embracing Randomness
Self-Healing Architectures• On chip-test and diagnostics used to
correct for variations and stress• Static and dynamic
16
SOCC, Sept. 2006SOCC, Sept. 2006
Self-Healing
• Introduce sensors that monitor key aspects of system
– Manufacturing and environmental conditions
Process variations, temperature, voltage, activity, etc
– Key properties that accelerate failure mechanisms
• Employ system-level intelligent control to reduce stress
– Temperature control via resource assignment
– Active management of voltage-reliability trade-offs
• Utilize tuning and healing to alleviate reliability threats
– NBTI reversal
– In-field clock tuning
Courtesy: T. AustinCourtesy: T. Austin
17
SOCC, Sept. 2006SOCC, Sept. 2006
Test Moving On-Line
• On-chip resources used to minimize test cost • Also available for dynamic re-evaluation and adaptation
On-chip noise samplersOn-chip noise samplers
BusInterface Master Wrapper
Low-CostTester
On-ChipMemory
Diag. test program
Responsemap
VCI
On-chip Bus
00001100000000000000000000000000000000100000000000100110000000001100010000000000111111111111111111111111111111110000000000000000
Logic failure map
CPU
On-chip leakage sensorOn-chip leakage sensor
90 nm Itanium90 nm Itanium
18
SOCC, Sept. 2006SOCC, Sept. 2006
Adaptive Biasing Using On-Line Test
5
10
15
20
25
30
35
40
45
50
1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07
Path Delay (ps)
Esw
itch
ing
(fJ) Adaptive Tuning
Worst Case, w/o Vth tuningNominal, w/ Vth tuning
Energy-performance trade-off
ModuleTest
Module
Vbb
Test inputsand responses
Tclock
Vdd
Dynamically adjust supply and threshold design parameters to center the design in the presence of process variations!
Courtesy: K. Cao, Berkeley
10xEasier Again in Regular Fabrics
19
SOCC, Sept. 2006SOCC, Sept. 2006
Adaptive (Body) Biasing Impact
Courtesy: P. Gelsinger and S. Borkar, Intel (DAC04)
4.5 mm
5.3
mm
Multiplesubsites
4.5 mm
5.3
mm
Multiplesubsites
4.5 mm
5.3
mm
Multiplesubsites
4.5 mm
5.3
mm
Multiplesubsites
20
SOCC, Sept. 2006SOCC, Sept. 2006
Dynamic Resource Allocation
In the MultiIn the Multi--Processor SpaceProcessor SpaceCompiler combines load Compiler combines load assignment with DVSassignment with DVS
mdlmdl group at PSUgroup at PSU
405060708090
100
2 4 8 16 32
Number of Processors
Nor
mal
ized
Ene
rgy
3D DFE LU SPLAT MGRID WAVE5
More savings with more processors!More savings with more processors!
In the Interconnect SpaceIn the Interconnect SpaceUse routing throttling to Use routing throttling to perform thermal managementperform thermal management
ThermalHerdThermalHerd (L.S. Peh, Princeton)(L.S. Peh, Princeton)
21
SOCC, Sept. 2006SOCC, Sept. 2006
Rejuvenation
Source: D. Blaauw, UMichSource: D. Blaauw, UMich
Negative Bias Temperature InstabilityNegative Bias Temperature Instability
22
SOCC, Sept. 2006SOCC, Sept. 2006
Curing the Nanometer Ailments
• Regularity and Structure
• Self-Healing
• Error-Resiliency
• Embracing Randomness
Redundancy GaloreThe only way to provide true error-resiliency!
With billions of transistors, overhead factors of 2 to 3 are reasonable if leading to 100% yield, supreme performance, or new applications.
23
SOCC, Sept. 2006SOCC, Sept. 2006
Error-Resilient Systems
Incorporate facilities to push through system faults
• Error detection technologies
– Systems checkers, online testing, continuous functional verification
• Fault diagnosis
– Fine-grained testing, online testing
• System state recovery
– Microarchitectural checkpointing, algorithmic tolerance
• Physical repair
– Sparing, TMR
Courtesy: T. AustinCourtesy: T. Austin
24
SOCC, Sept. 2006SOCC, Sept. 2006
A Gradual Introduction Process
A “pseudo-synchronous”approach to address process variations and power minimization with minimal overhead by combining circuit and architectural techniques
Courtesy: T. Austin, D. Blaauw, MichiganCourtesy: T. Austin, D. Blaauw, Michigan
Example: Aggressive Deployment using “Razor”Example: Aggressive Deployment using “Razor”
recover
IF
Raz
or F
F
ID
Raz
or F
F
EX
Raz
or F
F
MEM(read-only)
WB(reg/mem)
errorbubble
recover recover
Raz
or F
F
Stab
ilizer
FF
PC
recover
flushID
bubble
errorbubble
flushID
errorbubble
flushID
FlushControl
flushID
error
recover
IF
Raz
or F
FR
azor
FF
ID
Raz
or F
FR
azor
FF
EX
Raz
or F
FR
azor
FF
MEM(read-only)
WB(reg/mem)
errorbubble
recover recover
Raz
or F
FR
azor
FF
Stab
ilizer
FF
Stab
ilizer
FF
PCPC
recover
flushID
bubble
errorbubble
flushID
errorbubble
flushID
FlushControl
flushID
error
“razored pipeline”“razored pipeline”
Shadow Latch
Error_L
Errorcomparator
clk_del
FF
clk
QD
Processor
Total
Optimal Voltage
RecovEnergy
Supply Voltage
Ene
rgy
Processor
Total
Optimal Voltage
RecovEnergy
Supply Voltage
Ene
rgy
25
SOCC, Sept. 2006SOCC, Sept. 2006
The Memory Data-Retention Voltage (DRV)
DRVV when , DD
inverterRight 2
1
inverterLeft 2
1 =∂∂=
∂∂
V
V
V
V
VDD
V1
M4
M3
M6M5
M2
M1
Leakagecurrent
V2
Leakagecurrent
VDDVDD
0 0
0 0.1 0.2 0.3 0.40
0.1
0.2
0.3
0.4
V1 (V)
2VTC1VTC2
VDD=0.18V
VDD=0.4V
VTC of SRAM cell inverters
V2
(V)
When Vdd scales down to DRV, the Voltage Transfer Curves (VTC) of the internal inverters degrade to such a level that Static Noise Margin (SNM) of the SRAM cell reduces to zero.
DRV Condition:
Source: Huifang Qin, ISQED 2004
Example 2: Minimizing standby leakage in SRAMs
26
SOCC, Sept. 2006SOCC, Sept. 2006
The Impact of Process Variations
DRV Spatial Distribution (256*128 Cells)
130 nm CMOS
100 200 300 4000
1000
2000
3000
4000
5000
6000
DRV (mV)
His
togra
m o
f 32K
SR
AM
cel
ls
27
SOCC, Sept. 2006SOCC, Sept. 2006
Supply based tradeoff
SRAMError
ControlCode
Data int = 0
Data outt = Tst
Goal:Minimize power/bit
vS
28
SOCC, Sept. 2006SOCC, Sept. 2006
Power tradeoff with ECC
ECC saves standby powerHamming [31, 26, 3] achieves 33% power
saving
Reed-Muller [256, 219, 8] achieves 35% power saving
At the expense of time and area overheadAt the expense of time and area overhead
Minimum standby time to achieve power savingsMinimum standby time to achieve power savings
29
SOCC, Sept. 2006SOCC, Sept. 2006
1.1mm
1.1mm
Original mem1024x26
Customized 1024x31
enc
dec
• Error tolerant SRAM optimized for ultra-low voltage standby
• Selected implementation Hamming [31, 26, 3]
• 50% cell design overhead• 19% parity overhead
• Tapeout: May 2006
Prototype Design
30
SOCC, Sept. 2006SOCC, Sept. 2006
“Aggressive” Deployment At the Algorithm Level
][nx][nyaMain Block
Estimator
][ˆ ny| | >Th
][nye
Energy savings
Voltage
Pow
er
Pmain
PTOT
PEC
1.0
1.0
Courtesy: N. Shanbhag, IllinoisCourtesy: N. Shanbhag, Illinois
Voltage overscale Main Block.
Correct errors using Estimator.
Power savings ≥ 3X!
Voltage overscale Main Block.
Correct errors using Estimator.
Power savings ≥ 3X!
31
SOCC, Sept. 2006SOCC, Sept. 2006
Leveraging resiliency to increase value
error-free with errors error-corrected
Low power motion estimation architecture using Algorithmic
Noise Tolerance (Shanbhag, UIUC)
Low power motion estimation architecture using Algorithmic
Noise Tolerance (Shanbhag, UIUC)
Up to 71% energy reduction demonstratedUp to 71% energy reduction demonstrated
32
SOCC, Sept. 2006SOCC, Sept. 2006
• Core function validated by checker
• Checker relaxes burden of correctness on core processor
• Core does the heavy lifting, removes hazards that could slow the simple checker
speculativeinstructions
in-orderwith PC, inst,inputs, addr
IF ID REN REG
EX/MEM
SCHEDULER CHK CT
Performance Correctness
Core Checker
Courtesy: Todd Austin, Univ. of Michigan
205 mm2
Alpha 21264REMORAChecker
12 mm2
Self-checking processor
Moving the Verification on the Chip
33
SOCC, Sept. 2006SOCC, Sept. 2006
“On-Line X”(X = Verification, Test, Tuning, Reliability, Resource,
Power and Leakage Management)
From Design time to Run Time Yield Improvement!
“Turning lemons into lemonade”
T. Austin
“Turning lemons into lemonade”
T. Austin
34
SOCC, Sept. 2006SOCC, Sept. 2006
Coordinated Forward Error RecoveryCoordinated Forward Error Recovery
Runtime Validation of Multithreaded Processors
0.99
1
1.01
1.02
1.03
1.04
1.05
1.06
FFT LU CHOLESKY BARNES FMM WATER-NSQUARED
WATER-SPATIAL
Runtime Validation Configuration Fault Rate = 1/1K Fault Rate = 1/1M
SM
T P
roce
sso
rReg. File Memory
Runtime Monitorin
g Hardware Context Status Register
Hardware Synchronization Unit
DIVA checker processor
DIVA checker processor
Per-thread retired instructions
dis
pat
ch
Correctness Correctness Properties of Properties of Multithreaded Multithreaded
ExecutionExecution
InterInter--thread thread CommunicationCommunication
InterInter--thread thread SynchronizationSynchronization
IntraIntra--thread thread Data FlowData Flow
IntraIntra--thread thread Control FlowControl Flow
Courtesy: S. Malik, PrincetonCourtesy: S. Malik, Princeton
35
SOCC, Sept. 2006SOCC, Sept. 2006
BulletProof Silicon – The Next Generation
Goal: Single-defect tolerance for 5% area overhead
Key ideas: • No expensive computation checking• Protect computation and test Hw• Repair by disabling redundant parts
Approach:1. Execute and protect state2. Test concurrently when Hw idle3. If tests fails → roll back state
→ disable component → restart
IF ID EX
MEM W
B
checkers + BIST
µprocessor pipeline
CIRCUIT ENVELOPE – logic-level testing and reconfiguration
ARCHITECTURAL ENVELOPE – Check-pointing and epoch restore
spec
ulat
ive
stat
e
non-
spec
ulat
ive
stat
e
epochs boundary
epochs boundary
Rec
onfig
urat
ion
Courtesy: Austin,
Bertacco, U. Mich
Courtesy: Austin,
Bertacco, U. Mich
36
SOCC, Sept. 2006SOCC, Sept. 2006
• Exploit the properties of the CMP switch design to provide end-to-end error detection and recovery
– Enhance switch output channels
with CRC checkers
– Split flits into two parts and route
them independently using
different resources
– Add a Recovery Pointer
to each input buffer
– On Error Detection:
- All CRC checkers drop
outgoing packets
- Switch pipeline is flushed
- Head pointers are set to
recovery pointers
- Restart execution
BulletProof Router
CRC Checker
InterconnectSwitch
CRC Checker
CRC Checker
CRC Checker
RecoveryLogic
CRC Checker
RoutedFlit
RoutedFlit
RoutedFlit
RoutedFlit
RoutedFlit
Error Detection Signal
Header
Routing Logic
Input Buffers
Routing LogicVC State
CRC Checker
Buffer Checker
Switch ArbiterSwitch Arbiter
Cross-barCRC
Checker
RecoveryLogic
Switch Recovery
Error
Tail Flit
Head/Tail
Cross-bar Controller
System Diagnosis
System Diagnosis
CRC
abcde abcde
InputBuffers
Tail Head RecoveryHead
a: Correctly routed flitb, c: In the switch pipelined: Next flit to be routede: Last flit buffered
e dabcde abcde
InputBuffers
Tail Head RecoveryHead
a: Correctly routed flitb, c: In the switch pipelined: Next flit to be routede: Last flit buffered
e d
37
SOCC, Sept. 2006SOCC, Sept. 2006
Towards malleable, resilient architectures
The Quest: Scaleable (hard and soft) architectures that provide flexible redundancy to accommodate systematic and random, static and dynamic errors while avoiding brittleness!
38
SOCC, Sept. 2006SOCC, Sept. 2006
Curing the Nanometer Ailments
• Regularity and Structure
• Self-Healing
• Error-Resiliency
• Embracing Randomness
Maintaining a purely deterministic Boolean abstraction ultimately becomes untenable! Maintaining our abstractions == Slowly abandon them !!
39
SOCC, Sept. 2006SOCC, Sept. 2006
The Search for (New) Scaleable and Stackable Abstractions
An Interesting Case Study:The “Neural Network” MOCProperties:Properties:• Works well on noisy signals• Uses “soft” decisions • Operates in the presence of failures of components and interconnections
Challenge: Limited scopeWorks mostly for classification problems
Artificial neuronArtificial neuron
Allow devices to make errorsand use models-of-computation that tolerate them
(signal processing, communication, coding, information theory)
40
SOCC, Sept. 2006SOCC, Sept. 2006
Exploring the Yellow Brick Road
• 10-15% of terrestrial animal biomass
• 109 Neurons/”node”
• Since 105 years ago
Humans
• 10-15% of terrestrial animal biomass
• 105 Neurons/”node”
• Since 108 years ago
Ants
Easier to make ants than humans“Small, simple, swarm”
CourtesyD. Petrovic, UCB
41
SOCC, Sept. 2006SOCC, Sept. 2006
Inspired by the Sensor Network Paradigm
Artificial Skin
Communication Backplanes Real-time Health Monitoring
Smart Surfaces
42
SOCC, Sept. 2006SOCC, Sept. 2006
Example: Collaborative Networks
• Large number of states/nodes
• Bi-directional, non-linear, non-deterministic links
• Local coupling with globally emergent behavior
• Inherently redundant and resilient to failure
• Large number of states/nodes
• Bi-directional, non-linear, non-deterministic links
• Local coupling with globally emergent behavior
• Inherently redundant and resilient to failure
Sensor Network-on-a-chip
Source: N. Shangbah, D. Jones
43
SOCC, Sept. 2006SOCC, Sept. 2006
SN-on-a-chip – A simple example
Estimators need to be independentfor this scheme to be effective
Estimators need to be independentfor this scheme to be effective
A simple study:
2 different adders with voltage over-scaling
A simple study:
2 different adders with voltage over-scaling
Source: N. Shanbhag, UIUCSource: N. Shanbhag, UIUC
44
SOCC, Sept. 2006SOCC, Sept. 2006
Distributed Collaborative Systems on a Chip
Example: A configurable radio architecture based on collaborative autonomous entities
Source: J. Roychowdhury, J. Rabaey
Array of locally-coupled cheaplow-power oscillator-based units• Known to exhibit complex, spontaneous pattern formation • Operation mode selected through choice of coupling factors and operational nodes
Emerging patternas a function of coupling factor
45
SOCC, Sept. 2006SOCC, Sept. 2006
The Mechanical Radio
The Ultimate ULP Tunable Wireless Transceiver?
Support BeamsWine-Glass
Disk
Anchor
InputElectrode
Coupling Beam
OutputElectrode
R = 32 μm
Source: C. Nguyen, UC Michigan
9 wine-glass disc oscillator-based GSMcompliant oscillator
46
SOCC, Sept. 2006SOCC, Sept. 2006
Transitioning to the Post-Silicon Age
Implementation platforms that work under very low SNR, are non-deterministic, unpredictable and unreliable…
Molecular
Organic
NanoOptics
Nanotube
47
SOCC, Sept. 2006SOCC, Sept. 2006
Some Concluding Remarks
Formidable challenges over the next decades to dramatically alter design paradigms
Variability and reliability to lead to novel micro-architectures and computational models
Regularity and redundancy central tenets
The opportunities:
Use the abundance of transistors to move the burden from pre- or post-manufacturing evaluation to on-line activities
Gradual incorporation of error-resilient computational models
Formidable challenges over the next decades to dramatically alter design paradigms
Variability and reliability to lead to novel micro-architectures and computational models
Regularity and redundancy central tenets
The opportunities:
Use the abundance of transistors to move the burden from pre- or post-manufacturing evaluation to on-line activities
Gradual incorporation of error-resilient computational models
48
SOCC, Sept. 2006SOCC, Sept. 2006
The GSRC System-Design Roadmap
Concurrent
Resilient
Alternative
Now 2020’s
Core
Co
mp
lexityC
om
plexity
Co
mp
lexity
GSRC: The best answer to formidable challenges is a critical mass responseGSRC: The best answer to formidable GSRC: The best answer to formidable challenges is a critical mass responsechallenges is a critical mass response
49
SOCC, Sept. 2006SOCC, Sept. 2006
The GSRC Agenda
Concurrent Systems
Resilient Systems
AlternativeCom
putationalSystem
s
System Design
Core Framew
ork
Design Driver
W.M. Hwu T. Austin N. Shanbhag ASV
J. Wawrzynek
J.Rabaey
S. Malik K. Lutz
Structured along the line of big challenges rather than technologies
Provokes multi-disciplinary out-of-the-box thinking
41 Faculty17 Institutions
41 Faculty41 Faculty17 Institutions17 Institutions
50
SOCC, Sept. 2006SOCC, Sept. 2006
Thank you!“Creativity is the ability to introduce order into the randomness of nature”
― Eric Hoffer
The contributions of all the GSRC faculty to this presentation are greatly appreciated, so is the funding by the MARCO member companies and the US Government.