enpeople.virginia.edu/~xg2dt/papers/xinfei guo_uvers_poster...xinfei guo advisor: mircea r. stan...
TRANSCRIPT
10th Annual University of Virginia Engineering Research Symposium (UVERS 2014)
Core 6
Core 1 Core 2 Core 3 Core 4
Core 5 Core 7
Shared L3 Cache
Core 8
Zzzzzz...
Zzzzzz...
Heat Heat
Hea
t
Heat
Heat Heat
&0
0
0
Q
QSET
CLR
S
R
Timing Error!
High Power!Failure!
Slow!
Transistor Aging (Wearout)
Deterioration of circuit/system performance over time
Increase design margin
More significant with extremely scaling technology
Both Reversible and Permanent Part
Bias Temperature Instability (BTI) is the
most dominant reversible aging mechanism
Previous work
Accept the variations, track and
monitor them
Dynamically adapt to the variations
Reduce actual variations during operation
Limitations of previous work
The worst case becomes even worse with technology scaling
Power, performance and area (PPA) overhead
This WorkOur Goal
Reduce aging induced variations directly without
introducing overhead
Relax the design margin
Deeply rejuvenate the chip
Improve PPA metrics
Features
Explore the idea of periodic sleep for electronic systems not
unlike that of biological systems
Postulate that future electronics system will use sleep time as
an active recovery period essential for their overall
performance
Deeply rejuvenate the chip during “Sleep Time”
Demonstrate the techniques with both experiments and
models
ContributionsThree Accelerated Self-Healing techniques
Control sleep conditions explicitly (e.g. higher temperatures,
negative voltages)
Proactive Accelerated Rejuvenation (control the ratio of
sleep vs. active)
A first-order circuit model
Consider both wearout and accelerated recovery periods
Based on latest device level NBTI models
Validate using hardware(FPGA) experiments
Exploring On-Chip Solution
Negative voltage generator
“On-Chip Heater” in other electronic systems architectures
such as multicore.
N/PBTI
HCI
TDDB
EM
…
Motivation
Inspired by Biology: Sleep vs. Inactivity
Biological View:
During sleep, there are still several
active processes that are essential for
the recovery of their full capabilities
Conventional view in circuit community:
Sleep for electronic systems means a period of inactivity or
idleness. (Power gating/Clock gating, etc.)
Our Hypothesis:
Sleep should be used as an active recovery period for future
electronics. Electronic systems will benefit from such sleep
periods with active rejuvenation during which some of the
effects of wearout (like BTI) can be reversed, thus leading to
effective self-healing.
Proactive Accelerated Rejuvenation
Scheduled explicit accelerated recover periods ahead of any
sign of stress
Less overhead (no tracking, adaptation circuitry needed)
Easy to implement
Predictable and controllable
Better cumulative metrics
Extend life time effectively
Cross-Layer ModelWearout Model for FPGAs
Based on Trapping/Detrapping (TD) model
AC vs. DC Stress
Recovery is slower than degradation
The unrecovered part will accumulate phase to phase
t1 : Stress time; t2 : recovery time
The total threshold voltage shift :
The total delay shift:
Accelerated Recovery Model
Big dependence of delay
shift as a function of voltage,
temperature and sleep/active
Ratio.
Delay change in one cycle (T):
Fitting parameters are extracted based on measurement
Accelerated Self-Healing
Stress and Recovery “Knobs”
voltage, time length, temperature, switching activity (AC/DC)
and Ratio of active (wearout) and sleep (rejuvenation) time.
Test Configuration
Commercialized
FPGA chips
Accelerated Testing
Methodology
Test ResultsEffect of Switching Activity on Wearout
AC stress degrades the
Performance slower
Recovery is much slower
Effect of Temperature on Wearout
Negative Voltage
Experimental Setup High Temperature
Ratio of active vs. sleep time
Summary
Future WorkOn-chip Negative Voltages
Combine with on-chip power regulation techniques
Breakdown voltage limitation
Gate-induced drain leakage current (GIDL)
On-chip Heater
Combine the accelerated
techniques with existing
core scheduling solution
Utilize “Dark Silicon”
Conclusions Propose three accelerated Self-healing techniques
Demonstrate several cases that bring stressed chips to within
90% of their original design margin
On-chip solutions are discussed
Limitations: First-order model, other aging mechanisms
(EM, TDDB, etc.), chip-to-chip variations
Exploring the extra flexibility offered by the circadian
rhythms to improve the power, performance and area (PPA)
metrics
AcknowledgementsThis work was supported in part by NSF under grant No.
CCF-1255907, and by SRC through Global Research
Collaboration (GRC) program under task ID. 2410.001. We
would also like to thank Dr. Wayne Burleson from AMD
Research and Mr. Alec Roelke from UVA for discussions.
*Source: http://gladstoneinstitutes.org/node/11312
High Performance Low Power (HPLP) Lab, Computer Engineering Program, University of Virginia
Xinfei Guo Advisor: Mircea R. Stan
Exploring Accelerated Self-Healing Techniques for Electronic Chips and Systems
Biological Clock*
Time
∆Vth(t1)
t1 t1+t2
∆Vth
∆Vth(t1+t2)
Stress Recovery
0
Stress Recovery
)))(1log(
)1log(1)(())1log(()(
12
212221
ttCk
CtktVCtAttV thth
)exp()exp(~ 022
ox
ddr
kTt
BV
kT
EK
(1)
(2)
Stress and Recovery behavior
Pass-transistor based LUT structure
C0
C1
C2
C3
In0
Routing
Blocks
In1
LUT
Path of Interest
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
ddsox
ddsd
V
CtA
kTt
BV
kT
EYtT
))1log(()exp()exp(~)( 10
1
(3)
))1log(
)1
1log(
1)(1
(
))1
1log((
)( 0Ctk
tCk
tT
V
tCA
ttT d
dds
dad
(4)
16-b
Counter
fref clk
in
Cout16
EnEn
75 LUTs
Circuit Under Test (CUT)rst
refoutosc
dfCf
T4
1
2
1
FPGA Board and Mother Test Board
Test configuration
FPGA Chip
To FPGA
Programmer
To Mother Board
ProgrammerTo PC
Thermal
Chamber
Logic Analyzer
Chip is inside
Temperature
Control
Test Conditions
AC/DC stress test results
0
0.5
1
1.5
2
2.5
0 3 hours 6 hours 12 hours 24 hours
Fre
qu
ency
Deg
rad
atio
n (
%)
AC Stress DC Stress
0 1 2 3 4 5 6 7 8 9
x 104
0
0.5
1x 10
-9
Time(s)
Del
ay C
hang
e
Td
(s)
110 C Measurement
100C Measurement
100C Model
110C Model
Accelerated Wearout with 110 °C and 100 °C for 1 day
0
0.5
0 hour 0.3 hour 1 hours 2 hours 4 hours 6 hoursRec
over
ed D
elay
(ns)
0V 0V Model
-0.3V -0.3V Model
0
0.5
1
1.5
2
2.5
0 hour 0.3 hour 1 hours 2 hours 4 hours 6 hours
Rec
over
ed D
elay
(ns)
Negative Voltage-Accelerated Recovery at 20°C and 110 °C
20 °C110 °C
0
0.5
1
1.5
0 hour 0.3 hour 1 hours 2 hours 4 hours 6 hours
Rec
over
ed D
elay
(ns)
20 °C 20 °C Model110 °C 110 °C Model
0
0.5
1
1.5
2
2.5
0 hour 0.3 hour 1 hours 2 hours 4 hours 6 hours
Rec
over
ed D
elay
(ns)
0 V -0.3 V
High Temperature-Accelerated Recovery under 0V and -0.3 V
24.5
24.7
24.9
25.1
25.3
25.5
25.7
25.9
26.1
Fre
qu
ency
(MH
z)Design Margin Relax Parameter (%)
for ratio of active to sleep time is 4
Wearout for 48 hours
Accelerated
Recovery for
12 hours
Design Margin Relax Parameter** (%) for all cases
Illustration of wearout vs. recovery
Illustration of Multicore System Self-Healing
Sleep but recovery
Note: AS – accelerated stress, AR – accelerated recovery
**Design Margin Relaxed Parameter: Percentage the chip recovered from the original margin.
Sleeping Cores
24 hours