methodology to compute architectural vulnerability factors chris weaver 1, 2 shubhendu s. mukherjee...
DESCRIPTION
Strike Changes State 0 1TRANSCRIPT
Methodology to Compute Architectural
Vulnerability FactorsChris Weaver1, 2
Shubhendu S. Mukherjee1
Joel Emer 1
Steven K. Reinhardt1, 2
Todd Austin2
1Fault Aware Computing Technology (FACT), VSSAD, Intel2University of Michigan
Overview Background Previous reliability estimation methodology Proposed methodology for early reliability
estimates Sample analysis Conclusion
Strike Changes State
01
Failure Rate Definitions Interval-based
MTBF = Mean Time Between Failures Rate-based
FIT = Failure in Time = 1 failure in a billion hours 1 year MTBF = 109 / (24 * 365) FIT = 114,155 FIT Additive
Total of 228K FIT+
Cache: 0 FITIQ: 114K FITFU: 114K FIT
Motivation
1
10
100
1000
10000
100000
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Dat
a C
orru
ptio
n FI
T
1000 MTBF Goal
1
10
100
1000
10000
100000
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Dat
a C
orru
ptio
n FI
T
1000 MTBF Goal
FIT if all flips manifest as errors
1
10
100
1000
10000
100000
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Dat
a C
orru
ptio
n FI
T
1000 MTBF GoalFIT if all flips manifest as errorsFIT if 10% of flips manifest as errors
Results of precise & early analysis
If we meet goalwe are done
If we don’t meet goaladd error protection schemes
Objectives
Determine which bits matter Compute FIT rate
Strike on state bitBit
Read
Bit has error
protection
Erroris only detected(e.g., parity + no recovery)
Error can be corrected(e.g, ECC)
yes no
Does bit matter?
Silent Data Corruption
(SDC)
yesyes
no
Detected, but unrecoverable error
(DUE)
no error
yes no
benign faultno error
benign faultno error
* We only focus on SDC FIT* We only focus on SDC FIT
Architectural Vulnerability Factor (AVF)
AVFbit = Probability Bit Matters
=# of Visible Errors
# of Bit Flips from Particle Strikes
FITbit= intrinsic FITbit * AVFbit
Previous AVF Methodology
Statistical Fault Injection with RTL
Logic
1
0
Simulate Strike on Latch
0
output
Does Fault Propagate to Architectural State
Characteristics of SFI with RTL
Naturally characterizes all logical structures
RTL not till late in the design cycle Numerous experiments to flip all bits Generally done at the chip level
Limited structural insight
Objectives Determine which bits matter
Earlier in the design cycle With fewer experiments At the structural-level
Compute FIT rate Intrinsic FIT per bit Architectural Vulnerability Factor
Our Analysis: Which bits matter?
Branch Predictor Doesn’t matter at all (AVF = 0%)
Program Counter Almost always matters (AVF ~ 100%)
Architecturally Correct Execution (ACE)
ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine)
Anything else (un-ACE path) can be derated away
Program Input
Program Outputs
Example of un-ACE instruction: Dynamically Dead Instruction
Dynamically Dead Instruction
Most bits of an un-ACE instruction do not affect program output
Dynamic Instruction Breakdown
Average across all of Spec2K slices
DYNAMICALLY DEAD20%
PERFORMANCE INST1%
NOP26%
ACE46%PREDICATED
FALSE7%
Mapping ACE & un-ACE Instructions to the Instruction Queue
Architectural un-ACE Micro-architectural un-ACE
Wrong-PathInst
IdleNOP Prefetch ACE Inst
ACEInstEx-
ACEInst
T = 3 ACE% = 0/4T = 2 ACE% = 1/4
Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state
T = 1 ACE% = 2/4
Average number of ACE bits in a cycleAverage number of ACE bits in a cycleTotal number of bits in the structureTotal number of bits in the structure
=
T = 4 ACE% = 3/4 ( 2 + 1 + 0 + 3 ) / 4( 2 + 1 + 0 + 3 ) / 444
=
Little’s Law for ACEs
aceaceace LTN
totalNNAVF ace
Computing AVF Our approach is conservative
We assume every bit is ACE unless proven otherwise
Data Analysis Try to prove that data held in a structure is
un-ACE Timing Analysis
Tracks the time this data spent in the structure
Computing FIT rate of a Chip Total FIT = (FIT per biti X # of bitsi X
AVFi)Structure FIT per bit # of bits AVF Total FITBranch Predictor
.001* 1K 0 0
Program Counter
.001* 64 1 0.064
Instruction Queue
.001* 6400 ? ?
Funtional Units
.001* 4000 ? ?
… …Total FIT of whole chip =
column* Intrinsic FIT per bit from externally published data
Results:Experimental Setup
Used ASIM modeling infrastructure Model of a Itanium®2-like processor Ran all Spec2K benchmarks
Compiled with highest level of optimization with the Intel electron compiler
Simulated under a full OS Simulation points chosen using SimPoint
(Sherwood et al)
Instruction Queue
ACE percentage = AVF = 29%
NOP15%
ACE29%
IDLE31%
Ex-ACE10%
WRONG PATH3%
DYNAMICALLY DEAD
8%
PREDICATED FALSE
3%PERFORMANCE
INST1%
Functional Units
ACE percentage = AVF = 9%
SPECULATIVE ISSUE
1%
PERFORMANCE INST0%
PREDICATED FALSE
1%
DYNAMICALLY DEAD
4%
WRONG PATH1%
NOP6%
ACE9%
LOGICAL MASKING
0%
DATAPATH IDLE1%
UNIT IDLE77%
Computing FIT rate of Chip
Structure FIT per bit # of bits AVF Total FITBranch Predictor
.001* 1K 0 0
Program Counter
.001* 64 1 0.064
Instruction Queue
.001* 6400 .29 1.856
Funtional Units
.001* 4000 .09 0.360
… …Total FIT of whole chip =
column* Intrinsic FIT per bit from externally published data
Summary
Determine which bits matter ACE (Architecturally Correction
Execution) Compute FIT rate
Intrinsic FIT per bit AVF (Architectural Vulnerability
Factor)
Questions?
Statistical Fault Injection (SFI) Algorithm
Find a statistically significant set of bits Randomly select a bit Flip the bit Run two simulations: one with bit flip and one without
bit flip Run for pre-defined # cycles Compare architectural state of two simulations (e.g.,
register file) If mismatch, declare an error Repeat algorithm with different bit flip AVF = # mismatches observed / total # experiments
Used widely+ has provided useful AVF numbers till date
SFI vs. ACE analysisSFI ACE
Accuracy of Microarchitectural un-ACE
Better than ACE analysis
Conservative
Accuracy of Archirectural un-ACE
Conservative Better than SFI(e.g., covers dynamically dead instructions)
Insight Per-structure insights harder
Little’s Law & per-structure breakdown easier
# of experiments Large # required to be statistically significant
Small # of experiments can give good accuracy