methodology to compute architectural vulnerability factors chris weaver 1, 2 shubhendu s. mukherjee...

Methodology to Compute Architectural

Vulnerability FactorsChris Weaver1, 2

Shubhendu S. Mukherjee1

Joel Emer 1

Steven K. Reinhardt1, 2

Todd Austin2

1Fault Aware Computing Technology (FACT), VSSAD, Intel2University of Michigan

Overview Background Previous reliability estimation methodology Proposed methodology for early reliability

estimates Sample analysis Conclusion

Strike Changes State

01

Failure Rate Definitions Interval-based

MTBF = Mean Time Between Failures Rate-based

FIT = Failure in Time = 1 failure in a billion hours 1 year MTBF = 109 / (24 * 365) FIT = 114,155 FIT Additive

Total of 228K FIT+

Cache: 0 FITIQ: 114K FITFU: 114K FIT

Motivation

1

10

100

1000

10000

100000

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Dat

a C

orru

ptio

n FI

T

1000 MTBF Goal

1

10

100

1000

10000

100000

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Dat

a C

orru

ptio

n FI

T

1000 MTBF Goal

FIT if all flips manifest as errors

1

10

100

1000

10000

100000

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Dat

a C

orru

ptio

n FI

T

1000 MTBF GoalFIT if all flips manifest as errorsFIT if 10% of flips manifest as errors

Results of precise & early analysis

If we meet goalwe are done

If we don’t meet goaladd error protection schemes

Objectives

Determine which bits matter Compute FIT rate

Strike on state bitBit

Read

Bit has error

protection

Erroris only detected(e.g., parity + no recovery)

Error can be corrected(e.g, ECC)

yes no

Does bit matter?

Silent Data Corruption

(SDC)

yesyes

no

Detected, but unrecoverable error

(DUE)

no error

yes no

benign faultno error

benign faultno error

* We only focus on SDC FIT* We only focus on SDC FIT

Architectural Vulnerability Factor (AVF)

AVFbit = Probability Bit Matters

=# of Visible Errors

# of Bit Flips from Particle Strikes

FITbit= intrinsic FITbit * AVFbit

Previous AVF Methodology

Statistical Fault Injection with RTL

Logic

1

0

Simulate Strike on Latch

0

output

Does Fault Propagate to Architectural State

Characteristics of SFI with RTL

Naturally characterizes all logical structures

RTL not till late in the design cycle Numerous experiments to flip all bits Generally done at the chip level

Limited structural insight

Objectives Determine which bits matter

Earlier in the design cycle With fewer experiments At the structural-level

Compute FIT rate Intrinsic FIT per bit Architectural Vulnerability Factor

Our Analysis: Which bits matter?

Branch Predictor Doesn’t matter at all (AVF = 0%)

Program Counter Almost always matters (AVF ~ 100%)

Architecturally Correct Execution (ACE)

ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine)

Anything else (un-ACE path) can be derated away

Program Input

Program Outputs

Example of un-ACE instruction: Dynamically Dead Instruction

Dynamically Dead Instruction

Most bits of an un-ACE instruction do not affect program output

Dynamic Instruction Breakdown

Average across all of Spec2K slices

DYNAMICALLY DEAD20%

PERFORMANCE INST1%

NOP26%

ACE46%PREDICATED

FALSE7%

Mapping ACE & un-ACE Instructions to the Instruction Queue

Architectural un-ACE Micro-architectural un-ACE

Wrong-PathInst

IdleNOP Prefetch ACE Inst

ACEInstEx-

ACEInst

T = 3 ACE% = 0/4T = 2 ACE% = 1/4

Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state

T = 1 ACE% = 2/4

Average number of ACE bits in a cycleAverage number of ACE bits in a cycleTotal number of bits in the structureTotal number of bits in the structure

=

T = 4 ACE% = 3/4 ( 2 + 1 + 0 + 3 ) / 4( 2 + 1 + 0 + 3 ) / 444

=

Little’s Law for ACEs

aceaceace LTN

totalNNAVF ace

Computing AVF Our approach is conservative

We assume every bit is ACE unless proven otherwise

Data Analysis Try to prove that data held in a structure is

un-ACE Timing Analysis

Tracks the time this data spent in the structure

Computing FIT rate of a Chip Total FIT = (FIT per biti X # of bitsi X

AVFi)Structure FIT per bit # of bits AVF Total FITBranch Predictor

.001* 1K 0 0

Program Counter

.001* 64 1 0.064

Instruction Queue

.001* 6400 ? ?

Funtional Units

.001* 4000 ? ?

… …Total FIT of whole chip =

column* Intrinsic FIT per bit from externally published data

Results:Experimental Setup

Used ASIM modeling infrastructure Model of a Itanium®2-like processor Ran all Spec2K benchmarks

Compiled with highest level of optimization with the Intel electron compiler

Simulated under a full OS Simulation points chosen using SimPoint

(Sherwood et al)

Instruction Queue

ACE percentage = AVF = 29%

NOP15%

ACE29%

IDLE31%

Ex-ACE10%

WRONG PATH3%

DYNAMICALLY DEAD

8%

PREDICATED FALSE

3%PERFORMANCE

INST1%

Functional Units

ACE percentage = AVF = 9%

SPECULATIVE ISSUE

1%

PERFORMANCE INST0%

PREDICATED FALSE

1%

DYNAMICALLY DEAD

4%

WRONG PATH1%

NOP6%

ACE9%

LOGICAL MASKING

0%

DATAPATH IDLE1%

UNIT IDLE77%

Computing FIT rate of Chip

Structure FIT per bit # of bits AVF Total FITBranch Predictor

.001* 1K 0 0

Program Counter

.001* 64 1 0.064

Instruction Queue

.001* 6400 .29 1.856

Funtional Units

.001* 4000 .09 0.360

… …Total FIT of whole chip =

column* Intrinsic FIT per bit from externally published data

Summary

Determine which bits matter ACE (Architecturally Correction

Execution) Compute FIT rate

Intrinsic FIT per bit AVF (Architectural Vulnerability

Factor)

Questions?

Statistical Fault Injection (SFI) Algorithm

Find a statistically significant set of bits Randomly select a bit Flip the bit Run two simulations: one with bit flip and one without

bit flip Run for pre-defined # cycles Compare architectural state of two simulations (e.g.,

register file) If mismatch, declare an error Repeat algorithm with different bit flip AVF = # mismatches observed / total # experiments

Used widely+ has provided useful AVF numbers till date

SFI vs. ACE analysisSFI ACE

Accuracy of Microarchitectural un-ACE

Better than ACE analysis

Conservative

Accuracy of Archirectural un-ACE

Conservative Better than SFI(e.g., covers dynamically dead instructions)

Insight Per-structure insights harder

Little’s Law & per-structure breakdown easier

# of experiments Large # required to be statistically significant

Small # of experiments can give good accuracy

methodology to compute architectural vulnerability factors chris weaver 1, 2 shubhendu s. mukherjee...

Documents