understanding the propagation of hard errors to software and implications for resilient system...

39
Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S. Adve, Y. Zhou (UIUC), ASPLOS’08 Shimin Chen LBA Reading Group Presentation

Upload: neil-armstrong

Post on 02-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Understanding the Propagation of Hard Errors to Software and Implications for Resilient System DesignM. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S. Adve, Y. Zhou (UIUC), ASPLOS’08

Shimin ChenLBA Reading Group Presentation

Page 2: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Introduction Hardware reliability

Aging/wear out Infant mortality (insufficient burn-in) Soft errors (radiation) Design defects

Willing to pay 10% area overhead for reliability Industry panel discussion in SELSE II Conventional dual modular redundancy too costly

How?

Page 3: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Two Observations Only need to handle observable device faults

Faults that propagate through higher levels of the system and observable by software

Fault-free operation is the common case Must be optimized Willing to have increased overhead after a fault is

detected

Page 4: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Proposals: Cooperative HW-SW Detect high-level anomalous SW behavior

(symptoms of faults) Checkpoint/replay + diagnosis components

(For mission-critical system, may incorporate previous backup detection techniques)

Page 5: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Potential Advantages Generality: oblivious to numerous failure mechanisms

and microarchitectures Ignoring masked faults Optimizing for the common case Customizability: which action to take upon fault? Amortizing overhead across other system functions

Reuse online SW bug detection support

Page 6: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Investigation in This PaperQuestion to answer: Coverage: What HW faults produce detectable

anomalous SW behavior w/ high probability? Latency: What is the fault detection latency? Impact on OS:

How frequently is OS state corrupted by HW faults? Detection coverage and latency for such faults?

Focus on permanent faults (increasingly important)Methodology: Fault-injection study using simulations

Page 7: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Major Results Detection coverage: most permanent faults

that propagate to SW are easily detectable Detection latency: <= 100K instructions for

86% cases Impact on OS: often corrupt OS state

Page 8: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Outline SWAT System Assumptions Methodology Results Implications for Resilient System Design

Page 9: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

SWAT (SoftWare Anomaly Treatment)

The investigation assumes the following context: Always-on SW symptom-based detection A multicore system, at least one fault-free core Checkpoint/replay mechanism

Replay when fault is detected If anomalous behavior is deterministic, this is HW fault,

recover using a fault-free core Otherwise ignore (transient)

HW has the ability to repair or reconfigure around permanent faults

Firmware controlled diagnosis and recovery hide HW errors from becoming externally visible

Page 10: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Outline SWAT System Assumptions Methodology Results Implications for Resilient System Design

Page 11: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Simulation Environment Virtutech Simics + Wisconsin GEMS micro-

architectural and memory timing simulators

SPARC V9 ISA, 6 SpecInt2000, 4 SpecFP2000

OS activity < 1% for fault-free runs

Page 12: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Fault Injection Timing-first approach in GEMS

Cycle-accurate GEMS timing simulator Simics functional simulator Compare and set GEMS state based on simics state (so

GEMS can skip the support for some rare instructions) Fault injection

Inject fault into GEMS timing simulator If the mismatched states are due to fault injection, corrupt

simics states Activated fault vs. architecturally masked fault:

If GEMS state mismatched simics state? OS or user mode?

Check privilege mode

Page 13: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Fault Model: permanent faults Stuck-at-0:

A bit is always 0

Stuck-at-1: A bit is always 1

Dominant-0 Acts like a logical-AND between adjacent faulty bits

Dominant-1 Acts like a logical-OR between adjacent faulty bits

Dominant-x a.k.a. bridging fault

Page 14: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S
Page 15: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Number of Injected Faults 10 benchmarks 40 random points per benchmark after initialization 4 fault models 8 micro-architectural structures

Total = 10 x 40 x 4 x 8 = 12800

Page 16: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Fault Detection Run 10 million instructions with detailed simulation

If no SW symptom is detected, run fast simulation to finish the benchmark and check for corruption

Page 17: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S
Page 18: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Fatal HW TrapTypically not thrown during a correct execution

SPARC: Data Access Exception Division by zero Illegal instruction Memory misaligned Recover Error and Debug (too many nested traps) Watchdog reset (no instruction retires in the last

65536 cycles)

Page 19: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Abnormal Application Exit

Application may have a seg fault or assertion failure

OS knows the exit status

In simulation, looks for OS idle loop as an indication of such an exit

Page 20: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Hangs During the 10 million instructions Keep a counter per observed branch PC Increment the counter for a branch If any counter exceeds 100,000 (or 1% of the total

instructions), then flag a hang

Profiling the fault-free executions and mask out a handful of branches that do not satisfy this

Page 21: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

High OS Activity Amount of time the execution remains in OS Typically control returns to user-mode for a few 10s

of instructions except A timer interrupt after a quantum expires

(this < 10,000 instructions) System calls (could be 100K to 1 million instructions)

Detection threshold: Over 30,000 contiguous OS instructions But not in a system call

Page 22: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Metrics Coverage:

Masked faults: architecture + application Detection latency:

total number of instructions retired from the first architecture state corruption till the detection of the fault within 10M instructions

Page 23: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Outline SWAT System Assumptions Methodology Results Implications for Resilient System Design

Page 24: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

How do faults manifest in SW?FPU

Excluded

Page 25: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S
Page 26: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

What are masked faults?

Stuck-at faults: Register file: unused physical register RAT: unused logical register FPU: integer benchmarks

Bridging faults: Upper 32-bit in a 64-bit operations Often sign extensions: all-1 or all-0 In SW, small data size

Page 27: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Large number of detections in OS

Although low OS activity, over 65% of detected faults are through symptoms from the OS

Why? A fault in user-mode often results in a memory access

to a cold address, invoking a TLB miss SPARC TLB miss is software managed In OS trap, the same faulty HW OS is more control and memory intensive Often result in corrupted OS state

Page 28: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Fatal HW Traps Illegal instruction traps:

opcode bit changes result in illegal opcode (decoder) Watchdog timer reset:

over 65536 cycles no retired instructions ROB or RAT errors: register src, dest dependences

are messed up, resulting in some kind of indefinite wait

Misaligned accesses: Memory addresses are wrong

Red state exception: Over 4 nested traps

Page 29: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

High-OS OS trap handling TLB miss

Permanent HW fault corrupts TLB handler, resulting in the code never returning to user-mode

Significant overlap with fatal traps and hangs High-OS detects 30% of the faults Remove it reduces coverage by 15% Many cases eventually lead to fatal traps or hangs But detecting High-OS reduces latency

Page 30: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Others

Application aborts: 1% coverage Hangs: 3% coverage

Mostly in application Because OS-hangs are often detected

first as High-OS E.g. loop index variable is wrong, never

terminate

Page 31: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Undetected Faults

All but FPU, 0.8% of injected faults result in silent data corruption

FPU: 10% of faults result in silent data corruption Why? FPU results hardly affect memory

addresses or program control

Page 32: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Which SW components are corrupted?

• Need to checkpoint OS

• None case: watchdog reset trap, the first instruction in ROB is blocked

Page 33: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Detection latency

Application state corruption OS state corruption

Page 34: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Latency from App State Corruption

• Some Combination of SW and HW checkpointing schemes are needed

Page 35: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Latency from OS State Corruption

• HW checkpointing schemes may be sufficient

Page 36: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Transient Faults Have Different Characteristics

94% are architecturally masked within the 10M instruction window

3.4% are detected in the 10M window 1.2% are masked by applications 1.3% eventually results in detectable

symptoms Only 0.1% of the total injections result

in silent data corruption

Page 37: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Outline SWAT System Assumptions Methodology Results Implications for Resilient System Design

Page 38: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Detection A majority of permanent faults that propagate to

SW are detectable through low-cost monitoring of simple symptoms

Preliminary experiments show that the use of value-based invariants can significantly improve latency and coverage

FPU: use more HW mechanisms

Page 39: Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S

Recovery

OS recovery is necessary HW recovery mechanisms (e.g. ReVive,

SafetyNet) may be sufficient

Application recovery requires SW checkpoints