swat: designing reisilent hardware by treating software anomalies

39
SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign [email protected]

Upload: tovah

Post on 10-Feb-2016

20 views

Category:

Documents


0 download

DESCRIPTION

SWAT: Designing Reisilent Hardware by Treating Software Anomalies. Man-Lap (Alex) Li, Pradeep Ramachandran , Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

SWAT: Designing Reisilent Hardware byTreating Software Anomalies

Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo,Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu,

Sarita Adve, Vikram Adve, Yuanyuan Zhou

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

[email protected]

Page 2: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

2

Motivation• Hardware failures will happen in the field

– Aging, soft errors, inadequate burn-in, design defects, …

Need in-field detection, diagnosis, recovery, repair

• Reliability problem pervasive across many markets– Traditional redundancy (e.g., nMR) too expensive– Piecemeal solutions for specific fault model too expensive– Must incur low area, performance, power overhead

Today: low-cost solution for multiple failure sources

Page 3: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

3

Observations

• Need handle only hardware faults that propagate to software

• Fault-free case remains common, must be optimized

Watch for software anomalies (symptoms)Hardware fault detection ~ Software bug detectionZero to low overhead “always-on” monitors

Diagnose cause after symptom detected May incur high overhead, but rarely invoked

SWAT: SoftWare Anomaly Treatment

Page 4: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

4

SWAT Framework Components

• Detection: Symptoms of S/W misbehavior, minimal backup H/W

• Recovery: Hardware/Software checkpoint and rollback

• Diagnosis: Rollback/replay on multicore

• Repair/reconfiguration: Redundant, reconfigurable hardware

• Flexible control through firmware

Fault Error Symptomdetected

Recovery

Diagnosis Repair

Checkpoint Checkpoint

Page 5: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

5

SWAT

4. Accurate Fault Modeling

2. Detectors w/ Software support [Sahoo et al., DSN ‘08]

3. Trace Based Fault Diagnosis [Li et al., DSN ‘08]

1. Detectors w/ Hardware support [ASPLOS ‘08]

Diagnosis

Fault Error Symptomdetected

Recovery

Repair

Checkpoint Checkpoint

Page 6: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

6

Hardware-Only Symptom-based detection

• Observe anomalous symptoms for fault detection– Incur low overheads for “always-on” detectors– Minimal support from hardware

• Fatal traps generated by hardware– Division by Zero, RED State, etc.

• Hangs detected using simple hardware hang detector• High OS activity detected with performance counter

– Typical OS invocations take 10s or 100s of instructions

Page 7: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

7

Experimental Methodology

• Microarchitecture-level fault injection– GEMS timing models + Simics full-system simulation– SPEC workloads on Solaris-9 OS

• Permanent fault models– Stuck-at, bridging faults in latches of 8 arch structures– 12,800 faults, <0.3% error @ 95% confidence

• Simulate impact of fault in detail for 10M instructions

10M instr

Timing simulation

If no symptom in 10M instr, run to completion

Functional simulation

Fault

App masked, or symptom > 10M, or silent data corruption (SDC)

Page 8: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

8

Efficacy of Hardware-only Detectors

• Coverage: Percentage of unmasked faults detected– 98% faults detected, 0.4% give SDC (w/o FPU)

Additional support required for FPU-like units

– 66% of detected faults corrupt OS state, need recovery Despite low OS activity in fault-free execution

• Latency: Number of instr between activation and detection– HW recovery for upto 100k instr, SW longer latencies– App in 87% of detections recoverable using HW– OS recoverable in virtually all detections using HW

OS recovery using SW hard

Page 9: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

9

Improving SWAT Detection Coverage

Can we improve coverage, SDC rate further?

• SDC faults primarily corrupt data values– Illegal control/address values caught by other symptoms– Need detectors to capture “semantic” information

• Software-level invariants capture program semantics– Use when higher coverage desired– Sound program invariants expensive static analysis– We use likely program invariants

Page 10: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

10

Likely Program Invariants

• Likely program invariants– Hold on all observed inputs, expected to hold on others– But suffer from false positives– Use SWAT diagnosis to detect false positives on-line

• iSWAT - Compiler-assisted symptom detectors– Range-based value invariants [Sahoo et al. DSN ‘08]– Check MIN value MAX on data values– Disable invariant when diagnose false-positive

Page 11: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

11

iSWAT implementation

Training PhaseApplication

Compiler Pass in LLVM

- - - - - Application

- - - - -

Ranges i/p #1 . . . . Range

s i/p #n

Invariant Ranges

Invariant Monitoring

Code

Test,train,

external inputs

Page 12: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

12

iSWAT implementation

Training PhaseApplication

Compiler Pass in LLVM

- - - - - Application

- - - - -

Ranges i/p #1 . . . . Range

s i/p #n

Invariant Ranges

Invariant Monitoring

Code

Compiler Pass in LLVM

- - - - - Application

- - - - -

Invariant Checking

Code

Full System Simulation

Inject Faults

SWAT Diagnosis

InvariantViolation

False Positive(Disable Invariant)

Fault Detection

Fault Detection Phase

Test,train,

external inputs

Refinput

Page 13: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

13

iSWAT Results• Explored SWAT with 5 apps on previous methodology

• Undetected faults reduce by 30%• Invariants reduce SDCs by 73% (33 to 9)

• Overheads: 5% on x86, 14% on UltraSparc IIIi– Reasonably low overheads on some machines– Un-optimized invariants used, can be further reduced

• Exploring more sophistication for coverage, overheads

Page 14: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

14

Fault Diagnosis

• Symptom-based detection is cheap but – High latency from fault activation to detection– Difficult to diagnose root cause of fault– How to diagnose SW bug vs. transient vs. permanent fault?

• For permanent fault within core– Disable entire core? Wasteful!– Disable/reconfigure µarch-level unit?– How to diagnose faults to µarch unit granularity?

• Key ideas– Single core fault model, multicore fault-free core available– Checkpoint/replay for recovery replay on good core, compare– Synthesizing DMR, but only for diagnosis

Page 15: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

15

SW Bug vs. Transient vs. Permanent• Rollback/replay on same/different core• Watch if symptom reappears

No symptom SymptomFalse positive (iSWAT) or

Deterministic s/w orPermanent h/w bug

Symptom detectedFaulty Good

Rollback on faulty core

Rollback/replay on good core

Continue Execution

Transient or non-deterministic s/w bug

SymptomPermanenth/w fault,

needs repair!

No symptomFalse positive (iSWAT) orDeterministic s/w bug, send to s/w layer

Page 16: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

16

Diagnosis Framework

Permanent fault

Microarchitecture-LevelDiagnosis

Unit X is faulty

Symptomdetected

Diagnosis

Softwarebug

Transientfault

Page 17: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

17

Fault-Free CoreExecution

Faulty CoreExecution

Trace-Based Fault Diagnosis (TBFD)Permanent

fault detected

Invoke TBFD

DiagnosisAlgorithm

=?

Page 18: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

18

Trace-Based Fault Diagnosis (TBFD)Permanent

fault detected

Invoke TBFD

Rollback faulty-core to checkpoint

Replay execution, collect info

=?

DiagnosisAlgorithm

Fault-Free CoreExecution

Page 19: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

19

Trace-Based Fault Diagnosis (TBFD)Permanent

fault detected

Rollback faulty-core to checkpoint

Replay execution, collect info

=?

DiagnosisAlgorithm

Load checkpoint on fault-free core

Fault-free instruction exec

What info to collect?

What info to compare?What to do on divergence?

Invoke TBFD

Page 20: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

20

Can a Divergent Instruction Lead to Diagnosis?

Simpler case: ALU fault

sub r6,r1,r2sub r6,r1,r2 2 1 72 x 9

FaultyFault-free HW usedresults

add r1,r3,r5add r1,r3,r5 0dec alu

1 12

dstpreg

5 x 3

Both divergent instructions used same ALU ALU1 faulty

Page 21: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

21

r2 p20

p20 4

• Complex example: Fault in register alias table (RAT) entry

• Divergent instructions do not directly lead to faulty unit• Instead, look backward/forward in instruction stream

– Need to collect and analyze instruction trace

Can a Divergent Instruction Lead to Diagnosis?

r2 p20

r1

log phyp4

r3 p13

r5 p24

RAT

IA: r3 r2 + r2

phy valp20 4p24 3

Reg File

p4 8r3 p55

error!

r3 p24

r5 p24

p24 3p24 8

IB: r1 r5 * r2

r1 p4

p4 32

Fault-freer1=12

Diverged!

But IB does not use faulty HW…

Page 22: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

22

Diagnosing Permanent Fault to µarch Granularity

• Trace-based fault diagnosis (TBFD)– Compare instruction trace of faulty vs. good execution – Divergence faulty hardware used diagnosis clues

• Diagnose faults to µarch units of processor– Check µarch-level invariants in several parts of processor– Front end, Meta-datapath, datapath faults– Diagnosis in out-of-order logic (meta-datapath) complex

• Results– 98% of the faults by SWAT successfully diagnosed– TBFD flexible for other detectors/granularity of repair

Page 23: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

23

SWAT

4. Accurate Fault Modeling

2. Detectors w/ Software support [Sahoo et al., DSN ‘08]

3. Trace Based Fault Diagnosis [Li et al., DSN ‘08]

1. Detectors w/ Hardware support [ASPLOS ‘08]

Diagnosis

Fault Error Symptomdetected

Recovery

Repair

Checkpoint Checkpoint

Page 24: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

24

SWATSim: Fast and Accurate Fault Models

• Need accurate µarch-level fault models– Gate level injections accurate but too slow– µarch (latch) level injections fast but inaccurate

• Can we achieve µarch-level speed at gate-level accuracy?

• Mix-mode (hierarchical) Simulation– µarch-level + Gate-level simulation– Simulate only faulty component at gate-level, on-demand– Invoke gate-level sim at online for permanent faults

Simulating fault effect with real-world vectors

Page 25: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

25

SWAT-Sim: Gate-level Accuracy at µarch Speedsµarch simulation

r3 r1 op r2

Faulty UnitUsed?

Continue µarch simulation

µarch-LevelSimulation

NoInput

Output

Gate-LevelFault

Simulation

Stimuli

Response

Fault propagatedto output

Yes

r3

Page 26: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

26

Results from SWAT-Sim• SWAT-sim implemented within full-system simulation

– NCVerilog + VPI for gate-level sim of ALU/AGEN modules

• SWAT-Sim: High accuracy at low overheads– 100,000x faster than gate-level, same modeling fidelity– 2x slowdown over µarch-level, at higher accuracy

• Accuracy of µarch models using SWAT coverage/latency– µarch stuck-at models generally inaccurate– Differences in activation rate, multi-bit flips

• Complex manifestations Hard to derive better models– Need SWAT-Sim, at least for now

Page 27: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

27

SWAT Summary

• SWAT: SoftWare Anomaly Treatment– Handle all and only faults that matter– Low, amortized overheads– Holistic systems view enables novel solutions– Customizable and flexible

• Prior results:– Low-cost h/w detectors gave high coverage, low SDC rate

• This talk:– iSWAT: Higher coverage w/ software-assisted detectors– TBFD: µarch level fault diagnosis by synthesizing DMR– SWAT-Sim: Gate-level fault accuracy at µarch level speed

Page 28: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

28

Future Work• Recovery: hybrid, application-specific• Aggressive use of software reliability techniques

– Leverage diagnosis mechanism• Multithreaded software• Off-core faults• Post-silicon debug and test

– Use faulty trace as fault-model oblivious test vector• Validation on FPGA (w/ Michigan)• Hardware assertions to complement software symptoms

Page 29: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

BACKUP SLIDES

Page 30: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

30

0%

20%

40%

60%

80%

100%

Decoder Int ALU

Reg Dbus Int reg

ROB RAT AGEN FP ALU

Avg no FP

Total injections

SDC

Symp>10M

High-OS

Hang-App

Hang-OS

FatalTrap-AppFatalTrap-OSApp-Mask

Arch-Mask

100% 98% 98% 96% 100% 100% 95% 98%27%

Breakup of Detections by SW symptoms

• 98% unmasked faults detected within 10M instr (w/o FPU) – Need HW support or SW monitoring for FPU

Page 31: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

31

SW Components Corrupted

• 66% of faults corrupt system state before detection– Need to recover system state

0%

20%

40%

60%

80%

100%

Decoder INT ALU Reg Dbus

Int reg ROB RAT

AGEN FP ALU

Percentage of Injections

None

OS and maybe app

App only

Page 32: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

32

Latency from Application mismatch

0%

20%

40%

60%

80%

100%

Decoder INT ALU Reg Dbus

Int reg ROB RAT

AGEN FP ALU

100000001000000100000100001000100101

• 86% of faults detected under 100k– 42% detected under 10k

Page 33: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

33

0%

20%

40%

60%

80%

100%

Decoder INT ALU Reg Dbus

Int reg ROB RAT

AGEN FP ALU

100000001000000100000100001000100101

Latency from OS mismatch

• 99% of faults detected under 100k

Page 34: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

34

iSWAT implementation

Training PhaseApplication

Compiler Pass in LLVM

- - - - - Application

- - - - -

Ranges i/p #1 . . . . Range

s i/p #n

Invariant Ranges

Invariant Monitoring

Code

Compiler Pass in LLVM

- - - - - Application

- - - - -

Invariant Checking

Code

Full System Simulation

Inject Faults

SWAT Diagnosis

InvariantViolation

False Positive(Disable Invariant)

Fault Detection

Fault Detection Phase

Test,train,

external inputs

Refinput

Page 35: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

35

Trace-Based Fault Diagnosis (TBFD)Permanent

fault detected

Invoke diagnosis

Rollback faulty-core to checkpoint

Load checkpoint on fault-free core

Replay execution, collect µarch info

Fault-free instruction exec

TBFDFaults in Front-end

Meta-datapath Faults

Datapath Faults

Faulty trace Test trace=?

Page 36: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

36

Fault Diagnosability

0%

20%

40%

60%

80%

100%

Decoder INT ALU Reg Dbus Int Reg ROB RAT AGEN Overall

Percentage of Detected Faults

Incorrect

NoMismatch

D-Other

D-Unique

• 98% of detected faults are diagnosed– 89% diagnosed to unique unit/array entry– Meta-datapath faults in out-of-order exec mislead TBFD

Page 37: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

37

Accuracy of existing Fault Models

• SWAT-sim implemented within full-system simulator– NCVerilog + VPI to simulate gate-level ALU and AGEN

AGEN

0%

20%

40%

60%

80%

100%

uarch s@1 uarch s@0 Gate s@1 Gate s@0 Gate Delay

Percentage of Injections

Uarch-Mask Arch-Mask App-MaskDetected Detected>10M SDC

97.1% 94.0% 95.3% 95.5%96.0%Integer ALU

0%

20%

40%

60%

80%

100%

uarch s@1 uarch s@0 Gate s@1 Gate s@0 Gate Delay

Percentage of Injections

Uarch-Mask Arch-Mask App-MaskDetected Detected>10M SDC

100% 98.8% 94.4% 89.4%93.9%

• Existing µarch-level fault models inaccurate– Differences in activation rate, multi-bsit flips

• Accurate models hard to derive need SWAT-Sim!

Page 38: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

38

Summary: SWAT Advantages• Handles all faults that matter

– Oblivious to low-level failure modes & masked faults

• Low, amortized overheads– Optimize for common case, exploit s/w reliability solutions

• Holistic systems view enables novel solutions– Invariant detectors use diagnosis mechanisms– Diagnosis uses recovery mechanisms

• Customizable and flexible– Firmware based control affords hybrid, app-specific recovery (TBD)

• Beyond hardware reliability– SWAT treats hardware faults as software bugs

Long-term goal: unified system (hw + sw) reliability at lowest cost– Potential applications to post-silicon test and debug

Page 39: SWAT: Designing Reisilent Hardware by Treating Software Anomalies

39

Transients Results

• 6400 transient faults injected across 8 structures• 83% unmasked faults detected within 10M instr• Only 0.4% of injected faults results in SDCs