a practitioner‘s guide to software-based soft-error ... · soft-errors (transient hardware...

50
System Software Group A Practitioner‘s Guide to Software-based Soft-Error Mitigation Using AN-Codes http://www4.cs.fau.de Peter Ulbrich 15 th IEEE International Symposium on High Assurance Systems Engineering January 09, 2013

Upload: others

Post on 04-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

System Software Group

A Practitioner‘s Guide to Software-based Soft-Error Mitigation Using AN-Codes

http://www4.cs.fau.de

Peter Ulbrich 15th IEEE International Symposium on High Assurance Systems Engineering January 09, 2013

Page 2: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

■  Soft-Errors (Transient hardware faults)!■  Caused by (cosmic) radiation!  

  !   !  

Soft Errors – A Growing Problem

Peter Ulbrich – [email protected] 2

Page 3: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

■  Soft-Errors (Transient hardware faults)!■  Caused by (cosmic) radiation!■  PPeerrffoorrmmaannccee ((tteecchhnnoollooggyy)) vvss.. rreelliiaabbiilliittyy

  !   !  

Soft Errors – A Growing Problem

Peter Ulbrich – [email protected] 2

Page 4: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

■  Soft-Errors (Transient hardware faults)!■  Caused by (cosmic) radiation!■  PPeerrffoorrmmaannccee ((tteecchhnnoollooggyy)) vvss.. rreelliiaabbiilliittyy

■  Software-based fault-tolerance!■  Selective and resource-efficient (costs!)!■  Vital component: AArriitthhmmeettiicc eerrrroorr ccooddiinngg (AN codes)

Soft Errors – A Growing Problem

Peter Ulbrich – [email protected] 2

Page 5: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

The Combined Redundancy Approach (CoRed ) [1]

!!   !!

Peter Ulbrich – [email protected] 3

Encoded  opera+on      

Sphere  of  redundancy  (SOR)  

Isola+on  domain      

(1)  Ulbrich, Peter; Hoffmann, Martin; Kapitza, Rüdiger; Lohmann, Daniel; Schmid, Reiner; Schröder-Preikschat, Wolfgang: ““EElliimmiinnaattiinngg SSiinnggllee PPooiinnttss ooff FFaaiilluurree iinn SSooffttwwaarree--BBaasseedd RReedduunnddaannccyy””, EDCC 2012.

Combined Redundancy Approach CoRed

Page 6: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

The Combined Redundancy Approach (CoRed ) [1]

!!   !!

Peter Ulbrich – [email protected] 3

Encoded  opera+on      

Sphere  of  redundancy  (SOR)  

Isola+on  domain      

(1)  Ulbrich, Peter; Hoffmann, Martin; Kapitza, Rüdiger; Lohmann, Daniel; Schmid, Reiner; Schröder-Preikschat, Wolfgang: ““EElliimmiinnaattiinngg SSiinnggllee PPooiinnttss ooff FFaaiilluurree iinn SSooffttwwaarree--BBaasseedd RReedduunnddaannccyy””, EDCC 2012.

Combined Redundancy Approach CoRed

Triple Modular Redundancy

Page 7: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

The Combined Redundancy Approach (CoRed ) [1]

!!   !!

Peter Ulbrich – [email protected] 3

Encoded  opera+on      

Sphere  of  redundancy  (SOR)  

Isola+on  domain      

(1)  Ulbrich, Peter; Hoffmann, Martin; Kapitza, Rüdiger; Lohmann, Daniel; Schmid, Reiner; Schröder-Preikschat, Wolfgang: ““EElliimmiinnaattiinngg SSiinnggllee PPooiinnttss ooff FFaaiilluurree iinn SSooffttwwaarree--BBaasseedd RReedduunnddaannccyy””, EDCC 2012.

Combined Redundancy Approach CoRed

Triple Modular Redundancy

Arithmetic Error Coding

Page 8: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

The Combined Redundancy Approach (CoRed ) [1]

!!   !!

Peter Ulbrich – [email protected] 3

Encoded  opera+on      

Sphere  of  redundancy  (SOR)  

Isola+on  domain      

(1)  Ulbrich, Peter; Hoffmann, Martin; Kapitza, Rüdiger; Lohmann, Daniel; Schmid, Reiner; Schröder-Preikschat, Wolfgang: ““EElliimmiinnaattiinngg SSiinnggllee PPooiinnttss ooff FFaaiilluurree iinn SSooffttwwaarree--BBaasseedd RReedduunnddaannccyy””, EDCC 2012.

Combined Redundancy Approach CoRed

Triple Modular Redundancy

Arithmetic Error Coding

Page 9: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

The Combined Redundancy Approach (CoRed ) [1]

!!→ KKeeyy eelleemmeenntt:: CCooRReedd DDeeppeennddaabbllee VVootteerr !!

Peter Ulbrich – [email protected] 3

Encoded  opera+on      

Sphere  of  redundancy  (SOR)  

Isola+on  domain      

(1)  Ulbrich, Peter; Hoffmann, Martin; Kapitza, Rüdiger; Lohmann, Daniel; Schmid, Reiner; Schröder-Preikschat, Wolfgang: ““EElliimmiinnaattiinngg SSiinnggllee PPooiinnttss ooff FFaaiilluurree iinn SSooffttwwaarree--BBaasseedd RReedduunnddaannccyy””, EDCC 2012.

Combined Redundancy Approach CoRed

Triple Modular Redundancy

Arithmetic Error Coding

Page 10: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Problem Statement

Goals: ■  Full 1-bit fault coverage ■  Get what you’re paid for

Implementation: ■  UAV Flight-Control ■  DanceOS – Safety RTOS ■  KESO Embedded JVM!

    "

Peter Ulbrich – [email protected] 4

CoRed

Page 11: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Problem Statement

Goals: ■  Full 1-bit fault coverage ■  Get what you’re paid for

Implementation: ■  UAV Flight-Control ■  DanceOS – Safety RTOS ■  KESO Embedded JVM!

Problems: ■  Experiments showed ddiissccrreeppaanncciieess "

(in line with [3]) ■  IImmpplliiccaattiioonnss on error probability?

Peter Ulbrich – [email protected] 4

CoRed

Page 12: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Problem Statement

Goals: ■  Full 1-bit fault coverage ■  Get what you’re paid for

Implementation: ■  UAV Flight-Control ■  DanceOS – Safety RTOS ■  KESO Embedded JVM!

Problems: ■  Experiments showed ddiissccrreeppaanncciieess "

(in line with [3]) ■  IImmpplliiccaattiioonnss on error probability?

→ Practitioners cannot blindly relyon coding theory! ""

Peter Ulbrich – [email protected] 4

CoRed

Get what you’re paid for

Page 13: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Agenda ■  Introduction!

■  Background!■  The CoRed Dependable Voter!■  Arithmetic Error Coding

■  Think Binary!■  Choosing Appropriate Keys ■  Pitfall 1: Mapping Code to Binary

■  Know Your Compiler & Architecture!

■  Pitfall 2: Inter-Instruction State ■  Pitfall 3: Undefined Execution Environment ■  Multi-Bit Faults – A Glimpse

■  Conclusions & Lessons Learned!

Peter Ulbrich – [email protected] 5

Encoded  Voter  

D

C

(

Page 14: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Encode   Encoded  Voter  Replica  2  

Encode  Replica  1

Encode  Replica  3  

XC X

Y

Z

YC

ZC

{BE ,WC } Decode  

W

TMR  Provider                                                                                    AN  Code  Protec8on  Domain                              TMR  Consumer  

Decode  

Decode  

W

W

The CoRed Dependable Voter – Basics

■  Complex encoded comparison operation!

■  Data-flow integrity!■  Input: Variants ( XC , YC , ZC ) ■  Output: Constant signature (BE ) and encoded winner (WC ) ■  VVaalliiddaattiioonn: Subsequent check (decode)

■  Control-flow integrity!■  SSttaattiicc ssiiggnnaattuurree (expected value): Compile-time "→ Used as return value E

■  DDyynnaammiicc ssiiggnnaattuurree (actual value): Runtime "→ Applied to winner WC

Peter Ulbrich – [email protected] 6

Page 15: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Arithmetic Error Coding – Basics

■  General coding theory!!■  Data word + redundant information = code word ■  Fault detection → ddiissttaannccee bbeettwweeeenn ccooddee wwoorrddss

  !    

Peter Ulbrich – [email protected] 7

Page 16: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Arithmetic Error Coding – Basics

■  General coding theory!!■  Data word + redundant information = code word ■  Fault detection → ddiissttaannccee bbeettwweeeenn ccooddee wwoorrddss

■  Arithmetic error codes!■  Can cope with ccoommppuuttaattiioonnaall flflaawwss ■  AArriitthhmmeettiicc ooppeerraattoorrss (+, -, ×, =, …)

Peter Ulbrich – [email protected] 7

vc = A • v + Bv + D

Encoded value Constant (Key) Value Signature Timestamp

Page 17: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Arithmetic Error Coding – Basics

■  General coding theory!!■  Data word + redundant information = code word ■  Fault detection → ddiissttaannccee bbeettwweeeenn ccooddee wwoorrddss

■  Arithmetic error codes!■  Can cope with ccoommppuuttaattiioonnaall flflaawwss ■  AArriitthhmmeettiicc ooppeerraattoorrss (+, -, ×, =, …)

Peter Ulbrich – [email protected] 7

vc = A • v + Bv + D

Encoded value Constant (Key) Value Signature Timestamp

Page 18: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

p

pred

✓1A

2 8 192 16 384 32 768 61 440

106

105

104

103

values of A (16-bit constant key)

p

sd

c

(res

idua

lerr

orpr

obab

ility

)

What to Expect? – Residual Error Probability

■  Silent Data Corruption (SDC)!!■  Undetectable code-to-code word mutation

■  Residual error probability!■  Chance for a SDC ■  Fundamental property for safety assessment

→ TThhee bbiiggggeerr kkeeyy A,, tthhee bbeetttteerr??!Peter Ulbrich – [email protected] 8

psdc =valid code words

possible code words≈

1A

Page 19: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Agenda ■  Introduction!

■  Background!■  Arithmetic Error Coding ■  The CoRed Dependable Voter!

■  Think Binary!■  Choosing Appropriate Keys ■  Pitfall 1: Mapping Code to Binary

■  Know Your Compiler & Architecture!

■  Pitfall 2: Inter-Instruction State ■  Pitfall 3: Undefined Execution Environment ■  Multi-Bit Faults – A Glimpse

■  Conclusions & Lessons Learned!

Peter Ulbrich – [email protected] 9

Appropriate Keys M i C d t Bi

Encoded  Voter  

D

C

(

Page 20: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Think Binary – Choosing Appropriate Keys?

Peter Ulbrich – [email protected] 10

■  Theory: pprriimmee nnuummbbeerrss [4]!!■  Intuitively plausible ■  Non-primes suitable as well? [3]

  !!    

  !         !

  !!

Page 21: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Think Binary – Choosing Appropriate Keys?

Peter Ulbrich – [email protected] 10

■  Theory: pprriimmee nnuummbbeerrss [4]!!■  Intuitively plausible ■  Non-primes suitable as well? [3]

■  Practitioner’s approach: mmiinn.. HHaammmmiinngg ddiissttaannccee!!■  Distance (d ) between code words (# unequal bits) ■  d-1 bit eerrrroorr ddeetteeccttiioonn ccaappaabbiilliittiieess

  !         !

  !!

x! 1! 0! 1! 0!

y! 1! 1! 0! 0!

d = 2!

Page 22: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Think Binary – Choosing Appropriate Keys?

Peter Ulbrich – [email protected] 10

■  Theory: pprriimmee nnuummbbeerrss [4]!!■  Intuitively plausible ■  Non-primes suitable as well? [3]

■  Practitioner’s approach: mmiinn.. HHaammmmiinngg ddiissttaannccee!!■  Distance (d ) between code words (# unequal bits) ■  d-1 bit eerrrroorr ddeetteeccttiioonn ccaappaabbiilliittiieess

■  Brute force!■  11..44××11001144 eexxppeerriimmeennttss for all 16 bit As       !

  !!

x! 1! 0! 1! 0!

y! 1! 1! 0! 0!

d = 2!

Page 23: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Think Binary – Choosing Appropriate Keys?

Peter Ulbrich – [email protected] 10

■  Theory: pprriimmee nnuummbbeerrss [4]!!■  Intuitively plausible ■  Non-primes suitable as well? [3]

■  Practitioner’s approach: mmiinn.. HHaammmmiinngg ddiissttaannccee!!■  Distance (d ) between code words (# unequal bits) ■  d-1 bit eerrrroorr ddeetteeccttiioonn ccaappaabbiilliittiieess

■  Brute force!■  11..44××11001144 eexxppeerriimmeennttss for all 16 bit As

A = 58,368 dmin = 2 #errors detectable = 1     !

  !!

x! 1! 0! 1! 0!

y! 1! 1! 0! 0!

d = 2!

Page 24: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Think Binary – Choosing Appropriate Keys?

Peter Ulbrich – [email protected] 10

■  Theory: pprriimmee nnuummbbeerrss [4]!!■  Intuitively plausible ■  Non-primes suitable as well? [3]

■  Practitioner’s approach: mmiinn.. HHaammmmiinngg ddiissttaannccee!!■  Distance (d ) between code words (# unequal bits) ■  d-1 bit eerrrroorr ddeetteeccttiioonn ccaappaabbiilliittiieess

■  Brute force!■  11..44××11001144 eexxppeerriimmeennttss for all 16 bit As

A = 58,368 dmin = 2 #errors detectable = 1 58,831 3 2

  !

  !!

x! 1! 0! 1! 0!

y! 1! 1! 0! 0!

d = 2!

Page 25: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Think Binary – Choosing Appropriate Keys?

Peter Ulbrich – [email protected] 10

■  Theory: pprriimmee nnuummbbeerrss [4]!!■  Intuitively plausible ■  Non-primes suitable as well? [3]

■  Practitioner’s approach: mmiinn.. HHaammmmiinngg ddiissttaannccee!!■  Distance (d ) between code words (# unequal bits) ■  d-1 bit eerrrroorr ddeetteeccttiioonn ccaappaabbiilliittiieess

■  Brute force!■  11..44××11001144 eexxppeerriimmeennttss for all 16 bit As

A = 58,368 dmin = 2 #errors detectable = 1 58,831 3 2 58,659 " " " "6 " " " " " "5!

  !!

x! 1! 0! 1! 0!

y! 1! 1! 0! 0!

d = 2!

Page 26: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Think Binary – Choosing Appropriate Keys?

Peter Ulbrich – [email protected] 10

■  Theory: pprriimmee nnuummbbeerrss [4]!!■  Intuitively plausible ■  Non-primes suitable as well? [3]

■  Practitioner’s approach: mmiinn.. HHaammmmiinngg ddiissttaannccee!!■  Distance (d ) between code words (# unequal bits) ■  d-1 bit eerrrroorr ddeetteeccttiioonn ccaappaabbiilliittiieess

■  Brute force!■  11..44××11001144 eexxppeerriimmeennttss for all 16 bit As

A = 58,368 dmin = 2 #errors detectable = 1 58,831 3 2 58,659 " " " "6 " " " " " "5!

→ TThhee bbiiggggeerr tthhee bbeetttteerr iiss mmiisslleeaaddiinngg!! "!!

x! 1! 0! 1! 0!

y! 1! 1! 0! 0!

d = 2!

Page 27: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Double Check – Implementation in the Spotlight

■  Fault-simulation → eennttiirree ffaauulltt--ssppaaccee!!■  EEaacchh aanndd eevveerryy A, v and fault pattern ■  66..55××11001166 eexxppeerriimmeennttss for 16 bit As and 1-8 bit soft errors

  !!

  !!

Peter Ulbrich – [email protected] 11

p

pred

✓1A

2 8 192 16 384 32 768 61 440

106

105

104

103

values of A (16-bit constant key)p

sd

c

(res

idua

lerr

orpr

obab

ility

)

Page 28: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Double Check – Implementation in the Spotlight

■  Fault-simulation → eennttiirree ffaauulltt--ssppaaccee!!■  EEaacchh aanndd eevveerryy A, v and fault pattern ■  66..55××11001166 eexxppeerriimmeennttss for 16 bit As and 1-8 bit soft errors

→ EExxcceessss ooff pprreeddiicctteedd rreessiidduuaall eerrrroorr pprroobbaabbiilliittyy""!!

  !!

Peter Ulbrich – [email protected] 11

p

brd

(borderline bit errors)

p

pred

✓1A

2 8 192 16 384 32 768 61 440

106

105

104

103

values of A (16-bit constant key)p

sd

c

(res

idua

lerr

orpr

obab

ility

)

Page 29: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Double Check – Implementation in the Spotlight

■  Fault-simulation → eennttiirree ffaauulltt--ssppaaccee!!■  EEaacchh aanndd eevveerryy A, v and fault pattern ■  66..55××11001166 eexxppeerriimmeennttss for 16 bit As and 1-8 bit soft errors

→ EExxcceessss ooff pprreeddiicctteedd rreessiidduuaall eerrrroorr pprroobbaabbiilliittyy""!!

→ MMiissmmaattcchh wwiitthh HHaammmmiinngg ddiissttaannccee eexxppeerriimmeennttss !!

Peter Ulbrich – [email protected] 11

p

brd

(borderline bit errors)

p

pred

✓1A

2 8 192 16 384 32 768 61 440

106

105

104

103

values of A (16-bit constant key)p

sd

c

(res

idua

lerr

orpr

obab

ility

)

Page 30: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Pitfall 1: Mapping Code to Binary

Peter Ulbrich – [email protected] 12

■  Pitfall 1: Binary representation of code words!■  Coding theory is unaware of machine word sizes →  DDaannggeerroouuss oovveerr-- aanndd uunnddeerrflflooww ccoonnddiittiioonnss

   

Page 31: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Pitfall 1: Mapping Code to Binary

Peter Ulbrich – [email protected] 12

■  Pitfall 1: Binary representation of code words!■  Coding theory is unaware of machine word sizes →  DDaannggeerroouuss oovveerr-- aanndd uunnddeerrflflooww ccoonnddiittiioonnss

■  EEAANN PPaattcchh:: decode(vC, A, B, D) ■  Additional range checks → Prevent code space violation

W EAN  Decode  

Page 32: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Agenda ■  Introduction!

■  Background!■  Arithmetic Error Coding ■  The CoRed Dependable Voter!

■  Think Binary!■  Choosing Appropriate Keys ■  Pitfall 1: Mapping Code to Binary

■  Know Your Compiler & Architecture!

■  Pitfall 2: Inter-Instruction State ■  Pitfall 3: Undefined Execution Environment ■  Multi-Bit Faults – A Glimpse

■  Conclusions & Lessons Learned!

Peter Ulbrich – [email protected] 13

ment

Encoded  Voter  

D

D

C

Page 33: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Analysing the Assembly ■  Fault-Injection with FAIL* [5]

■  Based on Bochs simulator ■  EEaacchh aanndd eevveerryy register, flag, "

instruction and execution path ■  Fault-space pruning → Feasibility

  !       "

  !

Peter Ulbrich – [email protected] 14

h bility

Page 34: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Analysing the Assembly ■  Fault-Injection with FAIL* [5]

■  Based on Bochs simulator ■  EEaacchh aanndd eevveerryy register, flag, "

instruction and execution path ■  Fault-space pruning → Feasibility

■  Experimental setup!■  Implementation: C++ ■  Compiler: GCC 4.7.2-5 (IA32), -O2 ■  Footprint:"

■  RTOS: Spatial and temporal isolation

  !

Peter Ulbrich – [email protected] 14

h bility

), -O2

CoRed Voter Simple Voter Instructions 92 38 Memory (Bytes) 301 112

Page 35: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Analysing the Assembly ■  Fault-Injection with FAIL* [5]

■  Based on Bochs simulator ■  EEaacchh aanndd eevveerryy register, flag, "

instruction and execution path ■  Fault-space pruning → Feasibility

■  Experimental setup!■  Implementation: C++ ■  Compiler: GCC 4.7.2-5 (IA32), -O2 ■  Footprint:"

■  RTOS: Spatial and temporal isolation

→ VViioollaattiioonn ooff pprreeddiicctteedd ffaauulltt--ddeetteeccttiioonn ccaappaabbiilliittiieess!

Peter Ulbrich – [email protected] 14

h bility

), -O2

CoRed Voter Simple Voter Instructions 92 38 Memory (Bytes) 301 112

Page 36: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Know your Compiler and Architecture

Peter Ulbrich – [email protected] 15

■  Pitfall 2: Architecture specifics!■  Example: Absence of compound tteesstt--aanndd--bbrraanncchh ■  Control-flow iinnffoorrmmaattiioonn iiss ssttoorreedd iinn ssiinnggllee bbiitt →  RReedduunnddaannccyy iiss lloosstt

   

   

  !      

/* if (a == b) */ cmp eax, ebx je Lequal

Page 37: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Know your Compiler and Architecture

Peter Ulbrich – [email protected] 15

■  Pitfall 2: Architecture specifics!■  Example: Absence of compound tteesstt--aanndd--bbrraanncchh ■  Control-flow iinnffoorrmmaattiioonn iiss ssttoorreedd iinn ssiinnggllee bbiitt →  RReedduunnddaannccyy iiss lloosstt

■  EEAANN PPaattcchh:: apply(vC,sigDYN) ■  Malicious control-flow → Signature overflow → Additional check

   

  !      

Encoded  Voter  

XC

YC

ZC

{BE ,WC }

/* if (a == b) */ cmp eax, ebx je Lequal

Page 38: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Know your Compiler and Architecture

Peter Ulbrich – [email protected] 15

■  Pitfall 2: Architecture specifics!■  Example: Absence of compound tteesstt--aanndd--bbrraanncchh ■  Control-flow iinnffoorrmmaattiioonn iiss ssttoorreedd iinn ssiinnggllee bbiitt →  RReedduunnddaannccyy iiss lloosstt

■  EEAANN PPaattcchh:: apply(vC,sigDYN) ■  Malicious control-flow → Signature overflow → Additional check

   

■  Pitfall 3: Undefined Execution Environment!■  CCoommppiilleerr llaazziinneessss leaves encoded values in registers ■  ZZoommbbiiee vvaalluueess → leaking from caller to voter function →  IIssoollaattiioonn aassssuummppttiioonnss vviioollaatteedd

Encoded  Voter  

XC

YC

ZC

{BE ,WC }

/* if (a == b) */ cmp eax, ebx je Lequal

Page 39: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Know your Compiler and Architecture

Peter Ulbrich – [email protected] 15

■  Pitfall 2: Architecture specifics!■  Example: Absence of compound tteesstt--aanndd--bbrraanncchh ■  Control-flow iinnffoorrmmaattiioonn iiss ssttoorreedd iinn ssiinnggllee bbiitt →  RReedduunnddaannccyy iiss lloosstt

■  EEAANN PPaattcchh:: apply(vC,sigDYN) ■  Malicious control-flow → Signature overflow → Additional check

■  EEAANN PPaattcchh:: vote(xC, yC, zC) ■  Cleaning the local storage restores isolation

■  Pitfall 3: Undefined Execution Environment!■  CCoommppiilleerr llaazziinneessss leaves encoded values in registers ■  ZZoommbbiiee vvaalluueess → leaking from caller to voter function →  IIssoollaattiioonn aassssuummppttiioonnss vviioollaatteedd

Encoded  Voter  

XC

YC

ZC

{BE ,WC }

/* if (a == b) */ cmp eax, ebx je Lequal

Page 40: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Fault-Injection Campaigns – Final Results

Peter Ulbrich – [email protected] 16

■  3 Fault-Injection Campaigns:!■  Instructions and ■  General purpose registers and CPU flags ■  Program counter

  !

Instructions Registers and Flags Program Counter Simple CoRed Simple CoRed Simple CoRed

OK! 784 2772 1040 3204 127 267 Detected (Code)! - 995 - 1435 - 420 Detected (Trap)! 93 246 8 41 21 241 Detected (Isolation)! 825 1834 1825 3736 2804 6240 Detected (Timeout)! 0 1 0 0 0 0 Undetected (SDC)! 450 0 807 0 152 0

Page 41: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Fault-Injection Campaigns – Final Results

Peter Ulbrich – [email protected] 16

■  3 Fault-Injection Campaigns:!■  Instructions and ■  General purpose registers and CPU flags ■  Program counter

→ CCooRReedd ddeeppeennddaabbllee vvootteerr ppeerrffoorrmmss aass EEXXPPEECCTTEEDD!! !

Instructions Registers and Flags Program Counter Simple CoRed Simple CoRed Simple CoRed

OK! 784 2772 1040 3204 127 267 Detected (Code)! - 995 - 1435 - 420 Detected (Trap)! 93 246 8 41 21 241 Detected (Isolation)! 825 1834 1825 3736 2804 6240 Detected (Timeout)! 0 1 0 0 0 0 Undetected (SDC)! 450 0 807 0 152 0

Encoded  Voter  

D

D

D

Page 42: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Multi-Bit Faults – The Good, the Bad and the …

■  2-bit Fault-injection experiments!■  Full fault space coverage ■  Triple check fault-detection capabilities

■  Distances: dgood = 6, dbad = 2!

Peter Ulbrich – [email protected] 17

nd the …

Good A = 58,659 Bad A = 58,368 OK! 38639 38639 Detected (Code)! 21596 21519 Detected (Trap)! 47 47 Detected (Isolation)! 60438 60438 Detected (Timeout)! 0 0 Undetected! 0 77

Page 43: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Multi-Bit Faults – Tighten the Rules

Peter Ulbrich – [email protected] 18 18

3-bit faults 4-bit faults 5-bit faults OK! 33.742% 33.605% 33.544% Detected (Code)! 18.209% 18.356% 18.431% Detected (Trap)! 0.001% <0.001% 0% Detected (Isolation)! 47.993% 48.030% 48.023% Detected (Timeout)! 0% 0% 0% Undetected! 0 0 0 Fault Space! 3.59 × 106 1.03 × 108 2.90 × 109 Coverage! 16.13% 0.59% 0.04%

Page 44: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Agenda ■  Introduction!

■  Background!■  Arithmetic Error Coding ■  The CoRed Dependable Voter!

■  Think Binary!■  Choosing Appropriate Keys ■  Pitfall 1: Mapping Code to Binary

■  Know Your Compiler & Architecture!

■  Pitfall 2: Inter-Instruction State ■  Pitfall 3: Undefined Execution Environment ■  Multi-Bit Faults – A Glimpse

■  Conclusions & Lessons Learned!

Peter Ulbrich – [email protected] 19

ment

Encoded  Voter  

D

D

D

Page 45: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Conclusions & Lessons Learned

Peter Ulbrich – [email protected] 20

■  Soft-Errors (Transient hardware faults)!■  Caused by (cosmic) radiation!■  PPeerrffoorrmmaannccee ((tteecchhnnoollooggyy)) vvss.. rreelliiaabbiilliittyy ■  Software-based fault-tolerance!■  Selective and resource-efficient (costs!)!■  Vital component: AArriitthhmmeettiicc eerrrroorr ccooddiinngg (AN codes)

[2]

Soft Errors – A Growing Problem

Peter Ulbrich – [email protected]

2

(Transie t h d

→  Software-based fault-tolerance !is hard to implement!→  Missing tool support!

Page 46: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Conclusions & Lessons Learned

Peter Ulbrich – [email protected] 20

■  Soft-Errors (Transient hardware faults)!■  Caused by (cosmic) radiation!■  Performance (technology) vs. reliability ■  Software-based fault-tolerance!■  Selective and resource-efficient (costs!)!■  Vital component: Arithmetic error coding (AN codes)

[2]

Soft Errors – A Growing Problem

Peter Ulbrich – [email protected]

2

→  Software-based fault-tolerance !is hard to implement!→  Missing tool support!

Soft E

2

Pitfall 1: Mapping Code to Binary

Peter Ulbrich – [email protected]

12

■  Pitfall 1: Binary representation of code words!

■  Coding theory is unaware of machine word sizes

→  DDaannggeerroouus over-- and underflow conditions

■  EEAANN PPaattcchh:: decode(vC, A, B, D)

■  Additional range checks Prevent code space violation

W EAN'Decode'

: B

ding theory is unaware of mach

d nderflflflflflow conditions

B

f machine word sizes

→  HHuugghh iimmppaacctt ooff eennccooddiinngg ppaarraammeetteerrss!!

→  IImmpprroovveemmeenntt bbyy oorrddeerrss ooff mmaaggnniittuuddee!!!!

Page 47: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Conclusions & Lessons Learned

Peter Ulbrich – [email protected] 20

■  Soft-Errors (Transient hardware faults)!■  Caused by (cosmic) radiation!■  Performance (technology) vs. reliability ■  Software-based fault-tolerance!■  Selective and resource-efficient (costs!)!■  Vital component: Arithmetic error coding (AN codes)

[2]

Soft Errors – A Growing Problem

Peter Ulbrich – [email protected]

2

→  Software-based fault-tolerance !is hard to implement!→  Missing tool support!

Pitfall 1: Mapping Code to Binary

Peter Ulbrich – [email protected]

12

■  Pitfall 1: Binary representation of code words!

■  Coding theory is unaware of machine word sizes

→  Dangerous over- and underflow conditions

■  EAN Patch: decode(vC, A, B, D)

■  Additional range checks Prevent code space violation

W EAN'Decode'

→  Hugh impact of encoding parameters!

→  Improvement by orders of magnitude!!

rich ulbrich@cs fau de

Peter Ulbrich – ulbric

Know your Compiler and Architecture

Peter Ulbrich – [email protected]

15

■  Pitfall 2: Architecture specifics!■  Example: Absence of compound tteesstt--aanndd--bbrraanncchh ■  Control-flow iinnffoorrmmaattiioonn iiss ssttoorreedd iinn ssiinnggllee bbiitt →  RReedduunnddaannccyy iiss lloosstt ■  EEAANN PPaattcchh:: apply(vC,sigDYN) ■  Malicious control-flow Signature overflow Additional check

■  EEAANN PPaattch: vote(xC, yC, zC) ■  Cleaning the local storage restores isolation ■  Pitfall 3: Undefined Execution Environment!■  CCoommppiler laziness leaves encoded values in registers ■  ZZoommbbiie valuess leaking from caller to voter function →  IIssoollaattiioonn aassssuummppttiioonnss vviioollaatteedd

Encoded'Voter'

XC

YC

ZC

{BE ,WC }

ow Signature overflow Addi

h:ng

ndefined Execution Environmentler laziness leaves encoded values in registersl

g ature overflow Additional check→  LLiittttllee oobbvviioouuss ssoouurrccee ooff vvuullnneerraabbiilliittiieess!!→  TTiigghhtt ffeeeeddbbaacckk lloooopp wwiitthh FFII rreeqquuiirreedd!!→  IIssoollaattiioonn aanndd OOSS--ssuuppppoorrtt mmaannddaattoorryy!!

Page 48: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Conclusions & Lessons Learned

Peter Ulbrich – [email protected] 20

■  Soft-Errors (Transient hardware faults)!■  Caused by (cosmic) radiation!■  Performance (technology) vs. reliability ■  Software-based fault-tolerance!■  Selective and resource-efficient (costs!)!■  Vital component: Arithmetic error coding (AN codes)

[2]

Soft Errors – A Growing Problem

Peter Ulbrich – [email protected]

2

→  Software-based fault-tolerance !is hard to implement!→  Missing tool support!

Pitfall 1: Mapping Code to Binary

Peter Ulbrich – [email protected]

12

■  Pitfall 1: Binary representation of code words!

■  Coding theory is unaware of machine word sizes

→  Dangerous over- and underflow conditions

■  EAN Patch: decode(vC, A, B, D)

■  Additional range checks Prevent code space violation

W EAN'Decode'

→  Hugh impact of encoding parameters!

→  Improvement by orders of magnitude!!

rich

Know your Compiler and Architecture

Peter Ulbrich – [email protected]

15

■  Pitfall 2: Architecture specifics!■  Example: Absence of compound test-and-branch ■  Control-flow information is stored in single bit →  Redundancy is lost ■  EAN Patch: apply(vC,sigDYN) ■  Malicious control-flow Signature overflow Additional check

■  EAN Patch: vote(xC, yC, zC) ■  Cleaning the local storage restores isolation ■  Pitfall 3: Undefined Execution Environment!■  Compiler laziness leaves encoded values in registers ■  Zombie values leaking from caller to voter function →  Isolation assumptions violated

Encoded'Voter'

XC

YC

ZC

{BE ,WC }

→  Little obvious source of vulnerabilities!→  Tight feedback loop with FI required!→  Isolation and OS-support mandatory!

h – [email protected] h ulbrich@cs fau de

Know your Co

Pe

15

Multi-Bit Faults – Tighten the Rules

Peter Ulbrich – [email protected]

18 18

3-bit faults 4-bit faults 5-bit faults

OK!33.742% 33.605% 33.544%

Detected (Code)! 18.209% 18.356% 18.431%

Detected (Trap)! 0.001% <0.001% 0%

Detected (Isolation)! 47.993% 48.030% 48.023%

Detected (Timeout)! 0% 0% 0%

Undetected!0 0 0

Fault Space!3.59 × 106 1.03 × 108 2.90 × 109

Coverage!16.13% 0.59% 0.04%

ode)

ap0.001% <0.001%

ol im 0 0

03 108

p) 0.0

0 m→  TToooolliinngg ssppeeeedd iiss ccrruuiiccaall!!!!

Page 49: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Conclusions & Lessons Learned

Peter Ulbrich – [email protected] 20

■  Soft-Errors (Transient hardware faults)!■  Caused by (cosmic) radiation!■  Performance (technology) vs. reliability ■  Software-based fault-tolerance!■  Selective and resource-efficient (costs!)!■  Vital component: Arithmetic error coding (AN codes)

[2]

Soft Errors – A Growing Problem

Peter Ulbrich – [email protected]

2

→  Software-based fault-tolerance !is hard to implement!→  Missing tool support!

Pitfall 1: Mapping Code to Binary

Peter Ulbrich – [email protected]

12

■  Pitfall 1: Binary representation of code words!

■  Coding theory is unaware of machine word sizes

→  Dangerous over- and underflow conditions

■  EAN Patch: decode(vC, A, B, D)

■  Additional range checks Prevent code space violation

W EAN'Decode'

→  Hugh impact of encoding parameters!

→  Improvement by orders of magnitude!!

rich

Know your Compiler and Architecture

Peter Ulbrich – [email protected]

15

■  Pitfall 2: Architecture specifics!■  Example: Absence of compound test-and-branch ■  Control-flow information is stored in single bit →  Redundancy is lost ■  EAN Patch: apply(vC,sigDYN) ■  Malicious control-flow Signature overflow Additional check

■  EAN Patch: vote(xC, yC, zC) ■  Cleaning the local storage restores isolation ■  Pitfall 3: Undefined Execution Environment!■  Compiler laziness leaves encoded values in registers ■  Zombie values leaking from caller to voter function →  Isolation assumptions violated

Encoded'Voter'

XC

YC

ZC

{BE ,WC }

→  Little obvious source of vulnerabilities!→  Tight feedback loop with FI required!→  Isolation and OS-support mandatory!

h – [email protected] h ulbrich@cs fau de

Multi-Bit Faults – Tighten the Rules

Peter Ulbrich – [email protected]

18

3-bit faults 4-bit faults 5-bit faults

OK!33.742% 33.605% 33.544%

Detected (Code)! 18.209% 18.356% 18.431%

Detected (Trap)! 0.001% <0.001% 0%

Detected (Isolation)! 47.993% 48.030% 48.023%

Detected (Timeout)! 0% 0% 0%

Undetected!0 0 0

Fault Space!3.59 × 106 1.03 × 108 2.90 × 109

Coverage!16.13% 0.59% 0.04%

→  Tooling speed is cruical!!

15Peter Ulbrich – [email protected]

18

The Combined Redundancy Approach (CoRed ) [1]

!!→ KKeeyy eelleemmeenntt:: CCooRed Dependable Voter !

Peter Ulbrich – [email protected] 3

Encoded'opera+on'''

Sphere'of'redundancy'(SOR)'

Isola+on'domain'''

(1)  Ulbrich, Peter; Hoffmann, Martin; Kapitza, Rüdiger; Lohmann, Daniel; Schmid, Reiner; Schröder-Preikschat, Wolfgang: ““EElliimmiinnaattiinngg SSiinnggllee PPooiinnttss ooff FFaaiilluurree iinn SSooffttwwaarree--BBaasseedd RReedduunnddaannccyy””, EDCC 2012.

Combined Redundancy Approach CoRed

Triple Modular Redundancy

Arithmetic Error Coding

mbined Redundancy Approach (C

Red Dependable VoVoVoVoVoVoV ter

Encoded'opera+on'Isola+on'domain'

Triple Modular Redundancy

Arithmetic Erro Coding

Page 50: A Practitioner‘s Guide to Software-based Soft-Error ... · Soft-Errors (Transient hardware faults)! Caused by (cosmic) radiation! Performance (technology) vs. reliability Soft Errors

Thank you!

Implementation and further experimental results:"http://www4.cs.fau.de/Research/CoRed!