error detection with an encoding - cfaed | tu dresden · an encoding–cont’d q make a program...

Error detection with AN encoding

Norman A. RinkTechnische Universität [email protected]

6 March 2017

Outline

1. Introduction – hardware faults and software-based fault tolerance

2. AN encoding

3. Memory error detection

4. Summary

2

Outline


2. AN encoding


4. Summary

3

Hardware faults and soft errors

4

q Faults are a long-standing and recurring issue.q More recently: large-scale studies (co-)authored by, e.g., Google

(SIGMETRICS ‘09), Microsoft (EuroSys ’11).

q What causes hardware faults?q cosmic radiationq “dim silicon” (near-threshold computing to save energy)q temperature variations, process variations

à Typically lead to transient hard-ware faults, aka soft errors.

q Non-negligible fault rates in emerging computing paradigms: q silicon-/carbon-based nano-technologies, grapheneq chemical information processing, bio-inspired/quantum computing

taken from S. Borkar, “Designing reliable systems from unreliable components: …,” IEEE Micro, vol. 25, no. 6, 2005.

Hardware faults and soft errors – cont’d

5

cost

(run

time,

ene

rgy)

error probability/output degradation

approximate computing applications,e.g. image processing, machine learning

safety-/security-critical applications,e.g. automotive, operating system (kernels)

non-critical user applications, processes that can be restarted

q The trade-off of computing with faulty hardware:

à Hardware-implemented error handling (aka fault tolerance) is not flexible.

Error detection through redundant code

q Typical approach to detecting/correcting hardware faults:q replicate dataflow (duplication is sufficient for error detection)q insert checks

6

%3 = add i64 %0, %1%4 = mul i64 %3, %2

%3 = add i64 %0, %1%r3 = add i64 %r0, %r1%4 = mul i64 %3, %2%r4 = mul i64 %r3, %r2

%f0 = icmp eq i64 %4, %r4br i1 %f0, label continue,

label recover

IR transformation

original code

fault-tolerant code

duplicated dataflow

error check

runtime overhead of duplicated dataflow typically < 2x

This cannot detect errors in memory!

Outline


2. AN encoding


4. Summary

7

AN encoding

q Correctness condition: Integer values are multiples of a fixed constant A.à Check for faults like this:

if (n % A != 0) { exit(AN_ERROR_CODE); }

q Advantages over replication:q Data in memory is encoded (automatically protected): no need to replicate loads/stores.q Suitable for multithreaded and shared memory applications.

q Disadvantages: Large runtime overheads (up to and over several 10x).q Checking is expensive (modulo operation)q Decoding is expensive (division by A)q Decoding often required, e.g., for address operands.

8

Trade frequencies of these operations for runtime.

AN encoding – cont’d

q Make a program fault-tolerant by transforming it into an AN-encoded program.

1. Value encoding:

2. Operation encoding, e.g.:

3. Check insertion:q Where non-encoded values enter/leave the scope of AN encoding.à Often referred to as synchronization points.

9

%3 = mul i64 %0, %1 %t3 = mul i64 %0, %1%3 = div i64 %t3, A

multiplication

%1 = load i64* %0 %t0 = ptrtoint i64* %0 to i64%t1 = div i64 %t0, A%t3 = inttoptr i64 %t1 to i64*%1 = load i64* %t3

load

integerconstantc integerconstantc*A

Results: runtime overheads

10

label test case

A-C bubblesort

D CRC (cyclic redundancy checker)

E DES encryption

F Dijkstra

G-I lex, parse, eval

J Fibonacci

K integer matrix multiply

L memcopy

M-O quicksort

test programs runtime overheads

à Small but visible reduction of overheads for profiles 2 and 3.

Results: silent data corruption (sdc)

11

A B C D E F G H I J K L M N O

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.141 2 3 4

Relative frequencies of sdc responses

Outline


2. AN encoding


4. Summary

12

Protecting memory

q Encode integer values before storing to memory:

q Decode and check after loading from memory:

13

%0 = mul i64 %0, Astore i64 %0, i64* %p

%1 = load i64* %p%2 = srem i64 %1, A...%3 = sdiv i64 %2, A

This leads to an overhead of about 1.5x (on the same test programs as before).

q Have implemented AN encoding at the level of LLVM IR.q Leads to unprotected memory accesses:

q Compiler backend inserts memory accesses (below the LLVM IR level):q register spills, function calls etc.

Protecting memory – cont’d

14

q Example: register spills

q Analogously for (on x86)q return addressq function argumentsq callee-saved registers, frame pointerq (jump tables)

DMR in the compiler backend

15

mov rcx, -0x30(rbp)...mov -0x30(rbp), rcxadd rcx, rsi

mov rcx, -0x38(rbp)mov rcx, -0x30(rbp)...mov -0x30(rbp), rcxcmp -0x38(rbp), rcxjne <exit>add rcx, rsi

Results: AN encoding plus DMR in the backend

16

q Single bit flips in memory (i.e. in the results of load operations).

Outline


2. AN encoding


4. Summary

17

Summary

18

q AN encoding leads to large overheads (10x-100x) …q ... when used to protect operations in the CPU.

q Much lower overheads when AN encoding protects memory only (1.5x).q This is also more in the original spirit of encoding (for fault-tolerant communication).

q Making software fault-tolerant at higher abstraction levels (source code/IR) is likely to miss vulnerabilities.q And it misses bad vulnerabilities (return address, frame pointer).q Hence, one must address lower abstraction levels (i.e. machine instructions).

What can be done about this?

Error detection with AN encoding

Norman A. RinkTechnische Universität [email protected]

6 March 2017 Back up

References

1) N. A. Rink, D. Kuvaiskii, J. Castrillon, C. Fetzer. Compiling for Resilience: the Performance Gap. ERPP ’15 (co-located with ParCo ‘15).

2) N. A. Rink, J. Castrillon. Improving Code Generation for Software-based Error Detection. REES ‘15 (co-located with ESWeek ’15).

3) M. Hoffmann, P. Ulbrich, C. Dietrich, H. Schirmeier, D. Lohmann, W. Schröder-Preikschat. A practitioner’s guide to software-based soft-error mitigation using AN-codes. HASE ‘14.

4) M. Shafique, S. Garg, J. Henkel, D. Marculescu. The EDA challenges in the dark silicon era: Temperature, reliability, and variability perspectives. DAC ’14.

5) E. B. Nightingale, J. R. Douceur, V. Orgovan. Cycles, cells and platters: An empirical analysis of hardware failures on a million consumer PCs. EuroSys ’11.

6) S. Feng, S. Gupta, A. Ansari, S. Mahlke. Shoestring: Probabilistic soft error reliability on the cheap. ASPLOS ’10.

7) B. Schroeder, E. Pinheiro, W.-D. Weber. DRAM errors in the wild: A large-scale field study. SIGMETRICS ’09.

8) J. Yu, M. J. Garzarán, M. Snir. ESoftCheck: Removal of non-vital checks for fault tolerance. CGO ’09.

9) G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August. SWIFT: Software implemented fault tolerance. CGO ’05.

10) S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro, 25(6):10–16, 2005.

11) R. Baumann. Soft errors in advanced computer systems. IEEE Design & Test of Computers, 22(3):258–266, 2005. 20

Worked example of AN encoding

21

%1 = load i64* %0%2 = add i64 %1, 42

%3 = call i64 @foo(i64 %2)store i64 %3, i64* %0

%1 = load i64* %0%2 = add i64 %1, 126

%3 = call i64 @foo(i64 %2)store i64 %3, i64* %0

value encoding

*p = foo( (*p) + 42 );

%t0 = ptrtoint i64* %0 to i64%t1 = div i64 %t0, 3%t2 = inttoptr i64 %t1 to i64*%1 = load i64* %t2

%2 = add i64 %1, 126

%t3 = div i64 %2, 3%t4 = call i64 @foo(i64 %t3)%3 = mul i64 %t4, 3

%t5 = ptrtoint i64* %0 to i64%t6 = div i64 %t5, 3%t7 = inttoptr i64 %t6 to i64*

store i64 %3, i64* %t7

operation encoding

call void @check(i64 %0)%t0 = ptrtoint i64* %0 to i64%t1 = div i64 %t0, 3%t2 = inttoptr i64 %t1 to i64*%1 = load i64* %t2call void @check(i64 %1)

%2 = add i64 %1, 126

call void @check(i64 %2)%t3 = div i64 %2, 3%t4 = call i64 foo(i64 %t3)%3 = mul i64 %t4, 3

call void @check(i64 %0)%t5 = ptrtoint i64* %0 to i64%t6 = div i64 %t5, 3%t7 = inttoptr i64 %t6 to i64*

call void @check(i64 %3)store i64 %3, i64* %t7

check insertion

(A = 3)

Results: effectiveness of AN encoding

22

A A1 A2 A3 A4 B B1 B2 B3 B4 C C1

C2

C3

C4 D D1

D2

D3

D4 E E1 E2 E3 E4 F F1 F2 F3 F4 G G1

G2

G3

G4 H H1

H2

H3

H4 I I1 I2 I3 I4 J J1 J2 J3 J4 K K1 K2 K3 K4 L L1 L2 L3 L4 M M1

M2

M3

M4 N N1

N2

N3

N4 O O1

O2

O3

O4

rela

tive

frequ

enci

es

0.0

0.2

0.4

0.6

0.8

1.0

correct detected hang crash sdc

test program responses to faults

error detection with an encoding - cfaed | tu dresden · an encoding–cont’d q make a program...

Documents