error detection with an encoding - cfaed | tu dresden · an encoding–cont’d q make a program...
TRANSCRIPT
Error detection with AN encoding
Norman A. RinkTechnische Universität [email protected]
6 March 2017
Outline
1. Introduction – hardware faults and software-based fault tolerance
2. AN encoding
3. Memory error detection
4. Summary
2
Outline
1. Introduction – hardware faults and software-based fault tolerance
2. AN encoding
3. Memory error detection
4. Summary
3
Hardware faults and soft errors
4
q Faults are a long-standing and recurring issue.q More recently: large-scale studies (co-)authored by, e.g., Google
(SIGMETRICS ‘09), Microsoft (EuroSys ’11).
q What causes hardware faults?q cosmic radiationq “dim silicon” (near-threshold computing to save energy)q temperature variations, process variations
à Typically lead to transient hard-ware faults, aka soft errors.
q Non-negligible fault rates in emerging computing paradigms: q silicon-/carbon-based nano-technologies, grapheneq chemical information processing, bio-inspired/quantum computing
taken from S. Borkar, “Designing reliable systems from unreliable components: …,” IEEE Micro, vol. 25, no. 6, 2005.
Hardware faults and soft errors – cont’d
5
cost
(run
time,
ene
rgy)
error probability/output degradation
approximate computing applications,e.g. image processing, machine learning
safety-/security-critical applications,e.g. automotive, operating system (kernels)
non-critical user applications, processes that can be restarted
q The trade-off of computing with faulty hardware:
à Hardware-implemented error handling (aka fault tolerance) is not flexible.
Error detection through redundant code
q Typical approach to detecting/correcting hardware faults:q replicate dataflow (duplication is sufficient for error detection)q insert checks
6
%3 = add i64 %0, %1%4 = mul i64 %3, %2
%3 = add i64 %0, %1%r3 = add i64 %r0, %r1%4 = mul i64 %3, %2%r4 = mul i64 %r3, %r2
%f0 = icmp eq i64 %4, %r4br i1 %f0, label continue,
label recover
IR transformation
original code
fault-tolerant code
duplicated dataflow
error check
runtime overhead of duplicated dataflow typically < 2x
This cannot detect errors in memory!
Outline
1. Introduction – hardware faults and software-based fault tolerance
2. AN encoding
3. Memory error detection
4. Summary
7
AN encoding
q Correctness condition: Integer values are multiples of a fixed constant A.à Check for faults like this:
if (n % A != 0) { exit(AN_ERROR_CODE); }
q Advantages over replication:q Data in memory is encoded (automatically protected): no need to replicate loads/stores.q Suitable for multithreaded and shared memory applications.
q Disadvantages: Large runtime overheads (up to and over several 10x).q Checking is expensive (modulo operation)q Decoding is expensive (division by A)q Decoding often required, e.g., for address operands.
8
Trade frequencies of these operations for runtime.
AN encoding – cont’d
q Make a program fault-tolerant by transforming it into an AN-encoded program.
1. Value encoding:
2. Operation encoding, e.g.:
3. Check insertion:q Where non-encoded values enter/leave the scope of AN encoding.à Often referred to as synchronization points.
9
%3 = mul i64 %0, %1 %t3 = mul i64 %0, %1%3 = div i64 %t3, A
multiplication
%1 = load i64* %0 %t0 = ptrtoint i64* %0 to i64%t1 = div i64 %t0, A%t3 = inttoptr i64 %t1 to i64*%1 = load i64* %t3
load
integerconstantc integerconstantc*A
Results: runtime overheads
10
label test case
A-C bubblesort
D CRC (cyclic redundancy checker)
E DES encryption
F Dijkstra
G-I lex, parse, eval
J Fibonacci
K integer matrix multiply
L memcopy
M-O quicksort
test programs runtime overheads
à Small but visible reduction of overheads for profiles 2 and 3.
Results: silent data corruption (sdc)
11
A B C D E F G H I J K L M N O
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.141 2 3 4
Relative frequencies of sdc responses
Outline
1. Introduction – hardware faults and software-based fault tolerance
2. AN encoding
3. Memory error detection
4. Summary
12
Protecting memory
q Encode integer values before storing to memory:
q Decode and check after loading from memory:
13
%0 = mul i64 %0, Astore i64 %0, i64* %p
%1 = load i64* %p%2 = srem i64 %1, A...%3 = sdiv i64 %2, A
This leads to an overhead of about 1.5x (on the same test programs as before).
q Have implemented AN encoding at the level of LLVM IR.q Leads to unprotected memory accesses:
q Compiler backend inserts memory accesses (below the LLVM IR level):q register spills, function calls etc.
Protecting memory – cont’d
14
q Example: register spills
q Analogously for (on x86)q return addressq function argumentsq callee-saved registers, frame pointerq (jump tables)
DMR in the compiler backend
15
mov rcx, -0x30(rbp)...mov -0x30(rbp), rcxadd rcx, rsi
mov rcx, -0x38(rbp)mov rcx, -0x30(rbp)...mov -0x30(rbp), rcxcmp -0x38(rbp), rcxjne <exit>add rcx, rsi
Results: AN encoding plus DMR in the backend
16
q Single bit flips in memory (i.e. in the results of load operations).
Outline
1. Introduction – hardware faults and software-based fault tolerance
2. AN encoding
3. Memory error detection
4. Summary
17
Summary
18
q AN encoding leads to large overheads (10x-100x) …q ... when used to protect operations in the CPU.
q Much lower overheads when AN encoding protects memory only (1.5x).q This is also more in the original spirit of encoding (for fault-tolerant communication).
q Making software fault-tolerant at higher abstraction levels (source code/IR) is likely to miss vulnerabilities.q And it misses bad vulnerabilities (return address, frame pointer).q Hence, one must address lower abstraction levels (i.e. machine instructions).
What can be done about this?
Error detection with AN encoding
Norman A. RinkTechnische Universität [email protected]
6 March 2017 Back up
References
1) N. A. Rink, D. Kuvaiskii, J. Castrillon, C. Fetzer. Compiling for Resilience: the Performance Gap. ERPP ’15 (co-located with ParCo ‘15).
2) N. A. Rink, J. Castrillon. Improving Code Generation for Software-based Error Detection. REES ‘15 (co-located with ESWeek ’15).
3) M. Hoffmann, P. Ulbrich, C. Dietrich, H. Schirmeier, D. Lohmann, W. Schröder-Preikschat. A practitioner’s guide to software-based soft-error mitigation using AN-codes. HASE ‘14.
4) M. Shafique, S. Garg, J. Henkel, D. Marculescu. The EDA challenges in the dark silicon era: Temperature, reliability, and variability perspectives. DAC ’14.
5) E. B. Nightingale, J. R. Douceur, V. Orgovan. Cycles, cells and platters: An empirical analysis of hardware failures on a million consumer PCs. EuroSys ’11.
6) S. Feng, S. Gupta, A. Ansari, S. Mahlke. Shoestring: Probabilistic soft error reliability on the cheap. ASPLOS ’10.
7) B. Schroeder, E. Pinheiro, W.-D. Weber. DRAM errors in the wild: A large-scale field study. SIGMETRICS ’09.
8) J. Yu, M. J. Garzarán, M. Snir. ESoftCheck: Removal of non-vital checks for fault tolerance. CGO ’09.
9) G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August. SWIFT: Software implemented fault tolerance. CGO ’05.
10) S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro, 25(6):10–16, 2005.
11) R. Baumann. Soft errors in advanced computer systems. IEEE Design & Test of Computers, 22(3):258–266, 2005. 20
Worked example of AN encoding
21
%1 = load i64* %0%2 = add i64 %1, 42
%3 = call i64 @foo(i64 %2)store i64 %3, i64* %0
%1 = load i64* %0%2 = add i64 %1, 126
%3 = call i64 @foo(i64 %2)store i64 %3, i64* %0
value encoding
*p = foo( (*p) + 42 );
%t0 = ptrtoint i64* %0 to i64%t1 = div i64 %t0, 3%t2 = inttoptr i64 %t1 to i64*%1 = load i64* %t2
%2 = add i64 %1, 126
%t3 = div i64 %2, 3%t4 = call i64 @foo(i64 %t3)%3 = mul i64 %t4, 3
%t5 = ptrtoint i64* %0 to i64%t6 = div i64 %t5, 3%t7 = inttoptr i64 %t6 to i64*
store i64 %3, i64* %t7
operation encoding
call void @check(i64 %0)%t0 = ptrtoint i64* %0 to i64%t1 = div i64 %t0, 3%t2 = inttoptr i64 %t1 to i64*%1 = load i64* %t2call void @check(i64 %1)
%2 = add i64 %1, 126
call void @check(i64 %2)%t3 = div i64 %2, 3%t4 = call i64 foo(i64 %t3)%3 = mul i64 %t4, 3
call void @check(i64 %0)%t5 = ptrtoint i64* %0 to i64%t6 = div i64 %t5, 3%t7 = inttoptr i64 %t6 to i64*
call void @check(i64 %3)store i64 %3, i64* %t7
check insertion
(A = 3)
Results: effectiveness of AN encoding
22
A A1 A2 A3 A4 B B1 B2 B3 B4 C C1
C2
C3
C4 D D1
D2
D3
D4 E E1 E2 E3 E4 F F1 F2 F3 F4 G G1
G2
G3
G4 H H1
H2
H3
H4 I I1 I2 I3 I4 J J1 J2 J3 J4 K K1 K2 K3 K4 L L1 L2 L3 L4 M M1
M2
M3
M4 N N1
N2
N3
N4 O O1
O2
O3
O4
rela
tive
frequ
enci
es
0.0
0.2
0.4
0.6
0.8
1.0
correct detected hang crash sdc
test program responses to faults