assuring application-level correctness against soft errors jason cong and karthik gururaj

30
Assuring Application- Assuring Application- level Correctness level Correctness Against Soft Errors Against Soft Errors Jason Cong and Karthik Gururaj Jason Cong and Karthik Gururaj

Upload: justin

Post on 17-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Assuring Application-level Correctness Against Soft Errors Jason Cong and Karthik Gururaj. Motivation. Soft errors – issue for correct operation of CMOS circuits Problem becomes more severe – ITRS 2009 Smaller device sizes Low supply voltages Effect of soft errors on circuits - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

Assuring Application-level Assuring Application-level Correctness Against Soft ErrorsCorrectness Against Soft Errors

Jason Cong and Karthik GururajJason Cong and Karthik Gururaj

Page 2: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

MotivationMotivation

Soft errors – issue for correct operation of CMOS circuitsSoft errors – issue for correct operation of CMOS circuits

Problem becomes more severe – ITRS 2009Problem becomes more severe – ITRS 2009 Smaller device sizesSmaller device sizes

Low supply voltagesLow supply voltages

Effect of soft errors on circuitsEffect of soft errors on circuits Karnik 2004, Nguyen 2003Karnik 2004, Nguyen 2003

Effect of soft errors on software and processorsEffect of soft errors on software and processors Li et al 2005, Wang et al 2004Li et al 2005, Wang et al 2004

Page 3: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

Motivation Traditional notion of correctnessTraditional notion of correctness

Every last bit of every variable in a program should Every last bit of every variable in a program should be correctbe correct• Referred to as numerical correctnessReferred to as numerical correctness

Application-level correctnessApplication-level correctness Several applications can tolerate a degree of errorSeveral applications can tolerate a degree of error Image viewer, video decoding etcImage viewer, video decoding etc

However, there exist critical instructions even in However, there exist critical instructions even in such applicationssuch applications Example: state machine in video decoderExample: state machine in video decoder

Page 4: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

MotivationMotivation

Goal: Detect all “critical” instructions in the programGoal: Detect all “critical” instructions in the program

Protect “critical” instructions in the program against soft Protect “critical” instructions in the program against soft

errorserrors Using duplicationUsing duplication

Page 5: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

OutlineOutline

MotivationMotivation

Definition of critical instructionsDefinition of critical instructions

Program representationProgram representation

Static analysis to detect critical instructionsStatic analysis to detect critical instructions

Profiling and runtime monitoringProfiling and runtime monitoring

ResultsResults

Page 6: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

OutlineOutline

MotivationMotivation

Definition of critical instructionsDefinition of critical instructions

Program representationProgram representation

Static analysis to detect critical instructionsStatic analysis to detect critical instructions

Profiling and runtime monitoringProfiling and runtime monitoring

ResultsResults

Page 7: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

Defining critical instructionsDefining critical instructions

Elastic outputs – program outputs which can tolerate a Elastic outputs – program outputs which can tolerate a

certain amount of errorcertain amount of error Media applications – image, video etcMedia applications – image, video etc

Heuristics – Support vector machineHeuristics – Support vector machine

Characterizing quality of elastic outputs – Fidelity metricCharacterizing quality of elastic outputs – Fidelity metric Example: PSNR (peak signal to noise ratio) for JPEG, bit error Example: PSNR (peak signal to noise ratio) for JPEG, bit error

rate, rate,

Page 8: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

Defining critical instructionsDefining critical instructions Given application Given application AA::

II is the input to the application is the input to the application

A set of outputs A set of outputs OOcc - numerical correctness required - numerical correctness required

A set of elastic outputs A set of elastic outputs OO

Fidelity metric Fidelity metric F(I,O)F(I,O) for elastic outputs for elastic outputs

TT – threshold for acceptable output – threshold for acceptable output

An execution of An execution of AA is said to satisfy application-level correctness if: is said to satisfy application-level correctness if: All outputs All outputs εε OOcc are numerically correct are numerically correct

F(I,O) ≥ TF(I,O) ≥ T for elastic outputs for elastic outputs

NNminmin – the minimum number of elements of – the minimum number of elements of OO that need to erroneous that need to erroneous

for for F(I,O)F(I,O) to fall below to fall below TT

Page 9: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

Example: JPEG decoderExample: JPEG decoder

PSNR of 35dB is assumed to be good qualityPSNR of 35dB is assumed to be good quality

MSE = 20.56MSE = 20.56

Using 8-bit pixel values (MAX=255), Using 8-bit pixel values (MAX=255), Max error = 255Max error = 255

For a 1024x768 pixel image, For a 1024x768 pixel image, NNminmin ~ 251 ~ 251

20log( )MAX

PSNRMSE

Page 10: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

Defining critical instructionsDefining critical instructions

An instruction An instruction XX is said to be critical if is said to be critical if

X affects one of the outputs of X affects one of the outputs of OOcc (numerical correctness (numerical correctness

required) ORrequired) OR

X affects X affects NNminmin elastic output elements elastic output elements OO

Page 11: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

OutlineOutline

MotivationMotivation

Definition of critical instructionsDefinition of critical instructions

Program representationProgram representation

Static analysis to detect critical instructionsStatic analysis to detect critical instructions

Profiling and runtime monitoringProfiling and runtime monitoring

ResultsResults

Page 12: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

Program representationProgram representation

LLVM compiler infrastructureLLVM compiler infrastructure LLVM intermediate representationLLVM intermediate representation

Weighted program dependence graph (PDG) – Weighted program dependence graph (PDG) – GG

Page 13: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

ExampleExample

1. X=sqrt(Y); 2. for(i=1;i<N;++i) 3. { 4. C[i] = C[i-1] + i; 5. output[i] = C[Z] + X; 6. }.

1. X=sqrt(Y); 2. bb: 3. i = phi([0,entry], [i_next, bb]); 4. c_i_1 = load &C[i-1] 5. tmp = add c_i_1, i 6. store c_1_1, &C[i] 7. c_z = load &C[Z] 8. out_i = add X, c_i 9. store out_i &output[i]

LLVM IR – 3 address code

Page 14: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

ExampleExample

1. X=sqrt(Y); 2. for(i=1;i<N;++i) 3. { 4. C[i] = C[i-1] + i; 5. output[i] = C[Z] + X; 6. }.

add c_i, i load C[Z]

c_i_1 = load C[i-1]

c_i_1

sc

c_z

out_i

store C[i]

1

1

1

1

X

N

X=sqrt(Y)

store output[i] so

1

1. X=sqrt(Y); 2. bb: 3. i = phi([0,entry], [i_next, bb]); 4. c_i_1 = load &C[i-1] 5. tmp = add c_i_1, i 6. store c_1_1, &C[i] 7. c_z = load &C[Z] 8. out_i = add X, c_i 9. store out_i &output[i]

PDG - based on LLVM IR

Page 15: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

ExampleExample

1. X=sqrt(Y); 2. for(i=1;i<N;++i) 3. { 4. C[i] = C[i-1] + i; 5. output[i] = C[Z] + X; 6. }.

add c_i, i load C[Z]

c_i_1 = load C[i-1]

c_i_1

sc

c_z

out_i

store C[i]

1

1

1

1

X

N

X=sqrt(Y)

store output[i] so

1

Node for computing X

Page 16: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

ExampleExample

1. X=sqrt(Y); 2. for(i=1;i<N;++i) 3. { 4. C[i] = C[i-1] + i; 5. output[i] = C[Z] + X; 6. }.

add c_i, i load C[Z]

c_i_1 = load C[i-1]

c_i_1

sc

c_z

out_i

store C[i]

1

1

1

1

X

N

X=sqrt(Y)

store output[i] so

1

Node for computing X

Node (out_i) to compute C[Z]+X

Node (so) to store C[Z]+X into array output

Page 17: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

ExampleExample

1. X=sqrt(Y); 2. for(i=1;i<N;++i) 3. { 4. C[i] = C[i-1] + i; 5. output[i] = C[Z] + X; 6. }.

add c_i, i load C[Z]

c_i_1 = load C[i-1]

c_i_1

sc

c_z

out_i

store C[i]

1

1

1

1

X

N

X=sqrt(Y)

store output[i] so

1

Node for computing X

Node (so) to write to output array

Edge to represent dependence between X and out_i

Node (so) to store C[Z]+X into array output

Edge to represent dependence between out_i and so

Page 18: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

Assigning edge weightsAssigning edge weights Edge weight Edge weight u→v u→v - how many - how many

instances of node v are affected instances of node v are affected

by 1 instance of by 1 instance of uu??

Example:Example:

XX outside the loop, outside the loop, out_iout_i inside inside

the loopthe loop Edge weight NEdge weight N

Nodes Nodes out_iout_i and and soso are in the are in the

same basic block – same basic block – Edge weight 1Edge weight 1

add c_i, i load C[Z]

c_i_1 = load C[i-1]

c_i_1

sc

c_z

out_i

store C[i]

1

1

1

1

X

N

X=sqrt(Y)

store output[i] so

1

Page 19: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

OutlineOutline

MotivationMotivation

Definition of critical instructionsDefinition of critical instructions

Program representationProgram representation

Static analysis to detect critical instructionsStatic analysis to detect critical instructions

Profiling and runtime monitoringProfiling and runtime monitoring

ResultsResults

Page 20: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

Static analysis for detecting critical instructionsStatic analysis for detecting critical instructions

Find how many instances of output Find how many instances of output OO are affected by node are affected by node

xx

propagate(x →v) propagate(x →v) is the number of instances of is the number of instances of vv that are that are

affected by an instance of affected by an instance of xx

Page 21: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

ExampleExample propagate(u→v)propagate(u→v) initialized to edge weight for initialized to edge weight for

all edges all edges (u →v)(u →v)

propagate(X →out_i) = Npropagate(X →out_i) = N

w(out_i →so) = 1w(out_i →so) = 1

propagate(X →so) = propagate(X →out_i) *propagate(X →so) = propagate(X →out_i) *

w(out_i →so)w(out_i →so)

More formallyMore formally

add c_i, i load C[Z]

c_i_1 = load C[i-1]

c_i_1

sc

c_z

out_i

store C[i]

1

1

1

1

X

N

X=sqrt(Y)

store output[i] so

1

( )( ) max ( ( )* ( ))

u predecessors vpropagate x v propagate x u w u v

Page 22: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

OutlineOutline

MotivationMotivation

Definition of critical instructionsDefinition of critical instructions

Program representationProgram representation

Static analysis to detect critical instructionsStatic analysis to detect critical instructions

Profiling and runtime monitoringProfiling and runtime monitoring

ResultsResults

Page 23: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

Profiling and runtime monitoringProfiling and runtime monitoring

Static analysis is conservative in natureStatic analysis is conservative in nature May produce overly pessimistic resultsMay produce overly pessimistic results

Main reason – edge weights are initialized too highMain reason – edge weights are initialized too high

Profiling with test inputs to estimate edge weightsProfiling with test inputs to estimate edge weights

Page 24: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

ExampleExample

Assum static analysis Assum static analysis overestimates edge weight overestimates edge weight between between scsc and and c_zc_z

Profiling gives value of 1Profiling gives value of 1 Node Node sc sc is is likely non-critical likely non-critical

(LNC)(LNC) Contrast this with node Contrast this with node XX which which

is static criticalis static critical

1. X=sqrt(Y); 2. for(i=1;i<N;++i) 3. { 4. C[i] = C[i-1] + i; 5. output[i] = C[Z] + X; 6. }.

add c_i, i load C[Z]

c_i_1 = load C[i-1]

c_i_1

sc

c_z

out_i

store C[i]

1

1

1

1

X

N

X=sqrt(Y)

store output[i] so

1

Page 25: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

Profiling and runtime monitoringProfiling and runtime monitoring

Likely critical instructions – duplicated and checked in Likely critical instructions – duplicated and checked in

softwaresoftware Using the SWIFT method proposed by Reis et al 2005Using the SWIFT method proposed by Reis et al 2005

Likely non-critical instructions – monitored using Likely non-critical instructions – monitored using

lightweight runtime monitoring techniquelightweight runtime monitoring technique

Static non-critical instructions – no error checkingStatic non-critical instructions – no error checking

Page 26: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

OutlineOutline

MotivationMotivation

Definition of critical instructionsDefinition of critical instructions

Program representationProgram representation

Static analysis to detect critical instructionsStatic analysis to detect critical instructions

Profiling and runtime monitoringProfiling and runtime monitoring

ResultsResults

Page 27: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

ResultsResults

Benchmarks for Mediabench, SPEC, MibenchBenchmarks for Mediabench, SPEC, Mibench

Simics/GEMS simulation infrastructureSimics/GEMS simulation infrastructure

Page 28: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

Static instruction classificationStatic instruction classification

Significant number of instructions are non-criticalSignificant number of instructions are non-critical

Profiling helps to determine Profiling helps to determine likely non-criticallikely non-critical instructions instructions

Page 29: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

Comparison with previous workComparison with previous work Significant savings over approach proposed by Thaker et alSignificant savings over approach proposed by Thaker et al

Protects all instructions which compute memory addresses and control flowProtects all instructions which compute memory addresses and control flow

Page 30: Assuring Application-level Correctness Against Soft Errors Jason Cong and  Karthik Gururaj

ConclusionConclusion

Static + dynamic technique for detecting critical Static + dynamic technique for detecting critical

instructionsinstructions

Detect several non-critical instructionsDetect several non-critical instructions

Reduce overall energy by 25%Reduce overall energy by 25%