![Page 1: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/1.jpg)
Tolera'ngHardwareFaultsInCommoditySo8ware:Problems,
Solu'ons,andaRoadmap
KarthikPa*abiraman(h*p://blogs.ubc.ca/karthik),
ElectricalandComputerEngineeringUniversityofBri'shColumbia(UBC)
![Page 2: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/2.jpg)
Mo'va'on:HardwareErrors• Errorsarebecomingmorecommoninprocessors
– So8errorsanddevicevaria'ons('mingerrors)– Processorsexperiencewear-outandthermalhotspots
2
Source:ShekarBorkar(Intel)-Stanfordtalkin2005
![Page 3: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/3.jpg)
HardwareErrors:Tradi'onalSolu'ons• Guard-banding • DuplicaGon
Average Worst-case
Guard-bandingwastespowerasgapbetweenaverageandworst-casewidensduetovaria'ons
Guard-band
Hardwareduplica'on(DMR)canresultin2Xslowdownand/orenergyconsump'on
3
![Page 4: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/4.jpg)
Analterna'veapproach
4
Architecture
Opera'ngSystem
Applica'on
Devices/Circuits
User interacts with the application
SoHware
Hardware
User
Allowerrorsacrossthehardware-
soHwareboundary,butmakesureuserdoesnotperceiveit
![Page 5: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/5.jpg)
Device/CircuitLevel
ArchitecturalLevel
OperaGngSystemLevel
ApplicaGonLevel
Whydoso8waretechniqueswork?
ImpacRulErrors5
![Page 6: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/6.jpg)
So8wareTechniques
6
ApplicaGonProperGes
TargetedprotecGonmechanisms
Leverage the properties of the application to provide targeted protection, only for the errors that matter to it
Device/CircuitLevel
ArchitecturalLevel
OperaGngSystemLevel
ApplicaGonLevel
![Page 7: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/7.jpg)
Outline• Mo'va'on
• Techniquesdevelopedbymygroup[DSN’13][CASES’14]
• Abriefhistoryofso8waretechniques
• Adop'oninIndustry
• Researchopportuni'esandroadmap
7
![Page 8: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/8.jpg)
EDCs: Soft Computing Applications
Ø Applica'onsinmachinelearning,mul'mediaprocessingØ Expectedtodominatefutureworkloads[Dubey’07]
8
Originalimage(le8)versusfaultyimage:JPEGdecoder
![Page 9: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/9.jpg)
EDCs:EgregiousDataCorrup'ons
9
Ø Largeorunacceptabledevia'oninoutput
EDCimage(PSNR11.37)Vs.Non-EDCimage(PSNR44.79)
![Page 10: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/10.jpg)
EDCs:Goal
Ø Selec'velydetectEDCcausingfaults,butnotothers
10
Non-EDC
EDC
Detector
Benign
Applica'onExecu'on
![Page 11: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/11.jpg)
EDCs:Faultmodel
• Transienthardwarefaults– Causedbypar'clestrikes,supplynoise
• OurFaultModel– Assumeonefaultperapplica'onexecu'on– Processorregistersandexecu'onunits– MemoryandcacheprotectedwithECC– Controllogicprotectedwithothermethods
11
![Page 12: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/12.jpg)
EDCs:MainIdea
12
Corruptedbyhardwarefaults
Cri'calData
ApplicaGonData
Ourpriorwork:EDCsarecausedbycorrup'onofasmallfrac'onofprogramdata[Flikker-ASPLOS’11]Thiswork:Cri'caldatacanbeiden'fiedusingsta'canddynamicanalysis,withoutanyprogrammerannota'ons
IniGalStudy
HeurisGc Algorithm
![Page 13: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/13.jpg)
EDCs:Ini'alStudy
13
MonitorControl/PointerData
Ø InstrumentcodeØ FaultInjec'on
Ø Correla'onbetweenprogramdatause&faultoutcome
PerformedusingLLFIfaultinjector[DSN’14],attheLLVMIRcodelevel
Ini'alStudy
Heuris'c Algorithm
![Page 14: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/14.jpg)
EDCs:Ini'alStudy
14
6%
43%
23%
28%
Ini'alStudy
Heuris'c Algorithm
![Page 15: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/15.jpg)
15
voidconv422to444(char*src,char*dst,intheight,intwidth,intoffset){for(j=0;j<height;j++){for(i=0;i<width;i++){im1=(i<1)?0:i–1……}if(j+1<offset){src+=w;dst+=width;}}}
HighEDCLikelihood
Ø FaultinoffsetØ BranchFlip
Ini'alStudy
Heuris'c Algorithm
LowEDCLikelihood
EDCs:ExampleHeuris'cFaultsaffec+ngbrancheswithlargeamountofdatawithintheirbodieshaveahigherlikelihoodofresul+nginEDCoutcomes
![Page 16: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/16.jpg)
EDCs:Algorithm
16
Compiler EDCRankingAlgorithm
Selec'onAlgorithm
IR
Applica'onSourceCode
PerformanceOverhead
DataVariablesorLoca'onstoProtect
Representa'veinputs
Backwardslicereplica'on
Ini'alStudy
Heuris'c Algorithm
![Page 17: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/17.jpg)
EDCs:Detec'onCoverage
17
AverageEDCCoverageof82%at10%performanceoverhead
Higherisbeuer
𝐸𝐷𝐶 𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒= 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝐸𝐷𝐶𝑠/𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐸𝐷𝐶𝑠
![Page 18: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/18.jpg)
EDCs:Selec'vity
18
AverageBenignandNon-EDCCoverageof10to15%foroverheadsfrom10to25%
Lowerisbeuer
![Page 19: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/19.jpg)
SDCTune:SilentDataCorrup'on(SDCs)
19
Faultoccurs
Errorac'vated
ErrorMaskedBenign
Crash/Hang
SDC
Program
Finished
Correct output SDC Output
Results lost:
![Page 20: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/20.jpg)
SDCTune:Goals
• ProtecGngcriGcaldatainsoH-compuGngapplicaGonsfromEDCs– CanweextendthistoSilentDataCorrup'ons(SDCs)ingeneral-purposeapplica'ons?
• Challenge:– Notfeasibletoiden'fySDCsbasedontheamountofdataaffectedbythefaultaswasthecasewithEDCs
– Needforcomprehensivemodelforpredic'ngSDCsbasedonsta'canddynamicprogramfeatures
20
![Page 21: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/21.jpg)
SDCTune:MainIdea
• StartfromStoreandCmpinstrucGonsandgobackwardthroughprogram’sdatadependencies
• Usemachinelearning(CART)topredicttheSDCpronenessofStoreandCmpinstrucGons – Extract the related features by static/dynamic analysis– Quantify the effects by classification and regression – Estimate SDC rates of different Stores and Cmp instructions
21
![Page 22: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/22.jpg)
SDCTune:ExampleModel
22
NotusedinMaskingopera'ons
LinearRegressionforSDC-proneness
![Page 23: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/23.jpg)
SDCTune:Benchmarks
23
(a) Data dependency ofdetector-free code
(b) Basic detector in-strumented
(c) concatenate dupli-cated instructions
Figure 5: The shaded portion of (a) shows the instructions need protection.(b) shows the duplicated instructions (the shaded nodes) and the detectorinserted at the end of the two dependency chains. (c) shows one addedinstruction to protect(node e’) that concatenates the two dependency chainsand save one checker
1 for ( =0;; ++){2 // loop body3 = < ?1:0;4 if ( == 1)5 break;6 // decompose exit
predicationto
simulateinstruction�levelbehaviour .
7 }8
(a) Detector-free code
1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =
< ?1:0;8 if(flag != dup_flag)9 Assert();
10 // inconsistent11 if ( == 1)12 break;13 }
(b) Basic detector in-strumented
1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =
< ?1:0;8 if ( == 1)9 break;
10 }11 if(flag != dup_flag)12 Assert();13 // inconsistent
(c) Lazy checking ap-plied
Figure 6: (b) shows how the loop index i in original code (a) is protectedwith bold code as check. (c) shows how we move the check out of the loopbody
5. Experimental SetupIn this section, we empirically evaluate SDCTune for config-
urable SDC protection through fault injection experiments. All theexperiments and evaluations are conducted on a Intel i7 4-coremachine with 8GB memory running Debian Linux. Section 5.1presents the details of benchmarks and section 5.2 presents ourevaluation metrics. Section 5.3 presents our methodology andworkflow for performing the experiments.5.1 Benchmarks
We choose a total of 12 applications from a wide variety of do-mains for training and testing. They are from SPEC benchmarksuite [12], SPLASH2 benchmark suite [30], NAS parallel bench-mark suite [1], PARSEC benchmark suite [2] and Parboil bench-mark suite [27]. We divide the 12 applications into two groupsof 6 applications each, one for training and the other for testing.The four benchmarks studied in Section 2.3 are incorporated in thetraining group. The details of these training and testing benchmarksare shown in Table 5 and Table 6 respectively. All the applicationsare compiled and linked into native executables with -O2 optimiza-tion flags and run in a single threaded mode.5.2 Evaluation Metrics
To gauge the accuracy of SDCTune, we use it for estimating theoverall SDC rate of an application, as well as the SDC coverage
Table 5: Training programs
Program Description Benchmarksuite Input Stores Compar-
isonsIS Integer sorting NAS default 21 20LU Linear algebra SPLASH2 test 41 110
Bzip2 Compression SPEC test 681 646
Swaptions Price portfolioof swaptions
PARSEC Sim-large 36 101
Water Moleculardynamics
SPLASH2 test 187 224
CGConjugategradientmethod
NAS default 32 97
Table 6: Testing programs
Program Description Benchmarksuite Input Stores Compar-
isons
Lbm Fluiddynamics
Parboil short 71 34
Gzip Compression SPEC test 251 399
OceanLarge-scale
oceanmovements
SPLASH test 322 813
Bfs Breadth-Firstsearch
Parboil 1M 36 57
Mcf Combinatorialoptimization
SPEC test 87 158
Libquantum Quantumcomputing
SPEC test 39 136
for different performance overhead bounds. The former is used forcomparing the resilience of different applications, while the latteris used to insert detectors for configurable protection.Estimation of overall SDC rates: We perform a random fault in-jection experiment to determine the overall SDC rate of the appli-cation. We then compare the SDC rate obtained with SDCTune withthat obtained from the fault injection experiment. We also considerthe relative SDC rate compared to other applications (i.e., its rank).We use the same experimental setup for fault injection as describedin Section 2.3.SDC coverages for different performance overhead bounds: Weuse SDCTune to predict the SDC coverage for different instructionsto satisfy the performance overhead bounds provided by the user.We start with the most SDC prone instructions and iteratively ex-pand the set of instructions until the performance overhead boundsare met. We perform fault injection experiments on the program in-strumented with our detectors for these instructions, and measurethe percentages of SDCs detected. We then compare our resultswith those of full duplication, i.e., when every instruction is dupli-cated in the program, and hot-path duplication, i.e., when the top10% most executed instructions are duplicated in the program.SDC detection efficiency: Similar to the efficiency defined inprior work [25], we define the SDC detection efficiency as theratio between SDC coverage and performance overhead for a de-tection technique. We calculate the efficiency of each benchmarkunder a given performance overhead bound, and compare it withthe efficiencies of full duplication and hot-path duplication. TheSDC coverage of full duplication is assumed to be a hundred per-cent [23].5.3 Work Flow and Implementation
Figure 7 shows the workflow for estimating the overall SDCrates and providing configurable protection using SDCTune. The
Trainingprograms TesGngprograms(a) Data dependency ofdetector-free code
(b) Basic detector in-strumented
(c) concatenate dupli-cated instructions
Figure 5: The shaded portion of (a) shows the instructions need protection.(b) shows the duplicated instructions (the shaded nodes) and the detectorinserted at the end of the two dependency chains. (c) shows one addedinstruction to protect(node e’) that concatenates the two dependency chainsand save one checker
1 for ( =0;; ++){2 // loop body3 = < ?1:0;4 if ( == 1)5 break;6 // decompose exit
predicationto
simulateinstruction�levelbehaviour .
7 }8
(a) Detector-free code
1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =
< ?1:0;8 if(flag != dup_flag)9 Assert();
10 // inconsistent11 if ( == 1)12 break;13 }
(b) Basic detector in-strumented
1 =0;2 // duplication of i3 =0;4 for (;;) {5 // loop body6 = < ?1:0;7 =
< ?1:0;8 if ( == 1)9 break;
10 }11 if(flag != dup_flag)12 Assert();13 // inconsistent
(c) Lazy checking ap-plied
Figure 6: (b) shows how the loop index i in original code (a) is protectedwith bold code as check. (c) shows how we move the check out of the loopbody
5. Experimental SetupIn this section, we empirically evaluate SDCTune for config-
urable SDC protection through fault injection experiments. All theexperiments and evaluations are conducted on a Intel i7 4-coremachine with 8GB memory running Debian Linux. Section 5.1presents the details of benchmarks and section 5.2 presents ourevaluation metrics. Section 5.3 presents our methodology andworkflow for performing the experiments.5.1 Benchmarks
We choose a total of 12 applications from a wide variety of do-mains for training and testing. They are from SPEC benchmarksuite [12], SPLASH2 benchmark suite [30], NAS parallel bench-mark suite [1], PARSEC benchmark suite [2] and Parboil bench-mark suite [27]. We divide the 12 applications into two groupsof 6 applications each, one for training and the other for testing.The four benchmarks studied in Section 2.3 are incorporated in thetraining group. The details of these training and testing benchmarksare shown in Table 5 and Table 6 respectively. All the applicationsare compiled and linked into native executables with -O2 optimiza-tion flags and run in a single threaded mode.5.2 Evaluation Metrics
To gauge the accuracy of SDCTune, we use it for estimating theoverall SDC rate of an application, as well as the SDC coverage
Table 5: Training programs
Program Description Benchmarksuite Input Stores Compar-
isonsIS Integer sorting NAS default 21 20LU Linear algebra SPLASH2 test 41 110
Bzip2 Compression SPEC test 681 646
Swaptions Price portfolioof swaptions
PARSEC Sim-large 36 101
Water Moleculardynamics
SPLASH2 test 187 224
CGConjugategradientmethod
NAS default 32 97
Table 6: Testing programs
Program Description Benchmarksuite Input Stores Compar-
isons
Lbm Fluiddynamics
Parboil short 71 34
Gzip Compression SPEC test 251 399
OceanLarge-scale
oceanmovements
SPLASH test 322 813
Bfs Breadth-Firstsearch
Parboil 1M 36 57
Mcf Combinatorialoptimization
SPEC test 87 158
Libquantum Quantumcomputing
SPEC test 39 136
for different performance overhead bounds. The former is used forcomparing the resilience of different applications, while the latteris used to insert detectors for configurable protection.Estimation of overall SDC rates: We perform a random fault in-jection experiment to determine the overall SDC rate of the appli-cation. We then compare the SDC rate obtained with SDCTune withthat obtained from the fault injection experiment. We also considerthe relative SDC rate compared to other applications (i.e., its rank).We use the same experimental setup for fault injection as describedin Section 2.3.SDC coverages for different performance overhead bounds: Weuse SDCTune to predict the SDC coverage for different instructionsto satisfy the performance overhead bounds provided by the user.We start with the most SDC prone instructions and iteratively ex-pand the set of instructions until the performance overhead boundsare met. We perform fault injection experiments on the program in-strumented with our detectors for these instructions, and measurethe percentages of SDCs detected. We then compare our resultswith those of full duplication, i.e., when every instruction is dupli-cated in the program, and hot-path duplication, i.e., when the top10% most executed instructions are duplicated in the program.SDC detection efficiency: Similar to the efficiency defined inprior work [25], we define the SDC detection efficiency as theratio between SDC coverage and performance overhead for a de-tection technique. We calculate the efficiency of each benchmarkunder a given performance overhead bound, and compare it withthe efficiencies of full duplication and hot-path duplication. TheSDC coverage of full duplication is assumed to be a hundred per-cent [23].5.3 Work Flow and Implementation
Figure 7 shows the workflow for estimating the overall SDCrates and providing configurable protection using SDCTune. The
![Page 24: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/24.jpg)
SDCTune:Evalua'onMethod
24
Featuresextractedbasedonheuris'c
knowledgefromtrainingprograms
SDCrateforeachinstruc'on
P(SDC|I)fromtrainingprograms
Training(CARTMethod)
P(SDC|I)Predictor
Es'matetheSDCpronenessofdifferentprogram
instruc'ons
Findthesetofinstruc'onsforanoverheadbound(∑P(I))
RandomFaultInjec'onResultsfromtesGngprograms
ActualSDCcoveragefor
tesGngprograms
FeaturesextractedfromtesGngprograms
Trainingphase
TesGngandusingphase
EvaluaGonPhase
UsagePhase
![Page 25: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/25.jpg)
SDCTune:ModelValida'on
25
Trainingprograms TesGngprograms
Rankcorrela'on* 0.9714 0.8286P-value** 0.00694 0.0125
0
2
4
6
8
0 1 2 3 4 5 6 7Rank
ofo
verallSD
Cratesb
yesGm
aGon
RankofoverallSDCratesbyfaultinjecGonexperiment
Trainingprograms
TesingprogramTes'ngprograms
![Page 26: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/26.jpg)
SDCTune:SDCCoverage
26
Trainingprograms: TesGngprograms:
Overhead Coverage
10% 44.8%
20% 78.6%
30% 86.8%
Overhead Coverage
10% 39%
20% 63.7%
30% 74.9%
![Page 27: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/27.jpg)
SDCTune:FullDuplica'onandHot-PathDuplica'onOverheads
27
Fullduplica'onoverhead:53.7%to73.6%Hot-pathduplica'onoverhead:43.5to57.6%
NormalizedDetecGonEfficiency
10%overhead
20%overhead
30%overhead
Trainingprograms 2.38 2.09 1.54Tes'ngprograms 2.87 2.34 1.84
![Page 28: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/28.jpg)
EDCsandSDCTune:Summary
• SoHwareleveltechniquesfortunableandselecGveprotecGonfromEDCsandSDCs[DSN’13][DSN’14][CASES’14][TECS1][TECS2]
• Completelyautomated–noprogrammerintervenGonorannotaGonsareneeded
• SignificantefficiencygainoverfullduplicaGon
28
![Page 29: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/29.jpg)
Outline• Mo'va'on
• Techniquesdevelopedbymygroup[DSN’13][CASES’14]
• Abriefhistoryofso8waretechniques
• Adop'oninIndustry
• Researchopportuni'esandroadmap
29
![Page 30: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/30.jpg)
HistoryofS/Wtechniques:Pre-2000
• LonghistoryofsoHwaretechniquesforhighreliabilitysystemsgoingbacktoIBMMVS,TandemGuardian– Reliedonarchitecturalsupportfromthehardware– Assumedso8warewaswriuenintransac'onalstyle
• AlgorithmBasedFaultTolerance–1984[HuangandAbraham]:specializedapplica'onsinlinearalgebra
• Manycontrol-flowcheckingtechniquesfrom1980’s– Onlyprotectedtheprogram’scontrol-flowinstruc'ons
30
![Page 31: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/31.jpg)
HistoryofS/Wtechniques:2000-2005
• SoHerrorproblem[SunServer–Baumann2000]• ARGOSprojectfromStanford(McCluksey,2001)
– EDDI–so8ware-basedinstruc'onduplica'on– CFCSS–Lightweightcontrol-flowchecking
• ReliabilityandSecurityEngine(RSE)fromUIUC(2004)– Targetedcheckingofapplica'onproper'esatrun'me
• SWIFTfromPrinceton(2005)– Low-overheadcheckingthroughcompilerop'miza'ons
• FirstSELSEworkshoplaunched(2005)– Focusonen'resystemspanningso8wareandhardware
31
![Page 32: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/32.jpg)
SELSEPapers(2009-2017)
Dataunavailableforyears2005to2008.Basedon'tleandabstractsonly.
32
0% 20% 40% 60% 80%
100%
2009 2010 2011 2012 2013 2014 2015 2016 2017
PapersatSELSE2009-2017 (source:SELSEwebsite)
Software Hardware Both
0% 20% 40% 60% 80%
100%
2009 2010 2011 2012 2013 2014 2015 2016 2017
PapersatSELSE2009-2017 (source:SELSEwebsite)
Software Hardware Both
0% 20% 40% 60% 80%
100%
2009 2010 2011 2012 2013 2014 2015 2016 2017
PapersatSELSE2009-2017 (source:SELSEwebsite)
Software Hardware Both
![Page 33: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/33.jpg)
HistoryofS/Wtechniques:2010-today
• Cross-LayerResiliencebecomesabuzzword:Manygroupsworkingonthisproblemincludingours
• MulGpledomains:HPCsystems,EmbeddedSystems
• Callsfromdifferentfundingagencies(DoE,NSF,etc.)forCross-LayerResilienceTechniques–whitepapers
• Conjoinedtwin:ApproximateCompuGngtakesoff– PapersattopPL/architectureconferences
33
![Page 34: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/34.jpg)
Outline• Mo'va'on
• Techniquesdevelopedbymygroup[DSN’13][CASES’14]
• Abriefhistoryofso8waretechniques
• Adop'oninIndustry
• Researchopportuni'esandroadmap
34
![Page 35: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/35.jpg)
WhataboutSo8wareResearchers?
• PapersinthetopsoHwareengineering/tesGng/reliabilityconferencesabouthardwarefaultsanderrorsoverthelast10years(2006onwards)– ICSE:5papers(IEEEDL)– FSE:6papers(ACMDL)– ASE:7papers(ACMDL)– ISSTA:3papers(ACMDL)– ICST:2papers(IEEEDL)– ISSRE:10papers(IEEEDL)– Total:33outofover3000papers(about1%)
35
![Page 36: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/36.jpg)
Exampleconversa'onswithSo8wareDevelopersinIndustry
• Developer1(larges/wcompanyyou’veheardof)
• Me:Howdoyouhandlehardwarefaults?
• D1:Dotheseevenoccurintherealworld?
• Me:Showinghimdatagatheredbyhisowncompanyonh/wfaults
• D1:Hmmm…soundslikeaproblemforQAfolks.Wedon’tdealwithfaults.
• Tester1(larges/w-h/wcompanyyou’veheardof)
• Me:Howdoyouhandlehardwarefaults?
• T1:OurhardwarefolksputinvariousmechanismssuchasECCmemorytomaskthese
• H/wguy:Notreally,wedon’thandleeverything
• T1:Oh,well–that’snotpartofourrequirementsdoc.Maybeifwemeetourbugtargets...
36
![Page 37: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/37.jpg)
So8wareDevelopers
• MostsoHwaredevelopers(andtesters)ignorehardwarefaults,orassumefaultswillbehandledbyhardware(e.g.,ECCmemory)
• Eveniftheyrecognizetheimportanceoftheproblem,manythinkit’snottheirproblem– QAortes'ngpeopleshouldtakecareofit– Notpartofrequirements/specifica'ondocument
37
![Page 38: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/38.jpg)
Shouldwecareaboutdevelopers?
UlGmately,developersaretheoneswhodriveadopGonandassimilaGonwithinthebroadersoHwareecosystem
38
![Page 39: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/39.jpg)
BarrierstoAdop'on:PossibleReasons
• Reason1:So8waredevelopersdon’tcareaboutanythingtodowithhardware
• Reason2:Toomuch'meandeffort–manyotherpriori'esinso8waredevelopment
• Reason3:Lackofhigh-levelabstrac'ons
• Reason4:Noeasy-to-usetoolsthatintegratewiththeso8waredevelopmentworkflow
39
![Page 40: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/40.jpg)
BarrierstoAdop'on:Reason1
• SoHwareengineersdon’tcareaboutanythingtodowithhardware
• Nottrue.Manycounter-examples:– Parallelism,bothcoarseandfine-grained– Cacheconsciousdata-structuresandalgorithms– Energyefficiencyandenergy-awareprogramming– Determinism,memorymodels,etc.
40
![Page 41: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/41.jpg)
BarrierstoAdop'on:Reason2
• ToomuchGmeandeffortconsuming:manyotherprioriGesinsoHwaredevelopment
• ParGallytrue,butnotalways– Manyother'me-consumingac'vi'esareusede.g.,con'nuoustes'ng,sta'canalysis
– Techniquesfortolera'nghardwarefaultsdon’tneedtobe'meconsumingoreffort-intensive
41
![Page 42: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/42.jpg)
BarrierstoAdop'on:Reason3
• Lackofhigh-levelabstracGons
• Myexperience:MostlyTrue– Developerswanttoreasonwithquan''esthey’refamiliarwith(e.g.,run'me,defectratesetc.),notnecessarilythingslikeFITrates,orevencoverage
– Needtobeabletoreasonaboutcost-benefittradeoffsofdifferenttechniquesattheabstractlevelwithoutgoingintodetails
42
![Page 43: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/43.jpg)
BarrierstoAdop'on:Reason4
• Noeasy-to-usetoolsthatintegratewiththesoHwaredevelopmentworkflow
• Myexperience:Oneofthemainreasons– Needtounderstandso8waredevelopers’workflowandintegratewithit–nodevia'on
– Mustbeabletohandlelegacycode,weirdsetups,andmul'plelanguagesandlibraries
– S/Wmaintenanceaccountsfor60%ofcosts
43
![Page 44: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/44.jpg)
Whatwegotright(inmyopinion)
• Automatedworkflow,soli*letonoeffortonthepartoftheprogrammerwasneeded
• AbstracGonintermsthatordinaryprogrammerscanunderstand(e.g.,performance,coverage)– Candobeueronthisfrontthough
• Useofpopularopen-sourcetool(LLVM),whichiseasytointegratewithworkflow(intheory)
44
![Page 45: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/45.jpg)
Whatwegotwrong(inmyopinion)
• OurabstracGonwassGlltoolowlevel– manyprogrammersdidn’tunderstandcoverage
• Legacycode:LLVMcan’tcompileoldcode,inlineassembly,customizedbuildsystems
• DidnothaveaneasypathtoQAtesGngorsoHwaremaintenanceinourlongtermplan
45
![Page 46: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/46.jpg)
Outline• Mo'va'on
• Techniquesdevelopedbymygroup[DSN’13][CASES’14]
• Abriefhistoryofso8waretechniques
• Adop'oninIndustry
• Researchopportuni'esandroadmap
46
![Page 47: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/47.jpg)
OpenChallenges
• NeedtobuildsoHware-basedtechniquesthatordinaryprogrammerscanreasonabout
• NeedsoHwaretechniquesthatcanintegrateseamlesslywithoverallsoHwareworkflow– ShouldnotimpedeQAandtes'ngprocess– So8waremaintenanceshouldbeconsidered– Legacycodeandbuildsystems(ifrelevant)
47
![Page 48: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/48.jpg)
TheOpportunity:EvereuRoger’sModelforDisrup'veInnova'on
WearesGllintheinnovator/earlyadopterstage-needtocrosschasmandmoveto“earlymajority”
48
![Page 49: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/49.jpg)
Poten'alResearchRoadmap
49
Understandtheissuesfacingso8waredevelopersandtestersinadop'ngso8waretechniquesfortolera'nghardwarefaults
Buildtechniquestoaddresstheseissuesinacommonresearchframeworkforthecommunitytoavoideffortduplica'ons
Havedevelopersusetheframeworkinactualprac'ceandiden'fytheissuesthatcomeupduringtheuseoftheframework
![Page 50: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/50.jpg)
Conclusions
• SoHwaretechniquesfortoleraGnghardwarefaultsincommoditysystemshavemushroomed– Canbetunedbasedontheneedsoftheapplica'on– Canoffersignificantefficiencyoverfullduplica'on
• Unfortunately,thetechniqueshaveseenlimitedadopGoninindustry&bysoHwarecommunity– We,intheSELSEcommunity,allneedtoaddressthis– Emphasisshouldbeoncompleteso8warelife-cycle– Takeourmessagetotheso8wareengineeringvenues
50
![Page 51: Tolerang Hardware Faults In Commodity So8ware: Problems ...blogs.ubc.ca/karthik/files/2017/03/SELSE17-keynote.pdf · Tolerang Hardware Faults In Commodity So8ware: Problems, Solu’ons,](https://reader035.vdocument.in/reader035/viewer/2022062920/5f021f667e708231d402aee5/html5/thumbnails/51.jpg)
AcknowledgementsMygraduatestudents(7current,10graduated):AnnaThomas,QiningLu,LayaliRashid,MajidDadashi,BoFang,JieshengWei,FrolinOcariza,Kar'kBajaj,ShabnamMirshokraie,XinChen,FaridTabrizi,SabaAlimadadi,SheldonSequira,NithyaMurthy,AbrahamChan,MaryamRaiyat,Jus'nLi
Collaboratorsandfundingagencies(selectedonesbelow)
51
h:p://blogs.ubc.ca/karthik/