eee499 -real-time embedded system design
TRANSCRIPT
EEE499- Real-TimeEmbeddedSystemDesign
SoftwareFaultTolerance
Failures
• whenthebehaviorofasystemdeviatesfromthatofitsspecification.
• normallyassociatedwiththenotionthatsomethingwhichoncefunctionedproperly,nolongerdoesso.
• inthisclassical sense,softwaredoesnottrulyfail;ifthesoftwaredoesthewrongthing,itwillalwaysdothewrongthingunderidenticalcircumstances
Failures&Specifications
• traditionally(software)requirementsareviewedasapositive specificationofasystem
• anattempttodefinethefailuresofasystemcanbeviewedasanegative specificationofasystem
• whilebotharelikelyessential,thelatterisrarelyformallyidentifiedinprojects– systemengineerswilloftenneedthefailuredefinitions
Faults
• anunsatisfactoryconditionorstate– actions,timing,sequenceoramount
• randomfaults - inthecontextofabove,failuresarerandomfaultswhichoccurwhenacomponentbreaksinthefield.Randomfaultsonlyoccurinphysicalentities.
• systematicfaults - areintrinsicinthedesignorimplementationofacomponent.Errorsorsoftwaredefects(bugs)aresystematicfaults.
Faults(cont’d)
• hardware- faultscanberandomorsystematic– traditionalbath-tubcurvephenomenon
• software- faultsarealwayssystematic– softwareneitherbreaksnorwearsout– allsoftwarecontainslatentdefects(bugs)– estimates:• 20,000defectsperMillionLOC• 90%foundintesting• 200foundinearlylife• 1800latentdefects(0.18%defectrate)
FailureModes
• theclassificationofasystem’sfailuresaccordingtotheimpacttheyhaveontheservicesitdelivers.
• softwarefailuremodes:– valuedomain– timingdomain- early,omission,late– arbitrary(combinationofvalueandtiming)
FailureTypes
• singlepointfailures - thefailureofasystemduetoasinglecomponentfailure.– h/w: carbrakemastercylinder– s/w:?
• commoncausefailures - thefailureofmorethanonecomponentduetoacommoneventorcause.– h/w: powersupplyfailure– s/w:?
FaultPrevention
• faultavoidance - limittheintroductionoffaultycomponentsduringconstruction– rigorous(formal)specifications– provendesignmethodologies– properselectionofprogramminglanguage– engineering/developmentenvironments(IDEs)
• faultremoval - findingandremovingcausesoferror– reviews,inspections,testing,verification,validation
• butwhatiffaultscannotbeprevented?
FaultTolerance
• theabilityofasystemtocontinuefunctioningeveninthepresenceoffaults.– fullfaulttolerance - systemcontinuestooperatewithnosignificantlossoffunctionality(foralimitedtime)
– gracefuldegradation - systemcontinuestooperatebutinadegradedmode
– failsafe - systemtransitionstoasafestatepriortoshut-downorasaresultofaparticularfault
Redundancy
• inordertoachievefaulttoleranceinsoftwaresomedegreeofredundancy isrequired.– {Inthissense,thedefinitionofredundancyextendstoincludeadditionalcodethatwouldnotberequiredfornormaloperation}
• goal- minimizeredundancy,maximizeR(t)• paradox- toomuchmaydecreaseR(t)
StaticRedundancy
• redundantcomponentsareusedtomaskerrors• h/w:NModularRedundancy(NMR)
• s/w:N-versionprogramming– Nredundantversionsofthesamemoduleunderacommoncontrollerandavotingscheme
StaticRedundancy
• assumptions– complete,consistent,unambiguousspecification– independentfailuremodes• separatedevelopmentteams,processors,languages,fault-tolerantcommunicationlines,…
• issues(problems)– specificationsareamajorcontributortodefects– independenceisoftennotareasonableassumption
– $$$
DynamicRedundancy
• redundantcomponentsusedtodetecterrors• h/w:dataparitybitchecks
• s/w:recoveryblocks• phases– errordetection– damageassessment/control– errorrecovery– faulttreatment/continue
RP1
RP2
RP3
recovery points
Process P1
DynamicRedundancy
• assumptions– alternatemodulesonlyexecutebaseduponerrordetection
– stilllargelydependentuponthespecification– knownandunknownfailuremodesmay behandled
• issues(problems)– backwarderrorrecoverycan’talwaysundodamage– $$
AFewFaultTolerancePatterns
• HomogeneousRedundancyPattern– protectsagainsthardware(random)failuresonly
• DiverseRedundancyPattern– protectsagainstsystematic&randomfaults
• MonitorActuatorPattern– specializationofdiverseredundancy
• WatchdogPattern
HomogeneousRedundancyPattern
HomogeneousRedundancyPattern
DiverseRedundancyPattern
Monitor-ActuatorPattern
Monitor-ActuatorPattern
WatchdogPattern
WatchdogPattern
Can be hardware or software
Reliability
• reliability,R(t)- theprobabilitythat,whenoperatingunderstatedenvironmentalconditions,asystemwillperformitsintendedfunctionadequatelyforaspecifiedintervaloftime.
• ameasureofthesuccesswithwhichasystemconformstosomeauthoritativespecificationofitsbehavior
• mostfrequenthardwaremetric- MTBF– failurerateismoreuniversalinsoftware
Reliability&Time• asstatedinitsdefinition,reliabilityisevaluatedonatimeinterval– R(t=6hours)=95%
• intermsofsoftwarereliability,wemustdistinguishthreeformsoftime:– calendartime- standardelapsedtime– clocktime- thetimeaprocessorisactuallyrunningduringanycalendartime
– executiontime- thetimeanysoftwareprogramisexecutingonthegivenprocessor
Reliability&Environment
• again,asstatedinitsdefinition,reliabilityisafunctionoftheoperationalenvironment– morecommonlyusedisasystem’soperationalprofile
• anoperationalprofileofapieceofsoftwareisarepresentationofthevariousinputstatesandtheirrespectiveprobabilities– oftentheseprofilescanbefurtherdividedalongmodesofoperation
– forexampleanapplicationhasadifferentprofileduringstart-upmodethaninfulloperation
• testcases/testdataisselectedaccordingly
ReliabilityDefinition
• reliability,R(t)- theprobabilitythat,whenoperatingunderstatedenvironmentalconditions,asystemwillperformitsintendedfunctionforaspecifiedintervaloftime.
• reliabilityisevaluatedonatimeinterval– R(t=6hours)=95%
• thus,reliabilityisafunctionofsomefailurerate(intensity)andadesiredtimeinterval
UsesofReliability
• ameasureofdevelopmentstatus– givenareliabilitygoal(asetfailurerate),youmaydetermineremainingschedule,costs,tests,etcpriortoreleaseofthesoftware
• tomonitoroperationalperformance– isthereliabilitysuchthatmajormaintenanceisjustified
• providesinsightintothesoftwareproduct(quality)andthedevelopmentprocess
QuantifyingComponentReliability
• GivencomponentsfromaHomogeneousPoissonProcesswithafailureintensity(λ),thenreliabilitycanbeexpressedas:
R(T)=e- λTR
elia
bilit
y
Time
1.0reliability decreases exponentially with “mission” time
ComponentReliability- Example1
• Assumethatacapacitorhasbeentestedanddeterminedtohaveamean-time-between-failure(MTBF)of200hours,– whatistheprobabilitythatthecapacitorwillnotfailduring8hoursof(normal)useinthelab?
Solution:
λ = 1/ MTBF = 0.005 failures per hour
R(T=8 hours) = e -(0.005)(8)
= 0.961 or 96.1%
ComponentReliability- Example2
• Givenanunmannedspaceprobewitharequirementtooperatefailurefreefora25yearmissionwithaprobabilityof95%,– whatistherequiredsystemMTBF?
Solution:
λ = - ln (R(T)) / T
= -ln(0.95) / [(25)(365)(24)]
= 0.0000002 failures per hour
MTBF = 5,000,000 hours!
SoftwareFailureIntensity
• Asmostsoftwarereliabilityvaluesarestatedintermsofexecutiontime,weneedtobeabletoconvertthesevaluestoclocktime:– ifwedefine:• executiontimeasλτ• clocktimeasλt,
– then, λt=ρcλτ
– where, ρc representsaverageutilization
SoftwareFailureIntensity- Example
• Givenaperiodictaskwithanestimatedonehour(executiontime)reliabilityof99.5%.Inaddition,weknowthetaskrunsevery200msecwithanaveragecomputationtimeof2000µsec.– determinetheclock-time-basedfailureintensityλτ = - ln(R(τ=1))/τ = - ln(0.995)/1 = 0.005 f/hr
λt = ρc λτ = (0.01) (0.005) = 0.00005 f/hr
SystemReliability- SerialSystems
• Rsys =Π Ri forallicomponents
• Example– GivenRA =90%,RB =97.5%,RC =99.25%– Rsys =(.9)(.975)(.9925)=87.1%
*Assumes that all components are reliability-wise independent.
A B C
SerialSystems- Example
• Givenaspaceprobewith10,000identicalcomponentsanda25year95%reliabilityrequirement:– whatistherequiredcomponentfailurerate?
c1 c2 c10000•••
Solution:
RC = (Rsys)1/10000 = 0.999995
λc = -ln(.999995)/[(25)(365)(24)] = 2.28 E-11
• DefinetheprobabilityoffailureasQi=1- Ri,then
• Qsys =ΠQi
– itfollowsthatRsys =1– Qsys,or
• Rsys =1-Π (1- Ri)
SystemReliability- ParallelSystems
A
B
C
§ Example§ Given RA = 90%, RB = 97.5%, RC = 99.25%§ Rsys = 1 - (.1)(.025)(.0075) = 99.998%
• aspecialcaseoftheparallelsystemconfigurationwhereinitisrequiredthatonlykoutofnidenticalcomponentsareneededforsuccess
• Rsys =Σ [ ]Rci (1- Rc)n-i
SystemReliability- koutofn
A
A
Anii=k
n
§ where “n choose k”, [ ] = n! / (k!(n-k)!) nk
• Example– assumethatthereliabilityofc1 is90%andyouneedatleasttwoforthesystemtofunction
SystemReliability- koutofn
c1
c1
c1
solution:
Rsys = Σ [ ] Rci (1 - Rc)3-i
= [ ] Rc2 (1 - Rc) + [ ] Rc
3 (1 - Rc)0
= (3)(.81)(.1) + (1)(.729)(1)
= 97.2%
3ii=2
3
32
133
Exercise- SystemComponents• ConsideranEWsystemmadeupofthefollowing
components:– 2touchdisplays,onlyoneofwhichneedfunctionforthesystemto
function– eachtouchdisplaycontainsidentical software(meaningthatyou
shouldtreatthes/wasastand-alonemodule)– onesystemprocessorandthesystemintegrationsoftware– anECMwithembeddedjammersoftware
• Assumethatallhardwareutilizationis100%– inotherwordsallhardwarefunctionsfortheentiremissionduration
Exercise- ComponentData
Component Failure Rate(execution hours)
Utilization R(t=4)
TD - - .95
TD s/w 0.002 0.65 ?
SP - - .98
SP s/w .010 .95 ?
ECM - - .925
ECM s/w .012 .35 ?
Exercise- Questions
• a)Determinetheoverallsystemreliabilityfora4hourmission?
• b)Whatistheprobabilityofhavingatleastonefunctionaldisplayfora2hourmission?
• c)Whatistheweakestlinkinthesystem?Whatcouldbedonetoimprovetheoverallsystemreliability(assumingthatyoucannotsignificantlychangethecomponentreliabilitieswithoutseriousredesign).DrawthenewreliabilityblockdiagramandfindRsys
Exercise- Questions(continued)
• d)Assumethatwenowhave4equivalentECMsubsystems(combinedhardwareandsoftware)andthatforthesystemtobefunctionalany3ofthe4mustbefunctional.
• Drawthenewreliabilityblockdiagramandcalculatethesystemreliability.
References
[1]BurnsandWellings,“Real-TimeSystemsandProgrammingLanguages”,Chap5.
[2]Douglass,“DoingHardTime:DevelopingReal-TimeSystemswithUML,Objects,Frameworks,andPatterns”,Chap3.
[3]Musa,J.D.,Iannino,A.,Okumoto,K.“SoftwareReliability- Measurement,Prediction,Application”,Chapter4,McGraw-Hill1987.