Download - RTFT15 Unit 3
-
7/25/2019 RTFT15 Unit 3
1/40
UNIT 3
Fault Tolerance TechniquesIntroduction,
failure causes,
fault types, fault detection,
fault and error
containment,
redundancy,
data diversity,
reversal checks,
malicious or Byzantine
failures, !oll No" #$
%oding technique
&oft'are fault tolerance
Net'ork fault tolerance"
-
7/25/2019 RTFT15 Unit 3
2/40
Fault Tolerance
Fault tolerance refers to a system)s a*ility to deal
'ith malfunctions
Fault+tolerant systems + ideally systems capa*le of
eecuting their tasks correctly regardless of either
hard'are failures or soft'are errors
!eal Time and Fault Tolerance
-e.nition
-
7/25/2019 RTFT15 Unit 3
3/40
Failure Causes
There are three causes of failure"
!eal Time and Fault Tolerance
/rrors in the speci.cation or design,
-efects in the components,
/nvironmental e0ects
-
7/25/2019 RTFT15 Unit 3
4/40
Fault Types
Faults are classi.ed according to their *ehavior
# Temporal *ehavior
1 2utput *ehavior
fault is said to *e active 'hen it is physically
capa*le of generating errors and to *e benign 'hen
it is not
!eal Time and Fault Tolerance
%ategorized into
-
7/25/2019 RTFT15 Unit 3
5/40
Temporal BehaviorTransient faults"
These occur once and then disappear
Intermittent faults"
Intermittent faults are characterized *y a faultoccurring, then vanishing again, then reoccurring, then
vanishing
4ermanent faults"
This type of failure is persistent" it continues to eist
until the faulty component is repaired or replaced!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
6/40
Fault Detection
There are t'o 'ays to determine that a processoris malfunctioning"
#online
1o5ine
. 2nline detection goes on in parallel 'ith normalsystem operation
. 2ne 'ay of doing this is to check for any *ehaviorthat is Inconsistent 'ith correct operation
-e.nition
!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
7/40
monitor 6called a watchdog processor) is associated'ith each processor, looking for signs that theprocessor is faulty
The 'atchdog processor 'atches the data andaddress lines, as sho'n in Figure
second approach is to have multiple processors,'hich are supposed to put out the same result, andcompare the results
discrepancy indicates the eistence of a fault
!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
8/40
!eal Time and Fault Tolerance
2nline -etection using 7atchdog4rocessor
-
7/25/2019 RTFT15 Unit 3
9/40
The follo'ing actions are indicative of a faulty
processor
8 Branching to an invalid destination
8 Fetching an opcode from a location containing data
8 7riting into a portion of memory to 'hich the
process has no 'rite access
8 Fetching an illegal opcode
8 Inactive for more than a prescri*ed period
!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
10/40
25ine detection consists of running diagnostic tests
Not runna*le 7hen a processor is running such a test, it o*viously
cannot *e eecuting the applications soft'are
-iagnostic test can *e scheduled 9ust like ordinarytasks
The greater the failure rate, the greater must *e thefrequency 'ith 'hich these tests are run
!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
11/40
Fault and Error containment
7hen a fault or error occurs in one part of the system,it can, if unchecked, spread through the system like an
infectious disease
fault in one part of the system might
for eample, cause large voltage s'ings in another: a
fault+free processor can put out erroneous results as aresult of using erroneous input from a faulty unit
Faults and errors must therefore *e prevented from
spreading through the system This is called
Includes
!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
12/40
Fault Containment Zones (FCZ):
n F%; is a su*set of the system that operates
correctly despite ar*itrary logical or electrical faults
outside the su*set
That is, the failure of some part of the computer
outside an F%; cannot cause any element inside
that F%; to fail
Error Containment Zones (ECZ)
The function of an /%; is to prevent errors from
propagating across zone *oundaries This is
t icall achieved * votin redundant out uts
The system is divided into
!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
13/40
Redundancy
gmailcom
-
7/25/2019 RTFT15 Unit 3
14/40
T'o types" static 6or masking? and dynamic
redundancy
&tatic" redundant components are used inside asystem to hide the e0ects of faults: eg Triple @odular
!edundancy T@! A 3 identical su*components and ma9ority voting
circuits: the outputs are compared and if one di0ersfrom the other t'o that output is masked out
-ynamic" redundancy supplied inside a component'hich indicates that the output is in error: provides anerror detection facility: recovery must *e provided *yanother component
/g communications checksums and memory parity*its
-
7/25/2019 RTFT15 Unit 3
15/40
N+modular redundancy 6N@!? is a scheme for
for'ard error recovery
It 'orks *y using N processors instead ofone, and voting on their output N is usually
odd
Figure illustrates this scheme for N 3
2ne of t'o approaches is possi*le
In design 6a?, there arc N voters and theentire cluster produces N outputs In design
6*?, there is 9ust one voter
N+@odular !edundancy
!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
16/40
N+@odular !edundancy
!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
17/40
&ystem is provided 'ith di0erent soft'are versionof task
7ritten independently *y di0erent team of
programmers
If one version of task fail under certain input
another version can *e used
!eal Time and Fault Tolerance
&oft'are !edundancy
-
7/25/2019 RTFT15 Unit 3
18/40
N+Cersion 4rogramming
!ecovery Block pproach
!eal Time and Fault Tolerance
&oft'are !edundancy
-
7/25/2019 RTFT15 Unit 3
19/40
The N+version soft'are concept attempts to parallel the
traditional hard'are fault tolerance concept of N+'ayredundant hard'are
In an N+version soft'are system, each module is made'ith up to N di0erent implementations /ach variant
accomplishes the same task, *ut hopefully in a di0erent
'ay
/ach version then su*mits its ans'er to voter or
decider 'hich determines the correct ans'er, and!eal Time and Fault Tolerance
N+Cersion 4rogramming
-
7/25/2019 RTFT15 Unit 3
20/40
This system can hopefully overcome the design faultspresent in most soft'are *y relying upon the designdiversity concept
n important distinction in N+version soft'are is thefact that the system could include multiple types ofhard'are using multiple versions of soft'are
The goal is to increase the diversity in order to avoidcommon mode failures
Using N+version soft'are, it is encouraged that eachdi0erent version *e implemented in as diverse amanner as possi*le, including di0erent tool sets,di0erent programming languages, and possi*lydi0erent environments
!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
21/40
The recovery *lock operates 'ith an ad9udicator 'hich
con.rms the results of various implementations of thesame algorithm
In a system 'ith recovery *locks, the system vie' is*roken do'n into fault recovera*le *locks
The entire system is constructed of these fault tolerant*locks /ach *lock contains at least a primary,
secondary, and eceptional case code along 'ith an
ad9udicator !eal Time and Fault Tolerance
!ecovery Block pproach
-
7/25/2019 RTFT15 Unit 3
22/40
The ad9udicator is the component 'hich determines
the correctness of the various *locks to try
Upon .rst entering a unit, the ad9udicator .rst eecutesthe primary alternate
If the ad9udicator determines that the primary *lockfailed, it then tries to roll *ack the state of the system
and tries the secondary alternate
If the ad9udicator does not accept the results of any of
the alternates, it then invokes the eception handler,
'hich then indicates the fact that the soft'are could
not perform the requested operation!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
23/40
!eal Time and Fault Tolerance
&oft'are !edundancy &tructures
-
7/25/2019 RTFT15 Unit 3
24/40
chieves fault tolerance *y performing an operation
several times
Timeouts and retransmissions in relia*le point+to+
point and group communication are eamples of
time redundancy
This form of redundancy is useful in the presence of
transient or intermittent faults It is of no use 'ith
permanent faults!eal Time and Fault Tolerance
Time !edundancy
-
7/25/2019 RTFT15 Unit 3
25/40
# !ecovery 4oints
1 Back'ard /rror !ecovery
!eal Time and Fault Tolerance
Time !edundancy
-
7/25/2019 RTFT15 Unit 3
26/40
The *asic idea of information redundancy is to provide
more information than is strictly necessary and to usethat etra information to check for errors
7e use coding all the time ourselves, 'hile correcting
for typographical errors
For eample, if 'e encounter the 'ord Dstartegic,E 'e
'ill most likely unconsciously correct it to DstrategicE This 'as possi*le *ecause 6a? there is no such 'ord as
Dstartegic,E and 6*? DstrategicE is the closest 'ord that
'e can think of to DstrategicE!eal Time and Fault Tolerance
Information !edundancy
-
7/25/2019 RTFT15 Unit 3
27/40
The conditions 6a? and 6*? are at the *asis of all coding
theory
ll computer 'ords arc strings of 2s and #s %oding
ensures that not all strings of 2s and Is are legal 6ie, are
valid?
7hen assessing a coding scheme, 'e 'ant to kno' ho'
many etra *its it adds to the 'ords, and ho' many *it
errors it can detect or correct
7e are interested in ho' much 'ork it takes to encode!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
28/40
!epetition %odes
4arity coding
%hecksum codes
%yclic !edundancy check
!eal Time and Fault Tolerance
Information !edundancy structures
di i
-
7/25/2019 RTFT15 Unit 3
29/40
Data diversity
-ata diversity is an approach that can *e used inassociation 'ith any of the redundancy techniques
considered a*ove
&ometimes, hard'are or soft'are may fail for certaininputs, *ut not for other inputs that are very close to
them
&o, instead of applying the same input data to theredundant processors, 'e apply slightly di0erent inputdata to them
Thus 'e have in some cases another line of defenseagainst failure
This approach 'ill only work i the sensitivity o the!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
30/40
Data diversity
!eal Time and Fault Tolerance
R l Ch k
-
7/25/2019 RTFT15 Unit 3
31/40
Reversal Checks
If there is a simple relationship *et'een the inputs and
outputs of a system, it may then *e possi*le tocalculate the inputs given the outputs
This can then *e compared 'ith the actual inputs as acheck
For eample, consider a task that .nds the square rootof a num*er
To see if the process is correct, 'e can square theoutput and check it against the original input 2r let the
task consist of 'riting a *lock onto disk
The reverse operation consists of reading this *lockfrom the disk after 'riting and comparing it to the input
to make sure that the t'o are the same
"ntroduction
!eal Time and Fault Tolerance
#$%"C"&' &R BZ$*T"*E F$"%'RE
-
7/25/2019 RTFT15 Unit 3
32/40
#$%"C"&' &R BZ$*T"*E F$"%'RE
7henever a failure can cause a unit to *ehave
ar*itrarily, malicious or Byzantine failure is said tohappen
For correct operation, it is often the case that copies ofthe same data as seen *y various processors must *econsistent 6ie, the same?
7hen communication is limited to t'o+party messages,the faulty units must *e fe'er than a third of the total
num*er of units if consistency is to *e guaranteed
Introduction
!eal Time and Fault Tolerance
" t t d il h dli
-
7/25/2019 RTFT15 Unit 3
33/40
"nte!rated ailure handlin!
7hen an error is detected, the system mustrespond s'iftly to deal 'ith it
In the short term, the error might *e masked *yvoting
In the long term, the system 'ill have to locate thefailure that gave rise to the error and decide 'hatto do 'ith the failed unit
Three options are usually availa*le"
# retry
1 disconnect
3 replace
Introduction
!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
34/40
*etwork ault tolerance:
!elia*le communication protocols
greement protocols
-ata*ase commit protocols +pplication"
Includes
som*ody>gmailcom
$ t i lt t
-
7/25/2019 RTFT15 Unit 3
35/40
$!reement in aulty systems
Two $rmy ,ro-lem:
7e)ll .rst eamine the case of good processors *utfaulty communication lines
This is kno'n as the t'o army pro*lem
By.antine a!reement: The source processor *roadcasts its initial value to
all other processes
greement" ll nonfaulty processors agree on thesame value
Calidity" If the source processor is nonfaulty, thecommon agreed upon value *y all nonfaultyprocessors should *e the initial value of the source
Introduction
!eal Time and Fault Tolerance
Ch k i ti / R
-
7/25/2019 RTFT15 Unit 3
36/40
Check pointin! / Recovery
%heckpoint+!ecovery is a common technique forim*uing a program or system 'ith fault tolerantqualities, and gre' from the ideas used in systems'hich employ transaction processing
It allo's systems to recover after some faultinterrupts the system, and causes the task to fail,or *e a*orted in some 'ay
7hile many systems employ the technique tominimize lost processing time, it can *e used more*roadly to tolerate and recover from faults in a
critical application or task
Includes
!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
37/40
!eal Time and Fault Tolerance
%ontinue
-
7/25/2019 RTFT15 Unit 3
38/40
single checkpoint *u0er is maintained per
multithreaded !@2!
process
The element state is checkpointed after each
operation
%heckpoints are committed to sta*le storage after
processing a message
The is no need to do process+'ide checkpoints of
stacks, heap,
The eisting locking policy of element data prevents
the need to suspend all threads
@icro check pointing
!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
39/40
Facility for saving running processes and, at some
other time, restarting the saved processes from the
point already reached, 'ithout starting all over again
checkpoint image is saved in a set of disk .les and
can comprise
set of processes
ll processes in the process group 6a set of
processes that constitute a logical 9o*?
ll processes in a process session 6a set of
processes started from the same physical or logical
terminal?
I!I check pointing
!eal Time and Fault Tolerance
-
7/25/2019 RTFT15 Unit 3
40/40
T0$*1 &'