rtft15 unit 3

7/25/2019 RTFT15 Unit 3

1/40

UNIT 3

Fault Tolerance TechniquesIntroduction,

failure causes,

fault types, fault detection,

fault and error

containment,

redundancy,

data diversity,

reversal checks,

malicious or Byzantine

failures, !oll No" #$

%oding technique

&oft'are fault tolerance

Net'ork fault tolerance"

7/25/2019 RTFT15 Unit 3

2/40

Fault Tolerance

Fault tolerance refers to a system)s a*ility to deal

'ith malfunctions

Fault+tolerant systems + ideally systems capa*le of

eecuting their tasks correctly regardless of either

hard'are failures or soft'are errors

!eal Time and Fault Tolerance

-e.nition

7/25/2019 RTFT15 Unit 3

3/40

Failure Causes

There are three causes of failure"


/rrors in the speci.cation or design,

-efects in the components,

/nvironmental e0ects

7/25/2019 RTFT15 Unit 3

4/40

Fault Types

Faults are classi.ed according to their *ehavior

# Temporal *ehavior

1 2utput *ehavior

fault is said to *e active 'hen it is physically

capa*le of generating errors and to *e benign 'hen

it is not


%ategorized into

7/25/2019 RTFT15 Unit 3

5/40

Temporal BehaviorTransient faults"

These occur once and then disappear

Intermittent faults"

Intermittent faults are characterized *y a faultoccurring, then vanishing again, then reoccurring, then

vanishing

4ermanent faults"

This type of failure is persistent" it continues to eist

until the faulty component is repaired or replaced!eal Time and Fault Tolerance

7/25/2019 RTFT15 Unit 3

6/40

Fault Detection

There are t'o 'ays to determine that a processoris malfunctioning"

#online

1o5ine

. 2nline detection goes on in parallel 'ith normalsystem operation

. 2ne 'ay of doing this is to check for any *ehaviorthat is Inconsistent 'ith correct operation

-e.nition


7/25/2019 RTFT15 Unit 3

7/40

monitor 6called a watchdog processor) is associated'ith each processor, looking for signs that theprocessor is faulty

The 'atchdog processor 'atches the data andaddress lines, as sho'n in Figure

second approach is to have multiple processors,'hich are supposed to put out the same result, andcompare the results

discrepancy indicates the eistence of a fault


7/25/2019 RTFT15 Unit 3

8/40


2nline -etection using 7atchdog4rocessor

7/25/2019 RTFT15 Unit 3

9/40

The follo'ing actions are indicative of a faulty

processor

8 Branching to an invalid destination

8 Fetching an opcode from a location containing data

8 7riting into a portion of memory to 'hich the

process has no 'rite access

8 Fetching an illegal opcode

8 Inactive for more than a prescri*ed period


7/25/2019 RTFT15 Unit 3

10/40

25ine detection consists of running diagnostic tests

Not runna*le 7hen a processor is running such a test, it o*viously

cannot *e eecuting the applications soft'are

-iagnostic test can *e scheduled 9ust like ordinarytasks

The greater the failure rate, the greater must *e thefrequency 'ith 'hich these tests are run


7/25/2019 RTFT15 Unit 3

11/40

Fault and Error containment

7hen a fault or error occurs in one part of the system,it can, if unchecked, spread through the system like an

infectious disease

fault in one part of the system might

for eample, cause large voltage s'ings in another: a

fault+free processor can put out erroneous results as aresult of using erroneous input from a faulty unit

Faults and errors must therefore *e prevented from

spreading through the system This is called

Includes


7/25/2019 RTFT15 Unit 3

12/40

Fault Containment Zones (FCZ):

n F%; is a su*set of the system that operates

correctly despite ar*itrary logical or electrical faults

outside the su*set

That is, the failure of some part of the computer

outside an F%; cannot cause any element inside

that F%; to fail

Error Containment Zones (ECZ)

The function of an /%; is to prevent errors from

propagating across zone *oundaries This is

t icall achieved * votin redundant out uts

The system is divided into


7/25/2019 RTFT15 Unit 3

13/40

Redundancy

gmailcom

7/25/2019 RTFT15 Unit 3

14/40

T'o types" static 6or masking? and dynamic

redundancy

&tatic" redundant components are used inside asystem to hide the e0ects of faults: eg Triple @odular

!edundancy T@! A 3 identical su*components and ma9ority voting

circuits: the outputs are compared and if one di0ersfrom the other t'o that output is masked out

-ynamic" redundancy supplied inside a component'hich indicates that the output is in error: provides anerror detection facility: recovery must *e provided *yanother component

/g communications checksums and memory parity*its

7/25/2019 RTFT15 Unit 3

15/40

N+modular redundancy 6N@!? is a scheme for

for'ard error recovery

It 'orks *y using N processors instead ofone, and voting on their output N is usually

odd

Figure illustrates this scheme for N 3

2ne of t'o approaches is possi*le

In design 6a?, there arc N voters and theentire cluster produces N outputs In design

6*?, there is 9ust one voter

N+@odular !edundancy


7/25/2019 RTFT15 Unit 3

16/40

N+@odular !edundancy


7/25/2019 RTFT15 Unit 3

17/40

&ystem is provided 'ith di0erent soft'are versionof task

7ritten independently *y di0erent team of

programmers

If one version of task fail under certain input

another version can *e used


&oft'are !edundancy

7/25/2019 RTFT15 Unit 3

18/40

N+Cersion 4rogramming

!ecovery Block pproach


&oft'are !edundancy

7/25/2019 RTFT15 Unit 3

19/40

The N+version soft'are concept attempts to parallel the

traditional hard'are fault tolerance concept of N+'ayredundant hard'are

In an N+version soft'are system, each module is made'ith up to N di0erent implementations /ach variant

accomplishes the same task, *ut hopefully in a di0erent

'ay

/ach version then su*mits its ans'er to voter or

decider 'hich determines the correct ans'er, and!eal Time and Fault Tolerance

N+Cersion 4rogramming

7/25/2019 RTFT15 Unit 3

20/40

This system can hopefully overcome the design faultspresent in most soft'are *y relying upon the designdiversity concept

n important distinction in N+version soft'are is thefact that the system could include multiple types ofhard'are using multiple versions of soft'are

The goal is to increase the diversity in order to avoidcommon mode failures

Using N+version soft'are, it is encouraged that eachdi0erent version *e implemented in as diverse amanner as possi*le, including di0erent tool sets,di0erent programming languages, and possi*lydi0erent environments


7/25/2019 RTFT15 Unit 3

21/40

The recovery *lock operates 'ith an ad9udicator 'hich

con.rms the results of various implementations of thesame algorithm

In a system 'ith recovery *locks, the system vie' is*roken do'n into fault recovera*le *locks

The entire system is constructed of these fault tolerant*locks /ach *lock contains at least a primary,

secondary, and eceptional case code along 'ith an

ad9udicator !eal Time and Fault Tolerance

!ecovery Block pproach

7/25/2019 RTFT15 Unit 3

22/40

The ad9udicator is the component 'hich determines

the correctness of the various *locks to try

Upon .rst entering a unit, the ad9udicator .rst eecutesthe primary alternate

If the ad9udicator determines that the primary *lockfailed, it then tries to roll *ack the state of the system

and tries the secondary alternate

If the ad9udicator does not accept the results of any of

the alternates, it then invokes the eception handler,

'hich then indicates the fact that the soft'are could

not perform the requested operation!eal Time and Fault Tolerance

7/25/2019 RTFT15 Unit 3

23/40


&oft'are !edundancy &tructures

7/25/2019 RTFT15 Unit 3

24/40

chieves fault tolerance *y performing an operation

several times

Timeouts and retransmissions in relia*le point+to+

point and group communication are eamples of

time redundancy

This form of redundancy is useful in the presence of

transient or intermittent faults It is of no use 'ith

permanent faults!eal Time and Fault Tolerance

Time !edundancy

7/25/2019 RTFT15 Unit 3

25/40

# !ecovery 4oints

1 Back'ard /rror !ecovery


Time !edundancy

7/25/2019 RTFT15 Unit 3

26/40

The *asic idea of information redundancy is to provide

more information than is strictly necessary and to usethat etra information to check for errors

7e use coding all the time ourselves, 'hile correcting

for typographical errors

For eample, if 'e encounter the 'ord Dstartegic,E 'e

'ill most likely unconsciously correct it to DstrategicE This 'as possi*le *ecause 6a? there is no such 'ord as

Dstartegic,E and 6*? DstrategicE is the closest 'ord that

'e can think of to DstrategicE!eal Time and Fault Tolerance

Information !edundancy

7/25/2019 RTFT15 Unit 3

27/40

The conditions 6a? and 6*? are at the *asis of all coding

theory

ll computer 'ords arc strings of 2s and #s %oding

ensures that not all strings of 2s and Is are legal 6ie, are

valid?

7hen assessing a coding scheme, 'e 'ant to kno' ho'

many etra *its it adds to the 'ords, and ho' many *it

errors it can detect or correct

7e are interested in ho' much 'ork it takes to encode!eal Time and Fault Tolerance

7/25/2019 RTFT15 Unit 3

28/40

!epetition %odes

4arity coding

%hecksum codes

%yclic !edundancy check


Information !edundancy structures

di i

7/25/2019 RTFT15 Unit 3

29/40

Data diversity

-ata diversity is an approach that can *e used inassociation 'ith any of the redundancy techniques

considered a*ove

&ometimes, hard'are or soft'are may fail for certaininputs, *ut not for other inputs that are very close to

them

&o, instead of applying the same input data to theredundant processors, 'e apply slightly di0erent inputdata to them

Thus 'e have in some cases another line of defenseagainst failure

This approach 'ill only work i the sensitivity o the!eal Time and Fault Tolerance

7/25/2019 RTFT15 Unit 3

30/40

Data diversity


R l Ch k

7/25/2019 RTFT15 Unit 3

31/40

Reversal Checks

If there is a simple relationship *et'een the inputs and

outputs of a system, it may then *e possi*le tocalculate the inputs given the outputs

This can then *e compared 'ith the actual inputs as acheck

For eample, consider a task that .nds the square rootof a num*er

To see if the process is correct, 'e can square theoutput and check it against the original input 2r let the

task consist of 'riting a *lock onto disk

The reverse operation consists of reading this *lockfrom the disk after 'riting and comparing it to the input

to make sure that the t'o are the same

"ntroduction


#$%"C"&' &R BZ$*T"*E F$"%'RE

7/25/2019 RTFT15 Unit 3

32/40

#$%"C"&' &R BZ$*T"*E F$"%'RE

7henever a failure can cause a unit to *ehave

ar*itrarily, malicious or Byzantine failure is said tohappen

For correct operation, it is often the case that copies ofthe same data as seen *y various processors must *econsistent 6ie, the same?

7hen communication is limited to t'o+party messages,the faulty units must *e fe'er than a third of the total

num*er of units if consistency is to *e guaranteed

Introduction


" t t d il h dli

7/25/2019 RTFT15 Unit 3

33/40

"nte!rated ailure handlin!

7hen an error is detected, the system mustrespond s'iftly to deal 'ith it

In the short term, the error might *e masked *yvoting

In the long term, the system 'ill have to locate thefailure that gave rise to the error and decide 'hatto do 'ith the failed unit

Three options are usually availa*le"

# retry

1 disconnect

3 replace

Introduction


7/25/2019 RTFT15 Unit 3

34/40

*etwork ault tolerance:

!elia*le communication protocols

greement protocols

-ata*ase commit protocols +pplication"

Includes

som*ody>gmailcom

$ t i lt t

7/25/2019 RTFT15 Unit 3

35/40

$!reement in aulty systems

Two $rmy ,ro-lem:

7e)ll .rst eamine the case of good processors *utfaulty communication lines

This is kno'n as the t'o army pro*lem

By.antine a!reement: The source processor *roadcasts its initial value to

all other processes

greement" ll nonfaulty processors agree on thesame value

Calidity" If the source processor is nonfaulty, thecommon agreed upon value *y all nonfaultyprocessors should *e the initial value of the source

Introduction


Ch k i ti / R

7/25/2019 RTFT15 Unit 3

36/40

Check pointin! / Recovery

%heckpoint+!ecovery is a common technique forim*uing a program or system 'ith fault tolerantqualities, and gre' from the ideas used in systems'hich employ transaction processing

It allo's systems to recover after some faultinterrupts the system, and causes the task to fail,or *e a*orted in some 'ay

7hile many systems employ the technique tominimize lost processing time, it can *e used more*roadly to tolerate and recover from faults in a

critical application or task

Includes


7/25/2019 RTFT15 Unit 3

37/40


%ontinue

7/25/2019 RTFT15 Unit 3

38/40

single checkpoint *u0er is maintained per

multithreaded !@2!

process

The element state is checkpointed after each

operation

%heckpoints are committed to sta*le storage after

processing a message

The is no need to do process+'ide checkpoints of

stacks, heap,

The eisting locking policy of element data prevents

the need to suspend all threads

@icro check pointing


7/25/2019 RTFT15 Unit 3

39/40

Facility for saving running processes and, at some

other time, restarting the saved processes from the

point already reached, 'ithout starting all over again

checkpoint image is saved in a set of disk .les and

can comprise

set of processes

ll processes in the process group 6a set of

processes that constitute a logical 9o*?

ll processes in a process session 6a set of

processes started from the same physical or logical

terminal?

I!I check pointing


7/25/2019 RTFT15 Unit 3

40/40

T0$*1 &'

rtft15 unit 3

Documents