rtft15 unit 3

Upload: luis-anderson

Post on 25-Feb-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/25/2019 RTFT15 Unit 3

    1/40

    UNIT 3

    Fault Tolerance TechniquesIntroduction,

    failure causes,

    fault types, fault detection,

    fault and error

    containment,

    redundancy,

    data diversity,

    reversal checks,

    malicious or Byzantine

    failures, !oll No" #$

    %oding technique

    &oft'are fault tolerance

    Net'ork fault tolerance"

  • 7/25/2019 RTFT15 Unit 3

    2/40

    Fault Tolerance

    Fault tolerance refers to a system)s a*ility to deal

    'ith malfunctions

    Fault+tolerant systems + ideally systems capa*le of

    eecuting their tasks correctly regardless of either

    hard'are failures or soft'are errors

    !eal Time and Fault Tolerance

    -e.nition

  • 7/25/2019 RTFT15 Unit 3

    3/40

    Failure Causes

    There are three causes of failure"

    !eal Time and Fault Tolerance

    /rrors in the speci.cation or design,

    -efects in the components,

    /nvironmental e0ects

  • 7/25/2019 RTFT15 Unit 3

    4/40

    Fault Types

    Faults are classi.ed according to their *ehavior

    # Temporal *ehavior

    1 2utput *ehavior

    fault is said to *e active 'hen it is physically

    capa*le of generating errors and to *e benign 'hen

    it is not

    !eal Time and Fault Tolerance

    %ategorized into

  • 7/25/2019 RTFT15 Unit 3

    5/40

    Temporal BehaviorTransient faults"

    These occur once and then disappear

    Intermittent faults"

    Intermittent faults are characterized *y a faultoccurring, then vanishing again, then reoccurring, then

    vanishing

    4ermanent faults"

    This type of failure is persistent" it continues to eist

    until the faulty component is repaired or replaced!eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    6/40

    Fault Detection

    There are t'o 'ays to determine that a processoris malfunctioning"

    #online

    1o5ine

    . 2nline detection goes on in parallel 'ith normalsystem operation

    . 2ne 'ay of doing this is to check for any *ehaviorthat is Inconsistent 'ith correct operation

    -e.nition

    !eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    7/40

    monitor 6called a watchdog processor) is associated'ith each processor, looking for signs that theprocessor is faulty

    The 'atchdog processor 'atches the data andaddress lines, as sho'n in Figure

    second approach is to have multiple processors,'hich are supposed to put out the same result, andcompare the results

    discrepancy indicates the eistence of a fault

    !eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    8/40

    !eal Time and Fault Tolerance

    2nline -etection using 7atchdog4rocessor

  • 7/25/2019 RTFT15 Unit 3

    9/40

    The follo'ing actions are indicative of a faulty

    processor

    8 Branching to an invalid destination

    8 Fetching an opcode from a location containing data

    8 7riting into a portion of memory to 'hich the

    process has no 'rite access

    8 Fetching an illegal opcode

    8 Inactive for more than a prescri*ed period

    !eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    10/40

    25ine detection consists of running diagnostic tests

    Not runna*le 7hen a processor is running such a test, it o*viously

    cannot *e eecuting the applications soft'are

    -iagnostic test can *e scheduled 9ust like ordinarytasks

    The greater the failure rate, the greater must *e thefrequency 'ith 'hich these tests are run

    !eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    11/40

    Fault and Error containment

    7hen a fault or error occurs in one part of the system,it can, if unchecked, spread through the system like an

    infectious disease

    fault in one part of the system might

    for eample, cause large voltage s'ings in another: a

    fault+free processor can put out erroneous results as aresult of using erroneous input from a faulty unit

    Faults and errors must therefore *e prevented from

    spreading through the system This is called

    Includes

    !eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    12/40

    Fault Containment Zones (FCZ):

    n F%; is a su*set of the system that operates

    correctly despite ar*itrary logical or electrical faults

    outside the su*set

    That is, the failure of some part of the computer

    outside an F%; cannot cause any element inside

    that F%; to fail

    Error Containment Zones (ECZ)

    The function of an /%; is to prevent errors from

    propagating across zone *oundaries This is

    t icall achieved * votin redundant out uts

    The system is divided into

    !eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    13/40

    Redundancy

    gmailcom

  • 7/25/2019 RTFT15 Unit 3

    14/40

    T'o types" static 6or masking? and dynamic

    redundancy

    &tatic" redundant components are used inside asystem to hide the e0ects of faults: eg Triple @odular

    !edundancy T@! A 3 identical su*components and ma9ority voting

    circuits: the outputs are compared and if one di0ersfrom the other t'o that output is masked out

    -ynamic" redundancy supplied inside a component'hich indicates that the output is in error: provides anerror detection facility: recovery must *e provided *yanother component

    /g communications checksums and memory parity*its

  • 7/25/2019 RTFT15 Unit 3

    15/40

    N+modular redundancy 6N@!? is a scheme for

    for'ard error recovery

    It 'orks *y using N processors instead ofone, and voting on their output N is usually

    odd

    Figure illustrates this scheme for N 3

    2ne of t'o approaches is possi*le

    In design 6a?, there arc N voters and theentire cluster produces N outputs In design

    6*?, there is 9ust one voter

    N+@odular !edundancy

    !eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    16/40

    N+@odular !edundancy

    !eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    17/40

    &ystem is provided 'ith di0erent soft'are versionof task

    7ritten independently *y di0erent team of

    programmers

    If one version of task fail under certain input

    another version can *e used

    !eal Time and Fault Tolerance

    &oft'are !edundancy

  • 7/25/2019 RTFT15 Unit 3

    18/40

    N+Cersion 4rogramming

    !ecovery Block pproach

    !eal Time and Fault Tolerance

    &oft'are !edundancy

  • 7/25/2019 RTFT15 Unit 3

    19/40

    The N+version soft'are concept attempts to parallel the

    traditional hard'are fault tolerance concept of N+'ayredundant hard'are

    In an N+version soft'are system, each module is made'ith up to N di0erent implementations /ach variant

    accomplishes the same task, *ut hopefully in a di0erent

    'ay

    /ach version then su*mits its ans'er to voter or

    decider 'hich determines the correct ans'er, and!eal Time and Fault Tolerance

    N+Cersion 4rogramming

  • 7/25/2019 RTFT15 Unit 3

    20/40

    This system can hopefully overcome the design faultspresent in most soft'are *y relying upon the designdiversity concept

    n important distinction in N+version soft'are is thefact that the system could include multiple types ofhard'are using multiple versions of soft'are

    The goal is to increase the diversity in order to avoidcommon mode failures

    Using N+version soft'are, it is encouraged that eachdi0erent version *e implemented in as diverse amanner as possi*le, including di0erent tool sets,di0erent programming languages, and possi*lydi0erent environments

    !eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    21/40

    The recovery *lock operates 'ith an ad9udicator 'hich

    con.rms the results of various implementations of thesame algorithm

    In a system 'ith recovery *locks, the system vie' is*roken do'n into fault recovera*le *locks

    The entire system is constructed of these fault tolerant*locks /ach *lock contains at least a primary,

    secondary, and eceptional case code along 'ith an

    ad9udicator !eal Time and Fault Tolerance

    !ecovery Block pproach

  • 7/25/2019 RTFT15 Unit 3

    22/40

    The ad9udicator is the component 'hich determines

    the correctness of the various *locks to try

    Upon .rst entering a unit, the ad9udicator .rst eecutesthe primary alternate

    If the ad9udicator determines that the primary *lockfailed, it then tries to roll *ack the state of the system

    and tries the secondary alternate

    If the ad9udicator does not accept the results of any of

    the alternates, it then invokes the eception handler,

    'hich then indicates the fact that the soft'are could

    not perform the requested operation!eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    23/40

    !eal Time and Fault Tolerance

    &oft'are !edundancy &tructures

  • 7/25/2019 RTFT15 Unit 3

    24/40

    chieves fault tolerance *y performing an operation

    several times

    Timeouts and retransmissions in relia*le point+to+

    point and group communication are eamples of

    time redundancy

    This form of redundancy is useful in the presence of

    transient or intermittent faults It is of no use 'ith

    permanent faults!eal Time and Fault Tolerance

    Time !edundancy

  • 7/25/2019 RTFT15 Unit 3

    25/40

    # !ecovery 4oints

    1 Back'ard /rror !ecovery

    !eal Time and Fault Tolerance

    Time !edundancy

  • 7/25/2019 RTFT15 Unit 3

    26/40

    The *asic idea of information redundancy is to provide

    more information than is strictly necessary and to usethat etra information to check for errors

    7e use coding all the time ourselves, 'hile correcting

    for typographical errors

    For eample, if 'e encounter the 'ord Dstartegic,E 'e

    'ill most likely unconsciously correct it to DstrategicE This 'as possi*le *ecause 6a? there is no such 'ord as

    Dstartegic,E and 6*? DstrategicE is the closest 'ord that

    'e can think of to DstrategicE!eal Time and Fault Tolerance

    Information !edundancy

  • 7/25/2019 RTFT15 Unit 3

    27/40

    The conditions 6a? and 6*? are at the *asis of all coding

    theory

    ll computer 'ords arc strings of 2s and #s %oding

    ensures that not all strings of 2s and Is are legal 6ie, are

    valid?

    7hen assessing a coding scheme, 'e 'ant to kno' ho'

    many etra *its it adds to the 'ords, and ho' many *it

    errors it can detect or correct

    7e are interested in ho' much 'ork it takes to encode!eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    28/40

    !epetition %odes

    4arity coding

    %hecksum codes

    %yclic !edundancy check

    !eal Time and Fault Tolerance

    Information !edundancy structures

    di i

  • 7/25/2019 RTFT15 Unit 3

    29/40

    Data diversity

    -ata diversity is an approach that can *e used inassociation 'ith any of the redundancy techniques

    considered a*ove

    &ometimes, hard'are or soft'are may fail for certaininputs, *ut not for other inputs that are very close to

    them

    &o, instead of applying the same input data to theredundant processors, 'e apply slightly di0erent inputdata to them

    Thus 'e have in some cases another line of defenseagainst failure

    This approach 'ill only work i the sensitivity o the!eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    30/40

    Data diversity

    !eal Time and Fault Tolerance

    R l Ch k

  • 7/25/2019 RTFT15 Unit 3

    31/40

    Reversal Checks

    If there is a simple relationship *et'een the inputs and

    outputs of a system, it may then *e possi*le tocalculate the inputs given the outputs

    This can then *e compared 'ith the actual inputs as acheck

    For eample, consider a task that .nds the square rootof a num*er

    To see if the process is correct, 'e can square theoutput and check it against the original input 2r let the

    task consist of 'riting a *lock onto disk

    The reverse operation consists of reading this *lockfrom the disk after 'riting and comparing it to the input

    to make sure that the t'o are the same

    "ntroduction

    !eal Time and Fault Tolerance

    #$%"C"&' &R BZ$*T"*E F$"%'RE

  • 7/25/2019 RTFT15 Unit 3

    32/40

    #$%"C"&' &R BZ$*T"*E F$"%'RE

    7henever a failure can cause a unit to *ehave

    ar*itrarily, malicious or Byzantine failure is said tohappen

    For correct operation, it is often the case that copies ofthe same data as seen *y various processors must *econsistent 6ie, the same?

    7hen communication is limited to t'o+party messages,the faulty units must *e fe'er than a third of the total

    num*er of units if consistency is to *e guaranteed

    Introduction

    !eal Time and Fault Tolerance

    " t t d il h dli

  • 7/25/2019 RTFT15 Unit 3

    33/40

    "nte!rated ailure handlin!

    7hen an error is detected, the system mustrespond s'iftly to deal 'ith it

    In the short term, the error might *e masked *yvoting

    In the long term, the system 'ill have to locate thefailure that gave rise to the error and decide 'hatto do 'ith the failed unit

    Three options are usually availa*le"

    # retry

    1 disconnect

    3 replace

    Introduction

    !eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    34/40

    *etwork ault tolerance:

    !elia*le communication protocols

    greement protocols

    -ata*ase commit protocols +pplication"

    Includes

    som*ody>gmailcom

    $ t i lt t

  • 7/25/2019 RTFT15 Unit 3

    35/40

    $!reement in aulty systems

    Two $rmy ,ro-lem:

    7e)ll .rst eamine the case of good processors *utfaulty communication lines

    This is kno'n as the t'o army pro*lem

    By.antine a!reement: The source processor *roadcasts its initial value to

    all other processes

    greement" ll nonfaulty processors agree on thesame value

    Calidity" If the source processor is nonfaulty, thecommon agreed upon value *y all nonfaultyprocessors should *e the initial value of the source

    Introduction

    !eal Time and Fault Tolerance

    Ch k i ti / R

  • 7/25/2019 RTFT15 Unit 3

    36/40

    Check pointin! / Recovery

    %heckpoint+!ecovery is a common technique forim*uing a program or system 'ith fault tolerantqualities, and gre' from the ideas used in systems'hich employ transaction processing

    It allo's systems to recover after some faultinterrupts the system, and causes the task to fail,or *e a*orted in some 'ay

    7hile many systems employ the technique tominimize lost processing time, it can *e used more*roadly to tolerate and recover from faults in a

    critical application or task

    Includes

    !eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    37/40

    !eal Time and Fault Tolerance

    %ontinue

  • 7/25/2019 RTFT15 Unit 3

    38/40

    single checkpoint *u0er is maintained per

    multithreaded !@2!

    process

    The element state is checkpointed after each

    operation

    %heckpoints are committed to sta*le storage after

    processing a message

    The is no need to do process+'ide checkpoints of

    stacks, heap,

    The eisting locking policy of element data prevents

    the need to suspend all threads

    @icro check pointing

    !eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    39/40

    Facility for saving running processes and, at some

    other time, restarting the saved processes from the

    point already reached, 'ithout starting all over again

    checkpoint image is saved in a set of disk .les and

    can comprise

    set of processes

    ll processes in the process group 6a set of

    processes that constitute a logical 9o*?

    ll processes in a process session 6a set of

    processes started from the same physical or logical

    terminal?

    I!I check pointing

    !eal Time and Fault Tolerance

  • 7/25/2019 RTFT15 Unit 3

    40/40

    T0$*1 &'