fault tolerance. 2 fault tolerance terminology “dependability” - extent to which reliance can...
TRANSCRIPT
![Page 1: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/1.jpg)
Fault ToleranceFault Tolerance
![Page 2: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/2.jpg)
2
Fault tolerance terminologyFault tolerance terminology
“dependability” - extent to which reliance can justifiably be placed on service.General concept
“reliability” - continuity of servicemetric: mean time between failures (MBTF)
“availability” - readiness for usage
“safety” - avoidance of catastrophic effects on environment
“security” - resistance to unauthorized access.
![Page 3: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/3.jpg)
3
Faults, errors, failuresFaults, errors, failures
“fault” - component malfunction
“error” - system state is wrong
“failure” - system departs from specification
fault error failure
![Page 4: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/4.jpg)
4
SystemSystem
System
Environment
componentsfaul
t
failure
![Page 5: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/5.jpg)
5
Coping with faultsCoping with faults
Reduce/eliminate faults in components.
Fault tolerancePrevent faults from becoming failuresusually through redundancy.
![Page 6: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/6.jpg)
6
Types of faults (fault models)Types of faults (fault models)
Fault tolerance algorithms dependent on fault models.
“Crash fault” or “stop fault” - faulty component stops responding. No incorrect state changes in component.
“Timing fault” - response is too early or late.
“Byzantine fault” - arbitrary behavior. Can be considered adversarial (imagine worst case).
![Page 7: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/7.jpg)
7
The agreement problemThe agreement problem
Processors may fail
… so, use multiple processors
… but then, processors may disagree, causing failures.
Need a principled approach to distributed agreement
![Page 8: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/8.jpg)
8
Example: AFTI 16 (from J. Rushby)Example: AFTI 16 (from J. Rushby)
“Advanced Fighter Technology Integration F16
Triple-redundant digital flight-control system (DFCS) with analog backup
DFCS design was “asynchronous”processors ran independently
sample sensor, evaluate control law, send command to actuator
actuator averages or selects from commandsGeneral Dynamics felt synchronization would
introduce a single point of failure.
![Page 9: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/9.jpg)
9
AFTI 16 problemsAFTI 16 problems
Processors can get widely varying sensor readings because of timing differences
Reconfiguration can cause sudden changes in control (“thumps”).Need to allow wide range of “plausible values”
before declaring a processor “bad”Bad sensor reading drags average downSensor finally crosses threshhold and is
called “bad”average suddenly snaps back when sensor is
excluded.
![Page 10: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/10.jpg)
10
AFTI 16 problems (cont)AFTI 16 problems (cont)
Processor states can diverge rapidlyespecially when different processors go into
different control modes.
Design complexity70% of application code was for redundancy
managementControl laws had to be modified to ramp
changes in and out smoothly
![Page 11: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/11.jpg)
11
AFTI 16 flight test, Flight 36AFTI 16 flight test, Flight 36
“Departure” from control laws for 3 seconds
acceleration exceeded -4g, then +7g
Angle of attack went to -10 degrees, then +20 degrees
Aircraft rolled 360 degreees
Cause: side air probe cut out at high angle of attack
Analysis showed this would cause complete failure of DFCS for several areas of flight envelope
![Page 12: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/12.jpg)
12
AFTI 16 flight 44AFTI 16 flight 44
Each channel declared the others failedasynchronous operation, timing skew, sensor
noise
analog backup not selectedsimultaneous failure of two channels not
anticipated
Aircraft flown home on a single digital channel (not designed for this)
There were no hardware failures.
![Page 13: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/13.jpg)
13
AFTI 16 Analysis (NASA)AFTI 16 Analysis (NASA)
Nearly all failure indications were design oversights related to asynchronous operation
Failures due to lack of understanding of interactions amongAir data systemredundancy management softwareflight control laws (decision points, thumps,
ramp-in/out)
Moral of the story: Reliability through redundancy is a lot harder than it looks.
![Page 14: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/14.jpg)
14
Distributed consensusDistributed consensus
Goal: multiple processors agree on something in the presence of various kinds of faults and errors
Intellectually difficultAlgorithms are trickyProofs are subtleSensitive to assumptions
Synchronous vs. asynchronous Communication mechanism Fault models
Many papers written
![Page 15: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/15.jpg)
15
Synchronous vs. asynchronousSynchronous vs. asynchronous
Synchronous: Processors run in lock-stepHard to implement - model may be unrealistic
Requires clock synchronization.Consensus is easier
Asynchronous: Processors run at arbitrary speedEasier to implement - model is conservativeIn most models, consensus problem is
provably unsolvable.
![Page 16: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/16.jpg)
16
Synchronous vs. asynchronousSynchronous vs. asynchronous
Semi-synchronousBounds on how far out-of-sync processors
can getModel is fairly realisticConsensus is almost as easy as synchronous
![Page 17: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/17.jpg)
17
Fault modelsFault models
Goal: Make claims such as: “the system will continue to function if any single processor stops.”
More conservative fault models:Fault tolerance is harderBut, if successful, stronger claims can be
madeFewer assumptions = simpler FMEA, easier
“certification”
A lot of models have been proposed.
![Page 18: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/18.jpg)
18
Process fault modelsProcess fault models
“Stopping fault” - process stops sending messagesdoes not restartdoes not send wrong messagesliberal (easy) model
“Byzantine fault” - process behaves arbitrarilyName comes from cute “Byzantine generals”
metaphorMay send arbitrary messages, enter arbitrary
statesEquivalent to “evil” behavior, for our purposes
![Page 19: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/19.jpg)
19
Synchronous agreement with stopping faultsSynchronous agreement with stopping faults Multiple processes want to “agree” on a
value
Applicationssensor readings among redundant processorsdecide what time it isdecide which of a group of processors are
broken and should be removed from system.
![Page 20: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/20.jpg)
20
Synchronous agreement - propertiesSynchronous agreement - properties
Each process starts with an initial value, processes end with a decision value.
Agreement: all good processes decide on same values.
Validity: if all processors start with same value, that value is the final decision value.
Termination: All good processes eventually decide.
![Page 21: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/21.jpg)
21
Flood set algorithmFlood set algorithm
Assumption: There is a dedicated link between each pair of processes
No more than f processes can stop
Each process has an initial value v
Each process accumulates a set W of all the values it has ever seen.On each round, every process sends its W set
to every other processEvery process sets W to the union of the old
value and all the new values coming in from others.
![Page 22: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/22.jpg)
22
Flood setFlood set
After f rounds, every process looks at W. If W has only one value, choose that value.Else, choose 0 (a predetermined default).
![Page 23: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/23.jpg)
23
Flood set correctnessFlood set correctness
In f+1 rounds, there must be at least one round in which no processes stopAt most f processes can stop, and processes
cannot stop more than once.
If no process stops in round r, W will be the same in all good processes in subsequent rounds.All good processes successfully send all values
in W to all other good processes, so all processes will have same W after the round.
After this, nothing can get added to any W sets, so it doesn’t matter whether more processes stop.
![Page 24: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/24.jpg)
24
Flood set correctnessFlood set correctness
So, after f+1 rounds, all non-stopped processes have same W setsIf W has only one value, all processes pick this
value.Else all processes pick 1.
![Page 25: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/25.jpg)
25
Flood set exampleFlood set example
3 processes, 1 fault, default value = 0
P1 P2 P3
V0 A A B
W in round 0 {A} {A} {B}
final
W in round 1 {A,B} {A} -
something
something
something
P3Dies after
sending W to
but not P1 P2
W in round 2 {A,B} {A,B} -
Www
s
W sets for
,
are same
P1P2
-00
Blank here
blank here
blank here
Choose default
because |W|>1
![Page 26: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/26.jpg)
26
Flood set efficiencyFlood set efficiency
O((f + 1) n2) messages
f+1 rounds
n processes send n messages per round
O((f+1)n3) values are sent (each message
may have a set of up to n values)
![Page 27: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/27.jpg)
27
Optimized flood setOptimized flood set
Note: If W has more than one element, process doesn’t need to know what is in it.
Idea: Every process sends only first two distinct values.Every process sends its initial value on first round If process receives a different value, it sends it out on
next round
Correctness proof: run Flood and OptFlood in parallelsame initial values, stopping patternW sets have more than one value iff OptFlood process
gets two values.
![Page 28: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/28.jpg)
28
OptFlood efficiencyOptFlood efficiency
2 n2 messages
n processes send at most two messages to n other processes.
O(n2) values are sent
![Page 29: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/29.jpg)
29
Byzantine agreementByzantine agreement
Goal: non-faulty processes should agree on a value.E.g., message receivede.g., sensor value
Faults may cause arbitrary behaviorarbitrary values communicateddifferent values communicated to different receivers
Advantage: reduces fault analysis
Disadvantage: hard or impossible to do.
![Page 30: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/30.jpg)
30
Byzantine agreement propertiesByzantine agreement properties
Agreement: All good processes agree on a value
Validity: If source of value was non-faulty, agreed upon value is the same.
![Page 31: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/31.jpg)
31
Asynchronous agreementAsynchronous agreement
Asynchronous model: Message transmission takes arbitrary time.Processes run at arbitrary speeds.
Theorem: There is no algorithm that reaches agreement in an asynchronous model with even one Byzantine failureFine print: Details of conditions, communication
This is one of the most important results about distributed systems.
![Page 32: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/32.jpg)
32
Synchronous agreementSynchronous agreement
Synchronous model: Processes can communicate in a sequence of rounds. All processes complete a round before next round begins.
The agreement problem is solvable in this model.
Theorem: Tolerating k Byzantine faults requires > 3k processes.
So “Triple modular redundancy” can’t handle Byzantine faults.
Practical case: 1 Byzantine fault, 4 processes.
Assumes full connectivity (connections between each pair of processors).
![Page 33: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/33.jpg)
33
Synchronous agreement with one faultSynchronous agreement with one fault
Single transmitter communicates value to all processes.
Round 0: Transmitter sends value to n-1 receivers.Values are sent correctly if transmitter is not faulty.
Round 1: Each receiver sends value to n-2 other receivers. Receivers record all values separately. Intuition: receivers compare notes on what transmitter
told them.
Each receiver choose majority value of all values it received. If no majority, use pre-arranged default value.
![Page 34: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/34.jpg)
34
consensus
Finally, receivers
take majority of all
answers
P2
P1
P3
Rcvr
Xmtr
These are the
round 0 values
P1 P2 P3
Example 1- faulty transmitterExample 1- faulty transmitter
1
1
1
1
1
1
2
2
2
1
1
1
P1 P2 P3Round 0: faulty xmtr sends
varying results to rcvrs.
Round 1: rcvrs
exchange
values (reliably)
1 1 2
![Page 35: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/35.jpg)
35
consensus
There is no majority,
so rcvrs use default
P2
P1
P3
Rcvr
Xmtr
These are the
round 0 values
P1 P2 P3
Example 2- faulty transmitterExample 2- faulty transmitter
1
1
1
2
2
2
3
3
3
0
0
0
P1 P2 P3Round 0: faulty xmtr sends
varying results to rcvrs.
Round 1: rcvrs
exchange
values (reliably)
1 2 3
![Page 36: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/36.jpg)
36
consensus
Majority computes
correct values for
processes 2,3
P2
P1
P3
Rcvr
Xmtr
These are the
round 0 values
P1 P2 P3
Example 3- faulty receiverExample 3- faulty receiver
2
1
3
1
1
1
1
1
1
1
5
1
P1 P2 P3Round 0: faulty xmtr sends
varying results to rcvrs.
Process 1
sends bogus values
1 1 1
Process 1 is
broken, so result
is not required to be
correct
![Page 37: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/37.jpg)
37
General caseGeneral case
Previous algorithm can be generalized to handle more Byzantine faults.
General results: k faults require k+1 (k?) rounds, 3k+1 processors
Number of messages grows exponentially with number of rounds
Intuition: “Pn said that Pn-1 said that ... p1 said that p0 said that the value was x”There are exponentially many chains pn ... p0.
![Page 38: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/38.jpg)
38
Hybrid Byzantine agreementHybrid Byzantine agreement
Idea: Free bonus reliability with the purchase of Byzantine agreement.
Handles Byzantine faults, plus some more simpler faults
Symmetric fault: process sends same wrong value to everyone.
Nonmalicious fault: process sends a recognizable error value.
Advantages: If processors have these faults, we can tolerate more
faulty processors These faults are more probable than true Byzantine
faults - so this increases reliability
![Page 39: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/39.jpg)
39
Hybrid Byzantine agreementHybrid Byzantine agreement Modify previous algorithm by adding special error
value “E”.Nonmalicious faults send E value (other faults may send
E, also).Majority algorithm first removes E values.
Theorem: Algorithm reaches agreement if
n > 2a + 2s + b + ra = Byzantine, s = symmetric, b = nonmalicious, r =
number of rounds (excluding first transmission).Previous case: a=1, s=0, b=0, r=1, so n > 3With 6 processors, can deal with 1 Byzantine + 2
nonmalicious faults.or 1 Byzantine and 1 symmetric ... but just 1 Byzantine in previous algorithm
![Page 40: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/40.jpg)
40
VariationsVariations
Synchronous communication is difficultCompromise between synchronous and
asynchronous: real-time constraints.
“Authentication” - agreement can be made less costly by using digital signatures transmitter digitally signs messagesprocesses can’t lie about who said what.can handle any number of faults (in synchronous
model).
May assume different network connectivitySome links in network missing
![Page 41: Fault Tolerance. 2 Fault tolerance terminology “dependability” - extent to which reliance can justifiably be placed on service. General concept “reliability”](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6f5503460f94a504f4/html5/thumbnails/41.jpg)
41
SummarySummary
Fault tolerance is tricky. Redundancy does not necessarily buy reliability.
Byzantine models can account for unforeseen fault types.
Byzantine agreement is impossible in some models.
There exist practical algorithms for Byzantine agreement if synchronous communication is available.
There are deep theoretical results in this area.