dependability and real-time
Post on 09-Feb-2016
54 Views
Preview:
DESCRIPTION
TRANSCRIPT
Undergraduate course on Real-time SystemsLinköping University
TDDD07 Real-time SystemsLecture 7: Dependability &
Fault tolerance
Simin Nadjm-Tehrani
Real-time Systems Laboratory Department of Computer and Information Science
Linköping university
56 pagesAutumn 2009
Undergraduate course on Real-time SystemsLinköping University
2 of 56Autumn 2009
Dependability and real-time
• If a system is to produce results within time constraints, it needs to produce results at all!
• How to justify and measure how well computer systems do their jobs?
Undergraduate course on Real-time SystemsLinköping University
3 of 56Autumn 2009
Dependable systems
• How do things go wrong and why?• What can we do about it?
– This lecture: Basic notions of dependability and replication in fault-tolerant systems
– Next lecture: Designing Dependable Real-time systems
Undergraduate course on Real-time SystemsLinköping University
4 of 56Autumn 2009
Early computer systems
• 1944: Real-time computer system in the Whirlwind project at MIT, used in a military air traffic control system 1951
• Short life of vaccum tubes gave mean time to failure of 20 minutes
Undergraduate course on Real-time SystemsLinköping University
5 of 56Autumn 2009
Engineers: Fool me once, shame on you – fool me twice, shame on me
Undergraduate course on Real-time SystemsLinköping University
6 of 56Autumn 2009
Software developers: Fool me N times, who cares, this is complex and anyway no one expects software to work...
Undergraduate course on Real-time SystemsLinköping University
7 of 56Autumn 2009
FT - June 16, 2004
• "If you have a problem with your Volkswagen the likelihood that it was a software problem is very high. Software technology is not something that we as car manufacturers feel comfortable with.”
Bernd Pischetsrieder, chief executive of Volkswagen
Undergraduate course on Real-time SystemsLinköping University
8 of 56Autumn 2009
October 2005
• “Automaker Toyota announced a recall of 160,000 of its Prius hybrid vehicles following reports of vehicle warning lights illuminating for no reason, and cars' gasoline engines stalling unexpectedly.”Wired 05-11-08
• The problem was found to be an embedded software bug
Undergraduate course on Real-time SystemsLinköping University
10 of 56Autumn 2009
Driver support: Volvo cars
1984 ABS Anti-lock Braking System
2004 Blind Spot Information system (BLIS)
1998 Dynamic Stability and Traction Control (DSTC)
2006 Active Bi-Xenon lights
2002 Roll Stability Control (RSC)
2006 Adaptive Cruise Control (ACC)
2003 Intelligent Driver Information System (IDIS)
2006 Collision warning system with brake support
Undergraduate course on Real-time SystemsLinköping University
11 of 56Autumn 2009
Early space and avionics
• During 1955, 18 air carrier accidents in the USA (when only 20% of the public was willing to fly!)
• Today’s complexity many times higher
Undergraduate course on Real-time SystemsLinköping University
12 of 56Autumn 2009
Airbus 380
• Integrated modular avionics (IMA), with safety-critical digital
components, e.g.– Power-by-wire:
complementing the hydraulic powered flight control surfaces
– Cabin pressure control (implemented with a TTP operated bus)
Undergraduate course on Real-time SystemsLinköping University
13 of 56Autumn 2009
Dependability
• How do things go wrong?• Why?• What can we do about it?
• Basic notions in dependable systems and replication for fault tolerance
(sv. Pålitliga datorsystem)
Undergraduate course on Real-time SystemsLinköping University
14 of 56Autumn 2009
What is dependability?
Property of a computing system which allows reliance to be justifiably placed on the service it delivers.
[Avizienis et al.]
The ability to avoid service failures that are more frequent or more severe than is acceptable.
(sv. Pålitliga datorsystem)
Undergraduate course on Real-time SystemsLinköping University
15 of 56Autumn 2009
Attributes of dependability
IFIP WG 10.4 definitions:• Safety: absence of harm to people and
environment• Availability: the readiness for correct
service• Integrity: absence of improper system
alterations• Reliability: continuity of correct service• Maintainability: ability to undergo
modifications and repairs
Undergraduate course on Real-time SystemsLinköping University
16 of 56Autumn 2009
[Sv. Tillförlitlighet]Means that the system (functionally) behaves as specified, and does it continually over measured intervals of time.Typical measure in aerospace: 10-9
Another way of putting it: MTTF - One failure in 109 flight hours.
Reliability
Undergraduate course on Real-time SystemsLinköping University
17 of 56Autumn 2009
Faults, Errors & Failures
• Fault: a defect within the system or a situation that can lead to failure
• Error: manifestation (symptom) of the fault - an unexpected behaviour
• Failure: system not performing its intended function
Undergraduate course on Real-time SystemsLinköping University
18 of 56Autumn 2009
Examples
• Year 2000 bug• Bit flips in hardware due to cosmic
radiation in space• Loose wire• Air craft retracting its landing gear while
on ground
Effects in time: Permanent/ transient/ intermittent
Undergraduate course on Real-time SystemsLinköping University
19 of 56Autumn 2009
Fault Error Failure
• Goal of system verification and validation is to eliminate faults
• Goal of hazard/risk analysis is to focus on important faults
• Goal of fault tolerance is to reduce effects of errors if they appear - eliminate or delay failures
Some will remain…
Undergraduate course on Real-time SystemsLinköping University
20 of 56Autumn 2009
More on dependability
Four approaches [IFIP 10.4]:
1. Fault avoidance2. Fault removal3. Fault tolerance4. Fault forecasting
Next lecture
Undergraduate course on Real-time SystemsLinköping University
21 of 56Autumn 2009
Google’s 100 min outage
September 2, 2009: A small fraction of Gmail’s servers were
taken offline to perform routine upgrades.
“We had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers .”
Fault forecasting?
Undergraduate course on Real-time SystemsLinköping University
22 of 56Autumn 2009
More on dependability
Four approaches [IFIP 10.4]:
1. Fault avoidance2. Fault removal3. Fault tolerance4. Fault forecasting
Undergraduate course on Real-time SystemsLinköping University
23 of 56Autumn 2009
Fault tolerance
• Means that a system provides a degraded (but acceptable) function – Even in presence of faults– During a period defined by certain
model assumptions
• Foreseen or unforeseen?– Fault model describes the foreseen
faults
Undergraduate course on Real-time SystemsLinköping University
24 of 56Autumn 2009
External factors
Undergraduate course on Real-time SystemsLinköping University
25 of 56Autumn 2009
Fault models
• Leading to Node failures– Crash – Omission – Timing– Byzantine
• Leading to Channel failures– Crash (and potential partitions)– Message loss– Message delay– Erroneous/arbitrary messages
Undergraduate course on Real-time SystemsLinköping University
26 of 56Autumn 2009
On-line fault-management
• Fault detection– By program or its environment
• Fault tolerance using redundancy– software– hardware– Data
• Fault containment by architectural choices
Undergraduate course on Real-time SystemsLinköping University
27 of 56Autumn 2009
From D. Lardner: Edinburgh Review, year 1824:
”The most certain and effectual check upon errors which arise in the process of computation is to cause the same computations to be made byseparate and independent computers*; and this check is rendered still more decisive if theircomputations are carried out by different methods.”
* people who compute
Redundancy
Undergraduate course on Real-time SystemsLinköping University
28 of 56Autumn 2009
Static Redundancy
Used all the time (whether an error has appeared or not), just in case…
– SW: N-version programming– HW: Voting systems– Data: parity bits, checksums
Undergraduate course on Real-time SystemsLinköping University
29 of 56Autumn 2009
Dynamic Redundancy
Used when error appears and specifically aids the treatment
– SW: Recovery methods, Exceptions– HW: Switching to back-up module– Data: Self-correcting codes– Time: Re-computing a result
Undergraduate course on Real-time SystemsLinköping University
30 of 56Autumn 2009
Server replication models
• Active replication
X
• Passive replication
: Denotes a replica group
Undergraduate course on Real-time SystemsLinköping University
31 of 56Autumn 2009
• Active replication– Group membership: who is up, who is down?
• Passive replication– Primary – backup: bring the secondary server up to
date
• What do we need to implement to support transition to operational mode?– Message ordering– Agreement among replicas
Increasing availability
Undergraduate course on Real-time SystemsLinköping University
32 of 56Autumn 2009
Relating states
Needs a notion of time (precedence) in distributed systems
Recall from lecture 4:• If clocks are synchronised
– The rate of processing at each node can be related
– Timeouts can be used to detect faults as opposed to message delays
• Asynchronous systems: – Abstract time via order of events
Undergraduate course on Real-time SystemsLinköping University
33 of 56Autumn 2009
• Vector clocks help to synchronise at event level– Consistent snapshots
• But reasoning about response times and fault tolerance needs quantitative bounds
Distributed snapshot
Undergraduate course on Real-time SystemsLinköping University
34 of 56Autumn 2009
”Chicken and egg” problem
• Replication is useful in presence of failures if there is a consistent common state among replicas
• To get consistency, processes need to communicate their state via broadcast
• But broadcast algorithms are distributed algorithms that run on every node
• and can be affected by failures...
Undergraduate course on Real-time SystemsLinköping University
35 of 56Autumn 2009
Desirable broadcast properties
• Reliable broadcast– all non-crashed processes agree on
messages delivered (agreement)– no spurious messages (integrity) – all messages broadcast by non-crashed processes are delivered
(validity)
All or none!
Undergraduate course on Real-time SystemsLinköping University
36 of 56Autumn 2009
How to implement?
• The first step is to separate the underlying network (transport) and the broadcast mechanism
• Distinguish between receipt and delivery of a message
Undergraduate course on Real-time SystemsLinköping University
37 of 56Autumn 2009
Application layer
Broadcast protocol
Transport
BroadcastSend
Send
Deliver
Receive
Undergraduate course on Real-time SystemsLinköping University
38 of 56Autumn 2009
Desirable broadcast properties
• Reliable broadcast– all non-crashed processes agree on
messages delivered (agreement)– no spurious messages (integrity) – all messages broadcast by non-crashed processes delivered
(validity)
All or none!
Undergraduate course on Real-time SystemsLinköping University
39 of 56Autumn 2009
Example implementation
At every node n• Execute broadcast(m) by:
– adding sender(m) and a unique ID as a header to the message m (building m)
– send(m) to all neighbours including itself• When receive(m):
– if previously not executed deliver(m) then • if sender(m) /= n then send(m) to all
neighbours• deliver(m)
Undergraduate course on Real-time SystemsLinköping University
40 of 56Autumn 2009
Directly after a receipt?
While relaying?
After sending to some but not all neighbours?
This is where failure models come in...
What if n fails?
Undergraduate course on Real-time SystemsLinköping University
41 of 56Autumn 2009
Algorithms for broadcast
• Correctness: prove validity, integrity, agreement, order
Typical assumptions:– no link failures leading to partition– send does not duplicate or change
messages– receive does not ”invent” messages
In presence of a given failure model!
Undergraduate course on Real-time SystemsLinköping University
42 of 56Autumn 2009
The consensus problem
• Processes p1,…,pn take part in a decision• Each pi proposes a value vi
– e.g. application state info• All correct processes decide on a
common value v that is equal to one of the proposed values
Assume that we have a reliable broadcast
Undergraduate course on Real-time SystemsLinköping University
43 of 56Autumn 2009
Desired properties
• Every non-faulty process eventually decides (Termination)
• No two non-faulty processes decide differently (Agreement)
• If a process decides v then the value v was proposed by some process (Validity)
Algorithms for consensus have to be proven to
have these properties
In presence of given failure model
Undergraduate course on Real-time SystemsLinköping University
44 of 56Autumn 2009
Basic impossibility result
[Fischer, Lynch and Paterson 1985]
There is no deterministic algorithm solving the consensus problem in an asynchronous distributed system with a single crash failure
Undergraduate course on Real-time SystemsLinköping University
45 of 56Autumn 2009
A way around it
Assume Synchrony:
• Distributed computations proceed in rounds initiated by pulses
• Pulses implemented using local physical clocks, synchronised assuming bounded message delays
Undergraduate course on Real-time SystemsLinköping University
46 of 56Autumn 2009
Can I prove any protocol correct?
• NO!
• It depends on the fault model:– Assumptions on
how can the nodes/channels fail
Undergraduate course on Real-time SystemsLinköping University
47 of 56Autumn 2009
Problem:• Nodes need to agree on a decision but
some nodes are faulty, act in an arbitrary way (can be malicious)
Byzantine agreement
Undergraduate course on Real-time SystemsLinköping University
48 of 56Autumn 2009
Byzantine agreement protocol
• Proposed in 1980 by Pease, Shostak and Lamport
How many faulty nodes can the
algorithm tolerate?
Undergraduate course on Real-time SystemsLinköping University
49 of 56Autumn 2009
Result from 1980
• Theorem: There is an upper bound t for the number of byzantine node failures compared to the size of the network N, N 3t+1
• Gives a t+1 round algorithm for solving consensus in a synchronous network
• Here:– We just demonstrate that t nodes
would not be enough!
Undergraduate course on Real-time SystemsLinköping University
50 of 56Autumn 2009
Scenario 1
• G and L1 are correct, L2 is faulty
G
L1 L2
1
1
1
0
1
0L2
said
0L2
said
0
L1 said 1L1 said 1
G said 1
G said 0L1 L2
G
Undergraduate course on Real-time SystemsLinköping University
51 of 56Autumn 2009
Scenario 2
• G and L2 are correct, L1 is faulty
G
L1 L2
1
0
0
0
1
0L2
said
0L2
said
0
L1 said 1L1 said 1
G said 1
G said 0L1 L2
G
Undergraduate course on Real-time SystemsLinköping University
52 of 56Autumn 2009
Scenario 3
• The general is faulty!G
L1 L2
1
1
0
0
1
0L2
said
0L2
said
0
L1 said 1L1 said 1
G said 1
G said 0L1 L2
G
Undergraduate course on Real-time SystemsLinköping University
53 of 56Autumn 2009
2-round algorithm
… does not work with t=1, N=3!
• Seen from L1, scenario 1 and 3 are identical, so if L1 decides 1 in scenario 1 it will decide 1 in scenario 3
• Similarly for L2, if it decides 0 in scenario 2 it decides 0 in scenario 3
• L1 and L2 do not agree in scenario 3 !
Undergraduate course on Real-time SystemsLinköping University
54 of 56Autumn 2009
Summary
• Dependable systems need to justify why services can be relied upon
• To tolerate faults we normally deploy replication
• Replication needs (some kind of) agreement on state
• Agreement problems are hard to prove correct in distributed systems
• Require assumptions on faults, and (some kind of) synchrony
Undergraduate course on Real-time SystemsLinköping University
55 of 56Autumn 2009
• Remember in a TTP bus: – Nodes have synchronised clocks– The communication infrastructure (collection
of CCs and the TTP bus communication) guarantee a bounded time for a new data appearing comm. interface of a node (its CNI)
– Byzantine faults in nodes can be detected by other nodes!
• Exercise: Which faults are detectable in a system connected with a CAN bus?
Real-time communication
Undergraduate course on Real-time SystemsLinköping University
56 of 56Autumn 2009
Timing and fault tolerance
• Implementing fault tolerance often relies support for timing guarantees
• Exercise: How do faults affect support for timing?– How do faults affect response times in
nodes?– How do fault tolerance mechanisms
affect timing?
top related