1. introduction. 2 computing everywhere computing everywhere: – desktop, laptop, cars, cell phones...
TRANSCRIPT
1. Introduction
2
Computing Everywhere
Computing everywhere:• – Desktop, Laptop, Cars, Cell phones
Input devices everywhere:• – Sensors, cameras, microphones
Connectivity everywhere:• – Rapid growth of bandwidth in the interior of the net• – Internet at home and office
Increased reliance on computers is inevitable Computer systems will become invisible only
when they are reliable
3
Fault-Tolerance
Why?• Computers are increasingly being used in
critical applications where system failures may have severe consequences.
How?• By introducing redundancy (extra resources) in
the computer system, e.g., hardware redundancy and software redundancy.
4
Need for Fault Tolerance: Universal
Natural objects:• • Fat deposits in body: survival in starvation• • Duplication of eyes: graceful degradation
upon failure
Man-made objects• • Redundancy in ordinary text• • Asking for password twice during initial set-
up• • Duplicate tires in trucks
5
Mission Specific Approaches High availability systems:
• – Telephone• – Transaction processing: banks/airlines
Long life missions:• – Unscheduled maintenance too costly• – Manned and unmanned space borne systems
Critical applications:• – Real-time industrial control• – Aircraft control systems• – Life support systems
General Purpose Systems:• – CDs: encoding• – Internet: packet retransmission
6
Example of Failures - eBay Crash
eBay: giant internet auction house• – A top 10 internet business• – Market value of $22 billion• – 3.8 million users as of March 1999
June 6, 1999• – eBay system is unavailable for 22 hours with
problems ongoing for several days• – Stock drops by 6.5%, $3-5 billion lost revenues• – Problems blamed on Sun server software
7
Example - Ariane 5 Rocket Crash
Ariane 5 and its payload destroyed about 40 seconds after liftoff
Error due to software bug:• – Conversion of floating point to 16-bit int• – Out of range error generated but not handled
Testing of full system under actual conditions not done due to budget limits
Estimated cost: 120 million $
8
Example - The Therac-25 Failure
Therac-25 is a linear accelerator used for radiation therapy
More dependent on software for safety than predecessors (Therac-20, Therac-6)
Machine reliably treated thousands of patients, but occasionally there were serious accidents, involving major injuries and 1 death.
Software problems:• – No locks on shared variables (race conditions).• – Timing sensitivity in user interface.
9
Example - Tele Denmark
Tele Denmark Internet, ISP August 31, 1999
• – Internet service down for 3 hours• – Truck drove into the power supply cabinet at
Tele Denmark• – Where were the UPSs?
• Old ones had been disconnected for upgrade
• New ones were on the truck!
10
Dependable
Webster Dictionary Dependable: capable of being depended
on: RELIABLE Reliable: suitable or fit to be relied on:
DEPENDABLE Rely:
• 1) to be dependent <the system for which we depend on water>
• 2) to have confidence based on experience <someone you can rely on>
11
Dependability Dependability is that property of a computer
system such that reliance can justifiably be placed on the service it delivers.
Attributes Of Dependability• Reliability• Availability• Safety• Confidentiality• Integrity• Maintainability
Fault tolerance is not a system requirement. Fault tolerance is one of the mechanisms that can be used to provide dependability
12
Motivation Extreme fault tolerance has always been around
• – NASA’s deep space probes• – Medical computing devices (e.g., pacemakers)
But now fault tolerance is becoming more important• – More reliance on computers
Extreme fault tolerance• – Car controllers (e.g., anti-lock brakes), etc.
High fault tolerance• – Commercial servers (databases, web servers), file servers,
etc. Some fault tolerance
• – Desktops, laptops (really!), etc.
13
Reliability R(t) - Unreliability
R(t) is the probability that the system performs as specified without interruption over the entire interval [0,t]. R(t) is conditioned on the system being operational at time t=0.
time t can be very long, e.g. years in case of space applications
Unreliability F(t) is the probability that the system fails at any time in the interval [0,t].
F(t) = 1 - R(t)
14
Availability A(t)
A(t) is the probability that the system is up and running correctly at time t
This is different from reliability.• – Reliability considers the interval [0,t]• – Availability takes an instance of time
examples: transaction processing systems, e.g. reservation systems
15
Reliability vs. Availability
Example: A system that fails, on average, once per
hour but which restarts automatically in ten milliseconds is not very reliable but is highly available
Availability= Uptime/(Uptime+Downtime) = (3600000-10)/(3600000) = 0.9999972
16
Nines of Availability
17
Question ?
Which has higher availability?• (1) two 4 hour outage / year• (2) 1 minute outage / day• A(1) = (365*24-2*4)/(365*24) = 0.9990• A(2) = (24*60-1)/(24*60) = 0.9993
For an Internet-base company such as EBay or AOL, which would be more desirable? Why?
For a Hacker? Need to specify details of acceptable outages
18
Safety S(t) S(t) is the probability that the system does
not fail in the interval [0,t] in such a manner as to cause unacceptable damage or other terrible effects.
S(t) is attribute of a system which either operates correctly or fails in a safe manner.
Safety is a measure of the fail-safe capability of the system
• – system can be unreliable, yet safe• – bias towards safe failure
19
Safety Safety is a property of a system that it will not
endanger human life or the environment
A safety-related system is one by which the safety of equipment or plant is assured
The term safety-critical system is normally used as synonym for a safety-related system, although it may suggest a system of high criticality
20
Confidentiality
Absence of unauthorized disclosure of information
• Microsoft source code vs. Linux source code• Web browsing• Operating Systems Security Model• Files• Medical records• Credit card transaction records• School grades
21
Integrity
Absence of improper system state alterations
• Operating systems:• Files, memory, network packets• Linux kernel backdoor attempt• Database records• Your bank account• File transfer• Did I really get the right version of software X?
22
Maintainability M(t)
M(t) is the probability that a failed system will be restored within a specified period of time t
Restoration process• – locating problem, e.g. via diagnostics• – physically repairing system• – bringing system back to its operational
condition
23
Ex. Dependability Requirements
Telecommunications:• Availability, maintainability
Transportation:• Reliability, availability, safety
Weapons:• Safety
Nuclear systems:• Safety
24
Dependability of Pacemaker What matters for this system?
• Correct computation?• Correct logic?• Usability?
No, The safety of the patient
25
Dependability of Pacemaker General Characteristics Eight-bit processors, moving to 32-bit Software:
• Approximately 30K lines, mostly “C”• Vastly more software in external programmer
Patient data storage example:• 200 samples/sec
Long battery life necessary-device• “sleeps” between heart beats
26
Dependability of Pacemaker Is reliability the goal?
• Typical battery life is five years, but persistent storage is needed
Is availability the goal? How about an availability of 0.99999?
• This corresponds to an average of five minutes per year of downtime
• Death would result if this occurred all at once Is safety the goal? It’s safe when it’s off - or is it?
• Leaving the system off might result in death very quickly.....
27
Security
Security is a combination of attributes:• Integrity• Confidentiality• Availability
Under different situations, these attributes are more or less important:
• Denial of service is an availability issue• Disclosure of information is a confidentiality
28
Performability P(L,t)
P(L,t) is the probability that the system performance will be at or above some level L at time t
Measure of the likelihood that some subset of the function is performed correctly
This differs from reliability, which dictates that all functions are performed correctly
29
Graceful Degradation
The ability of system to automatically decrease its level of performance to compensate for hardware failure and software errors
30
Testability
Testability: ease of detecting presence of a fault