fundamental concepts neeraj suri

65
© Suri/DEEDS SWFT Course 2007-08 1 Fundamental Concepts Neeraj Suri Software Fault Tolerance (SWFT)

Upload: marged

Post on 11-Jan-2016

46 views

Category:

Documents


4 download

DESCRIPTION

Software Fault Tolerance (SWFT). Fundamental Concepts Neeraj Suri. Overview. Motivation for Fault Tolerance Terminology Faults, Errors and Failures Dependability Recovery Backward and forward Redundancy Error Confinement. Examples of Software Failures. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 1

Fundamental Concepts

Neeraj Suri

Software Fault Tolerance (SWFT)

Page 2: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 2

Overview

Motivation for Fault Tolerance Terminology

Faults, Errors and Failures

Dependability Recovery

Backward and forward

Redundancy Error Confinement

Page 3: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 3

Examples of Software Failures Denver airport: Failure in luggage management system opening

delayed for several months Failure of a space probe sent to Mars due to inhomogeneity of

measuring units (inch and cm) Problems when space shuttle Endeavor met with Intelstat 6 due to

rounding of near-zero values Flaw in Apollo 11 software made moons gravity repulsive rather

than attractive Ariane V guidance failure AT&T system suffered a 9 hour US-wide blockade

Switch experienced abnormal behavior due to flaws in recovery Switch experienced abnormal behavior due to flaws in recovery software and propagated to all switchessoftware and propagated to all switches

Software problem caused radiation safety door of a nuclear power processing plant in the UK to open accidentally

Several patients killed through radiation overdoses due to software flaws in Therac-25 (cancer treatment system)

... (many, many, many more examples – WinXP, Vista !!!)

Page 4: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 4

Software Failures

33%

62%

Page 5: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 5

Has this trend continued?

[2] D. Oppenheimer et al. “Why do Internet services fail, and what can be done about it”, USENIX 2003.

(50%) (94%) (76%) (79%)(tolerance)

45%

2,5%

35%

17,5%

Page 6: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 6

Microsoft EULA

EXCLUSION OF INCIDENTAL, CONSEQUENTIAL AND CERTAIN OTHER DAMAGES. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT SHALL MICROSOFT OR ITS SUPPLIERS BE LIABLE FOR ANY SPECIAL, INCIDENTAL, INDIRECT, OR CONSEQUENTIAL DAMAGES WHATSOEVER (INCLUDING, BUT NOT LIMITED TO, DAMAGES FOR LOSS OF PROFITS OR CONFIDENTIAL OR OTHER INFORMATION, FOR BUSINESS INTERRUPTION, FOR PERSONAL INJURY, FOR LOSS OF PRIVACY, FOR FAILURE TO MEET ANY DUTY INCLUDING OF GOOD FAITH OR OF REASONABLE CARE, FOR NEGLIGENCE, AND FOR ANY OTHER PECUNIARY OR OTHER LOSS WHATSOEVER) ARISING OUT OF OR IN ANY WAY RELATED TO THE USE OF OR INABILITY TO USE THE SOFTWARE PRODUCT, THE PROVISION OF OR FAILURE TO PROVIDE SUPPORT SERVICES, OR OTHERWISE UNDER OR IN CONNECTION WITH ANY PROVISION OF THIS EULA, EVEN IN THE EVENT OF THE FAULT, TORT (INCLUDING NEGLIGENCE), STRICT LIABILITY, BREACH OF CONTRACT OR BREACH OF WARRANTY OF MICROSOFT OR ANY SUPPLIER, AND EVEN IF MICROSOFT OR ANY SUPPLIER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Page 7: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 7

“Mistakes in software development will continue to be made, no matter how carefully the software is built, and failures will continue to occur: that is the way things are in engineering.”

John Knight, 2004

Page 8: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 8

Reasons for Bad Software

Economic / Engineering Reasons need to sell software to pay for development diminishing return for fixing rare bugs

Legal Reasons no legal requirements for good software

Legacy Reasons need to be (bug) compatible

Technical Reasons software is complex!

Page 9: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 9

Economical Reasons

Budget time, persons, money “too many bugs, too little time”

Time To Market expenses vs income

Reliability vs Features know thy customer!

Cost vs Benefit Tradeoff

Page 10: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 10

Engineering Issues

Undetected bugs trigger conditions to complex to find bug

No / benign effects effect of bug is benign, or activation of bug is unlikely

Not easy to fix design is just flawed

Might introduce more bugs developers do not know code sufficiently well

Page 11: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 11

Technical Reasons

Incomplete understanding of systems Too high complexity makes it impossible to make sure

that software is correct

Complexity For correctness, complexity should be kept minimal

Page 12: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 12

Software Bugs

Professional programmer: about 100-150 “bugs” per 1000 lines of code exact value depends on many factors…

Example: Windows XP is about 45 million LOC would imply about 4.5 to 6.75 million bugs

Page 13: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 13

Software Testing

Rule of thumb: In each round of testing one finds at most half of the

bugs

Multiple rounds of testing: find new bugs with new tests

Example: (million bugs) 4.5 2.251.120.560.280.14…

“Program testing can be a very effective way to

show the presence of bugs, but it is hopeless inadequate for showing their absence!”

Dijkstra

Page 14: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 14

Software Size

A. Chou et al. “An Empirical Study of Operating System Errors”, SOSP 2001

Page 15: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 15

Challenge: Software Continues to Grow...

Figure: Windows Generations

Page 16: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 16

Projected Software Bugs

A. Chou et al. “An Empirical Study of Operating System Errors”, SOSP 2001

Page 17: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 17

Software Bugs are a Fact

“If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time”

Shimon Peres

Page 18: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 18

Thesis

Using software fault tolerance mechanisms one could provide better software, at a lower cost, and shorter time-to-market

Example: Apache is very successful project Uses many SWFT mechanisms to increase its robustness

Page 19: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 19

Defense in Depth

Strategy: Avoid faults (e.g., reduce stress of programmers, better

languages) Remove faults (e.g., use appropriate tools) Cope with faults (i.e., tolerate faults) Predict faults

Page 20: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 20

Motivation

Progress in software engineering Design Testing Formal methods ...

Experience still shows that assuming software to be bug free is … (& can be hazardous to your health )

Page 21: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 21

Dependability Threats

Failure (“Ausfall”) Observable deviation from the specification

Error (“Fehler”) Part of the system state that may lead to a failure

Fault (“Störung”) “Defect” or “Flaw” of a system An active fault produces an error. Otherwise, it is

dormant.

Page 22: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 22

Chain of Dependability Threats

Fault Error Failure

A fault activates an error A error propagates to become a failure

activation propagation

Page 23: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 23

Example

0 int size = 0 ; long* p = 0;1 add_element (long value) 2 size += 1;3 p = (int*) realloc(p,sizeof(long)*size);4 p[size-1] = value;5 }

Activation: add_element(...) and no memory;

Error: p=0 and size > 0 (* before line 4 *);

Propagation: execution of line 4

Failure: segmentation violation

Page 24: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 24

Error Propagation

Error Error

Failure of one component can activate an error in another component

One components fault is another components failure

ComponentFault

Failure

Component

Page 25: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 25

Goal of Fault Tolerance/Robustness

The Goal of Fault Tolerance/Robustness is to

Avoid/Mitigate Failures in the Presence of Faults

Page 26: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 26

Fault Tolerance

component

Error component

failure

fault

System Interface

fault

error free

Page 27: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 27

Basic Fault Tolerant Approach

Define set of likely faults and likely bounds on fault frequency (failure model)

Components must be able to tolerate specified faults

Faults outside failure model might or might not be handled

Page 28: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 28

Origin of Faults

ProcessManageme

nt

Specification

Operation

Maintenance

HumanInteraction

Design

Implementation

Environment

Component

DefectsReuse

Requirements

Engineering

Documentation

Page 29: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 29

Fault Classification

Persistence Transient fault Intermittent fault (periodic fault) Permanent fault

Creation time Design fault Operational fault

Intention Accidental fault Intentional fault

Page 30: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 30

Classical Failure Model

Crash failure Fail-silent and Fail-stop

Omission failure Timing failure

System fails to respond within a specified time slice Both late and early responses might be “bad” Late timing failure = performance failure

Arbitrary failure System behaves arbitrarily

Page 31: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 31

Failure Hierarchy

The algorithms used forachieving any kind of

fault tolerance depend onthe system and failure model

Arbitrary

Timing

Omission

Crash

Fail-stop

Page 32: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 32

Achieving Dependability

Fault tolerance should not be applied in isolation

Fault AvoidanceFault

Forecasting

Fault RemovalFault

Tolerance

Page 33: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 33

Fault Avoidance

Reduce the number of faults during software construction Rigorous Software Development Process

• Requirements Specification & Analysis• Structured Design• Well-defined Mapping to Programming Languages• Clear Documentation

Formal Methods Software Reuse

Page 34: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 34

Rigorous Software Development (1)

Requirements elicitation Discover what features each stakeholder expects the

system to provide Imperfect process

• Technical and non-technical people have to collaborate• Use-cases

Computer scientists can’t be experts in all application areas

Current software development processes are rigid• No support for “iterative” requirements engineering

Page 35: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 35

Rigorous Software Development (2)

Requirements Analysis / Specification Specify in a clear and precise way what functionality your

system must provide Complete, but not too complex Consistent Determine (or even better: generate) test cases Should not be rigid

Page 36: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 36

Drink Distributor Example (1)

Provides hot drinks: coffee, tea and chocolate User interface Cycle treatment

1. Insert money2. Choose drink3. Take change4. Take drink

Or press cancel and coins are given back

ChangeDrink

Coins Cancel

CoffeeTeaChocolate

Drink Distributor

Page 37: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 37

Drink Distributor Example (2)

Incomplete specification No deadline for cancellation specified What if user inserts new coins before the end of a

cycle? What if the user changes his selection? What should be done when resources (change,

cups, spoons, sugar, coffee, tea, chocolate, water) run out?

Provide partial service?(e.g. only tea and coffee / require exact change)

If manufacturer and user make divergent interpretations, operation time failure will occur

Page 38: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 38

Drink Distributor Example (3)

Augment specification Cancellation not possible once drink has been

chosen Add green / red light to indicate cycle start Only the first selected beverage is taken into

account Add lights to show availability of drinks

Each omission of constraint in the specification can lead to a failure in the service delivered to the user

Dissatisfaction Loss of money

Page 39: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 39

Rigorous Software Development (3)

Structured design For instance in Object-Orientation:

Apply O-O principles, e.g. abstraction, information hiding, modularity, classification, to reduce complexity of the solution

Provide easy-to-read documentation• UML

Programming Methodology Good programming discipline

• Pair-programming Well-defined mapping of design models to programming

constructs Standards or coding conventions

Page 40: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 40

Formal Methods (1)

Specifications are developed using mathematically tractable languages and tools Petri Nets, Algebraic Specifications

Permits to prove desired properties Generation of test cases Generation of code!

Page 41: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 41

Formal Methods (2)

Mathematical specifications of software tend to be equal in size as the program itself just as error-prone

Tools (model-checkers) still face algorithmic challenges when attempting to prove properties of huge models

However, have been successfully applied for safety-critical components

Page 42: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 42

Software Reuse

Well exercised software is less likely to fail Save development cost Undiscovered faults may appear when the

component is used in a new environment

Page 43: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 43

Fault Removal

Detect and remove existing faults using proofs Assertion checking in simulator

Testing Exhaustive testing not feasible Can’t show the absence of faults Quality measures

Formal Inspection

Page 44: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 44

Fault Forecasting

Also known asSoftware reliability measurement [Lyu96]

Estimation Gather failure data during operation or testing Apply statistical inference techniques

Prediction Gather software metrics during development

Fault forecasting can indicate the need for additional testing or for applying fault tolerance

Page 45: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 45

Seriousness Classes (1)

DO-178B (standard for SW certification), civil aeronautics Without effects Minor / benign

• Upset passengers, small increase in workload for the Upset passengers, small increase in workload for the crewcrew

Major / significant• Injuries of the passengers / crew and reducing the Injuries of the passengers / crew and reducing the

efficiency of the crewefficiency of the crew Dangerous / serious

• Small number of casualties / serious injuries, or Small number of casualties / serious injuries, or preventing the crew from achieving its task in a preventing the crew from achieving its task in a precise and complete mannerprecise and complete manner

Catastrophic / disastrous• Leading to human lives lostLeading to human lives lost

Page 46: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 46

Seriousness Classes (2)

DO-178B, civil aeronautics Without effects Minor / benign

• Probable: p > 10Probable: p > 10-5-5

Major / significant• Rare: 10Rare: 10-7-7 < p < 10 < p < 10-5-5

Dangerous / serious• Extremely rare: 10Extremely rare: 10-9 -9 < p < 10< p < 10-7-7

Catastrophic / disastrous• Extremely improbable: p < 10Extremely improbable: p < 10-9 -9

Page 47: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 47

Software Fault Tolerance

Tolerate faults that remain in the system after development, preventing system failure Remove errors and their effects from the computational

state before a failure occurs

Page 48: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 48

Classification

Single Version Software Monitoring techniques, atomicity of actions, decision

verification, exception handling

Multi-version Software Functionally independent, yet equivalent software Recovery blocks, N-version programming, …

Multiple Data Representation Retry blocks, N-copy programming, …

Page 49: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 49

Recovery

Error detection Identify erroneous state

Error diagnosis Assess the damage

Error containment Prevent further damage / error propagation

Error recovery Substitute the erroneous state with an error-free one

Page 50: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 50

Backward Error Recovery (1)

System state is saved at predetermined recovery points Called checkpointing Incremental checkpointing, log

State should be checkpointed on stable storage, not affected by failures

Recover error-free state by rolling back to a previously saved (error-free) state

Page 51: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 51

Backward Error Recovery (2)

Checkpoint

Checkpoint

Faultdetected

Fault manifests

Assumption: faulty behaviorAfter last checkpoint

Page 52: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 52

Backward Error Recovery (2)

Checkpoint

Checkpoint

Rollback

Depending on the technique used:• Try again• Try a different alternate• Do nothing (wait for the next request)

Page 53: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 53

Backward Error Recovery (2)

Checkpoint

Checkpoint

Checkpoint

Checkpoint

Page 54: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 54

Advantages of Backward Recovery

+ Requires no knowledge of the errors in the system state

+ Can handle arbitrary / unpredictable faults (as long as they do not affect the recovery mechanism)

+ Can be applied regardless of the sustained damage (the saved state must be error-free, though)

+ General scheme / application independent+ Particularly suitable for recovering from transient

faults

Page 55: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 55

Disadvantages of Backward Recovery

―Requires significant resources (e.g. time, computation, stable storage) for checkpointing and recovery

―Checkpointing requires To identify consistent states The system to be halted / slowed down temporarily

―Care must be taken in concurrent systems to avoid the domino effect

Page 56: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 56

Forward Error Recovery

Detect the error Damage assessment Build a new error-free state from which the system

can continue execution “Safe stop” Degraded mode Error compensation

• E.g., switching to a different component, etc…

Page 57: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 57

Forward Error Recovery (2)

Faultdetecte

d

Fault manifests

State Reconstruction

Damage Assessment

Page 58: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 58

Advantages of Forward Recovery

+ Efficient (time / memory) If the characteristics of the fault are well understood,

forward recovery is the most efficient solution

+ Well suited for real-time applications Missed deadlines can be addressed

+ Anticipated faults can be dealt with in a timely way using redundancy

Page 59: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 59

Disadvantages of Forward Recovery

—Application-specific—Can only remove predictable errors from the

system state—Requires knowledge of the actual error—Depends on the accuracy of error detection,

potential damage prediction, and actual damage assessment

—Not usable if the system state is damaged beyond recoverability

Page 60: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 60

Redundancy

Key concept of fault tolerance Hardware redundancy

• Most common use of redundancyMost common use of redundancy• We’re not going to address itWe’re not going to address it

Software redundancy• Additional applications, modules, objects used in the Additional applications, modules, objects used in the

system to support fault tolerancesystem to support fault tolerance Information redundancy

• Error-detecting or error-correcting codesError-detecting or error-correcting codes• Diverse dataDiverse data• Data produced for fault toleranceData produced for fault tolerance

Time redundancy• Use additional time for fault toleranceUse additional time for fault tolerance

Page 61: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 61

Architectural Structure

Systems, especially concurrent ones, are increasingly complex

Consist of several components / subcomponents Fault tolerance must account for that

Different fault tolerance approaches for each component Failure of a subcomponent can be perceived as a fault in

the parent component Clear structuring reduces complexity

Page 62: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 62

Error Confinement

System partitioned into regions, beyond which effects of faults should not propagate

Components should only be accessible through a well-defined (and preferably narrow [Kop97]) interface

Different confinement regions may employ different fault tolerance techniques depending on failure semantics of the environment and subcomponents

Page 63: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 63

Normal Processing

Idealized Fault-Tolerant Component (1)

Error Processing

Local Exception

Return to normal

Service Request

Service Request Reply

Reply

Interface Exception

Failure Exception

Failure Exception

Interface Exception

Page 64: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 64

Idealized Fault-Tolerant Component (2)

Receives requests for service Produces responses 3 kinds of exceptions

Interface exception: An invalid service request has been made

Local exception: An internal error is detected Failure exception: Component is unable to provide the

requested service Recursive structure

Page 65: Fundamental Concepts Neeraj Suri

© Suri/DEEDSSWFT Course 2007-08 65

Recommended Reading

[Lyu96] Lyu, M. R. (ed.): Handbook of Software Reliability Engineering, New York, IEEE Computer Society Press, McGraw-Hill, 1996.

[Kop97] Kopetz, H.: Real–Time Systems — Design Principles for Distributed Embedded Applications. Kluwer Academic Publishers, 1997.

[RX95] Randell, B.; Xu, J.: The Evolution of the Recovery Block Concept, chapter 1, pp. 1 – 21, in Lyu, M. R. (Ed.): Software Fault Tolerance, John Wiley & Sons, 1995.

[1] M. Sullivan and R. Chillarege, “A comparison of Software Defects in Database Management Systems and Operating Systems”, FTCS, p475-484, July 1992.

[2] D. Oppenheimer, A. Ganapathi, and D. A. Patterson, “Why do Internet services fail, and what can be done about it”, in Proc. USENIX Symposium on Internet Technologies and Systems, 2003.

[3] J. Gray, “A Census of Tandem System Availability Between 1985 and 1990”, Tandem Technical Report 90.1