practical reports on dependability manifestation of system failure site unavailability system...

21

Upload: hollie-hoover

Post on 03-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption
Page 2: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

Practical Reports on Dependability

Page 3: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

Manifestation of System Failure

• Site unavailability

• System exception /access violation

• Incorrect result

• Data loss/corruption

• Slow down

Page 4: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

PAGE UNAVAILABLE

Page 5: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

PAGE UNAVAILABLE

Page 6: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

System Exception

Page 7: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

Performance Slowdown

Page 8: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

DOWNTIME

unplanned20 %

planned80 %

15% contribution

Page 9: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

DOWNTIME

unplanned20 %

planned80 %

Page 10: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

DOWNTIMEunplanned

20 %

planned80 %

other20 %

software/human

80 %

Page 11: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

UNPLANNED DOWNTIME

other20 %

software/human

80 %

Page 12: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

UNPLANNED DOWNTIMEother20 %

software/human

80 %

software40 %operator

40 %

other20 %

Page 13: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

UNPLANNED DOWNTIME

software40 %operator

40 %

other20 %

Page 14: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

Software Errors

Triggers

• Resource exhaustion

• Logical errors

• System Overload

• Recovery code

• Failed upgrade

Page 15: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

Logical Error

Page 16: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

SYSTEM OVERLOAD

Page 17: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

Operator Errors

Triggers

• Configurational– Incorrect parameter setting

• Procedural– Omit/inncorect maintainance action

• Miscellaneous

Page 18: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

FAILURE

DURATION• Short (minutes)• Long (weeks)

– Implies large fault chains

FREQUENCY

• Permanent (down until problem fixed)

• Transient (resolves without

intervention)

• Intermittent (trasient + occasional)

SCOPE• Entire system

• Parts of the System

Page 19: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

Fault Chains

• ”the series of component failures that led up to a user-visible failure”

• Uncoupled– Independent failures

• Tightly Coupled– Cascading/corelated

failure

Page 20: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

Non-Malicious Software Failure

• Most Common Causes– Routine maintenance– Software upgrade– System integration

• Other Causes– System overload– Resource exaustsion– Complex fault tolerant routines

Page 21: Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption

”ROUTINE” MAINTAINANCE

• Danske Bank 2003– March 11: routine operation to replace a defective

electrical unit in IBM DB2 disk system– System failure: Disks becomes inaccessable – 6 hours later: system restarted– March 12: Batch systems running incorrectly– Three More errors discovered:

1. Recovery process on several tables won’t start2. Recovery jobs won’t run symultaneously3. Recovery jobs can’t reastablish data in tables

– March 14: All data recovered and system functional