practical reports on dependability manifestation of system failure site unavailability system...

Post on 03-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Practical Reports on Dependability

Manifestation of System Failure

• Site unavailability

• System exception /access violation

• Incorrect result

• Data loss/corruption

• Slow down

PAGE UNAVAILABLE

PAGE UNAVAILABLE

System Exception

Performance Slowdown

DOWNTIME

unplanned20 %

planned80 %

15% contribution

DOWNTIME

unplanned20 %

planned80 %

DOWNTIMEunplanned

20 %

planned80 %

other20 %

software/human

80 %

UNPLANNED DOWNTIME

other20 %

software/human

80 %

UNPLANNED DOWNTIMEother20 %

software/human

80 %

software40 %operator

40 %

other20 %

UNPLANNED DOWNTIME

software40 %operator

40 %

other20 %

Software Errors

Triggers

• Resource exhaustion

• Logical errors

• System Overload

• Recovery code

• Failed upgrade

Logical Error

SYSTEM OVERLOAD

Operator Errors

Triggers

• Configurational– Incorrect parameter setting

• Procedural– Omit/inncorect maintainance action

• Miscellaneous

FAILURE

DURATION• Short (minutes)• Long (weeks)

– Implies large fault chains

FREQUENCY

• Permanent (down until problem fixed)

• Transient (resolves without

intervention)

• Intermittent (trasient + occasional)

SCOPE• Entire system

• Parts of the System

Fault Chains

• ”the series of component failures that led up to a user-visible failure”

• Uncoupled– Independent failures

• Tightly Coupled– Cascading/corelated

failure

Non-Malicious Software Failure

• Most Common Causes– Routine maintenance– Software upgrade– System integration

• Other Causes– System overload– Resource exaustsion– Complex fault tolerant routines

”ROUTINE” MAINTAINANCE

• Danske Bank 2003– March 11: routine operation to replace a defective

electrical unit in IBM DB2 disk system– System failure: Disks becomes inaccessable – 6 hours later: system restarted– March 12: Batch systems running incorrectly– Three More errors discovered:

1. Recovery process on several tables won’t start2. Recovery jobs won’t run symultaneously3. Recovery jobs can’t reastablish data in tables

– March 14: All data recovered and system functional

top related