availability and reliability

25
Availability and reliability, 2013 Slide 1 Availability and Reliability

Upload: sommerville-videos

Post on 11-Nov-2014

520 views

Category:

Technology


1 download

DESCRIPTION

Accompanies video on my YouTube channel on system availability and reliability

TRANSCRIPT

Page 1: Availability and reliability

Availability and reliability, 2013 Slide 1

Availability and Reliability

Page 2: Availability and reliability

Availability and reliability, 2013 Slide 2

Principal dependability properties

Page 3: Availability and reliability

Availability and reliability, 2013 Slide 3

• Reliability– The probability of failure-free

system operation over a specified time in a given environment for a given purpose

Page 4: Availability and reliability

Availability and reliability, 2013 Slide 4

• Availability– The probability that a system, at a

point in time, will be operational and able to deliver the requested services

Page 5: Availability and reliability

Availability and reliability, 2013 Slide 5

Availability specification

• Both reliability and availability attributes can be expressed as numbers:– Availability of 0.999 means that the

system is up and running for 99.9% of the time;

Page 6: Availability and reliability

Availability and reliability, 2013 Slide 6

Reliability specification

• Probability of failure on demand (POFOD) of 0.0001 means that on average 1 in 10, 000 demands for service from a system will fail in some way

Page 7: Availability and reliability

Availability and reliability, 2013 Slide 7

Availability and reliability

• Availability and reliability are closely related– Obviously if a system is unavailable it is

not delivering the specified system services.

Page 8: Availability and reliability

Availability and reliability, 2013 Slide 8

• However, it is possible to have systems with low reliability that must be available. – So long as system failures can be

repaired quickly and does not damage data, some system failures may not be a problem.

Page 9: Availability and reliability

Availability and reliability, 2013 Slide 9

• Availability is therefore best considered as a separate attribute reflecting whether or not the system can deliver its services.

• Availability takes repair time into account, if the system has to be taken out of service to repair faults.

Page 10: Availability and reliability

Availability and reliability, 2013 Slide 10

Availability perception

• Availability is usually expressed as a percentage of the time that the system is available to deliver services e.g. 99.9%.

Page 11: Availability and reliability

Availability and reliability, 2013 Slide 11

Page 12: Availability and reliability

Availability and reliability, 2013 Slide 12

Subjective availability

• The number of users affected by the service outage. – Loss of service in the middle of the

night is less important for many systems than loss of service during peak usage periods.

Page 13: Availability and reliability

Availability and reliability, 2013 Slide 13

• The length of the outage. – The longer the outage, the more the

disruption. Several short outages are less likely to be disruptive than 1 long outage. Long repair times are a particular problem.

Page 14: Availability and reliability

Availability and reliability, 2013 Slide 14

Reliability metrics

• Probability of failure on demand (POFOD)– Probability that a system will not

deliver a service correctly when requested

– Used for systems where demands are infrequent and intermittent

Page 15: Availability and reliability

Availability and reliability, 2013 Slide 15

• Rate of occurrence of failure (ROCOF)– Number of system failures in a given

time period

– Used for transaction processing systems with frequent and regular transactions

Page 16: Availability and reliability

Availability and reliability, 2013 Slide 16

• Fault– A characteristic of a software system that can lead to a

system error.

• Error– An erroneous system state that can lead to system behavior

that is unexpected by system users.

• Failure– An event that occurs at some point in time when the system

does not deliver a service as expected by its users.

Page 17: Availability and reliability

Availability and reliability, 2013 Slide 17

Faults-errors-failures

Fault

Error

Failure

Page 18: Availability and reliability

Availability and reliability, 2013 Slide 18

Faults and failures

• Failures are a usually a result of system errors.

• The incorrect state causes undesirable system behaviour

• Incorrect state is a consequence of executing faulty code

Page 19: Availability and reliability

Availability and reliability, 2013 Slide 19

• However, faults do not necessarily result in system errors– The erroneous system state resulting

from the fault may be transient and ‘corrected’ before an error arises.

– The faulty code may never be executed.

Page 20: Availability and reliability

Availability and reliability, 2013 Slide 20

• Errors do not necessarily lead to system failures– The error can be corrected by built-in

error detection and recovery – The failure can be protected against

by built-in protection facilities. These may, for example, protect system resources from system errors

Page 21: Availability and reliability

Availability and reliability, 2013 Slide 21

Reliability achievement

• Fault avoidance– Development technique are used

that either minimise the possibility of mistakes or trap mistakes before they result in the introduction of system faults.

Page 22: Availability and reliability

Availability and reliability, 2013 Slide 22

• Fault detection and removal– Verification and validation

techniques that increase the probability of detecting and correcting errors before the system goes into service are used.

Page 23: Availability and reliability

Availability and reliability, 2013 Slide 23

• Fault tolerance– Run-time techniques are used to

ensure that system faults do not result in system errors and/or that system errors do not lead to system failures.

Page 24: Availability and reliability

Availability and reliability, 2013 Slide 24

Summary

• Availability is the probability that a system will be available when a service request is made

• Reliability is the probablity that a system will deliver a service as expected by users

Page 25: Availability and reliability

Availability and reliability, 2013 Slide 25

Summary

• Software faults lead to state errors lead to operational failures

• Fault avoidance, detection and tolerance are strategies for achieving reliability