cse3308/csc3080 - software engineering: analysis and designlecture 7b.1 software engineering:...

CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.1

Software Engineering: Analysis and Design - CSE3308

Reliability

CSE3308/CSC3080/DMS/2000/17

Monash University - School of Computer Science and Software Engineering


Lecture Outline

What is reliability?

Failures and Faults

Why is reliability desirable?

Good Enough Software

Measuring reliability

Specifying reliability

Achieving a reliable system


What is reliability?

A formal definition The probability of failure-free operation of a computer

program in a specified environment for a specified time

An informal definition How well the system users think it provides the required

services

For a system to be reliable both the informal and the formal definitions must be satisfactorily met

e.g. an aeroplane navigation system may have a very low probability of failure, but even one failure may make it unreliable in the view of the pilot and the passenger


Aspects of reliability

Reliability cannot be defined in an absolute manner

Reliability can only be defined in relationship to a particular operational context

The relationship between the faults in a software product and the reliability of such a product is very complex

To properly consider the reliability of a piece of software, the impact of a fault must be assessed


Faults and Failures

A fault is a static software characteristic

which causes a failure to occur

A failure corresponds to unexpected run-time

behaviour observed by the user of the system

Faults don’t necessarily cause failures

If a user doesn’t notice a failure, is it a

failure?


Faults and Failures (2) Reliability is related to the probability that a

fault will cause a failure while in operational use One study found that removing 60% of the

faults in a product increased reliability only 3% Many faults will only cause failures after

hundreds or thousands of months of use This is not necessarily something which can be

safely ignored though. It was feared that Y2K faults might cause

catastrophic failures after many years of reliable operation


Types of failures

Transient - Occurs only with certain inputs Permanent - Occurs with all inputs Recoverable - System can recover without

operator intervention Unrecoverable - Operator intervention needed

to recover from failure Non-Corrupting - Failure does not corrupt

system state or data Corrupting - Failure corrupts system state or

data


Why is reliability desirable?

Reliability is only one of many desirable system characteristics

Ensuring reliability can be very expensive

Example - Bell Laboratories reported that it took 8 years to move software availability on one system from 99.9% to 99.98%

Reliability often conflicts with other system characteristics such as efficiency


The penalties of reliability

Increases costs by: redundant hardware additional design additional

implementation work Validation overheads decreased efficiency of

the product due to the need for redundant code to handle exceptions


The prize of reliability

Unreliable software isn’t used

Unreliable systems are hard to improve

System failure costs may be very high (e.g. the Westpac disaster)

Costs of loss of data may be very high

Inefficiency is predictable and can be worked around


Good enough software

A very old concept, recently promulgated in the software industry

The reliability and quality of software should be as low as possible without stopping your customers from purchasing the software

First mover benefits overpower any advantage from increased reliability

Many business software organisations utilise the idea

Not an idea one wants to see move into the safety and mission critical systems field


Measuring reliability

Most of the techniques are derived from hardware reliability metrics

Problem is that hardware is far more likely to fail due to wear than design and implementation defects

Software doesn’t wear and failures are from design and implementation defects

Still worthwhile to consider the techniques derived from hardware reliability


Reliability Acronyms

MTBF - Mean Time Between Failures MTTF - Mean Time To Failure MTTR - Mean Time To Repair MTBF = MFFT + MTTR Many people consider it to be far more useful than

measuring fault rate per LOC

Availability = MTTF x 100% (MTTF +MTTR)

Very important in any continuously running system


Other reliability metrics

MTBF while better than fault rates still has problems

Many software failures are transient and recoverable and therefore MTBF is not really a good measure of the reliability

Need measures which handle whether a software system will be available to meet a demand

We may need to use different measures for different parts of the system; often is no one best measure of reliability


Other reliability metrics (2)

POFOD - Probability Of Failure On Demand Measure of the likelihood that the system will

fail when a service request is made A POFOD of 0.001 means that 1 out of every

1000 service requests will fail ROCOF - Rate of Occurrence Of Failure Measure of the frequency of occurrence with

which unexpected behaviour is likely to occur A ROCOF of 2/100 means 2 failures are likely to

occur in each of 100 operational time units Also called failure intensity


Reliability measurements

Number of system failures for a given number of inputs

Time between system failures Number of transactions between failures Time to restart after failure Time may be measured as

raw execution time calendar time number of transactions


Reliability Specification

Need to be able to express reliability requirements in a quantifiable and verifiable manner

Specifications as follow are irrelevant The software shall be reliable as possible The software shall exhibit no more than N faults per 1000

lines

Reliability is dynamic and therefore can’t be expressed in terms of source code

We can never know if all the faults have been removed from source code


Establishing a reliability specification

For each identified sub-system identify the different types of system failure analyse the consequences of the failure

Partition the failures into different classes For each failure class identified

define a reliability metric which is appropriate it is not necessary to use the same metric for different

classes of failure

Realise that some reliability metrics are unable to be validated

a reliability specification which says that over the lifetime of the system an event will never occur


Examples of a reliability specification for an ATM

Failure Class Example ReliabilityMetric

Permanent, non-corrupting

System fails to operatefor any card input.System must berestarted to fix

ROCOF1 occurrence/1000days

Transient, non-corrupting

The magnetic stripedata cannot be read onan undamaged cardwhich is input

POFOD1 in 1000 transactions

Transient, corrupting A pattern oftransactions acrossthe network causesdatabase corruption

Unquantifiable! Shouldnever happen in thelifetime of the system


Statistical Testing

A software testing process used to test the reliability of software rather than discover the faults

Determine the operational profile of the system, i.e. the probable pattern of usage of the system

Select or generate a set of test data corresponding to the operational profile

Apply the test cases to the program, recording the amount of execution time between failures, using appropriate time units

After a statistically significant number of failures have been observed, the software reliability can be computed


Difficulties of Statistical Testing Operational Profile uncertainty High costs of generating the operational

profile Statistical uncertainty when high reliability is

specified Very hard to generate a valid operational

profile for new systems which don’t correspond to an existing system

Reliability measurements are unreliable Still a very valuable tool in specifying and

measuring the reliability of a system


Achieving a reliable system

Three basic strategies to achieve reliability Fault Avoidance

Build fault-free systems from the start

Fault Tolerance Build facilities into the system to let the system continue

when faults cause system failures

Fault Detection Use software validation techniques to discover faults prior

to the system being put into operation

For most systems, fault avoidance and fault detection suffice to provide the required level of reliability


Implementing Fault Avoidance Availability of a formal and unambiguous

system specification Adoption of a quality philosophy by developers.

Developers should be expected to write bug-free programs

Adoption of information hiding and encapsulation

Production of readable programs/specifications Use of a strongly-typed language


Implementing Fault Avoidance

Restrictions on use of error prone constructs e.g.

pointers floating point numbers dynamic memory allocation recursion parallelism interrupts


Implementing Fault Tolerance

Even if somehow we build a fault-free system, we still need fault-tolerance in critical systems

Fault-free does not mean failure-free Fault-free means that the system correctly

meets its specifications Specifications may be incomplete or faulty or

unaware of a requirement of the environment Can never conclusively prove that a system is

fault-free


Aspects of Fault Tolerance Failure Detection

System must be able to detect that the current state of the system has caused a failure or will cause a failure

Damage Assessment System must detect what damage the system failure has caused

Fault Recovery System must change the state of the system to a known “safe”

state Can correct the damaged state (forward error recovery - harder) Can restore to a previous known “safe” state (backwards error

recovery - easier)

Fault Repair Modifying the system so that the failure does not recur Many software failures are transient and need no repair and

normal processing can resume after fault recovery


Implementing Fault Tolerance

Hardware - Triple-Modular Redundancy (TMR) Hardware unit is replicated three (or more) times Output is compared from three units If one unit fails, its output is ignored Space Shuttle is a classic example

Machine 1Machine 1

Machine 2Machine 2

Machine 3Machine 3

OutputComparator

OutputComparator


Implementing Fault Tolerance (2) Using Software N-Version programming

Have multiple teams build different versions of the software and then execute them in parallel

Assumes teams are unlikely to make the same mistakes Not necessarily a valid assumption, if teams all work from the

same specification

Recovery Blocks Each program component includes a test to check if the

component has executed successfully Has alternative code to back-up and repeat the operation if it

fails Similar to assertions and exceptions

Both assume that the specification is correct


N-Version Programming

Version 1Version 1

Version 2Version 2

Version 3Version 3

OutputComparator

OutputComparator


Recovery Blocks

Algorithm 1Algorithm 1

Algorithm 2Algorithm 2 Algorithm 3Algorithm 3

AcceptanceTest

AcceptanceTest

Try Algorithm 1 Test for success

Continue execution if acceptance testsucceeds. Signalexception if allalgorithms fail

Retest

Retry

RetestAcceptance testfails - Retry

cse3308/csc3080 - software engineering: analysis and designlecture 7b.1 software engineering:...

Documents