cse3308/csc3080 - software engineering: analysis and designlecture 7b.1 software engineering:...
TRANSCRIPT
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.1
Software Engineering: Analysis and Design - CSE3308
Reliability
CSE3308/CSC3080/DMS/2000/17
Monash University - School of Computer Science and Software Engineering
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.2
Lecture Outline
What is reliability?
Failures and Faults
Why is reliability desirable?
Good Enough Software
Measuring reliability
Specifying reliability
Achieving a reliable system
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.3
What is reliability?
A formal definition The probability of failure-free operation of a computer
program in a specified environment for a specified time
An informal definition How well the system users think it provides the required
services
For a system to be reliable both the informal and the formal definitions must be satisfactorily met
e.g. an aeroplane navigation system may have a very low probability of failure, but even one failure may make it unreliable in the view of the pilot and the passenger
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.4
Aspects of reliability
Reliability cannot be defined in an absolute manner
Reliability can only be defined in relationship to a particular operational context
The relationship between the faults in a software product and the reliability of such a product is very complex
To properly consider the reliability of a piece of software, the impact of a fault must be assessed
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.5
Faults and Failures
A fault is a static software characteristic
which causes a failure to occur
A failure corresponds to unexpected run-time
behaviour observed by the user of the system
Faults don’t necessarily cause failures
If a user doesn’t notice a failure, is it a
failure?
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.6
Faults and Failures (2) Reliability is related to the probability that a
fault will cause a failure while in operational use One study found that removing 60% of the
faults in a product increased reliability only 3% Many faults will only cause failures after
hundreds or thousands of months of use This is not necessarily something which can be
safely ignored though. It was feared that Y2K faults might cause
catastrophic failures after many years of reliable operation
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.7
Types of failures
Transient - Occurs only with certain inputs Permanent - Occurs with all inputs Recoverable - System can recover without
operator intervention Unrecoverable - Operator intervention needed
to recover from failure Non-Corrupting - Failure does not corrupt
system state or data Corrupting - Failure corrupts system state or
data
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.8
Why is reliability desirable?
Reliability is only one of many desirable system characteristics
Ensuring reliability can be very expensive
Example - Bell Laboratories reported that it took 8 years to move software availability on one system from 99.9% to 99.98%
Reliability often conflicts with other system characteristics such as efficiency
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.9
The penalties of reliability
Increases costs by: redundant hardware additional design additional
implementation work Validation overheads decreased efficiency of
the product due to the need for redundant code to handle exceptions
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.10
The prize of reliability
Unreliable software isn’t used
Unreliable systems are hard to improve
System failure costs may be very high (e.g. the Westpac disaster)
Costs of loss of data may be very high
Inefficiency is predictable and can be worked around
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.11
Good enough software
A very old concept, recently promulgated in the software industry
The reliability and quality of software should be as low as possible without stopping your customers from purchasing the software
First mover benefits overpower any advantage from increased reliability
Many business software organisations utilise the idea
Not an idea one wants to see move into the safety and mission critical systems field
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.12
Measuring reliability
Most of the techniques are derived from hardware reliability metrics
Problem is that hardware is far more likely to fail due to wear than design and implementation defects
Software doesn’t wear and failures are from design and implementation defects
Still worthwhile to consider the techniques derived from hardware reliability
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.13
Reliability Acronyms
MTBF - Mean Time Between Failures MTTF - Mean Time To Failure MTTR - Mean Time To Repair MTBF = MFFT + MTTR Many people consider it to be far more useful than
measuring fault rate per LOC
Availability = MTTF x 100% (MTTF +MTTR)
Very important in any continuously running system
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.14
Other reliability metrics
MTBF while better than fault rates still has problems
Many software failures are transient and recoverable and therefore MTBF is not really a good measure of the reliability
Need measures which handle whether a software system will be available to meet a demand
We may need to use different measures for different parts of the system; often is no one best measure of reliability
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.15
Other reliability metrics (2)
POFOD - Probability Of Failure On Demand Measure of the likelihood that the system will
fail when a service request is made A POFOD of 0.001 means that 1 out of every
1000 service requests will fail ROCOF - Rate of Occurrence Of Failure Measure of the frequency of occurrence with
which unexpected behaviour is likely to occur A ROCOF of 2/100 means 2 failures are likely to
occur in each of 100 operational time units Also called failure intensity
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.16
Reliability measurements
Number of system failures for a given number of inputs
Time between system failures Number of transactions between failures Time to restart after failure Time may be measured as
raw execution time calendar time number of transactions
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.17
Reliability Specification
Need to be able to express reliability requirements in a quantifiable and verifiable manner
Specifications as follow are irrelevant The software shall be reliable as possible The software shall exhibit no more than N faults per 1000
lines
Reliability is dynamic and therefore can’t be expressed in terms of source code
We can never know if all the faults have been removed from source code
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.18
Establishing a reliability specification
For each identified sub-system identify the different types of system failure analyse the consequences of the failure
Partition the failures into different classes For each failure class identified
define a reliability metric which is appropriate it is not necessary to use the same metric for different
classes of failure
Realise that some reliability metrics are unable to be validated
a reliability specification which says that over the lifetime of the system an event will never occur
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.19
Examples of a reliability specification for an ATM
Failure Class Example ReliabilityMetric
Permanent, non-corrupting
System fails to operatefor any card input.System must berestarted to fix
ROCOF1 occurrence/1000days
Transient, non-corrupting
The magnetic stripedata cannot be read onan undamaged cardwhich is input
POFOD1 in 1000 transactions
Transient, corrupting A pattern oftransactions acrossthe network causesdatabase corruption
Unquantifiable! Shouldnever happen in thelifetime of the system
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.20
Statistical Testing
A software testing process used to test the reliability of software rather than discover the faults
Determine the operational profile of the system, i.e. the probable pattern of usage of the system
Select or generate a set of test data corresponding to the operational profile
Apply the test cases to the program, recording the amount of execution time between failures, using appropriate time units
After a statistically significant number of failures have been observed, the software reliability can be computed
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.21
Difficulties of Statistical Testing Operational Profile uncertainty High costs of generating the operational
profile Statistical uncertainty when high reliability is
specified Very hard to generate a valid operational
profile for new systems which don’t correspond to an existing system
Reliability measurements are unreliable Still a very valuable tool in specifying and
measuring the reliability of a system
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.22
Achieving a reliable system
Three basic strategies to achieve reliability Fault Avoidance
Build fault-free systems from the start
Fault Tolerance Build facilities into the system to let the system continue
when faults cause system failures
Fault Detection Use software validation techniques to discover faults prior
to the system being put into operation
For most systems, fault avoidance and fault detection suffice to provide the required level of reliability
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.23
Implementing Fault Avoidance Availability of a formal and unambiguous
system specification Adoption of a quality philosophy by developers.
Developers should be expected to write bug-free programs
Adoption of information hiding and encapsulation
Production of readable programs/specifications Use of a strongly-typed language
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.24
Implementing Fault Avoidance
Restrictions on use of error prone constructs e.g.
pointers floating point numbers dynamic memory allocation recursion parallelism interrupts
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.25
Implementing Fault Tolerance
Even if somehow we build a fault-free system, we still need fault-tolerance in critical systems
Fault-free does not mean failure-free Fault-free means that the system correctly
meets its specifications Specifications may be incomplete or faulty or
unaware of a requirement of the environment Can never conclusively prove that a system is
fault-free
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.26
Aspects of Fault Tolerance Failure Detection
System must be able to detect that the current state of the system has caused a failure or will cause a failure
Damage Assessment System must detect what damage the system failure has caused
Fault Recovery System must change the state of the system to a known “safe”
state Can correct the damaged state (forward error recovery - harder) Can restore to a previous known “safe” state (backwards error
recovery - easier)
Fault Repair Modifying the system so that the failure does not recur Many software failures are transient and need no repair and
normal processing can resume after fault recovery
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.27
Implementing Fault Tolerance
Hardware - Triple-Modular Redundancy (TMR) Hardware unit is replicated three (or more) times Output is compared from three units If one unit fails, its output is ignored Space Shuttle is a classic example
Machine 1Machine 1
Machine 2Machine 2
Machine 3Machine 3
OutputComparator
OutputComparator
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.28
Implementing Fault Tolerance (2) Using Software N-Version programming
Have multiple teams build different versions of the software and then execute them in parallel
Assumes teams are unlikely to make the same mistakes Not necessarily a valid assumption, if teams all work from the
same specification
Recovery Blocks Each program component includes a test to check if the
component has executed successfully Has alternative code to back-up and repeat the operation if it
fails Similar to assertions and exceptions
Both assume that the specification is correct
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.29
N-Version Programming
Version 1Version 1
Version 2Version 2
Version 3Version 3
OutputComparator
OutputComparator
CSE3308/CSC3080 - Software Engineering: Analysis and Design Lecture 7B.30
Recovery Blocks
Algorithm 1Algorithm 1
Algorithm 2Algorithm 2 Algorithm 3Algorithm 3
AcceptanceTest
AcceptanceTest
Try Algorithm 1 Test for success
Continue execution if acceptance testsucceeds. Signalexception if allalgorithms fail
Retest
Retry
RetestAcceptance testfails - Retry