03 - reliability software

56
Reliability Glen Dobson [email protected] http://www.comp.lancs.ac.uk/~dobsong/teaching/dependability

Upload: aparkenthon

Post on 03-Jun-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 1/56

Reliability

Glen Dobson

[email protected]

http://www.comp.lancs.ac.uk/~dobsong/teaching/dependability

Page 2: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 2/56

Recapping

• Overview of dependability

“the property of a system such that we can justifiably place our reliance on theservice it delivers”

• Key dependability attributes

Reliability, Availability, Safety, Security

• Relationship between attributes Effect of primary attributes on each other as well as the effect of auxiliary

attributes• Criticality and conflict

Different attributes are critical to different systems

Improving one attribute may be detrimental to another 

• Dependability requirements Use measurable criteria

High dependability => Hard (& expensive) to test

•  Availability “Readiness for correct service”

Problems – inherent vs. operational availability, what availability does not  tell you

Page 3: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 3/56

Overview

Definition of reliability• Reliability metrics

• System failure

• Preventing failure

• Testing for failures

• Group Discussion

Page 4: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 4/56

Definition

Laprie Reliability is the continuity of correct service

(a service is correct when it implements thesystem function)

• More pragmatically

In a given time period, for a given usage

pattern, how likely is a system to fail? Failure = deviating from the system specification

• For some systems Failure may mean deviating fromexpectations

Page 5: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 5/56

Assessment qualifiers

 Assessment of reliability depends on: Intended system usage

Intended operational profile

Context and environment of use

Time and period of use

Load and intensity of use

• Reliability is a function of these factors

• If any change, we must reassess

Page 6: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 6/56

Reliability measures

POFOD - Prob. of failure on demand• ROCOF - Rate of occurrence of failure

• MTTF - Mean time to failure

• Each is suitable for different systems

• Suitable time units should be chosen

Physical or logical

Page 7: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 7/56

Suitable time units???

Determine suitable time units for: ATM cash withdrawal

Editing with a word processor 

Web server providing pages

General practitioner diagnosis

Nuclear reactor core shutdown  Alert

Patient illness

Request

Minute

Transaction

Page 8: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 8/56

Bath tub curve

Time

Failures

Effects:

• Hardware (degradation)

• Software (evolution)• People (mental faculties)

Page 9: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 9/56

Software & the bath tub…

The bath tub is a sketch of hardware reliability• Tends not to apply so well to software

Burn in period is similar 

Upgrades cause a sudden decrease in reliability Usually followed by another burn in period

Ideally the upgrade/burn in effect decreases over time

Once the software is no longer upgraded then the

reliability becomes constant

• Our bath tub will certainly have a bumpy bottom

Page 10: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 10/56

Software Reliability Curve

Time

Failures

v1.0 v2.0 v3.0

Initial

Development

Software no longer 

actively maintained

Page 11: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 11/56

But these are only sketches

Systems are made up of software, hardware…• …and what about people?

What does their failure curve look like?

Burn in period = training/familiarisation

May forget/pick up bad habits over time

Organisational changes will have an effect

• e.g. high workload/stress will affect mental capacity

Personnel changes will have an effect

• So the bath tub is likely to have bumps all over

the place.

Page 12: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 12/56

Failure manifestation

Fault FailureError

Fault The adjudged or hypothesised cause of an error. Typically a

mistake or lack in the preparation of a component.

• Error 

The initial deviation in system state which eventually leads tofailure. This is usually unintended/unexpected behaviour.

• Failure

 A deviation from correct service (i.e. from the specification or system

function)

Page 13: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 13/56

Page 14: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 14/56

Examples

• Fault Programming mistake

Poor training

Joined pins on chip

• Error  Incorrect floating point calculation

Misfiling of documents

No actuator signal

Failure Mis-navigation

Treatment not given to patient

Burglar alarm not sounding

Page 15: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 15/56

Fault classification

• Can classify along many axes:

Phase of creation/occurrence

Development Faults/Operation Faults

System Boundaries

Internal Faults/External Faults

Phenomonological Causes

Natural Faults/Human-made Faults

Dimension Hardware Faults/Software Faults

Objective

Malicious Faults/Non-Malicious Faults

Page 16: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 16/56

Fault classification (2)

Can classify along many axes: Intent

Deliberate Faults/Non-deliberate Faults

Capability Accidental Fault/Incompetence Faults

Persistence

Permanent Faults/Transient Faults

Page 17: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 17/56

Page 18: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 18/56

Fault-Error-Failure

 A Common source of confusion resultsfrom perspective on “system”, eg..

mental human fault⇒

programming error⇒

software fault⇒

software error⇒

system failure

Page 19: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 19/56

The general case

Failure

Fault

Error

Failure

Fault Error Failure

Fault

Page 20: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 20/56

Latencies

Fault FailureError

Fault latency Failure latency

• Faults may go undetected for a long time

• Faults may be dormant (i.e. never lead to an error) oractive

• Internal errors may never reach the system’s externalstate

Page 21: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 21/56

Page 22: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 22/56

Automated statistical testing

• Capture of operational profiles

•  Auto generation of minimal test sets

• Used to assess the reliability of system

• Still unrealistic for VHR systems

Page 23: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 23/56

Failure prevention

Fault FailureErrorFault

avoidanceFault

tolerance

It is better to avoid faults than tolerate them(prevention is better than the cure)

We might not regain our balance !

Faultremoval

Fault avoidance - Use tarmac instead

Fault removal - Council fix slab, walk around slab

Fault tolerance - Trip, but regain balance

Page 24: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 24/56

Fault avoidance

Prevent inclusion of new faults Managed development

Formal methods

Quality culture

Page 25: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 25/56

Managed development

• Mature lifecycle (Certified Process?)

• Management and control of: Requirements and designs

Evolution

Testing

Configuration

Documentation

• Traceability•  Accountability

•  Audit and review

Page 26: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 26/56

Formal methods

Specify system using formal language• Precise vocabulary, syntax and semantics

• Based on maths, set theory, logic etc.

• Spec. can then be processed formally

Page 27: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 27/56

Benefits of formal methods

• Reduce ambiguity & misunderstanding

•  Automatically analyse for:

Consistency

Correctness Completeness

• Specs can be emulated or simulated

• Verify produced system using proofs• Prove various properties of system

• Transformation to construct system

Page 28: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 28/56

Page 29: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 29/56

Predicate calculus

Extends propositional calculus• Much more powerful

•  Allows the use of variables

• “for all” (universal) quantifier 

• “there exists” (existential) quantifier 

Page 30: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 30/56

Page 31: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 31/56

Formal method problems

• Time consuming and expensive

• Hard to understand (not “fun”)

• Domain experts stand little chance

• Problems concealed by formality• Transformation slow and difficult

• Tool support is patchy

• How do we know specification is right?

• No single language suitable for everything

Page 32: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 32/56

Applying formal methods

• Useful for specific sub-systems

• Useful for specific sub-problems (e.g.

safety)

• Cost effective if used appropriately

• Limited use in industry

•Has yet to deliver in large scale

Page 33: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 33/56

Page 34: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 34/56

Fault removal

• Detect and remove existing faults

Testing and debugging

Reviews/inspections

Static analysis

Page 35: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 35/56

Testing

•  Alpha/Beta - Acceptance/real operational use

• Black/White box - opaque/transparent components

• Functional/Structural - (as above)

• Defect/Statistical - explicit search/normal usage

• Unit/integration - component/whole system

• Regression - repeat test set after each repair 

• Stress - push upper bounds, try to break system

Page 36: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 36/56

Reviews and inspections

• Focus on artifacts produced

• No operational system required

• Expert judgement - cross discipline

•Examine & critique produced artifacts

• Requires knowledge of artifacts and domain

• Often cheaper than testing

• Not all problems are identified

• Used to assess non testable attributes

Page 37: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 37/56

Fault tolerance

Fault FailureErrorFault

avoidanceFault

toleranceFault

removal

Page 38: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 38/56

Page 39: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 39/56

Assertions

• Run time checks

• Performed periodically

• Ensures system in a safe state

•  Are we safe before we continue?

• Manually coded or auto-generated

Page 40: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 40/56

Page 41: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 41/56

Failure likelihood

•  All modules failing together unlikely

• Provided modules are independent !!!

• Scaled up to N-version systems

•  Automatic module repair/replacement

Page 42: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 42/56

Recovery Blocks

• Redundant components used in series

•  Acceptance test used to assess results

• If one component fails, try the next

• Roll-back state before retry

• Try until success or no more left

C i h

Page 43: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 43/56

Comparing approaches

• Modular redundancy less efficient

• Since all modules MUST be executed

• Recovery blocks good (with no failure)

• But how to write the acceptance test ?

C t di it

Page 44: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 44/56

Component diversity

• Diversity essential for these methods

• Both in design and implementation

• Each component should use different: System specifications

Design paradigms

Programming languages

Development environments

 Algorithms

Backgrounds and cultures

P bl ith d d

Page 45: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 45/56

Problems with redundancy

• Duplicate faults can still exist !!!

• People still make the same mistakes

• Hard to think of different ways to work

 Added complexity can hide faults• Can’t do acceptance test for everything

• What happens if components don’t agree?

• Big efficiency hit (problem for RT systems)

• Can be very expensive (three times the cost)

? R li bilit f thi ?

Page 46: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 46/56

? Reliability of this course ?

Fault avoidance(almost formal methods ;o)

Fault avoidanceRedundancy, Diversity

Fault removal

Fault removalTesting

Redundancy

Diversity

Fault removal - review

Fault tolerance

Trying to avoid the failure of this course:

• Used Ian’s book

•  Also used alternative sources

• Used spell checker to make slides

• 5th year we have done this course

• Both Mark and Glen are lecturing

Checked each other’s slides• Mark (social sci.) Glen (comp sci.)

• Your comments in lectures

F lt i j ti

Page 47: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 47/56

Fault injection

•  Assess system or sub-component

• Test harness to assess fault tolerance

•  Artificially create and introduce faults

• Can be performed on: Simulation of component

 Actual component under test load

 Actual component in actual use

• Each has it’s own pros and cons

Page 48: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 48/56

Fault injection

Page 49: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 49/56

Fault injection

FaultInjector

DataCollector

WorkloadGenerator

Fault

Library

Workload

Library

Collected

Data

Target System

Controller

Types of injection

Page 50: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 50/56

Types of injection

• Compile time injection

• Run time injection

• Interactive injection

•  Addition of new

•  Alteration of existing• Removal of old

Problems with fault injection

Page 51: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 51/56

Problems with fault injection

• Unrealistic operational profiles

• Time consuming for complex systems

• Impractical for VHR systems

• Instruments interfere with operation

• Limited to S/W and H/W components

Or is it ?

Summary

Page 52: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 52/56

Summary

Group discussion

Page 53: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 53/56

Group discussion

Dave works in an office. He uses a desktop PC with an off-the-shelf

OS. He does most of his work using a standard office suite (wordprocessor, spreadsheet, etc.). Dave browses the web and e-mails

his friends when he is bored. His dog is called Caruthers.

Dave finds using his computer very unreliable and regularly loses

work. What reasons could there be for this unreliability? What could

be done to improve reliability?

Consider both social and technical perspectives. Do not limit your

thinking to only those topics covered in the lecture material.

Page 54: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 54/56

Thoughts - solutions

Page 55: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 55/56

Thoughts - solutions

• Hardware checking and repair 

• Install latest s/w updates

• Proactive behaviour by Dave (regular saving)

• Backing up procedures

•  Avoiding unreliable features (e.g. tables)

• Redundancy - Davina replicates work

• Better testing, reviews, walkthroughs etc

• Formal modelling of Dave’s office (NOT)

• Ethno study of Dave’s office (insight ?)

Not cost effective to provide high reliability !

Further questions

Page 56: 03 - Reliability Software

8/12/2019 03 - Reliability Software

http://slidepdf.com/reader/full/03-reliability-software 56/56

Further questions

• What effect would the publishing of

portions of the OS on the web have?

• What effect would a new competitor for the

office suite have?• What effect would the installation of a new

version of the OS have?