software reliability engineering

Software Reliability EngineeringMark Turner

Topics Covered in this Presentation

What software reliability engineering is and why it is needed.Defining software reliability targets.Operational profiles.Reliability risk management.Code inspection.Software testing.Reliable system design.Reliability modeling.Reliability demonstration.

INTRODUCTION

What Software Reliability Engineering is and why it is needed

Different Views of Reliability

Mechanical Reliability

+Electronic Reliability

+

Software Reliability

System

Product development teams

View reliability at the sub-domain level, addressing mechanical, electronic and software issues.

View reliability at the system level, with minimal consideration placed on sub-domain distinction.

To develop a reliable product engineering teams must consider both views (system and sub-domain).

Although this presentation focuses on software reli ability engineering, it should be viewed as a component part of an overall Design for Reliability process, not as a disparate activity as hardware-

software interactions may be missed.

This presentation does not make any distinction bet ween software and firmware, but the same techniques apply equally to both.

Customers

The primary measure of reliability is defined by th e customer.

System -Level Reliability Modeling (1 of 2)

Computer ServerR=0.9665

SoftwareR=0.99

A “traditional” reliability program may include modeling, evaluation and testing to prove that the hardware meets the reliability target, but software should not be forgotten as it is a system component.

Individually the hardware and software may meet the reliability target…but they also have to when they are combined.

A system is made up of components/sub-systems; each has its own inherent reliability.

System probability of failure = H/W Failure Probabi lity x S/W Failure ProbabilityI.e., H/W = 0.9665, S/W = 0.99, System = 0.9665 x 0 .99 = 0.9568

System Reliability = 95.68%

System -Level Reliability Modeling (2 of 2)

Therefore the software reliability should also be accounted for in the system-level reliability model.

Software may consist of both the operating system (OS) and configurable (turnkey) software. It may not be possible to influence the OS design, but turnkey software can be focused on.

This may consist of re-used software such as library functions and newly developed software.

If the reliability of the library functions is already understood then library function re-use simplifies the software reliability engineering process.

What is Software Reliability Engineering (SRE)?

The quantitative study of the operational behavior of software-based

systems with respect to user requirements concerning reliability.

This presentation will provide an introduction to software reliability engineering…..

SRE has been adopted either as standard or as best practice by more than 50 organizations in their software projects includin g AT&T, Lucent, IBM, NASA and Microsoft, plus many others worldwide.

Why is SRE Important?

So that it can be determined how satisfactorily products are functioning.

Avoid over-designing – products could cost more than necessary and lower profit.

If more features are added to meet customer demand then reliability should be monitored to ensure that defects are not designed in, which could impact reliability.

If a customer’s product is not designed well, with reliability and quality in mind, then they may well turn to a COMPETITOR!

Having a software reliability engineering process c an make organizations more competitive as customers will al ways expect reliable software that is better and cheaper .

There are several key reasons a reliability enginee ring program should be implemented:

Why is SRE Beneficial?

Managing customer demands:

For Engineers :

For the organization:

Enables software to be produced that is more reliable; built faster and cheaper.

Makes engineers more successful in meeting customer demands.

In turn this avoids conflicts – risk, pressure, schedule, functionality, cost etc.

Products can be developed that are delivered to the customer at the right time, at an acceptable cost, and with satisfactory reliability.

Improves competitiveness.

Reduces development costs.

Provides customers with quantitative reliability me trics.

Places less emphasis on tools and a greater emphasi s on “designing in reliability.”

Common SRE Challenges

Data is collected during test phases, so if problems are discovered it is too late for fundamental design changes to be made.

Failure data collected during in-house testing may be limited, and may not represent failures that would be uncovered in the product’s actual operational environment.

Reliability metrics obtained from restricted testing data may result in reliability metrics being inaccurate.

There are many possible models that can be used to predict the reliability of the software, which can be very confusing.

Even if the correct model is selected there may be no way of validating it due to having insufficient field data.

Fault Lifecycle Techniques

Prevent faults from being inserted.

Remove faults that have been inserted.

Design the software so that it is fault tolerant.

Forecast faults and/or failures.

Avoids faults being designed into the software when it is being constructed.

Detect and eliminate faults that have been inserted through inspection and test.

Provide redundant services so that the software continues to work even though faults have occurred or are occurring.

Evaluate the code and estimate how many faults are present and the occurrences and consequences of software failures.

Preventing Faults From Being Inserted

A formal requirement specification always being available that has been thoroughly reviewed and agreed to.

Formal inspection and test methods being implemented and used.

Early interaction with end-users (field trials) and requirement refinement if necessary.

The correct analysis tools and disciplined tool use.

Formal programming principles and environments that are enforced.

Systematic techniques for software reuse.

Initial approach for reliable software

This requires:

Formal software engineering processes and tools, if applied successfully, can be very effective in preventing faults (but is no g uarantee!) However, software

reuse without proper verification can result in dis appointment.

A fault that is never created does not cost anything to fix. This should be the ultimate objective of software engineering.

Removing FaultsWhen faults are injected into the software, the nex t method that can be used is fault removal.

Both have become standard industry practices. This presentation will focus closely on these.

Approaches:

Software inspection.

Software testing.

Fault Tolerance

This is a survival attribute – the software has to continue to work even though a failure has occurred.

Fault tolerance techniques enables a system to:

Prevent dormant software faults from becoming active (i.e., defensive programming to check for input and output conditions and forbid illegal operations).

Contain software errors within a confined boundary to prevent them from propagating further (i.e., exception handling routines to treat unsuccessful operations).

Recover software operations from erroneous conditions by using techniques such as check pointing and rollback.

Fault/Failure ForecastingIf software failures are likely to occur it is crit ical to estimate the number of failures and predict when each is likely to occur.

Measuring software reliability provides:

Fault/failure forecasting requires:

Defining a fault/failure relationship – why the failure occurs and its effect.

Establishing a software reliability model.

Developing procedures for measuring software reliability.

Analyzing and evaluating the measurement results.

Useful metrics that can be used to plan further testing and debug efforts, to calculate warranty costs and plan further software releases.

Determines when testing can be terminated.

This will help concentrate on failures that have the greatest probability of occurring, provide reliability improvement opportunities and improve customer satisfaction.

SRE Process Overview

Perform Code Inspection

Determine Reliability Objective

Define Operational Profile

Perform Software Testing

Collect Failure Data

Select Appropriate Software Model

Use software Reliability Model(s) to Calculate Current Reliability

Continue Testing

Reliability Objectives

met?

Software Release Acceptable from Reliability Perspective

Validate Field Reliability

This slide shows a general SRE process flow that has six major components:

Determine Reliability Objective

Define Operational Profile

Select Appropriate Software Model

Use software Reliability Model(s) to Calculate Current Reliability

Validate Field Reliability

Determine the reliability Target.

Define a software operational Profile.

Conduct reliability modelling to measure the software reliability – continuously improve the software reliability until the target is reached.

Conduct code inspection.

Perform Code Inspection

Perform Software Testing

Perform software testing.

Field reliability validation.

SRE TermsReliability objective: The product’s reliability goal from the customer’s viewpoint.

Operational profile: A set of system operational scenarios with their associated probability of occurrence.

This encourages testers to select test cases according to the system’s likely operational usage.

Product reliability at the end of various test phases.

Amount of additional test time required to reach the product’s reliability objective.

The reliability growth that is still required (ratio of initial to target reliability).

Prediction of field reliability.

Reliability modeling: This is an essential element of SRE that determines whether the product meets its reliability objective.

One or more models can be used to calculate, from failure data collected during system testing, various estimates of a product’s reliability as a function of test time. It can also provide the following information:

Field Reliability Validation: Determination of whether the actual field reliability meets the customer’s target.

OBJECTIVES

Defining software reliability targets

Software Reliability ObjectivesReliability target(s) should be defined and used to :

Manage customer expectations.

Determine how reliability growth can and will be tracked throughout the program.

Determine availability targets. Software reliability is commonly expressed as an availability metric though rather than as a probabilistic reliability metric. This is defined as:

Software uptime

Software uptime + downtimeAvailability =

A data collection and analysis methodology also has to be defined:

How inspections will be conducted.

How failure data will be collected.

How the data will be analyzed, i.e., what model will be used?

This helps project managers track metrics and plan resource.

Managing the Software Reliability Objective

Defects are most often detected and addressed at a later date than the original design effort.

This is usually related to the intensity of the effort, i.e. the number of engineers working on the program, the project schedule and the various design decisions that are made etc.

Test efforts are relied on to discover most defects, this lag can have a negative impact on the program.

This can be mitigated against by using code inspection, but some testing will still be necessary. Code inspections should be conducted to IEEE 1028.

There is still a lag though between defect insertion and correction, which can have a negative impact on the program.

Defects are often inserted from the beginning of pr oject.

The eventual defect rate represents the reliability target, and as defects are discovered and addressed the software reliability i s increased, or grown –this is termed ‘Reliability Growth Management”.

Initial Reliability Growth Model - The Rayleigh Curve (1 of 3)

The eventual goal should be to forecast the discove ry rate of defects as a function of time throughout the software developmen t program.

( )

=

− 222

121 t

PeaktePeak

Ktf

This cannot be achieved until data from prior similar projects becomes available. This may take time but the effort provides value as it enables accurate forecasts to be achieved from the beginning of the project.

Industry data is also available.

This helps to manage customer expectations as it demonstrates a strategy for improving software reliability.

To produce this curve, reliability data from prior software developments has to be available. Therefore this is a goal, it’s not a technique that can be used immediately. To get to this stage metrics need to be collected by using the methods discussed in this presentation.

The Rayleigh Curve (2 of 3)

The model's cumulative distribution function (CDF) describes the total-to-date effort expended or defects found at each in terval – returns the software reliability at various points in time.

( )

−=

− 222

1

1t

PeakeKtF

The Rayleigh Curve (3 of 3)

Example: A software project has a 12 month delivery

Prior data is available to generate a reliability forecast.

The customer wants to know what the effect is of pulling the delivery in to 9 months. What is the answer?

It reduces the total containment effectiveness (TCE), otherwise expressed as reliability, from 89.6% to 61%.

Tradeoff:This allows expectations to be managed by explaining that to achieve early delivery their will be a tradeoff in the reliability, which may require a later release. This type of management helps to avoid possible customer dissatisfaction.

Further InformationSoftware reliability growth using the Rayleigh Curv e is discussed in greater depth in Appendix A of How Reliable Is Your Product ?: 50 Ways to Improve Product Reliability, by Mike Silverman. The text of Appendix A was provided by the author of this presentation.

This book is highly recommended for anybody that is interested in improving product reliability, available from Amazon or direct ly from Ops A La Carte.

Software Availability and Failure Intensity (1 of 2)

As mentioned earlier, instead of a reliability metr ic being provided, customers may ask for a certain ‘availability’.

The probability of software failure

Length of downtime when failure occurs.

It depends on:

It essentially describes the expected fraction of t he operating time during which a software component or system is functioning acceptably.

This is the average (over time) probability that a system or a capability of a system is currently functional in a specified env ironment.

If the software is not being modified (if further d evelopment or further releases are not planned) then the failure rate wil l be constant and therefore the availability will be constant.

Software Availability and Failure Intensity (2 of 2)

Downtime can be expressed as: Where: tm=downtime per failureλ=failure intensity

For software , the downtime per failure is the time to recover from the failure, not the time required to find and remove t he fault.

λmtyAvailabilt

+=∴

1

1

If an availability specification for the software i s specified, then the downtime per failure will determine a failure inten sity objective:

mttyAvailabili

tyAvailabili

×−= 1λ

Either an availability or a failure intensity objec tive have to be defined.

λmtDowntime =

Software uptime

Software uptime + downtimeAvailability =From earlier, availability is defined as:

Example

The downtime per failure can be used to determine the failure intensity objective.

hrfailure

At

A

m

/1.01.099.0

99.01

1

=∴×

−=∴

−=

λ

λ

λ

or 100 failures/kHrs

A product must be available 99% of time.Required Downtime = 6 minutes (0.1hr)

Availability, Failure Intensity, Reliability and MTBF

What is the Mean Time Between Failures (MTBF)?

A customer specifies an availability target of 0.99999 and a maximum software downtime of 5 minutes, or 0.083 hours. The failure intensity is determined from:

HrfailurestyAvailabili

Downtime/083.0

0099999.0

00083.0 ===λ

What is the reliability probability for a period of 2 years?

99998.0101

1752099999.0

101 99)(

=×

×−×

−== eetR

Tλ

1099

102.1083.0101101 ×=×=×=

λMTBF

This presentation will discuss reliability in terms of availability, probability and MTBF. These are the relationships between each of these three metrics.

Hours

THE OPERATIONAL PROFILE

Defining a structured approach to inspection and test

Defining an Operational Profile

Why is it useful?

It provides information on how users will employ the product.

It enables the most critical operations to be focused on during testing.

This allows the efficiency of the reliability test effort to be improved.

It allows more realistic test cases to be designed.

To do this the individual software operations have t o be identified, which are:

Major system logical tasks that returns control to the system when complete.

Major = a task that is related to a functional requirement or feature rather than a subtask.

The operation can be initiated by a user, another part of the system, or by the systems own controller.

For more information on operational profiles refer to Software Reliability Engineering: More Reliable Software Faster and Chea per – John D. Musa

An operational profile is a quantitative characteri zation of how a system will be used in the field by customers.

Developing an Operational Profile (1 of 5)

Five steps are needed to develop an operational pro file: 1. Identify operation initiators (users, other sub-systems, external systems, product’s own

controller etc.

2. Create an operations list – this is a list of operations that each initiator can execute. If all initiators can execute every operation then the initiators can be omitted, and instead just focus on producing a thorough operations list.


A good way to generate an operations list for a men u-driven product is to produce a ‘walk tree” rather than use an initiators list. An example of a menu driven system is provided below.

This is based on a medical enteral pump, used for feeding patients.


Step 3. Once the operational profile is complete it should be reviewed to ensure:

All operations are of short duration in execution time (seconds at most).

Each operation must have substantially different processing from the others.

All operations must be well-formed, i.e., sending messages and displaying data are parts of the operation and not operations in themselves.

The final list is complete with high probability.

The total number of operations is reasonable, taking the test budget into account. This is because each operation will be focused on individually using a test case, so if the list is too long it may result in the project test phase being very lengthy.


Step 4. Determine occurrence rates for each operation – this may need to be estimated to begin with, but can be revised later.

Number of operation occurrences

Time the total set of operations is runningOccurrence Rate =


Step 5. Determine the occurrence probabilities.

Occurrence rate of each operation

Total operation occurrence rateOccurrence Probability =

This table has been rearranged by sorting the operations in order of descending probabilities. This presents the operational profile in a form that is more convenient to use.

Establish Failure Definitions What is critical to the customer? How does the cust omer define a failure?

A failure is any departure of system behavior in execution from the user needs.

A Fault is a defect that causes the failure (i.e., missing code).

Answer – by developing an operational profile. This enables resource to be focused on addressing issues in operations that have the highest probability of failure. Results in failures having a low failure intensity.

Failure modes should be defined early in the project – this provides a specification for what the system should NOT be doing!

Failure severity classes can be defined as shown below. The failures that have the highest severity should be focused on first.

Faults have to be detect – how can this be done?

A fault may not result in failure…but a failure can only occur if a fault exists.

SOFTWARE FMEA

Software reliability risk management

Software FMEA and Risk AnalysisA software Failure Mode and Effects Analysis (SFMEA ) is a systematic method that:

Recognizes, evaluates, and prioritizes potential failures and their effects.

Identifies and prioritizes actions that could eliminate or reduce the likelihood of potential failures occurring.

Effect

Software Failure

CauseFailure Mode(Defect)

Process StepMaterial or

process input

An FMEA aids in anticipating failure modes in order to determine and assess the risk to the customer or product.

thenRisks have to be reduced to acceptable levels.

Software FMEA and Risk Analysis (1 of 2)

Sensor Controller Actuator

Potential failure mode - unintended system function.

Results in undesirable system behavior - could include potential controller or sensor failures.

The first step is to produce a fault tree

Fault trees provide a graphical and logical framewo rk system failure modes to be analyzed. These can then be used to assess the overall impact of software failures on a system, or to prove that cer tain failure modes cannot occur.

SYSTEM BLOCK DIAGRAM

Here is a simple example of how to use a fault tree to perform a Software FMEA. It is far better to begin an FMEA using a fau lt tree. Filling in a spreadsheet immediately can easily result in confus ion and is rarely successful!!

Software FMEA and Risk Analysis (2 of 2)

1 2

CODE INSPECTION

A reliability improvement and risk management technique

Why Inspect Code?

Case study performed by the Data Analysis Center fo r Software (DACS):

Formal inspections should be carried out on the:

Requirements.

Design.

Code.

Test plans.

Approximately 18 man hours plus rework are required per 300-400 lines of code.

“…formal design and code inspections rank as the most effective methods of defect removal yet discovered…(defect removal) can top 85%, about twice those of any form of testing.”

-Capers Jones

Applied Software Measurement, 3rd Ed.

McGraw Hill 2008

85% Defect Containment: cost = $1,000,000, Duration = 12 months

95% Defect Containment: cost = $750,000, Duration = 10.8 months

Formal “Fagan Style” Inspections

This is a defined process that is quantitatively managed.

The objective is to do the thing right. There is no discussion of options, it is either right or wrong, or it requires investigation.

Ideally 4 inspectors participate (it can be 3-5, but not less than 3). Participants have roles – Leader, Reader, Author and Tester.

The review rate target is 150-200 lines of code per hour. What is found depends on how closely the inspectors look at the code.

This is a 6 step process that is defined in IEEE 1028.

Data is stored in a repository for future reference.

The outcome should be that defects are found and fixed, and that data is collected and analyzed.

Relationship Between Inspection and Reliability (1 of 2)

Adapted from a similar approach in : Capers Jones


McGraw Hill 2008

For a four-phase test process the reliability is li kely to vary between 74% and 92% (based on industry data).

Note that not all fixes address problems completely. Some fixes may not be totally effective, while others may also introduce further problems. This is where inspection can be of value.

Relationship Between Inspection and Reliability (2 of 2)

Adapted from: Capers Jones


McGraw Hill 2008

Introducing inspection can increase the reliability to 93 – 99%(based on industry data).

Inspection alone can enable the software to surpass the reliability that is obtained from a test-only process!

This also increases the scope for reducing the emphasis on testing.

SOFTWARE TESTING

Further defect detection and elimination

Static Analysis (1 of 2)

This should be performed after the code is develope d.

It is pattern based – it scans the code to check for patterns that are known to cause defects.

The benefits of static analysis are:

It can examine more execution paths than conventional testing.

It can be applied early in the software design, providing significant time and cost savings.

This type of analysis uses coding standard rules and enforces internal coding guidelines.

This is a simple task, easily automated, that reduces future debugging effort.

It is data flow based, in that it statically simula tes execution paths, so is able to automatically detect potential runtime errors su ch as:

Resource leaks.

NullpointerExceptions.

SQL injections.

Security vulnerabilities.

Static Analysis (2 of 2)

Examples of warning classes that can be obtained fr om static analysis are:

Buffer overrun

Buffer underrun

Cast alters value

Ignored return value

Division by zero

Missing return statement

Null pointer dereference

Redundant condition

Shift amount exceeds bit width

Type overrun

Type underrun

Uninitialized variable

Unreachable code

Unused value

Useless assignment

Buffer Overflow Example

char arr[32];For (int i = 0; i < 64; i++){

arr[i] = (char)i;}

Consider the code segment below:

Here, memory that is beyond the range of the stack-based variable “arr” is being explicitly addressed. This results in memory being overwritten, which could include the stack frame information that is required for the function to successfully return to its caller, etc.

This coding pattern is typical of security vulnerabilities that exist in software. The specifics of the vulnerability may change from one instance to another, but the underlying problem remains the same, performing array copy operations that are incorrectly or insufficiently guarded against exploit.

Static analysis can assist in detecting such coding patterns

Types of TestsFunctional tests

Load tests

Regression tests

Endurance tests

This is single execution of operations with interactions between the various operations minimized. The focus is on whether the operation executes correctly.

These tests attempt to represent field use and the environment as accurately as possible, with operations executing simultaneously and interacting. Interactions can occur directly, through the data, or as a result of resource conflicts. This testing should use the operational profile.

Functional tests that can be conducted after every build involving significant change. The focus during these tests is to reveal faults that may have been created during the change process.

Ad-hoc testing is similar to load testing in that it should represent the field use and environment as accurately as possible. This will focus on how the product is to be used…and may be misused.

RELIABLE SYSTEM DESIGN

A look at fault tolerance, an essential aspect of system design

Reliable System Design (1 of 7)

To achieve reliable system design software should b e designed such that it is fault tolerant.

Fault confinement,

Fault detection,

Diagnosis,

Reconfiguration,

Recovery,

Restart,

Repair,

Reintegration.

Typical responses to system or software faults duri ng operation includes a sequence of stages:


Fault Confinement.

Limits the spread of fault effects to one area of the system – prevents contamination of other areas.

Achieved through use of:

Erroneous system behaviors due to software faults are typically undetectable.

Reduction of dependencies can help.

- self-checking acceptance tests,

- exception handling routines,

- consistency checking mechanisms,

- multiple requests/confirmations.


Fault Detection.

This stage recognizes that something unexpected has occurred in the system.

Fault latency – period of time between fault occurrence and detection.

The shorter the fault latency is, the better the system can recover. Two technique classes are off-line and on-line fault diagnosis:

- Off-line techniques are diagnostic programs.

System cannot perform useful work under test.

- On-line techniques provide real-time detection capability.

System can still perform useful work.

Watchdog monitors and redundancy schemes.


Diagnosis.

This is necessary if the fault detection technique does not provide information about the failure location and/or properties.

This is often an off-line technique that may require a system reset.

On-line techniques can also be used i.e., when a diagnosis indicates unhealthy system conditions (such as low available resources), low-priority resources can be released automatically in order to achieve in-time transient failure prevention.

Reconfiguration.

This occurs when a fault is detected and a permanent failure is located.

The system may reconfigure its components either to replace the failed component or to isolate it from the rest of the system (i.e., redundant memory, error checking of memory in case of partial corruption etc).

Successful reconfiguration requires robust and flexible software architecture and reconfiguration schemes.


Recovery.

Uses techniques to eliminate the effects of faults.

There are two approaches:

- fault masking,- retry and rollback.

Fault masking hides effects of failures by allowing redundant, correct information to outweigh the incorrect information.

Retry makes a second try at an operation as many faults are transient in nature.

Rollback makes use of backed up (checkpointed) operations at some point in its processing prior to fault detection, and operation recommences from this point.

Fault latency is very important because the rollback must go back far enough to avoid the effects of undetected errors that occurred before the detected error.


Restart.

This occurs after the recovery of undamaged information.

There are three approaches:

- hot restart,- warm restart;- cold restart.

Hot restart – resumption of all operations from the point of fault detection (this is only possible if no damage has occurred).

Warm restart – only some of the processes can be resumed without loss.

Cold restart – complete reload of the system is performed with no processes surviving.


Repair.

Replacement of failed component – on or off-line.

Off-line – system brought down to perform repair. System availability depends on how fast a fault can be located and removed.

On-line – Component replaced immediately with a back up spare (similar to reconfiguration), or perhaps operation can continue without using the faulty component (i.e., masking redundancy or graceful degradation).

On-line repair prevents system operation interruption.

Reintegration.

Repaired module must be reintegrated into the system.

For on-line repair, reintegration must be performed without interrupting system operation.

Non-redundant systems are fault intolerant and, to achieve reliability, fault avoidance is often the best approach. Redundant systems should u se fault detection, masking redundancy (i.e., disabling 1 out of N units), and dynamic redundancy (i.e., temporarily disabling certain operations ) to automate one or m ore stages of fault handling.

RELIABILITY MODELING

Determining what reliability has actually been achieved

Reliability Modeling (1 of 4)

This is used to calculate what the current reliabil ity is, and if the reliability target is not yet being achieved, determine how muc h testing and debug needs to be completed in order to achieve the relia bility target.

How many failures are we likely to experience during a fixed time period?

What is the probability of experiencing a failure in the next time period?

What is the availability of the software system?

Is the system ready for release (from a reliability perspective)?

T1 T2 T3 T4 T5 T6 T7 T8

t1 t2 t3 t4 t5 t6 t7

T=0

Software Failures

TE

Ti is the Cumulative Time To Failureti is the inter-arrival time = Ti – Ti-1

The questions that reliability modeling aims to ans wer are:


In reliability engineering it is usual to identify a failure distribution, especially when modeling non-repairable products *. This approach can be used because it is assumed that hardware faults are statistically independent and identically distributed.

Where software is concerned, events (failures) are not necessarily independent due to interactions with other system e lements, so in most cases failures are not identically distributed.

* Although it can be argued that a software system can be repaired by fixing the fault, in reliability terms it is still a non-repairable product because it is not wearing out. For instance, a car is a repairable device as parts can be changed when they wear out, but this does not necessarily make it as good as new. If a software fault is repaired it is actually as good as new again, and in fact the improvement may make it better than new.

When a failure occurs in a software system the next failure may depend on the current operational time of the unit, and there fore each failure event in the system may be DEPENDENT.


Therefore what is needed is to model the Rate of Occ urrence of Failures and the Number of Failures within a given time.

T1 T2 T3 T4 T5 T6 T7 T8

t1 t2 t3 t4 t5 t6 t7

T=0

Software Failures

TE

As an example, with reference to the figure below, a model is needed that will report the fact that 8 failures are expected b y timeTE and that the Rate of Occurrence of Failures is Increasing with Time.


If a Distribution Analysis is performed on the Time -Between-Failures, then this is equivalent to saying that there are 9 diffe rent systems, where System 1 failed after t 1 hours of operation, System 2 after t 2,…, etc.

System 1

System 2

System 3

System 9

T=0 t1

t2

t3

T9 (suspension *)

.

.

.

Example: Changing the break pads on a car. This doe s not mean that the car is now failure free!* A unit that continues to work at the end of the analysis period or is removed from a test in working condition.

I.e., it may fail at some point in the future.

This is the same as assuming that the system is fai lure free if the fault is addressed, which may not necessarily be true as fur ther failures may occur.

An Example of an Incorrect Approach (1 of 4)

This example has been included because it is a comm on approach to hardware reliability modeling but it CANNOT be used for mode ling software reliability. This method is normally used to model a non-repaira ble hardware product. Unfortunately when used in analyzing software reliab ility it returns incorrect results…but it is an easy trap for a reliability eng ineer to fall into!!!

A total of 6 different firmware and 4 different har dware failure modes are identified

Both firmware and hardware failure data is collected from three systems:


The conventional reliability engineering approach i s to take the Time-Between-Failures for each system and then fit a dis tribution.

319-152

Notice that hardware failures have been removed.

The time between the last failure and the current age is a Suspension.


A Weibull (life data) Analysis is conducted, but wi th software this is not appropriate!

This analysis assumes a sample of 20 systems, and

one system failed after 152hrs, the other after

319hrs, etc.


This system will be used for a total of 250 hours. What will the software reliability be?

97.63% -GREAT RESULT

BUT COMPLETELY

WRONG!!!

Distribution analysis is okay for non-repairable products containing only hardware, but not for anything containing software (or for repairable hardware only products).

However, it is correct to fit a distribution on the First-Time-to-Failure of each system.

In products that contain software, events are dependent, and therefore alternative analysis methods should be used.

An Example of a Correct Approach

This is the probability that the unit will NOT fail in the first 250 hours. Reliability=68.36%

Notice that the confidence interval is very wide.

Three Possible SRE approaches…

Are multiple systems being

tested?

Can testing be stopped after each phase to fix failure

modes?

Use NHPP model (This is the best

option)

Yes

No Use 3-Parameter Crow-Extended

Model

Yes

No

Use Crow-Extended Model

This is the current state of the art in software reliability modeling, and is suitable for most projects. However, this approach is not suitable for testing a single unit (i.e., a large expensive system), or where not all faults are going to be fixed in between compiles. A better model is needed for this type of application.

It is hypothesized by the author that these models may be more suitable for developments where the NHPP model cannot be well applied. This essentially represents a future state of software reliability testing. However, before being readily accepted they should be validated, i.e., by comparing their predicted reliability with actual field data. Use of these models has been included in this presentation for completeness and possible future application.

A Better SRE Analysis Approach (1 of 4)

A model is needed that will take into account the fact that when a failure occurs the system has a “Current Age,” or in other words there is a further failure that is likely to occur.

For example, in System 1, the system has an age of 152 hours after the first firmware failure mode has been detected.

In other words, all other operations that can result in a failure also have an age of 152 hours and the next failure event is based on this fact.


The NHPP (Non-Homogenous Poisson Process) with a Po wer Law failure Intensity is such a model:

Pr[N(T)=n] is the probability that n failures will be observed by time, T.

Λ(T) is the Failure Intensity Function (Rate of Occurrence of Failures).

Where:

Just because a model is used for hardware does not mean that it cannot be suitable for software as well, as models simply describe times-to-failure. Therefore a hardware model can also be used for software, providing that it is a dependent model (failures are dependent on the operational time, rather than being independent).


NHPP model parameters:

Here the failure events of System 1 are analyzed between the period of 0 and 1380 hours. This folio also contains the failure events for Systems 2 and 3 (not shown).

Of interest is the fact that Beta >1, which indicates that the inter-arrival times between unique failures are decreasing, so there may be little opportunity for reliability improvement.

A Better SRE Analysis Approach (4 of 4)NHPP model results:

The cumulative number of failures is 0.1352, or 13.52 failures per 25000 operational hours.

Plot shows the cumulative number of failure vs. time, from which conclusions and further predictions can be obtained. The Weibull plot intersects the X-axis, so out-of-box failures should not be present. If it had intersected the Y-axis then this would indicate potential for out-of-box failures.

An Example Using the NHPP Model (1 of 8)

125.024

8 ==FR

Software is under development – the reliability requirement is to have no more than 1 fault in every 8 hours of software operation.

Three Test Engineers provide a total of 24 hours of testing each day.

In a testing day, the failure intensity goal 3 faul ts/day.

One new compile is available for testing each week, when fixes are implemented.

The failure rate goal is:

Failures per hour

324125.0 =×=IFR Faults per day


Failure data is obtained:

NHPP model parameters

The data is grouped by the number of days until a n ew compile is available, i.e., the first 45 failures are contained in one gr oup and are fixed in compile #1.


The instantaneous failure intensity after 28 days of testing is 4.4947 faults/day.

The answer is after an additional 149-28=121 days of testing and development

(test-analyze-fix)

If testing is continued with the same growth rate, when will the goal of no more than 3 faults/day be achieved?


An extra 121 days is longer than anticipated. Let’s take a closer look by generating a Failure Intensity vs. Time plot…

It can be seen that there was a jump in the failure intensity between 20 and 23 days.

This is why it is estimated that more development time is required.

The next step is to analyze the data set for the per iod up to 20 days of testing, before the failure intensity increased…

Each of these lines indicates the failure intensity over a given interval (which in this case is 5 days).


The NHPP model data is limited to the first 20 days of testing and another Failure Intensity vs. Time plot is generate d, but this time for the first 20 days:

This plot shows the decrease in the failure intensity rate over the first 20 days of testing.

This confirms that the failure intensity continuously reduced during the first 20 days.


Based on the first 20 days of data the additional test and development duration can be recalculated, which results in there being an additional 55-28=27 days to achieve the goal of having no more than 3 faults/day, rather than 121!

This generates questions:

Why is there such a big difference in the test duration still required?

What happened when the failure intensity jumped on the 23rd day of testing and development?

Answer – New functionality was added. The jump in re quired test time is typical when new features are introduced, and appli es to software and hardware alike.

Because new functionality has been added it would be wise to reset the clock and track the reliability growth from the 20th day forward…


Now the NHPP model parameters need to be obtained a nd plotted for the last 8 days of testing (8 days is an arbitrary numb er; enough data needs to be available to have confidence in any conclusions that are drawn).

This provides better resolution. By taking a “macro’ view it can be seen that the failure intensity is starting to increase, so the minimum failure intensity point has been determined. For improved accuracy calculations should be based on this.


There are also situations where some issues are fix ed immediately, others are addressed later and more minor issues may not be addressed at all. In this type of situation the Crow Extended Model can be useful…

Based on this data set 51-8=43 more days of developmental testing are required.

It may be too early to make any predictions based on only 8 days of testing, but the result can be used to obtain a general idea of the remaining development time required and produce a test plan.

To pull in the schedule 3 more Test Engineers could be added and the code recompiled every 2 days, which will complete the project within 1 month.

Crow -Extended Model Introduction (1 of 2)

This is not a common SRE model but does have the be nefit of supporting decision making by providing metrics such as

Failure intensity vs. time.

Demonstrated Mean Time Between Failures (MTBF*).

MTBF growth that can be achieved through implementation of corrective actions.

Maximum potential MTBF that can be achieved through implementation of corrective actions.

Maximum potential MTBF that can likely be achieved for the software and estimates regarding latent failure modes that have not yet been uncovered through testing.

This model utilizes A, BC and BD failure mode classi fications to analyze growth data.

A = Failure mode that will not be fixed.

BC = A Failure mode that will be fixed while the test is in progress.

BD = A Failure mode that will be corrected at the end of the test.

* This model uses MTBF rather than failure intensity or reliability metrics. A conversion between these various metrics is provided in slide 28.

Crow -Extended Model Introduction (2 of 2)

There is no reliability growth for A modes.

The effectiveness of the corrective actions for BC modes is assumed to be demonstrated during the test.

BD modes require a factor to be assigned that estim ates the effectiveness of the correction that will be implemented after the t est.

Analysis using the Crow Extended model allows diffe rent management strategies to be considered by reviewing whether th e reliability goal will be achieved.

There is one constraint to this approach – the testi ng must be stopped at the end of the test phase and all BD modes must be fixe d. The Crow Extended model will return misleading conclusions if it is u sed across multiple test phases. For those situations use the 3-Parameter Cr ow-Extended model (discussed next).

DO NOT APPLY THIS MODEL TO A MULTIPLE SYSTEM TEST, USE THE NHPP MODEL INSTEAD.

Crow -Extended Model Example (1 of 8)

A product underwent development testing, during whi ch failure modes were observed. Some modes were corrected during the test (BC modes), some modes were corrected after the end of the test (del ayed fixes, BD modes) and some modes were left in the system (A modes). The t est was terminated after 400 hours; the times-to-failure are provided below:


An effectiveness factor has been assigned for each BD failure mode (delayed fixes). The effectiveness factor is based on engineering assessment and represents the fractional decrease i n failure intensity of a failure mode after the implementation of a correcti ve action.

The effectiveness factors for the BD failure modes are provided below:

This is a metric that enables an assessment to be made of whether or not the corrective actions have been effective, and if they have, how effective they were. This is often a subjective judgment.


The times-to-failure data and effectiveness factors are entered:

Note that this data sheet only displays 29 rows of data, but all data is entered even though it has not been shown.

Effectiveness factor is expressed as 0-1 (0-100% of the failure intensity being removed by the corrective action).


Model parameter calculation:

Here the failure events are analyzed between the period of 0 and 400 hours.


Growth potential MTBF plot:

Demonstrated MTBF(MTBF at end of test without corrective

actions)

Instantaneous MTBF(Demonstrated MTBF with time)

Projected MTBF(Estimated MTBF after delayed corrective

actions have been implemented)

Growth potential MTBF(Maximum achievable MTBF based

on current strategy)

The demonstrated MTBF, (the result of fixing BC modes during the test) is about 7.76 hours.

The projected MTBF (the result of fixing BD mode after the test) is about 11.13 hours.

The growth potential MTBF (if testing continues with the current strategy, i.e. modes corrected vs. modes not corrected and with the current effectiveness of each corrective action) is estimated to be about 14.7 hours. This is the maximum attainable MTBF.


An Average Failure Mode Strategy plot is a pie char t that breaks down the average failure intensity of the software into the following categories:

A modes – 9.546%.

BC modes addressed – 14.211%.

BC modes still undetected – 30.655%.

BD modes removed – 8.846%.

BD modes to be removed – 3.355 (because corrective actions were <100% effective).

BD modes still undetected – 33.386%


Individual Mode MTBF plot, which shows the MTBF of each individual failure mode. This enables the failure modes with the lowes t MTBF to be identified.

Blue = Failure mode MTBF before corrective action.

Green = Failure mode MTBF after corrective action.

These are the failure modes that cause the majority of software failures, and should be addressed as the highest priority when reliability improvement activities are to be implemented.


Failure Intensity vs. Time plot:

This can be analyzed in exactly the same way as in the NHPP example.

3-Parameter Crow -Extended Model Introduction (1 of 2)

This is not a common SRE model either, but has the same benefits as the single parameter Crow-Extended model plus multiple test phases can also be taken into account.

This model is ideal in situations where software is to be tested over multiple phases but where all bug fixes cannot be introduced as faults are discovered, i.e., all bugs will be addressed on an ad-hoc basis over an extended time period.

The model provides the flexibility of not having to specify when the test will end, so it can be continuously updated with new test data. Therefore this model is optimized for continuous evaluation rather than fixed test periods.

It can only be applied to an individual system, so it lends itself ideally to situations where an individual complex system is being tested. DO NOT APPLY ANY CROW MODEL TO A MULTIPLE SYSTEM TEST, USE THE NHPP MODEL INSTEAD.

3-Parameter Crow -Extended Model Introduction (2 of 2)

F Failure time.

I Time at which a certain BD failure mode has been corrected. BD modes that have

not received a corrective action by time T will not have an associated I event in the

data set.

Q A failure that was due to a quality issue, such as a build problem rather than a design problem. The reliability engineer can decide whether or not to include quality issues in the analysis.

P A failure that was due to a performance issue, such as an incorrect component

being installed in a device where the embedded code is being tested. The reliability

engineer can decide whether or not to include performance issues in the analysis.

AP This is an analysis point, used to track overall project progress, which can be

compared to planned growth phases.

PH The end of a test phase. Test phases can be used to track overall project progress,

which can be compared to planned growth phases.

X A data point that is to be excluded from the analysis.

This model uses several event codes:

3-Parameter Crow -Extended Model Example (1 of 11)

Software is under development. Testing is to be con ducted in 3 phases.

Phase 1 – 6 weeks of manual testing that is run 45 hours per week, total 270 hours.

Phase 2 – 4 weeks of automated testing that is run 24/7, total 672 hours.

Phase 3 – 8 weeks of field manual testing that is run 40 hours per week, total 320 hours.

One hour of continuous testing equates to 7 hours o f customer usage, so the testing includes a usage acceleration factor of 7.

The average fix delay for the three phases are 90 h ours, 90 hours and 180 hours respectively (fix time = time delay between d iscovering a failure mode to the time the corrective action is incorporated i nto the design).

Taking usage acceleration into account, cumulative test times for the three phases is 1890 hours, 6594 hours and 8834 hours res pectively.


0016.01251

2 ==

hrs62588340016.0

8834 =×

=

Customer reliability target = 2 failures per year.

Usage duty cycle = 0.1428

Therefore for continuous usage, reliability target is: = 2 failures every 1251 hrs.

Equivalent test time = 8834 hrs.

Failure intensity:

Required MTBF:


The growth potential = 1.3

This is the amount by which the MTBF target should exceed the requirement for margin. The higher the GP, the lower the risk is to the program. This is an initial estimate based on prior experience, and the higher the GP margin, the less risk is present in the program.

Average effectiveness factor = 0.5 (1.0 = a perfect fix, 0 = inadequate fix)

Management strategy – address at least 90% of all un ique failure modes prior to formal release.

Beta parameter = 0.7This is the rate for discovering new, distinct failure modes found during testing. Again, this is an estimate based on prior experience.

A discovery beta of less than 1 indicates that the inter-arrival times between unique B modes are increasing. This is desirable, as it is assumed that most failures will be identified early on, and their inter-arrival times will become larger as the test progresses. This is an initial estimate; the actual discovery beta can be obtained from the final results, allowing this parameter estimation to be refined with testing experience.

This is also an initial estimate based on prior experience.


Based on the assumptions on the previous slide, an overall growth planning model can be created that shows the nominal and act ual idealized growth curve and the planned growth MTBF for each phase.

Note that inter-phase average fix delays have been multiplied by 7 to take the usage acceleration factor into account.

A growth planning folio is created and 1890, 6594 a nd 8834 are entered for the Cumulative Phase Times and 630, 630 and 1260 fo r the Average phase delays.


The project parameters are input into the Planning Calculations window .

Given the MTBF target and design margin that has been specified, along with other required inputs to describe the planned reliability growth management strategy, the final MTBF that can be achieved is calculated, along with other useful results. Here it is verified that 625 hours is achievable (if it was not achievable a figure of less than 625 hours would be calculated).


A growth planning plot can then be obtained:

MTBF at end of phase 3

Effectiveness Factors for all BD modes are specifie d, together with when they are to be implemented.

MTBF at end of phase 1 MTBF at end of phase 2

This plot displays the MTBF vs. Time values for the three phases that have been planned for the test.


Test failure data is collected during the three pha ses:

Actual discovery beta

(original estimate was

0.7)


Growth potential MTBF plot can now been obtained:

Demonstrated MTBF

(MTBF at end of test without corrective

actions)

Instantaneous MTBF

Projected MTBF(Estimated MTBF after

delayed corrective actions have been

implemented)

Growth potential MTBF

(Maximum achievable MTBF based on current

strategy)

If the MTBF goal is higher than the Growth Potentia l line then the current design cannot achieve the desired goal and a redesign or c hange of goals may be required. For this example, the goal MTBF of 650 ho urs is well within the growth potential and is expected to be achieved after the implementation of the delayed BD fixes.


Average Failure Mode Strategy plot, breaking down t he average failure intensity of the software into categories:

A modes – 13.432%.

BC modes addressed – 19.281%.

BC modes still undetected – 13.76%.

BD modes removed – 25.893%.

BD modes remain (because corrective actions were <100% effective) –5.813%.

BD modes still undetected – 21.882%


Individual Mode MTBF plot showing the MTBF of each individual failure mode, thus enabling the failure modes with the lowest MTB F to be identified.

Blue = Failure mode MTBF before corrective action.

Green = Failure mode MTBF after corrective action.


The RGA Quick calculation pad indicates that the discovery rate of new unseen BD modes at 630 hours is 0.0006 per hour.

The Beta bounds are less than 1, indicating that there is still growth in the system (think of this as the leading edge slope of the bathtub curve; when beta=1 there is no more growth potential)

RELIABILITY DEMONSTRATION

Demonstration that a minimum software reliability has been achieved

Reliability Demonstration Testing (1 of 2)

There can be occasions when the actual software rel iability may have to be measured through practical demonstration rather tha n testing. However, this is more applicable where all known faults have been removed and the software is considered to be stable.

This can be achieved through sequential sampling theory.

If the reliability has already been discovered by conducting a reliability growth program then there may be little value in conducting this test. This is actually more suitable for situations where a reliability growth program has not been conducted.

Reliability Demonstration Testing (2 of 2)

Discrimination Ratio 2%

Consumer Risk Level 0.1 (10%).

Supplier Risk Level 0.1 (10%).

A project specific chart depends on:

Discrimination Ratio – this is an error in the failure intensity estimation that is considered to be acceptable.

Consumer Risk Level – this is the probability of falsely claiming the failure intensity has been met when it has not.

Supplier Risk Level – this is the probability of falsely claiming the failure intensity objective has not been met when it has.

Common values are:

Example

Multiply by requirement target

Requirement: 4 Failures/million operations

Software can be accepted after

failure 3 with 90% confidence that it is within the reliability target and a 10% risk that it is not.

The boundary has to be crossed

though.

1 0.4 1.62 0.625 2.53 1.2 4.8

Reliability Demonstration Test Chart Design (1 of 2)

What if the software is still in the Continue regio n at the end of the test?

Assume that the end of test is reached just after failure 2.

Option 1 – Calculate the Failure Intensity Objective:

76.5444.1

44.15.2

6.3

=×=∴

===

FIO

F

FFactor

PREVIOUS

CURRENT

Grouped data CANNOT be used, it has to be obtained from individual units.

Option 2 - Extend the test time by ≥factor.

Reliability Demonstration Test Chart Design (2 of 2)

( )( )γ

γ−

−=1

lnnATN and

( )( )γ

γ−

−=1

lnnBTN

Where:

A and B are defined from:α

β−

=1

lnAα

β−= 1lnB

α: Supplier risk (probability of falsely claiming objective is not met when it is).

β: Consumer risk (probability of falsely claiming objective is met when it is not).

Accept-Continue Boundary Reject-Continue Boundary

The following formulae is used to design RDT charts :

TN: Normalized measure of when failure occur (horizontal coordinate).

n: Failure number.

γ: Discrimination ratio (ratio of max acceptable failure intensity to the failure objective).

Where:

Reliability Demonstration Test Chart Design Example

In this example n=16.

γln,0

B

0,1 γ−

A

( )n

nB,

1lnγ

γ−

− ( )n

nA,

1lnγ

γ−

−

The boundary intersections with the x and y axis's can be calculated using the following formulae:

SRE ReviewEnables defect discovery rates to be forecast and monitored – helps all staff –enables customer expectations to be managed.

Software FMEA enables failure modes and risks to be identified.

Enables reliability targets to be established and monitored.

Establishes formal and thorough test and analysis m ethodologies.

Provides a method for modeling and demonstrating so ftware reliability.

Defines code inspection processes.

Guarantees customer satisfaction!

References

Adamantios Mettas, “Repairable Systems: Data Analysis and Modeling” Applied Reliability Symposium 2008.

Michael R. Lyu, “ Software Reliability Engineering: A Roadmap”.

Dr Larry Crow, “An Extended Reliability Growth Model For Managing and Accessing Corrective Actions” Reliability And Maintainability Symposium 2004.

John D. Musa, “Software Reliability Engineering: More Reliable Software Faster and Cheaper” Authorhouse 2004.

Reliasoft RGA 7 Training Guide.

Capers Jones, “Applied Software Measurement, 3rd Edition, McGraw Hill 2008.

software reliability engineering

Technology