software reliability for engineers - j.k.orr 2015-09-23

Copyright 2015 By James K. Orr 1

Software Reliability For Dummies Engineers

James K. OrrIndependent Consultant

[email protected]

9/23/2015


Introduction• This presentation presents a very simplified approach to

computing Software Reliability – an engineering approach as opposed to a complex statistical approach.

• This approach evolved from analysis of the Space Shuttle Primary Avionics Software System (PASS), the software that controlled the Space Shuttle from pre-launch, through ascent, on-orbit, entry to landing.

• Approach may be limited to similar systems (large scale critical software with relatively few users).

• If you would like assistance in applying this method, please contact me at [email protected].

9/23/2015

mailto:[email protected]


Contents

• Introduction• Contents• Evolution of Space Shuttle PASS Alternate

Reliability Model in 1989.• Generalized Approach For Software Reliability

With Examples, Equations, and Simulation• Sample Results From Space Shuttle PASS Reliability

Analysis• References

9/23/2015


Evolution of Space Shuttle PASS

Alternate Reliability Model in 1989

9/23/2015


Requirement Reliability Prediction• Following the loss of the Space Shuttle Challenger and crew in

1986, IBM Federal Systems Division – Houston as the Space Shuttle Primary Avionics Software System developer was assigned a “return to flight” action to model the software reliability of “loss of vehicle and crew” latent errors (defects).

• This was a two step process. First, compute software reliability (time to next failure). Second, model the probability that a failure occurring during flight would be a “loss of vehicle and crew” latent errors (defects).

• Discussion in this paper focuses on the first activity, compute software reliability (time to next failure).

9/23/2015


“Professional” Approach

• IBM Federal Systems Division – Houston contacted multiple experts in software reliability. Ultimate, N. F. Schneidewind and his “SMERFS” software reliability estimation tool was selected to model the Space Shuttle PASS reliability.

• See reference 1 for one paper that documents the results of this work. The link with reference 1 also connects to a full list of papers, etc. by DR. Schneidewind.

9/23/2015


Motivating An “Engineering” Approach

• During this time (1986 – 1989), I was working as senior technical staff at IBM Federal Systems Division – Houston. Roles included:– Project Coordination and Technical Leadership, 1984-1988. Led initiatives to support high flight

rate in period leading to loss of Space Shuttle Challenger in January 1986. Oversaw initiatives to implement mandatory changes to On-Board Shuttle Software (PASS) prior to return to flight in September 1988. Earned IBM highest technical achievement award, outstanding achievement award for shuttle software engineering, development and verification technical leadership

– Member of IBM/NASA Shuttle Flight Software (PASS) Discrepancy Review Board, 1981-1992. Maintained rigor of Discrepancy Review Board process, ensuring identification and correction of process escapes and identification and correction of similar errors due to prior process deficiencies found by audits.

• In these roles, I reviewed the results produced by IBM and the “SMERFS” software reliability estimation tool. A key part of the process was to separate software into “layers” based on the development cycle for each release of PASS flight software. In comparing the failures (Flight Software Discrepancies) by development cycle to data being processed by the of IBM/NASA Shuttle Flight Software (PASS) Discrepancy Review Board, I observed significant differences in time to failure by release.

9/23/2015


“Engineering” Approach

• FROM NOTES DATED 04/16/1990 (WITH ADDED HISTORICAL INSIGHT)– Analysis was done “eye balling” time between

failures for recent releases. An “engineering judgment” was used of a rough estimated time to next failure by release. This was compared to the values produced by “SMERFS” as well as the prototype for the Alternate Reliability Model.

• Data on next page has been updated with actual time to next failure as of 03/14/2007.

9/23/2015


Evaluation Of “SMERFS”Operational Increment

Engineering Judgment

Time to Next Failure (Days)

“SMERFS” 09/30/89

MTTF (Days)

Alternate Reliability

Model 12/01/89

MTTF (Days)

Next Failure After 12/89Actual MTTF

(Days)Added

03/14/20074 1500 167 970 22625 700 164 729 27466 600 146 539 2037 800 291 864 327

7C 1000 466 1458 44848A 700 455 1461 29588B 350 256 351 63938C 180 420 143 185

Composite (Combine All

Above)63

30

60

67

See Reference 2, page 16 to identify time frame for Operational Increments.See pages 43 – 53 for significant process improvements applied for Operational Increments OI-8A, OI-8B, and OI-8C.

9/23/2015


Evaluation Of “SMERFS”• Alternate model was developed to better match the engineering

judgment values. In hindsight, looking back after 17 plus years (in 2007), the engineering judgment was most accurate (6 % conservative), followed by the Alternate Reliability Model (10 % conservative). SMERFS was conservative, but in error by 55 % of the actual results.

• RATIONALE FOR “ALTERNATE RELIABILITY MODEL” From Notes dated 04/16/1990– First, subtle differences existed between Predicted Time Between Failures using

“SMERFS” (Statistical Modeling and Estimating of Reliability Functions for Software) and the actual data. The key difference that was unacceptable was the skew in probability of the next error occurring on older OI’s (for example, OI 4) rather than on recent OI’s (for example, OI-8C). Actual data showed an opposite trend.

– Second, “SMERFS” required significant historical data prior to producing accurate results making it inappropriate for predicting in advance the reliability of unreleased systems

9/23/2015


Effects Of Process Improvement• Candidate reasons for mismatch between “SMERFS” and reality. See

Reference 3, page 9. This shows very large spike in product error rate for Operational Increments OI-1 and OI-2. See page 16 for tabular data.

• Continual process improvements through OI-8C may have accounted for error in “SMERFS” predications.

OI-1OI-2

OI-3OI-4

OI-5 OI-6

OI-7

OI-8B

OI-8C

STS-1

STS-2 STS-5

OI-25 Process Issues During Transition From IBM To Loral

9/23/2015


Summary of the method• The Space Shuttle “Alternate Reliability Model” program was developed

for the Space Shuttle Primary Avionics Software System, which is human flight rated system of approximately 450,000 sources lines of code (excluding comment lines). Operational Increment release development over 15 years has demonstrated that the reliability characteristics per unit of changed code for each release is very consistent, with variations explainable by process deviations or other special causes of variation.

9/23/2015


Summary of the method• The Space Shuttle “Alternate Reliability Model” program computes software

reliability estimates in complex software systems even as the reliability characteristics change over time.

• The method and tool works in two independent modes.– First, when failure data is available, it will estimate two model coefficients for

each grouping of software being analyzed. These two model coefficients can then be used to calculate the software reliability characteristics of each grouping of software, and also total software reliability for all groupings combined.

– Second, the two model coefficients are also normalized based on relative size. For appropriate circumstances (e.g., the software is produced with essentially the same equivalent quality process), estimates of software reliability can be made prior to any failures occurring based on relative size of the software.

• Once the two model coefficients are determined, reliability and failure information over a user defined time interval can be computed

9/23/2015


Required Inputs Mode # 1Mode # 1 (use actual failure data to compute reliability)• Define software as “uniform layers.” These layers represent functionally whatever

characteristic desired to be modeled. In the Space Shuttle Primary Avionics Software System context, each layer represents all new/changed software delivered by each release. In the Constellation context, layers could be broken down by function and criticality of the software.

• Relative size measure of each layer. In the Space Shuttle Primary Avionics Software System context, relative size is defined by new/changed source lines of code (slocs). In the Constellation context, relative size could be based on number of requirements, functions points, or any other measure desired. Comparing relative functional size of software function to the PASS slocs and Constellation size parameter of choice could perform correlation between Space Shuttle Primary Avionics Software System and Constellation.– Data on each failure. – Date of failure– “Layer” of software that was the source of the failure

9/23/2015


Required Inputs Mode # 2Mode # 2 (use relative size and historical data to compute reliability)• Define software as “uniform layers.” These layers could represent as

desired to be modeled. • Relative size measure of each layer. • Expected relative quality level compared to historical data (could be

subjective).

All (produce reliability calculations)• Date or date range. Typically, this would correspond to (a) the date of

flight at which you wanted a Mean Time To Failure, or (b) any range of dates over which you wanted to determine the expected number of failures (expressed as a scalar, which in some contexts would represent the likelihood of a failure in that interval).

9/23/2015


MATHEMATICAL BASIS “ALTERNATE RELIABILITY MODEL”

The expected number of failures at any time • X = K_layer ln(t) - K_layer ln(tref) for t > tref• Where X = number of software failures• K_layer = a single constant that characterizes the grouping of software• tref = a reference time in days shortly after release (Configuration Inspection date), typically on the order of 90 days. 90 days was selected as the

time to normally reconfigure a system for flight and begin its use– The 90 days makes operations sense in the Space Shuttle Program Primary Avionics Software System (PASS) environment– The 90 days avoids a lot of mathematical issues as t takes on small values, ultimately approaching 0.

• t = time in days after release (Configuration Inspection date). t varies from approximately 90 days up to approximately 10,000 days for Space Shuttle Primary Avionics Software System (PASS) data.

• For every pair of successive failures, a value of K can be computed. Values computed for each pair of successive failures may vary by a factor on the order of 100. – K failures N to N + 1 = 1 / (ln( t at failure N + 1 ) - ln( t at failure N ) )– In the Space Shuttle Primary Avionics Software System Data, failures are some times reported on the same day. Mathematically, the above

equation does not work for this situation. The approach adopted was to treat all failures occurring within 12 days (evolved through multiple iterations) in one K calculation

– If two failures within one 12 day interval• K failures N to N + 2 = 2 / (ln( t at failure N + 2 ) - ln( t at failure N ) )

– If three failures within one 12 day interval• K failures N to N + 3 = 3 / (ln( t at failure N + 3 ) - ln( t at failure N ) )

– Etc.• The above calculations give a series of K terms each associated with a time interval. A single value of K layer for each by “layer” or set of released

changes is calculated by weighting by the associated delta time interval. Note the equation below is simplified to the case where all failures occur more than 12 days apart. Note also that the method assumes a failure at the current date for each layer as the calculations are performed to insure a conservative estimate is produced.

9/23/2015


MATHEMATICAL BASIS “ALTERNATE RELIABILITY MODEL”

• Standard deviation is computed directly from all of the computed F_factor.

• Normalized standard deviation (SD_factor) is computed by dividing the standard deviation by the composite final F_factor.

9/23/2015


Illustrations – Ideal and Noisy Data

9/23/2015


Sample Calculations – Ideal Data

9/23/2015


Sample Calculations – Noise 1 Data

9/23/2015


Sample Calculations – Noise 2 Data

9/23/2015


Summary Noise Calculations

• Ideal Data uses integer dates as close as possible to produce exactly even integer failures using model equations.

• Noise 1 Data uses random variance in dates to produce a Standard Deviation in F_factor of about 17 %.

• Noise 2 Date uses random variance in dates to produce a Standard Deviation in F_factor or about 28 %.

• The above are simply samples, no other value.

9/23/2015


Generalized Approach For Software Reliability

With Examples, Equations, and Simulation

9/23/2015


Key Issues• Must separate failures during development from failures post release operations.• Ideally separate failures from post release operations by completion date of each

release content.• Selection of tref is critical in that it must not be near 0. Zero time is normally when

verification testing is completed.– Based on Space Shuttle PASS experience, a value of 90 days is recommended.

This was the time from when verification on an Operational Increment was completed until a flight specific reconfigured release was available to field users (crew training, Software Avionics Integration Laboratory testing).

– Alternately, the time at which the first failure occurs could also be used (if significantly greater than 0).

• Method does not work in trying to treat single failures occurring on the same day or very, very close together.– Space Shuttle PASS engineering solution was to group all failures occurring

within 12 days into a single calculation with N number of failures between the two time points.

9/23/2015


Test If This Approach Is Valid

• This model may or may not work for any specific system and set of failure data.

• The most direct test is to plot failures versus time from release verification completion with time as logarithmic scale.– If this plot is approximately linear, then this approach

(PASS “Alternate Reliability Model”) is valid.– If there are failures very near delta time near 0,

these should be ignored for modeling purposes.

9/23/2015


Effect Of Not Isolating Each Release

9/23/2015


Extreme Samples Of Model Equations

• Sample 1 has random variation in dates for failures, plus assumes two failures at second failure point to demonstrate multiple failures within a short time period (typically within 12 days). Uses same dates as Noise 2 Sample for first four failures only.

• Sample 2 has random variation in dates for failures. Uses same dates as Noise 2 Sample for first four failures only.

• Sample 3 has random variation in dates for failures. Uses same dates as Noise 1 Sample for first four failures only.

9/23/2015


Sample 1Compute K_factor and SD_factor

9/23/2015


Compute K_factor and SD_factor

9/23/2015



9/23/2015


Sample 3 Compute Failures Versus Time

9/23/2015



9/23/2015


Sample Results FromSpace Shuttle PASSReliability Analysis

9/23/2015


Estimate K_factor • The following four charts illustrate how K_factor can be predicted from other

software metrics such as Product Error Rate (Product Errors per 1000 new/changed source lines of code)

• Data is shown from OI-20 (released in 1990) to OI-30 (released in 2003). These were large releases with 7 to 20 years of service life. Relatively stable software development and verification process was used except for OI-25 (see Reference 2 for more information).

• Page 40 shows Product Error Rate data from Reference 3. Page 41 shows PASS K_factor per 1000 new/changed source lines of code from my personal notes.

• Page 42 tabulates key information. Page 43 plots the relationship between K_factor per KSLOC versus Product Error Rate (Product Errors per KSLOC).

• This relationship could be used to estimate reliability of a future system if an estimate of Product Error Rate is known based on prior process performance.– K_factor = (K_factor per KSLOC as a function of Product Error Rate) *

KSLOC of system

9/23/2015


Reference 3, Page 16

Focus On AP-101S (upgraded General Purpose Computer) Major Releases

9/23/2015


PASS “Alternate Reliability Model” K_factor

OI-30

OI-29OI-28

OI-27OI-26B

OI-26

OI-25

OI-24

OI-23OI-22

OI-21

OI-20

9/23/2015


Prediction of K_factor

9/23/2015


Discussion Of Results For PASS• Computed Alternate Reliability Model coefficients analysis for OI-3 and later systems using

the post flight failures. For OI-30, OI-32, OI-33, and OI-34, adjusted the calculated values due to the assumption of an additional failure on the day of the analysis gave unrealistically high values. Alternate Reliability Model coefficients were adjusted to a value per unit of size (1000 uncommented new/changed source lines of HAL/S code, or KSLOC) that was consistent with other similar recent OI’s.

• For Release 16 (STS-1) to OI-2, failure data exists for the combined releases, not separately. Computation of Alternate Reliability Model coefficients was done by comparing failures per year for the combined releases to Alternate Model output for assumed of Alternate Reliability Model coefficients. Alternate Reliability Model coefficients derived based on constant value per unit of size (KLSOC). Additional unique analysis produced the Alternate Reliability Model coefficient showing the variability of the predicted failures per year.

• Analysis focused on flown systems. Data from Operational Increments not flown combined with later flown increment. As an example, failures and KSLOC’s from OI-7C and OI-8A are included in the calculation of Alternate Reliability Model coefficients for OI-8B. For simplicity, combined data from OI-8F under OI-20 even thought OI-8F supported flights due to the small size of OI-8F and OI-8F’s unique nature. OI-8F made operating system changes to support the AP-101S General Purpose Computer upgrade.

9/23/2015


Modeled Reliability Compares to Actual Flight Data, STS-1 to OI-2

9/23/2015


Modeled Reliability Compares to Actual Flight Data, OI-3 to OI-34

9/23/2015


References

9/23/2015

1. Schneidewind, N. F. and Keller, T. W. 1992. "Application of Reliability Models to the Space Shuttle," IEEE Software, July 1992, pp. 28-33.– See list of papers by N.F. Schneidewind at

• http://faculty.nps.edu/vitae/cgi-bin/vita.cgi?p=display_more&id=1023567911&field=pubs

2. James K. Orr, Daryl Peltier, Space Shuttle Program Primary Avionics Software System (PASS) Success Legacy - Major Accomplishments and Lessons Learned Detail Historical Timeline Analysis, August 24, 2010, NASA JSC-CN-21350, presented at NASA-Contractors Chief Engineers Council 3-day meeting August 24-26, 2010, in Montreal, Canada. – Free at

• http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20100028293.pdf

3. James K. Orr, Daryl Peltier, Space Shuttle Program Primary Avionics Software System (PASS) Success Legacy – Quality & Reliability Data, August 24, 2010, NASA JSC-CN-21317, presented at NASA-Contractors Chief Engineers Council 3-day meeting August 24-26, 2010, in Montreal, Canada– Free at

• http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20100029536.pdf

Copyright 2015 By James K. Orr 489/23/2015

http://faculty.nps.edu/vitae/cgi-bin/vita.cgi?p=display_more&id=1023567911&field=pubs

http://faculty.nps.edu/vitae/cgi-bin/vita.cgi?p=display_more&id=1023567911&field=pubs

http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20100028293.pdf



software reliability for engineers - j.k.orr 2015-09-23

Documents