some successful approaches to software reliability modeling in … · 2018-12-21 · lucent...

1

Lucent TechnologiesBell Labs Innovations

Some Successful Approaches to Software Reliability Modeling in Industry

Daniel R. Jeske and Xuemei Zhang Bell Laboratories, Lucent Technologies

Holmdel, NJ

1. Introduction and Context2. Software Reliability Growth Models3. Architecture-Based Software Reliability Models4. Case Studies5. Summary

ASA Conference on Quality and ProductivityMay 21, 2003

2


When are Software Reliability Models Typically Applied?

Software Reliability Growth Models......reliability is quantified and influences the release decision

Unit Testing

SoftwareDevelopment

IntegrationTesting

SystemTesting

GeneralAvailability

RequirementsFormulation

ArchitectureDesign

Architecture-Based Reliability Models.....decisions are being made as to how to design reliability into the system

FirstOffice

Application

Software Reliability Growth Models......reliability predictions are verified......parameters useful for modeling reliability of next release are estimated

3


Software Reliability Growth Models (SRGMs)

4


SRGMs: Questions to Answer

•What would the user-perceived failure rate of the software be ifif the software was to be released now?

•How much more testing is needed to achieve the reliability targets?

•How many faults exist in the code at the end of system test?

5


Goel-Okumoto Model

Cumulative Test Time (days)

CumulativeFailures

T

( ) (1 )d tM t c e−= −( ) ( )

dT

T M T

cde

λ−

′==

Failure Rate Estimate

x

x

xx

A Well-Known SRGM

= initial number of faults = average per-fault failure rate

= cumulative test time, aggregated across all installations

= current value of

cdt

T t

6


The expected number of faults detected in a time interval is proportional to the number of faults remaining in the software at time t.

Thus,

or equivalently,

implying,

Goel and Okumoto further assume that the number of faults observed in (0 , t) is Poisson{ M(t) }.

),( ttt ∆+

( ) ( ) [ ( )]M t t M t d c M t t+ ∆ − = − ∆

( ) (1 )dtM t c e −= −

( ) ( )M t dM t cd′ + =

Motivation for Goel-Okumoto Model

7


( )1 d t

dd teβ −=

+

Cumulative Test Time (days)

CumulativeFailures

T

x

x x

x

1( ) (1 )(1 )

dtdtM t c e

eβ−

−= × −+

S-shaped Model

8


Challenges With Applications

Traditional SRGMs are applied to system test data with the hope of obtaining an estimate of software failure rate that will be observed in the field. The following issues have to be taken into consideration:

• The usage profile in the field is typically very different than the testing profile

• Fault removal in field environments is not instantaneous

• Quality of software reliability data

9


G-O Model

( ) [1 ]

( )

dt

dt

M t c e

t cdeλ

−

−

= −=

where: c initial number of faults at time of field deploymentd average per fault failure rate in a field environmentp probability that a detected fault is successfully removed

( ) [1 ]

( )

dpt

dpt

cM t ep

t cdeλ

−

−

= −

=

G-O Model with Imperfect Debugging

Non-Instantaneous Fault Removal TimesIn Field Environments

10


1 1 n dp

µ= +

•Perfect debugging assumption is an acceptable assumption (based on thorough regression tests)

•We relate non-instantaneous fault removal to imperfect debugging

Under the imperfect debugging model:

E(# of occurrences of each fault)= 1/p

Under the situation of non-instantaneous fault removals, if µ denotes the mean time to remove a fault and there are n systemsin the field

E(# of occurrences of each fault)= 1+ nµd


11



1

1

( ) (1 )[1 ]

( )

d tn d

d tn d

M t c n d e

t cde

µ

µ

µ

λ

− ×+

− ×+

= + −

=

where: c initial number of faults at time of field deployment d average per fault failure rate in the field environmentµ average time to remove a fault (µ = 0 gives back the G-O model)n number of systems in the fieldt cumulative exposure time, aggregated across all installations

12


Field Failure Rate Prediction if Testing Profile Matches the Field Usage Profile

The number of initial faults at time of field deployment time is the same as the number of residual faults after testing has completed

The average per fault failure rate in the field environment is identical to the average per fault failure rate in the test environment

For the field failure rate model, we can use the G-O model replacing with and adjusting for non-instantaneous removal times

1( ) ( )d t

dT n dt ce de µλ− ×

− +=

dTce−c

13


Field Failure Rate Prediction When Testing Profile Does Not Match the Field Usage Profile

1) Testing and field usage profiles drive the average per fault failure rate. Usually thefailure rate of faults is smaller in field environments than in test environments

2) Define , where is the average per fault failure rate in the field environment

3) K is the “per fault failure rate” calibration factor

4) Estimate K from previous releases of the software, or from related projects

*K /d d=

/1 /( ) ( ) ( / )

d Kt

dT n d Kadj t ce d K e µλ

− ×− +=

*d

14


BCF - Base Station Controller FrameBTS - Base Transceiver Station MSC - Mobile Switching CenterPSTN - Public Switched Telephone Network

BCF

MSC

PSTN

BTSs

BCF Software•setup calls•tear down calls•negotiate cell handoffs•dynamically control transmit

power levels•alarm handling•overload control•equipment provisioning

Case Study: GSM Wireless System

15

Lucent TechnologiesBell Labs InnovationsObjective

Estimate the field failure rate of R3 BCF software which is currently

in system test

1) R3 Test Data in a non-operational profile environment

2) Field failure data for R1 and R2

Goals

Data Available

16


Field Failure Data for R1

System Days FailuresMonth Days Cumulative Month Cumulative

1 1,249 1,249 4 42 3,472 4,721 6 103 4,065 8,786 4 144 4,883 13,669 3 175 5,425 19,094 6 236 5,656 24,750 1 247 7,549 32,299 2 268 8,295 40,594 4 309 8,882 49,476 1 3110 6,120 55,596 0 3111 2,465 58,061 1 3212 527 58,588 1 3313 45 58,633 0 33

17


0

30

60

90

120

0 35,000 70,000 105,000 140,000 175,000

Cumulative System Days

Cum

ulat

ive

Failu

res

0

8

16

24

32

0 10,000 20,000 30,000 40,000 50,000 60,000

Cumulative System Days

Cum

ulat

ive

Failu

res 1

1

1

-5

58,633 system-days (13 months , 167 systems)45 days

ˆ 22.3 faultsˆ 0.0000746 failures/day/fault

(StdError: 3.07 10 )

T

c

d

µ====

×

R1 Field

R2 Field

Field Failure Data Analysis

2

2

2

-6

167,900 system-days (13 months , 370 systems)45 days

ˆ 114.01 faultsˆ 0.0000128 failures/day/fault

(StdError: 3.07 10 )

T

c

d

µ====

×

18

Lucent TechnologiesBell Labs InnovationsR3 Test Data

Number of Days of Each SoftwareInstallationCalendar

Week1 2 3 4 5 6 7 8 9 10 11

CumulativeSystem-Days

of TestingCumulative

Failures

1 5 5 52 4 9 63 4 13 134 5 18 135 5 5 28 226 5 33 247 5 5 43 298 5 5 5 5 63 349 5 5 5 5 5 88 4010 5 5 5 5 5 5 5 123 46

33 5 5 5 6 955 17034 5 5 6 6 977 17635 5 5 6 6 999 18036 2 1001 181

As many as 11 frames were being used in parallel during system test.

‘Failure’ is defined as a severity 1, 2 or 3 MR.

19


G-O Model Fit to the R3 Test Data

0

50

100

150

200

250

0 500 1000 1500 2000 2500

Cumulative System Days of Testing

Cum

ulat

ive

Failu

res

239-181=58

ˆˆ( ) (1 )dtcM t e −= −

1 2

2

2

ˆ 238.3 faults ˆ 0.00142 fails/day/fault

ˆ 31ˆ ˆ( ) / 2

ˆ 27.02 0.00552ˆˆ 0.00552 0.00027

c

d

dK

d d

cVar

d

=

= =+

−= −

20


0

0.5

1

1.5

2

2.5

0 0.1 0.2 0.3 0.4 0.5

Calendar Time (Years)

R3

Failu

re R

ate

R3 Field Failure Rate Prediction

0.00142 / 31365 345

1 345 45 0.00142 / 31ˆ ( ) 58 (365 0.00142 / 31)

(fails/year, after calendar-years of field exposure)

t

adj t e

t

λ− × ×

+ × ×= × × ×

95% Prediction Limits

Smoothed NonparametricEstimate of R3 Failure Rate

21


Architecture-Based Software Reliability Models

22


Questions to Answer

•What type of software redundancy, if any, is needed in order to achieve the reliability target?

•How fast do fault recovery times need to be?

•Should software processes try to restart before a failover is attempted?

•How thorough do system fault diagnostics need to be?

•What is the required software failure rate?

23


BCF - Base Station Controller FrameBTS - Base Transceiver Station MSC - Mobile Switching CenterPSTN - Public Switched Telephone Network

BCF

MSC

PSTN

BTSs

BTS Controller Software•allocate radio resources•negotiate hand-offs•manage connections between BTS and BCF

•provision and maintenance

Case Study - Continued

Alternative Architectures•Simplex, no process restart capability•Simplex, with restart capability•Active-Standby, with fail-over capability

24


0

1

2

34

6

7

812

13

15

16

179

18

10

20

19

14

5

11

Detectedrestartable SW failure

Restart Success

Restart Fails System Detects

AutoFailoverFails, System Detects

Parallel Reboot

Auto Failover SucceedsAutoFailover

Fails, System Does not Detect

Detected non-restartable SW failure

UndetectedSW failure

ManualDetection

Manual Failover Fails

ParallelReboot

Manual FailoverSuccess

ManualDetection

Restartable orNon-restartable SW failure

UndetectedMonSW failure

DetectedMonSWfailure

Restart Success

Restart Fails No Escalation

Restart Fails, System Does not Detect

ManualDetection

Restart Fails and Escalationto AutoFailover

Undetected Stdby MonSWfailure

Restartable failure on Active side

Restartable orNon-restartable

Stdby SW failureManualDetection

Restartable Failure on Active side

Non-restartableSW failure on Active side

Restart Success

ManualDetection

Restart Fails

Restart Success

Restart Fails Undetected

Restartableor Non-restartableStdby SW failure

RebootManualDetection

Restartable failure on Active side

Non-restartableSW failure on Active side

Restart Success

Parallel Reboot

Non-restartable SW failure on Active side

Reboot

Restart Fails

AutoFailoverFails

1cλ

2cλ

3(1 )c λ−

1 2(1 )( )s sc λ λ− +

3(1 ) sc λ−

m mr µ

(1 ) r sr q µ−

fpµ

Rµ

(1 ) f fp q µ−(1 )(1 )f fp q µ− −

m mfp µ(1 )m mfp µ−

Rµ

3dµ λ+

3d sµ λ+Rµ

1 2s sλ λ+

2λ

dµ

3cλ

srµ

(1 )m mr wµ−

srµ

(1 ) sr µ−

fµ

Rµ

srµ

(1 ) sr µ−

srµ

(1 ) sr µ−

2λ

dµ

(1 )(1 )r sr q µ− −

2λ

3dµ λ+

3dµ λ+

21 λλ +

))(1( 21 λλ +− c

(1 )(1 )m mr w µ− −

Rµ

21

1λMonSW failure on Active side3λ

Restart Success

m mr µ

Restart Fails

1λ

22

MonSW failure on Active side

3λRestart Success

Restart Fails

m mr µ

1λ

23

m mr µ

MonSW failure on Active

side

3λ

mmr µ)1( −

mmr µ)1( −

mmr µ)1( −General Active-Standby

Software Availability Model

25


Special Case: Cold Standby

0

2

34

6

7

5

Detectedrestartable SW failure

Restart Success

Restart Fails System Detects

AutoFailoverFails, System Detects

Parallel Reboot

Auto Failover SucceedsAutoFailover

Fails, System Does not Detect

Detected non-restartable SW failure

UndetectedSW failure

ManualDetection

Manual Failover Fails

ParallelReboot

Manual FailoverSuccess

ManualDetection

Restart Fails, System Does not Detect

ManualDetection

Reboot

11

8

1

1cλ

2cλ

(1 ) r sr q µ−

fpµ

Rµ

(1 ) f fp q µ− (1 )(1 )f fp q µ− −

m mfp µ

(1 )m mfp µ−

Rµ

3dµ λ+

srµ

(1 )(1 )r sr q µ− −

3dµ λ+

3dµ λ+

))(1( 21 λλ +− cRµ

26


Simplex Systems as Special Cases

1

1 2 3

0 (no restartable software)0 (no standby side)

0 and (no restart attempts)0 and (no automatic failover attempts)

0 and (no manual failover attempts)

s s s

s

f

m mf

m f

rp

p

q q

λλ λ λ

µµµ

== = =

= = ∞= = ∞= = ∞= 1 (no restart or failover attempts)

1 and manual recovery duration ofm mr µ=

= =

1 2 3 0 (no standby)0 and (no automatic failover attempts)

0 and (no manual failover attempts)

1 (no automatic failover attempts)

s s s

f

m mf

f

p

p

q

λ λ λµµ

= = == = ∞= = ∞=

Simplex System With No Restart Capability

Simplex System With Restart Capability

27


0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 1213 14 1516 1718 19 2021 2223 24

Hour of Day

% o

f Bus

y H

our

Tra

ffic

Network Traffic Profile

28


0

20

40

60

80

100

0 20 40 60 80 100

%Capacity Remaining

%T

raff

ic S

erve

d

% Capacity Remaining After the Failure Capacity Required for Traffic Volume Offered in Hour-i = % Offered Traffic Served if Failure Occurs in Hour-i

= Min ( , ) /

Expected %

i

i

i i

CCS

C C C

S

==

= Traffic Served, Given a Failure That Leaves Capacity at C

Reward Model

29


0.1

1

10

0.5 0.6 0.7 0.8 0.9 1

Coverage Factor

Failu

res/

year

No Restart

Restart

Active-Standby

0.5

4

Feasibility Regions for Critical Parameters

Level Curves for 99.999% Availability(all other parameters are fixed)

1 2λ λ+

c

30


Identifying Influential Parameters

•Availability formula is straight-forward to obtain, but is complicated in that it ishighly non-linear and it depends on numerous (23) parameters

•Analysis of derivatives, while also straight-forward is not particularly insightful, since they themselves depend on the same set of parameters.

•“Tornado Graphs” depends too much on the fixed values for the other parameters.

•Identify likely ranges for each parameter to define the domain of the availability function

•Uniformly sample N times from the domain to obtain A1, A2, . . . , AN

• Use an efficient second-order polynomial to approximate the availability formula

•Use the t-statistics as the associated measure of influence

Challenges

Approach

31


Case Study – Continued(Likely Ranges for Parameter Values in Active-Standby Architecture)

Detections/yr

26,280Reboots/yr

3,153,600Failovers/yr

31,536,000Restarts/yr

Fails/yr

Fails/yr

0Fails/yr

Fails/yr

17,520Recoveries/yrFails/yr

26,280Failovers/yrFails/yr1λ

2λ

3λ

1sλ

2sλ

3sλ

sµ

fµ

rµdµ

mfµ

mµr

mr

rq

fq

p

mp

c

w

[ 0.1 , 1.0 ]

[ 0.1 , 1.0 ]

[ 0.1 , 0.5 ]

[ 0.1 , 0.5 ]

[ 0.1 , 0.5 ]

[ 1095 , 2190 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

[ 0.9 , 0.99 ]

32


Case Study – Continued(Efficient Second-Order Regression Model)

1 2 1 2

1 3 2 3

3

(min/yr) 497 379 378 358 356 284( /1000) 268( /1000) 46112.2 ( /1000) 319 12 ( /1000) 29822 11.6 ( /1000) 11.7 9.6 9 5.8 5 2.1

d d

d d

d f r m m

D c c c cc

p r q q r w p

λ λ λ λ µ µλ µ λ λ µ λ

λ µ

= + + − − + − −− + − −− − − − − − − −

0

20

40

60

80

100

0 20 40 60 80 100

Predicted

Obs

erve

d

Influential Parameters•Software failure rates on active side•Silent failure detection time•Coverage factor•Cross-product of coverage factor with

software failure rates and silent failuredetection time

Weakly Influential Parameters•All of the success probabilities

InterpretationAdequacy of Fit

33


Needed Research

•SRGM fitting tools

•Small sample inference procedures

•Failure rate template

•Improved methods for linking SRGMsto architecture-based models

some successful approaches to software reliability modeling in … · 2018-12-21 · lucent...

Documents