some successful approaches to software reliability modeling in … · 2018-12-21 · lucent...
TRANSCRIPT
1
Lucent TechnologiesBell Labs Innovations
Some Successful Approaches to Software Reliability Modeling in Industry
Daniel R. Jeske and Xuemei Zhang Bell Laboratories, Lucent Technologies
Holmdel, NJ
1. Introduction and Context2. Software Reliability Growth Models3. Architecture-Based Software Reliability Models4. Case Studies5. Summary
ASA Conference on Quality and ProductivityMay 21, 2003
2
Lucent TechnologiesBell Labs Innovations
When are Software Reliability Models Typically Applied?
Software Reliability Growth Models......reliability is quantified and influences the release decision
Unit Testing
SoftwareDevelopment
IntegrationTesting
SystemTesting
GeneralAvailability
RequirementsFormulation
ArchitectureDesign
Architecture-Based Reliability Models.....decisions are being made as to how to design reliability into the system
FirstOffice
Application
Software Reliability Growth Models......reliability predictions are verified......parameters useful for modeling reliability of next release are estimated
3
Lucent TechnologiesBell Labs Innovations
Software Reliability Growth Models (SRGMs)
4
Lucent TechnologiesBell Labs Innovations
SRGMs: Questions to Answer
•What would the user-perceived failure rate of the software be ifif the software was to be released now?
•How much more testing is needed to achieve the reliability targets?
•How many faults exist in the code at the end of system test?
5
Lucent TechnologiesBell Labs Innovations
Goel-Okumoto Model
Cumulative Test Time (days)
CumulativeFailures
T
( ) (1 )d tM t c e−= −( ) ( )
dT
T M T
cde
λ−
′==
Failure Rate Estimate
x
x
xx
A Well-Known SRGM
= initial number of faults = average per-fault failure rate
= cumulative test time, aggregated across all installations
= current value of
cdt
T t
6
Lucent TechnologiesBell Labs Innovations
The expected number of faults detected in a time interval is proportional to the number of faults remaining in the software at time t.
Thus,
or equivalently,
implying,
Goel and Okumoto further assume that the number of faults observed in (0 , t) is Poisson{ M(t) }.
),( ttt ∆+
( ) ( ) [ ( )]M t t M t d c M t t+ ∆ − = − ∆
( ) (1 )dtM t c e −= −
( ) ( )M t dM t cd′ + =
Motivation for Goel-Okumoto Model
7
Lucent TechnologiesBell Labs Innovations
( )1 d t
dd teβ −=
+
Cumulative Test Time (days)
CumulativeFailures
T
x
x x
x
1( ) (1 )(1 )
dtdtM t c e
eβ−
−= × −+
S-shaped Model
8
Lucent TechnologiesBell Labs Innovations
Challenges With Applications
Traditional SRGMs are applied to system test data with the hope of obtaining an estimate of software failure rate that will be observed in the field. The following issues have to be taken into consideration:
• The usage profile in the field is typically very different than the testing profile
• Fault removal in field environments is not instantaneous
• Quality of software reliability data
9
Lucent TechnologiesBell Labs Innovations
G-O Model
( ) [1 ]
( )
dt
dt
M t c e
t cdeλ
−
−
= −=
where: c initial number of faults at time of field deploymentd average per fault failure rate in a field environmentp probability that a detected fault is successfully removed
( ) [1 ]
( )
dpt
dpt
cM t ep
t cdeλ
−
−
= −
=
G-O Model with Imperfect Debugging
Non-Instantaneous Fault Removal TimesIn Field Environments
10
Lucent TechnologiesBell Labs Innovations
1 1 n dp
µ= +
•Perfect debugging assumption is an acceptable assumption (based on thorough regression tests)
•We relate non-instantaneous fault removal to imperfect debugging
Under the imperfect debugging model:
E(# of occurrences of each fault)= 1/p
Under the situation of non-instantaneous fault removals, if µ denotes the mean time to remove a fault and there are n systemsin the field
E(# of occurrences of each fault)= 1+ nµd
Non-Instantaneous Fault Removal TimesIn Field Environments
11
Lucent TechnologiesBell Labs Innovations
Non-Instantaneous Fault Removal TimesIn Field Environments
1
1
( ) (1 )[1 ]
( )
d tn d
d tn d
M t c n d e
t cde
µ
µ
µ
λ
− ×+
− ×+
= + −
=
where: c initial number of faults at time of field deployment d average per fault failure rate in the field environmentµ average time to remove a fault (µ = 0 gives back the G-O model)n number of systems in the fieldt cumulative exposure time, aggregated across all installations
12
Lucent TechnologiesBell Labs Innovations
Field Failure Rate Prediction if Testing Profile Matches the Field Usage Profile
The number of initial faults at time of field deployment time is the same as the number of residual faults after testing has completed
The average per fault failure rate in the field environment is identical to the average per fault failure rate in the test environment
For the field failure rate model, we can use the G-O model replacing with and adjusting for non-instantaneous removal times
1( ) ( )d t
dT n dt ce de µλ− ×
− +=
dTce−c
13
Lucent TechnologiesBell Labs Innovations
Field Failure Rate Prediction When Testing Profile Does Not Match the Field Usage Profile
1) Testing and field usage profiles drive the average per fault failure rate. Usually thefailure rate of faults is smaller in field environments than in test environments
2) Define , where is the average per fault failure rate in the field environment
3) K is the “per fault failure rate” calibration factor
4) Estimate K from previous releases of the software, or from related projects
*K /d d=
/1 /( ) ( ) ( / )
d Kt
dT n d Kadj t ce d K e µλ
− ×− +=
*d
14
Lucent TechnologiesBell Labs Innovations
BCF - Base Station Controller FrameBTS - Base Transceiver Station MSC - Mobile Switching CenterPSTN - Public Switched Telephone Network
BCF
MSC
PSTN
BTSs
BCF Software•setup calls•tear down calls•negotiate cell handoffs•dynamically control transmit
power levels•alarm handling•overload control•equipment provisioning
Case Study: GSM Wireless System
15
Lucent TechnologiesBell Labs InnovationsObjective
Estimate the field failure rate of R3 BCF software which is currently
in system test
1) R3 Test Data in a non-operational profile environment
2) Field failure data for R1 and R2
Goals
Data Available
16
Lucent TechnologiesBell Labs Innovations
Field Failure Data for R1
System Days FailuresMonth Days Cumulative Month Cumulative
1 1,249 1,249 4 42 3,472 4,721 6 103 4,065 8,786 4 144 4,883 13,669 3 175 5,425 19,094 6 236 5,656 24,750 1 247 7,549 32,299 2 268 8,295 40,594 4 309 8,882 49,476 1 3110 6,120 55,596 0 3111 2,465 58,061 1 3212 527 58,588 1 3313 45 58,633 0 33
17
Lucent TechnologiesBell Labs Innovations
0
30
60
90
120
0 35,000 70,000 105,000 140,000 175,000
Cumulative System Days
Cum
ulat
ive
Failu
res
0
8
16
24
32
0 10,000 20,000 30,000 40,000 50,000 60,000
Cumulative System Days
Cum
ulat
ive
Failu
res 1
1
1
-5
58,633 system-days (13 months , 167 systems)45 days
ˆ 22.3 faultsˆ 0.0000746 failures/day/fault
(StdError: 3.07 10 )
T
c
d
µ====
×
R1 Field
R2 Field
Field Failure Data Analysis
2
2
2
-6
167,900 system-days (13 months , 370 systems)45 days
ˆ 114.01 faultsˆ 0.0000128 failures/day/fault
(StdError: 3.07 10 )
T
c
d
µ====
×
18
Lucent TechnologiesBell Labs InnovationsR3 Test Data
Number of Days of Each SoftwareInstallationCalendar
Week1 2 3 4 5 6 7 8 9 10 11
CumulativeSystem-Days
of TestingCumulative
Failures
1 5 5 52 4 9 63 4 13 134 5 18 135 5 5 28 226 5 33 247 5 5 43 298 5 5 5 5 63 349 5 5 5 5 5 88 4010 5 5 5 5 5 5 5 123 46
33 5 5 5 6 955 17034 5 5 6 6 977 17635 5 5 6 6 999 18036 2 1001 181
As many as 11 frames were being used in parallel during system test.
‘Failure’ is defined as a severity 1, 2 or 3 MR.
19
Lucent TechnologiesBell Labs Innovations
G-O Model Fit to the R3 Test Data
0
50
100
150
200
250
0 500 1000 1500 2000 2500
Cumulative System Days of Testing
Cum
ulat
ive
Failu
res
239-181=58
ˆˆ( ) (1 )dtcM t e −= −
1 2
2
2
ˆ 238.3 faults ˆ 0.00142 fails/day/fault
ˆ 31ˆ ˆ( ) / 2
ˆ 27.02 0.00552ˆˆ 0.00552 0.00027
c
d
dK
d d
cVar
d
=
= =+
−= −
20
Lucent TechnologiesBell Labs Innovations
0
0.5
1
1.5
2
2.5
0 0.1 0.2 0.3 0.4 0.5
Calendar Time (Years)
R3
Failu
re R
ate
R3 Field Failure Rate Prediction
0.00142 / 31365 345
1 345 45 0.00142 / 31ˆ ( ) 58 (365 0.00142 / 31)
(fails/year, after calendar-years of field exposure)
t
adj t e
t
λ− × ×
+ × ×= × × ×
95% Prediction Limits
Smoothed NonparametricEstimate of R3 Failure Rate
21
Lucent TechnologiesBell Labs Innovations
Architecture-Based Software Reliability Models
22
Lucent TechnologiesBell Labs Innovations
Questions to Answer
•What type of software redundancy, if any, is needed in order to achieve the reliability target?
•How fast do fault recovery times need to be?
•Should software processes try to restart before a failover is attempted?
•How thorough do system fault diagnostics need to be?
•What is the required software failure rate?
23
Lucent TechnologiesBell Labs Innovations
BCF - Base Station Controller FrameBTS - Base Transceiver Station MSC - Mobile Switching CenterPSTN - Public Switched Telephone Network
BCF
MSC
PSTN
BTSs
BTS Controller Software•allocate radio resources•negotiate hand-offs•manage connections between BTS and BCF
•provision and maintenance
Case Study - Continued
Alternative Architectures•Simplex, no process restart capability•Simplex, with restart capability•Active-Standby, with fail-over capability
24
Lucent TechnologiesBell Labs Innovations
0
1
2
34
6
7
812
13
15
16
179
18
10
20
19
14
5
11
Detectedrestartable SW failure
Restart Success
Restart Fails System Detects
AutoFailoverFails, System Detects
Parallel Reboot
Auto Failover SucceedsAutoFailover
Fails, System Does not Detect
Detected non-restartable SW failure
UndetectedSW failure
ManualDetection
Manual Failover Fails
ParallelReboot
Manual FailoverSuccess
ManualDetection
Restartable orNon-restartable SW failure
UndetectedMonSW failure
DetectedMonSWfailure
Restart Success
Restart Fails No Escalation
Restart Fails, System Does not Detect
ManualDetection
Restart Fails and Escalationto AutoFailover
Undetected Stdby MonSWfailure
Restartable failure on Active side
Restartable orNon-restartable
Stdby SW failureManualDetection
Restartable Failure on Active side
Non-restartableSW failure on Active side
Restart Success
ManualDetection
Restart Fails
Restart Success
Restart Fails Undetected
Restartableor Non-restartableStdby SW failure
RebootManualDetection
Restartable failure on Active side
Non-restartableSW failure on Active side
Restart Success
Parallel Reboot
Non-restartable SW failure on Active side
Reboot
Restart Fails
AutoFailoverFails
1cλ
2cλ
3(1 )c λ−
1 2(1 )( )s sc λ λ− +
3(1 ) sc λ−
m mr µ
(1 ) r sr q µ−
fpµ
Rµ
(1 ) f fp q µ−(1 )(1 )f fp q µ− −
m mfp µ(1 )m mfp µ−
Rµ
3dµ λ+
3d sµ λ+Rµ
1 2s sλ λ+
2λ
dµ
3cλ
srµ
(1 )m mr wµ−
srµ
(1 ) sr µ−
fµ
Rµ
srµ
(1 ) sr µ−
srµ
(1 ) sr µ−
2λ
dµ
(1 )(1 )r sr q µ− −
2λ
3dµ λ+
3dµ λ+
21 λλ +
))(1( 21 λλ +− c
(1 )(1 )m mr w µ− −
Rµ
21
1λMonSW failure on Active side3λ
Restart Success
m mr µ
Restart Fails
1λ
22
MonSW failure on Active side
3λRestart Success
Restart Fails
m mr µ
1λ
23
m mr µ
MonSW failure on Active
side
3λ
mmr µ)1( −
mmr µ)1( −
mmr µ)1( −General Active-Standby
Software Availability Model
25
Lucent TechnologiesBell Labs Innovations
Special Case: Cold Standby
0
2
34
6
7
5
Detectedrestartable SW failure
Restart Success
Restart Fails System Detects
AutoFailoverFails, System Detects
Parallel Reboot
Auto Failover SucceedsAutoFailover
Fails, System Does not Detect
Detected non-restartable SW failure
UndetectedSW failure
ManualDetection
Manual Failover Fails
ParallelReboot
Manual FailoverSuccess
ManualDetection
Restart Fails, System Does not Detect
ManualDetection
Reboot
11
8
1
1cλ
2cλ
(1 ) r sr q µ−
fpµ
Rµ
(1 ) f fp q µ− (1 )(1 )f fp q µ− −
m mfp µ
(1 )m mfp µ−
Rµ
3dµ λ+
srµ
(1 )(1 )r sr q µ− −
3dµ λ+
3dµ λ+
))(1( 21 λλ +− cRµ
26
Lucent TechnologiesBell Labs Innovations
Simplex Systems as Special Cases
1
1 2 3
0 (no restartable software)0 (no standby side)
0 and (no restart attempts)0 and (no automatic failover attempts)
0 and (no manual failover attempts)
s s s
s
f
m mf
m f
rp
p
q q
λλ λ λ
µµµ
== = =
= = ∞= = ∞= = ∞= 1 (no restart or failover attempts)
1 and manual recovery duration ofm mr µ=
= =
1 2 3 0 (no standby)0 and (no automatic failover attempts)
0 and (no manual failover attempts)
1 (no automatic failover attempts)
s s s
f
m mf
f
p
p
q
λ λ λµµ
= = == = ∞= = ∞=
Simplex System With No Restart Capability
Simplex System With Restart Capability
27
Lucent TechnologiesBell Labs Innovations
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 1213 14 1516 1718 19 2021 2223 24
Hour of Day
% o
f Bus
y H
our
Tra
ffic
Network Traffic Profile
28
Lucent TechnologiesBell Labs Innovations
0
20
40
60
80
100
0 20 40 60 80 100
%Capacity Remaining
%T
raff
ic S
erve
d
% Capacity Remaining After the Failure Capacity Required for Traffic Volume Offered in Hour-i = % Offered Traffic Served if Failure Occurs in Hour-i
= Min ( , ) /
Expected %
i
i
i i
CCS
C C C
S
==
= Traffic Served, Given a Failure That Leaves Capacity at C
Reward Model
29
Lucent TechnologiesBell Labs Innovations
0.1
1
10
0.5 0.6 0.7 0.8 0.9 1
Coverage Factor
Failu
res/
year
No Restart
Restart
Active-Standby
0.5
4
Feasibility Regions for Critical Parameters
Level Curves for 99.999% Availability(all other parameters are fixed)
1 2λ λ+
c
30
Lucent TechnologiesBell Labs Innovations
Identifying Influential Parameters
•Availability formula is straight-forward to obtain, but is complicated in that it ishighly non-linear and it depends on numerous (23) parameters
•Analysis of derivatives, while also straight-forward is not particularly insightful, since they themselves depend on the same set of parameters.
•“Tornado Graphs” depends too much on the fixed values for the other parameters.
•Identify likely ranges for each parameter to define the domain of the availability function
•Uniformly sample N times from the domain to obtain A1, A2, . . . , AN
• Use an efficient second-order polynomial to approximate the availability formula
•Use the t-statistics as the associated measure of influence
Challenges
Approach
31
Lucent TechnologiesBell Labs Innovations
Case Study – Continued(Likely Ranges for Parameter Values in Active-Standby Architecture)
Detections/yr
26,280Reboots/yr
3,153,600Failovers/yr
31,536,000Restarts/yr
Fails/yr
Fails/yr
0Fails/yr
Fails/yr
17,520Recoveries/yrFails/yr
26,280Failovers/yrFails/yr1λ
2λ
3λ
1sλ
2sλ
3sλ
sµ
fµ
rµdµ
mfµ
mµr
mr
rq
fq
p
mp
c
w
[ 0.1 , 1.0 ]
[ 0.1 , 1.0 ]
[ 0.1 , 0.5 ]
[ 0.1 , 0.5 ]
[ 0.1 , 0.5 ]
[ 1095 , 2190 ]
[ 0.9 , 0.99 ]
[ 0.9 , 0.99 ]
[ 0.9 , 0.99 ]
[ 0.9 , 0.99 ]
[ 0.9 , 0.99 ]
[ 0.9 , 0.99 ]
[ 0.9 , 0.99 ]
[ 0.9 , 0.99 ]
32
Lucent TechnologiesBell Labs Innovations
Case Study – Continued(Efficient Second-Order Regression Model)
1 2 1 2
1 3 2 3
3
(min/yr) 497 379 378 358 356 284( /1000) 268( /1000) 46112.2 ( /1000) 319 12 ( /1000) 29822 11.6 ( /1000) 11.7 9.6 9 5.8 5 2.1
d d
d d
d f r m m
D c c c cc
p r q q r w p
λ λ λ λ µ µλ µ λ λ µ λ
λ µ
= + + − − + − −− + − −− − − − − − − −
0
20
40
60
80
100
0 20 40 60 80 100
Predicted
Obs
erve
d
Influential Parameters•Software failure rates on active side•Silent failure detection time•Coverage factor•Cross-product of coverage factor with
software failure rates and silent failuredetection time
Weakly Influential Parameters•All of the success probabilities
InterpretationAdequacy of Fit
33
Lucent TechnologiesBell Labs Innovations
Needed Research
•SRGM fitting tools
•Small sample inference procedures
•Failure rate template
•Improved methods for linking SRGMsto architecture-based models