Building Models from Your Software Data
Brad Clark, Ph.D.
Software Metrics, Inc.
16th International Forum on COCOMO
and Software Cost Modeling
Los Angeles, CA October 23-26, 2001
Tutorial: Building Models 2
Agenda
– 1:00 - 2:30 PM Tutorial– 2:30 - 3:00 PM Break– 3:00 - 4:30 PM Tutorial conclusion
Miscellaneous– Bathrooms– Telephones– Tutorial Format: Collaborative participation
• One person talks at a time
• Keep discussions to the point
• No attribution
• End-of-course evaluation
Tutorial: Building Models 3
Directions
Gerontology AuditoriumParking
Structure A
Electrical Engineering
Building(Hughes)
Computer Science Building
(Salvatori)
CSE - 3rd Floor
School of Engineering
McClintock Ave.
Wes
t 37
th P
lace
Gate #1
Tutorial: Building Models 4
Tutorial Outline
• Purpose• A software engineering modeling example• Model building steps• Mean-based model exercise• Regression based model exercise• Summary
Tutorial: Building Models 5
The need for models
• Models are useful for forecasting, performance analysis, and decision-making– WBS is narrowly addressed with current estimation models– Strength of cause and effect relationships– Impact of decision making: Personnel turnover
• Establish data requirements (model parameters)• Explain assignable causes of variation and their
degree of influence• Used to validate data
– Poor data definitions and collection consistency– Poor processes that produce the data
Tutorial: Building Models 6
WBS Help3.1 Program Management3.1.1 Planning & Mgt3.1.2 Program Control3.1.3 Contract Management3.1.4 Contractor Laboratory3.2 System Engineering3.2.1 SysReq'ts3.2.2 Design and Integration3.2.4 Sup. & Maintainability Eng.3.2.5 QA3.2.6 CM3.2.7 Human Factors3.2.8 Security3.3 HW/SW Design, Development and Production3.3.1 HW Design & Dev3.3.2 SW Design & Dev3.3.3 HW/SW Integration & Checkout3.5 Test and Evaluation3.5.1 Sys T&E3.5.4 Site Accep3.6 Documentation3.7 Support
Software Cost Estimation Models
How is the effort estimated for the rest of these?
Tutorial: Building Models 7
Decision Impact Analysis
• Do we give the team an incentive to stay or do we look for new hirers?
Estimated PM = 2.94 * KSLOC * PCON
PCON Effect on PM 3% / yr 0.81 0.0% 6% / yr 0.90 +11.0%12% / yr 1.00 +23.5%24% / yr 1.12 +38.0%48% / yr 1.29 +59.0%
PCON Effect on PM 3% / yr 0.81 0.0% 6% / yr 0.90 +11.0%12% / yr 1.00 +23.5%24% / yr 1.12 +38.0%48% / yr 1.29 +59.0%
• Estimated Person Months for a 100 KSLOC project with 3% PCON = 238 PM
• Same project with 12% PCON = 294 PM (23.5% increase)
• If the burdened labor rate is $10,000/PM, the cost increase is $235,000/PM
• Estimated Person Months for a 100 KSLOC project with 3% PCON = 238 PM
• Same project with 12% PCON = 294 PM (23.5% increase)
• If the burdened labor rate is $10,000/PM, the cost increase is $235,000/PM
Why not give everyone a financial incentive to stay?PM: Person MonthsPCON: Personnel Continuity
Tutorial: Building Models 8
Data validation -1
• Check for internal consistency• Be suspicious of “perfect” data• Understand reason for outliers• Check data relationships
– Effort and size– Effort and schedule– Size and defects– Effort and defects
Tutorial: Building Models 9
Data validation -2
$0
$20,000
$40,000
$60,000
$80,000
$100,000
$120,000
$140,000
0 500 1000 1500 2000 2500
Actual Hours
Bud
get
What looks suspicious here?
Tutorial: Building Models 10
Tutorial Objectives
• Share data analysis experiences with real data– (COCOMO as a thinking aid)
• Show how models created from data are based on the average (or mean) of the data and its spread or variation
• Show how model performance improves with the removal of assignable causes of variation
• Raise awareness on the many sources of variation in software engineering data
Tutorial: Building Models 11
What Will We Do?
• Using supplied data, we will build simple models– Mean or Median– One variable regression models– Stratifying data
• Two sets of data– The first set will be used to learn a technique– The second set will be used to practice the technique
• Intent is to show how to create small models by example
Tutorial: Building Models 12
What You Will Walk Away With
• A new skill: using Excel to look at data– Data summaries– Graphing data– Simple regression models
• An understanding of what is behind numbers produced by models, a.k.a. understanding variation– An intelligent consumer of data
(which you can practice during this conference’s presentations)
– A responsible data reporter
• Understanding model parameters and their impact on explaining variation
Tutorial: Building Models 13
About the Instructor and SMI • Brad Clark
– Former Navy Pilot– Worked in civil service for 10 years– Attended USC Graduate School: 1992 - 1997
• Development of the COCOMO II model• Process Maturity Effects on Effort
– Started consulting in 1998 in using measurement to manage software projects
• Software Metrics, Inc. (SMI)– Very small, private consulting company located in Haymarket, Va.– Started in 1983 by John and Betsy Bailey– Focus: Using software measurement to manage software projects:
estimation, feasibility analysis, performance
Tutorial: Building Models 14
About You
• What is your name?• Where do you work?• Do you have any experience with statistics or
empirical modeling?
Tutorial: Building Models 15
Tutorial Outline
• Purpose• A software engineering modeling example• Model building steps• Mean-based model exercise• Regression based model exercise• Summary
Tutorial: Building Models 16
What is a model?• A model is a representation of the essential structure
of some object or event in the real world.– Physical (airplane, building, bridge)– Symbolic (language, computer program, mathematical
equation)
• Two major characteristics of models– Models are necessarily incomplete– Models may be changed or manipulated with relative ease
• No model includes every aspect of the real world– Building models necessarily involves simplifying
assumptions– It is critical that the assumptions made when constructing
models be understood and be reasonable.Source: Introductory Statistics Concepts, Models, and Applications by David Stockburger
Tutorial: Building Models 17
Using Data to Estimate
Effort Consumption = 11.9 Person Hours / Function PointEffort Consumption = 11.9 Person Hours / Function Point
Estimated Function Effort Effort = Points * Consumption 880.6 = 74 * 11.9 3,665.2 = 308 * 11.9 5,057.5 = 425 * 11.9
Actual Effort 16514,080 3,602
What does this mean?
Yikes!
Tutorial: Building Models 18
First Model: Sample Mean (est. X)
99.7%95%68%
-1 SD-2 SD-3 SD X +1 SD +2 SD +3 SD
Population Spreadwith Mean, X
NormalDistribution
T-Distribution
est. X
Sample Spread with estimated Mean
-90% CI +90% CIConfidence Interval (CI)
11.9 20.33.6
Tutorial: Building Models 19
Necessary and Sufficient Information• What additional information do we want to know
about the stated relationship to make it more accurate?
?
?
?
Effort Consumption = 11.9 Person Hours / Function PointEffort Consumption = 11.9 Person Hours / Function Point
Tutorial: Building Models 20
Data Analysis: PHr/FPPHr/FP
Mean 11.92Standard Error 4.48Median 7.50Standard Deviation 13.44Range 43.48Minimum 2.23Maximum 45.71Confidence Level(90.0%) 8.33
PN FP PHrs PHr/FP1 40 300 7.502 931 6,400 6.873 425 3,602 8.484 181 1,550 8.565 308 14,080 45.716 163 1,090 6.697 74 165 2.238 333 1,070 3.219 241 4,350 18.05
0123456
5 10 15 20 25 30 35 40 45 50
PHr/FP
Fre
qu
enc
y
0
80% CI
90% CI
ConfidenceInterval (CI)11.9
20.33.6
95% CI
18.25.7
22.31.6
Confidence Interval can be “tightened” by removing assignable causes of variation.
Tutorial: Building Models 21
Reducing the Confidence Interval
• Some assignable causes of variation among project data points– Noisy data (size and effort)– Complexity of the software (effort)– Amount of required testing (effort)– Building components for reuse (effort)– Changes in requirements (size)– Required reliability and safety features (size)– Interoperability (effort and size)– Development / maintenance team experience (effort)– Turnover of key people (effort)
Tutorial: Building Models 22
Measurement Specifications• Staff Turnover Specification Example
– Typical Data Items• Number of personnel
• Number of personnel gained (per period)
• Number of personnel lost (per period)
– Typical Attributes• Experience factor
• Organization
– Typical Aggregation Structure• Activity
– Typically Collected for Each• Project
– Count Actuals Based On• Financial reporting criteria
• Organization restructuring or new organizational chart
Source: Practical Software Measurement Objective Information for Decision Makers by McGarry et. al.
Tutorial: Building Models 23
Models Depend on Solid Data
• Models are created from data Models are only as good as the data used to create them– life-cycle phase– overtime to get work done– experience– tools– complexity– reuse
• Data used to create models must be well specified
Tutorial: Building Models 24
Accounting for Requirements Volatility
PHr/Adj_FP
Mean 8.01Standard Error 2.07Median 6.62Standard Deviation 6.21Range 20.81Minimum 2.05Maximum 22.86Confidence Level(90.0%) 3.85
PN FP REVL Adj_FP PHrs PHr/Adj_FP1 40 40.00 300 7.502 931 50 1396.50 6,400 4.583 425 30 552.50 3,602 6.524 181 10 199.10 1,550 7.795 308 100 616.00 14,080 22.866 163 1 164.63 1,090 6.627 74 9 80.66 165 2.058 333 10 366.30 1,070 2.929 241 60 385.60 4,350 11.28
Assignable cause of variation: Adjust the size with the effects of requirement’s volatility (REVL)Adj FP = FP * (1 + REVL%)
0123456
5 10 15 20 25 30 35 40 45 50
PHr/Adj_FP
Fre
quen
cy
90% CI11.94.28.0
Tutorial: Building Models 25
Impact of Personnel Continuity on EffortThis factor captures the turmoil caused by the project losing key, lead personnel. The loss of key personnel leads to extra effort in new people coming to work for the project and having to spend time coming up to speed on what has to be done. The rating scale is in terms of the project’s personnel turnover normalized to a year.
Descriptors: 48% per
year
Rating Levels Very Low Low Nominal High Very High
Effort Multipliers 1.29 1.12 1.00 0.90 0.81
+12%+15% -11% -11%Effect on Effort:
24% per year
12% per year
6% per year
3% per year
Source: Software Cost Estimation with COCOMO II by Barry Boehm et. al.
Variation: Staff Turnover
Tutorial: Building Models 26
Accounting for Staff Turnover
Assignable cause of variation: Adjust the effort with the effects of Personnel Continuity (PCON)Adj_PHr = PHr / PCON
Adj_PHr/Adj_FPMean 7.87Standard Error 1.46Median 8.05Standard Deviation 4.39Range 15.19Minimum 2.53Maximum 17.72Confidence Level(90.0%) 2.72
0
2
4
6
8
5 10 15 20 25 30 35 40 45 50
Adj_PHr / Adj_FP
Fre
quen
cy
90% CI10.65.27.9
PN Adj_FP PHrs PCON Adj_PHrsAdj_PHr / Adj_FP
1 40.00 300 0.90 333.33 8.332 1396.50 6,400 0.81 7901.23 5.663 552.50 3,602 0.81 4446.91 8.054 199.10 1,550 0.81 1913.58 9.615 616.00 14,080 1.29 10914.73 17.726 164.63 1,090 1.00 1090.00 6.627 80.66 165 0.81 203.70 2.538 366.30 1,070 0.81 1320.99 3.619 385.60 4,350 1.29 3372.09 8.75
Tutorial: Building Models 27
COCOMO Suite
• Attempts to identify and quantify assignable causes of variation (drivers)Model Purpose
COCOMO Custom cost and schedule estimation
COCOTS COTS Based Systems cost estimation
COQUALMO Defect introduction and removal
CORADMO Rapid application development cost and
schedule estimation
COPSEMO Staged schedule & effort model
COPROMO Productivity improvement model
COSYSMO System engineering cost and schedule est.
Tutorial: Building Models 28
My favorite!
• Statistical Regression fits a line through points minimizing the least square error between the points and the line
• The regression analysis yields a line with a slope, M, and intercept, A:
Y = A + MX + e• The goodness of fit is given by a statistic called R2. The closer to 1.0, the better the fit.
0
5000
10000
15000
0 500 1000 1500
Adj_FP
PH
r
est_PHr = 1061.3 + 6.0645 Adj_FPR2 = 0.3232
Second Model: Linear Regression Analysis
Tutorial: Building Models 29
Starting Point: Scatter Plot
Source: Albrecht and Gaffney, "Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation," IEEE Transactions on Software Engineering, Vol SE-9, No 6, Nov 1983.
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
0 500 1000 1500 2000
Unadjusted Function Points
KS
LO
C
Y = 11.3 + 0.0651 X + e
RandomThere is variation with each model coefficient
Model Boundaries
Tutorial: Building Models 30
Regression Analysis Example
-50015003500550075009500
115001350015500
-500
500
150
0
250
0
350
0
450
0
550
0
650
0
750
0
850
0
950
0
105
00
Model Estimate of PHr
PH
r
est_PHr = -12127 + 6.16 Adj_FP + 13879 PCONAdj. R2 = 0.64Linear Model
est_PHr = 4.84* Adj_FP1.08 * PCON2.72
Adj. R2 = 0.88Multiplicative Model
02000400060008000
10000120001400016000
0
100
020
00
300
040
00
500
060
00
700
080
00
900
010
000
110
0012
000
Model Estimate of PHr
PH
r
Compare ModelsCompare Models
Tutorial: Building Models 31
Model Accuracy
PRED(L) = X• Means that the model estimates
within L% of the actual values X% of the time
• In other words, how often does the model predict within the desired circle?
Models are necessarily incomplete and are not 100% accurateModels are necessarily incomplete and are not 100% accurate
Example: PRED(30) = 70The model predicts within 30% of the actuals 70% of the time. 30% of the actual
Tutorial: Building Models 32
Model Evaluation
PN PHr/FPAdj_PHr / Adj_FP Linear Model
Multiplicative Model Actual PHrs
1 476.80 314.80 612.26 200.76 3002 11,097.52 7,326.97 7,703.17 7,026.10 6,4003 5,066.00 3,344.75 2,512.57 2,570.26 3,6024 2,157.52 1,424.47 339.16 849.69 1,5505 3,671.36 2,423.96 9,565.01 10,274.27 14,0806 1,942.96 1,282.81 2,764.17 1,227.46 1,0907 882.08 582.38 -389.25 318.92 1658 3,969.36 2,620.71 1,367.44 1,645.88 1,0709 2,872.72 1,896.67 8,148.05 6,181.82 4,350
PRED(30) 0.0 0.55 0.33 0.44
Which model would you choose?
Tutorial: Building Models 33
Summary -1: Two ModelsMean X 1-Variable Regression
x x x x xxx xx
X
Y
Y = 0 + X
x
x
x
x
xxx
x
xY
X
Y = A + MX + e
-1 SD-2 SD-3 SD X +1 SD +2 SD +3 SD
Tutorial: Building Models 34
Summary -2
• Definition of a model• Data specifications• Normal versus T distribution• Model characteristics
– Model Usage– Model boundaries– Confidence interval– Model accuracy– Assignable causes of variation
• Large model examples
Tutorial: Building Models 35
Tutorial Outline
• Purpose• A software engineering modeling example• Model building steps• Mean-based model exercise• Regression based model exercise• Summary
Tutorial: Building Models 36
Modeling Steps: 1. Decide what relationship you would like to investigate
• What do you want to know– Estimation of Requirements Volatility– Establishing thresholds for performance monitoring– Working overtime’s effect on personnel turnover– Estimation of the number of defects to be found before Final
Acceptance Test
People Code Size
Requirements
Defects
Build Duration Rework
DocumentationTest Cases
Design Units
Cost
Req’ts Evolution
Process Maturity
Change RequestsFunction Points
Tutorial: Building Models 37
2. Identify assignable causes of variation (drivers)• Use your experience and intuition• Possible sources of variation:
– Customer participation - Development team experience– Application domain exp. - Complexity of application– Development flexibility - Design constraints– Requirements volatility - Adaptation of existing code– Programming lang. exp. - Use of modern methodologies– Compression of schedule - Use of software tools– Code inspections - Management capability– Team size - Application size– Personnel turnover - Architecture & Risk resolution
Tutorial: Building Models 38
3. Collect data
• Specify data to be collected based on:– assignable causes of variation– what is available
• Select 10 projects to go back and collect extra data– Based on project applicability– Use measurement specifications as a checklist for each data
• How much project data is enough?
Tutorial: Building Models 39
4. Normalize data and check for consistency• In some cases it may be appropriate to normalize:
– normalize data to a “per unit” measure• size
• defects
• calendar days, weeks, months
• effort hours, effort days, effort months
– normalize about the mean of the data to get percentage increase or decrease from the mean
• Plot data– Check that known relationships exist– Detect outliers and investigate– Scatter plots are very useful
Tutorial: Building Models 40
5. Build model and evaluate
• Models should be:– Simple– Explainable– Analyzable– Most important: They should make sense!
• Models make explicit data relationships– Show strength and direction of relationships– While a relationship exists - it may not be valid or make
sense– The relationships you want to use in modeling are ones that
show valid “cause and effect”
Tutorial: Building Models 41
6. Add or remove drivers
• Drivers are data attributes that explain (or drive) variation.
• The more drivers used in a model - the more data that must be collected.
• While a driver may make sense to use in explaining variation, the data may not support this conclusion– Collect more data, the current dataset may be biased and
not represent a true sample – There may be drivers that are correlated, this could cover
the effects of the weak performing driver
• Warning: Correlation effects between drivers
Tutorial: Building Models 42
7. Repeat steps 3 to 7
• If the model does not have an acceptable accuracy, then:– collect more data– analyze it for its influence on variation– add and remove cost drivers– evaluate model
Tutorial: Building Models 43
8. Pilot model
• Document model and create a tool for its use• Model must be piloted to test its reasonableness,
understandability, and accuracy.– Collect actual values of model inputs (including assignable
causes of variation)
• Model should be used with its confidence interval• Feedback should be incorporated into model and tool
Tutorial: Building Models 44
Tutorial: Building Models 45
Tutorial Outline
• Purpose• A software engineering modeling example• Model building steps• Mean-based model exercise• Regression based model exercise• Summary
Tutorial: Building Models 46
Exercise 1: Growth Model
• Modeling step #1: What do we want to investigate?– We are going to develop a growth model based on real data
in a report from NASA Software Engineering Laboratory*– A growth model increases size based on “other information”– It will be used in estimating cost and schedule for future
software projects
• When will the model be used?– What information will be available at the time?
• What will be the scope of the model?– What will be included or excluded in the estimate?
* Cost and Schedule Estimation Study Report, SEL-93-001
Tutorial: Building Models 47
Exercise 1: Assignable Causes of Variation• Modeling step #2: What are possible causes (that
can be controlled) of growth?
?
?
?
?
Tutorial: Building Models 48
Exercise 1: Survey the Data• Modeling step #3: Collect data• Using Microsoft Excel, open the file with the NASA SEL
data.• Select the data definitions worksheet
– Project type– Programming Language– Duration– Effort for management and technical– Estimated SLOC size– Actual SLOC size– New SLOC size– Growth (derived)– Reuse (derived)
Tutorial: Building Models 49
Exercise 1: Plot the Data
• Modeling step #4: Normalize data and check for consistency– Copy Data worksheet and name it “Scatter Plots”
• Create a scatter plot of the following data elements against Growth%– Project Type– Programming Language– Duration– Effort– Estimated SLOC– New SLOC– Reuse%
Tutorial: Building Models 50
Exercise 1: Check for Correlation• Check for correlation of data elements to Growth%
– Excel: Tools -> Data Analysis -> Correlation(new Worksheet Ply: Correlation - this will create a new worksheet)
• Compare correlation numbers to scatter plots– What can you conclude?
TypeN LangNDuration (Weeks)
Effort (Hours)
SLOC Est
SLOC Act
SLOC New Growth
TypeN 1.000LangN -0.507 1.000Duration (Weeks) 0.694 -0.274 1.000Effort (Hours) 0.307 -0.392 0.681 1.000SLOC_Est 0.248 -0.348 0.346 0.626 1.000SLOC_Act 0.337 -0.412 0.524 0.785 0.919 1.000SLOC_New 0.344 -0.363 0.724 0.972 0.566 0.758 1.000Growth 0.041 -0.360 0.085 -0.001 -0.359 -0.095 0.138 1.000Reuse% -0.331 0.379 -0.623 -0.561 0.011 -0.164 -0.673 -0.472
Tutorial: Building Models 51
Exercise 1: Create Mean-Based Models• Modeling step #5: Build model and evaluate
– Copy the Data worksheet and name it “Project-Models”
• Which relationships shown in the scatter plots looked most promising?
• Based on the intended model’s purpose, what data would be realistically available?
Tutorial: Building Models 52
Exercise 1: Create Project Type Mean-Based Models• Build 3 Mean-Models based on “Project Type”
– Sort data by project type• 1 - TS
• 2 - AGSS
• 3 - DS
– Excel: Tools -> Data Analysis -> Descriptive Statistics– Input Range: TypeN (for one set of values)– Output Range: swipe two empty columns– Check Summary Statistics– Check Confidence Interval: 80%– Describe each model
• Mean, Standard Deviation, Min - Max values, Number of data points, 80% confidence intervals
Tutorial: Building Models 53
Exercise 1: TS Mean-Model
Sum of productivities / No. projectsStd Deviation / SQRT(No projects)
Middle Growth% value
Spread; SQRT(Variance)
Small TS project Growth%
Largest TS project Growth%
Confidence interval within which lies the real population mean
TS Projects Growth
Mean 0.21Standard Error 0.05Median 0.20Mode 0.20Standard Deviation 0.11Sample Variance 0.01Kurtosis 2.82Skewness 1.49Range 0.30Minimum 0.10Maximum 0.40Sum 1.05Count 5.00Confidence Level (80.0%) 0.08
Tutorial: Building Models 54
Exercise 2: Create Reuse% Mean-Based Model• Build a Mean-Based Model for Reuse%
– Looking at scatter plot, how can this data be stratified?– Copy Data worksheet and name it “Reuse-Model”– Sort data by Reuse%– Use Descriptive Statistics to build two Reuse% models
based on stratified dataExcel: Tools -> Data Analysis -> Descriptive Statistics (80%
Confidence Interval)
– Describe each modelMean, Standard Deviation, Min - Max values, Number of data
points, 80% confidence intervals
Tutorial: Building Models 55
Mean-Based Modeling Conclusions• Correlation analysis versus Scatter Plots• Variation in the data
– When to use the Mean versus the Median– Stratifying or categorizing data– Determines (in part) the confidence interval
• Number of data points are important• Minimum and maximum values set model boundaries• The mean is a model that describes data “on
average”• The standard deviation is a model that describes
distances “in general”
Tutorial: Building Models 56
Tutorial Outline
• Purpose• A software engineering modeling example• Model building steps• Mean-based model exercise• Regression based model exercise• Summary
Tutorial: Building Models 57
• Statistical Regression fits a line through points minimizing the least square error between the points and the line
• The regression analysis yields a line with a slope, M, and intercept, A:
Y = A + MX• The goodness of fit is given by a statistic called R2. The closer to 1.0, the better the fit.
Linear Regression Models
y = 6.1365x + 206.12
R2 = 0.7286
0
500
1000
1500
2000
2500
0.0 50.0 100.0 150.0 200.0 250.0 300.0 350.0
COBOL KSLOC
Un
adju
ste
d F
un
ctio
n P
oin
ts y = 6.1365x + 206.12
R2 = 0.7286
0
500
1000
1500
2000
2500
0.0 50.0 100.0 150.0 200.0 250.0 300.0 350.0
COBOL KSLOC
Un
adju
ste
d F
un
ctio
n P
oin
ts
Tutorial: Building Models 58
Two Types of Regression Models
• Additive (linear)
Y = A + MX + e
• Multiplicative
y = A • XM• e• The regression technique requires a linear form
– Works for the first model form– Do not work for the second model form
• Non-linear models must be transformed into a linear form
Tutorial: Building Models 59
Transforming Non-Linear Models
• Log-log transformation
Y = A • XB
ln(Y) = ln(A) + B • ln(X)
• Reversing the log-log transformation
eln(Y) = e[ln(a) + M • ln(X)]
y = ea • XM
A = ea
Y = A • XM
Tutorial: Building Models 60
Exercise 3: Create Additive Duration Regression Model• Modeling step #5: Build model and evaluate
– Copy the Data worksheet and name it “Duration-Model”
• Examine the Duration Relationship to Growth% in the scatter plots
• Based on the intended model’s purpose, would this data be realistically available?
Tutorial: Building Models 61
Exercise 3: Create Additive Model• Build a Regression Model based on Duration
– Select Tools -> Data Analysis -> Regression– Y Input range: Growth% data– X Input range: Duration data– Select labels– Confidence interval set to 80%– Output range : select 7 columns for the output– Select residuals– Select: OK– Describe the model
• Intercept (A), Slope (M), R2 of the model, Min - Max values, Number of data points, 80% confidence intervals for A and M
– Create a scatter plot of the data with trend line
Tutorial: Building Models 62
Exercise 3: Additive Duration ModelSUMMARY OUTPUT
Regression StatisticsMultiple R 0.57204208R Square 0.32723214Adjusted R Square 0.26607143Standard Error 29.2818403Observations 13
ANOVA
df SS MS FSignificance
FRegression 1 4587.542915 4587.543 5.350365 0.041074958Residual 11 9431.687854 857.4262Total 12 14019.23077
Coefficients Standard Error t Stat P-value Lower 80.0%Upper 80.0%
Intercept -40.527843 34.29305817 -1.181809 0.2622 -87.2840386 6.228353Duration 0.71766616 0.310263556 2.313086 0.041075 0.294643418 1.140689
Growth% = -40.5 + 0.72 DurationGrowth% = -87.3 + 0.30 Duration (lower 80%)Growth% = 6.23 + 1.14 Duration (upper 80%)
Growth% = -40.5 + 0.72 DurationGrowth% = -87.3 + 0.30 Duration (lower 80%)Growth% = 6.23 + 1.14 Duration (upper 80%)
Tutorial: Building Models 63
Additive Model Scatter Plot
y = 0.7177x - 40.528
R2 = 0.3272
-20
0
20
40
60
80
100
120
140
0 50 100 150 200
Duration
Gro
wth
y = 0.7177x - 40.528
R2 = 0.3272
-20
0
20
40
60
80
100
120
140
0 50 100 150 200
Duration
Gro
wth
Tutorial: Building Models 64
Exercise 4: Create Multiplicative Duration Regression Model• Modeling step #5: Build model and evaluate
– Copy the Data worksheet and name it “Ln-Duration-Model”
• Transform the Growth% and Duration data into log-space by taking the logarithms of each column– Insert a new column next to Growth% and Duration– Label them Ln-Growth% and Ln-Duration– In the new column take the logarithms of the column next to
it• e.g. in cell H2 type =ln(G2); copy this formula into the remaining
cells
Tutorial: Building Models 65
Exercise 4: Create Multiplicative Model• Build a Regression Model based on Ln-Duration
– Use the same procedures as last time
• Transform the results back into normal-spaceeln(y) = e[ln(a) + eb • ln(x)]
y = ea • xb
a = ea <- all we have to do is raise the intercept to the e
=exp(intercept)
• Describe the model– Intercept (A), Slope (M), R2 of the model, Min - Max values,
Number of data points, 80% confidence intervals for A and M
• Create a scatter plot of the data with trend line
Tutorial: Building Models 66
Exercise 4: Multiplicative Duration Model
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.55929926R Square 0.312815663Adjusted R Square 0.250344359Standard Error 0.735024727Observations 13
ANOVA
df SS MS FSignifican
ce FRegression 1 2.705278 2.705278 5.00735 0.046889Residual 11 5.942875 0.540261Total 12 8.648152
CoefficientsStandard
Error t Stat P-valueLower 80.0%
Upper 80.0%
Intercept -4.221394794 3.352932 -1.259016 0.234087 -8.792883 0.350094ln-Duration 1.613752301 0.721162 2.237711 0.046889 0.630498 2.597007exp(Intercept) 0.014678157 0.000152 1.419201
Growth% = 0.015 * Duration1.6
Growth% = 0.0002 * Duration0.63 (lower 80%)Growth% = 1.42 * Duration2.6 (upper 80%)
Growth% = 0.015 * Duration1.6
Growth% = 0.0002 * Duration0.63 (lower 80%)Growth% = 1.42 * Duration2.6 (upper 80%)
Tutorial: Building Models 67
Multiplicative Model Scatter Plot
y = 0.0147x1.6138
R2 = 0.3128
0
20
40
60
80
100
120
140
0 50 100 150 200
Weeks
Gro
wth
%
Tutorial: Building Models 68
Multiplicative Model Scatter Plot(Log - Log scale)
y = 1.6138x - 4.2214
R2 = 0.3128
0.000
1.000
2.000
3.000
4.000
5.000
6.000
0 1 2 3 4 5 6
Weeks
Gro
wth
%
Tutorial: Building Models 69
Model Comparison -1
• Using the two models, estimate Growth%– In each Duration worksheet, create a new column next to
Growth% and Ln-Growth%– Label it est. Growth%– Using the models created in Exercise 3 and 4, compute the
estimated Growth%• One model is additive: Growth% = -40.5 + 0.72 Duration
• One model is multiplicative: Growth% = 0.015 * Duration1.6
Tutorial: Building Models 70
Model Comparison -2
• Compute the Magnitude Relative Error (MRE) for each Growth estimate:– Create a new column next the the est. Growth% column– Label it MRE– Compute MRE: (Actual Growth% - Estimated Growth%) /
Actual Growth%
• Count the errors that are less than or equal to 30% and divide by the number of data points. This is PRED(30)
Tutorial: Building Models 71
Model Comparison Results
PN Growth % Additive Model Multiplicative Model2 5 0.43 30.153 50 0.66 45.004 30 0.45 31.405 30 41.80 29.736 40 30.28 23.397 80 64.84 44.048 130 51.16 35.299 20 26.68 21.5310 20 32.44 24.5411 15 18.76 17.6412 10 -6.44 7.3513 25 20.20 18.3314 20 38.92 28.09
PRED(30) 31 62
Tutorial: Building Models 72
Tutorial Outline
• Purpose• A software engineering modeling example• Model building steps• Mean-based model exercise• Regression based model exercise• Summary
Tutorial: Building Models 73
Summary -1• Modeling steps
1. Decide what relationship you would like to investigate
2. Identify assignable causes of variation (drivers)
3. Collect data
4. Normalize data and check for consistency
5. Build model and evaluate
6. Add or remove drivers
7. Repeat steps 3 to 7
8. Pilot model
• Mean-Based Models– Scatter plot versus correlation analysis– Stratify data to identify different relationships– Mean, Standard Deviation, Min - Max values, Number of data
points, 80% confidence intervals
Tutorial: Building Models 74
Summary -2
• Regression-Based Models– Additive: Y = A + MX– Multiplicative: Y = A * XM
– Use of logarithms to transform multiplicative into additive– Analysis in log-space versus linear-space– Use of PRED(L) as a measure of model performance
• Data defines the model!– Data quality– Scope of coverage: life-cycle phases– Depth of coverage: included / excluded in the count– Correlation effects among assignable causes of variation– Min and Max inputs (based on low and high data points)
Tutorial: Building Models 75
Further Information
• Cost and Schedule Estimation Study, NASA Software Engineering Laboratory, SEL-93-002, Nov 1993
• Introductory Statistics Concepts, Models, and Applications by David Stockburger, www.atomicdogpublishing.com, 2ed, 2001
• Practical Software Measurement Objective Information for Decision Makers by John McGarry, David Card, Cheryl Jones, Beth Layman, Elizabeth Clark, Joseph Dean, and Fred Hall, Addison-Wesley, 2001
• Software Cost Estimation with COCOMO II by Barry Boehm, Chris Abts, Winsor Brown, Sunita Chulani, Brad Clark, Ellis Horowitz, Ray Madachy, Donald Reifer, and Bert Steece, Prentice Hall PTR, 2000.
• Statistics, Data Analysis, and Decision Making, by James Evans and David Olson, Prentice-Hall, 1999
• Statistical Analysis Simplified, by Glen Hoffherr and Robert Reid, McGraw-Hill, 1997
Tutorial: Building Models 76
Contact Information
Brad Clark
Software Metrics, Inc.
Washington, D.C. area
(703) 754-0115
http://www.software-metrics.com