elaine weyuker august 2014. to determine which files of a large software system with multiple...

21
Elaine Weyuker August 2014

Upload: nathaniel-brooks

Post on 16-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Elaine WeyukerAugust 2014

Page 2: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

To determine which files of a large software system with multiple releases are likely to contain the largest numbers of bugs in the next release.

Page 3: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Help testers prioritize testing efforts.

Help developers decide when to do design and code reviews and what to re-implement.

Help managers allocate resources.

Page 4: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Verified that bugs were non-uniformly distributed among files.

Identified properties that correlated most closely with fault-proneness, and then built a statistical model and ultimately a tool to make predictions.

Page 5: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Size of file (KLOCs)Number of changes to the file in the previous 2 releases.

Number of bugs in the file in the last release.

Age of file (Number of releases in the system)

Language the file is written in.

Page 6: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

All of the systems we’ve studied to date used a configuration management system which integrates version control and change management functionality, including bug history.

Initially manually read Modification Requests (MRs), decide which were bugs, and what sorts of changes, if any, were made to files.

Once we had a tool, data was automatically extracted and passed to the prediction engine.

Page 7: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Used Negative Binomial Regression

Also considered machine learning algorithms including:◦ Recursive Partitioning◦ Random Forests◦ BART (Bayesian Additive Regression Trees)

Page 8: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Consists of two parts.

The back end extracts data needed to make the predictions.

The front end makes the predictions and displays them.

Page 9: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Extracts necessary data from the repository.

Predicts how many bugs will be in each file in the next release of the system.

Sorts to files in decreasing order of the number of predicted bugs.

Displays results to user.

Page 10: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Percentage of actual bugs that occurred in the N% of the files predicted to have the largest number of bugs. (N=20)

Considered other measures less sensitive to the specific value of N.

Page 11: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

System Years Followed

Releases LOC % Faults Top 20%

NP 4 17 538K 83%

WN 2 9 438K 83%

VT 2.25 9 329K 75%

TS 9+ 35 442K 81%

TW 9+ 35 384K 93%

TE 7 27 327K 76%

IC 4 18 1520K 91%

AR 4 18 281K 87%

IN 4 18 2116K 93%

Page 12: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of
Page 13: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

What Are We Missing?

Page 14: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Fault Prediction Tool Overview

Prediction Engine

Statistical Analysis

Version Mgmt /Fault Database

(previous releases)

Release to be predicted

User-supplied parameters

Fault-proneness predictions

Page 15: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

User enters system name.

User specifies that all problems reported in System Test phase are faults.

User asks for fault predictions for release “Bluestone2008.1”

Available releases are found in the version mgmt database. User chooses the releases to analyze.

User selects 4 file types.

Page 16: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

User confirms configuration

User enters filename to save the configuration.

User clicks Save & Run button, to start the prediction process.

Page 17: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Initial prediction view for Bluestone2008.1

All files are listed in decreasing order of predicted faults

Page 18: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Listing is restricted to eC files

Page 19: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Listing is restricted to 10% of eC files

Page 20: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Prediction tool is fully-operational◦ 750 lines Python◦ 2150 lines C, 75K bytes compiled

Current version’s backend is specific for the internal AT&T configuration management system but can be adapted to other configuration management systems. All that is needed is a source of the data required by the prediction model.

Page 21: Elaine Weyuker August 2014.  To determine which files of a large software system with multiple releases are likely to contain the largest numbers of

Developers◦ Counts◦ Individuals

Calling StructureAmount of Code Change