identifying reasons for software changes using historic databases

Identifying Reasons for Software Changes Using Historic Databases

The CISC 864 Analysis

By Lionel Marks

Purpose of the Paper

Using the textual description of a change, try to understand why that change was performed (Adaptive, Corrective, or Perfective)

Observe difficulty, size, and interval on the different types of changes

Three Different Types of Changes Traditionally, the three types of changes are

(Taken from ELEC 876 Slides):

Three Types of Changes in This Paper Adaptive: Adding new features wanted by the

customer (Switched with Perfective) Corrective: Fixing Faults Perfective: Restructuring code to

accommodate future changes (Switched with Adaptive)

They did not say why they changed these definitions

The Case Study Company

This paper did not divulge the company it used for its case study

It is an actual business Kept developer names/actions anonymous in

the study This allowed them to study a real system that

has lasted for many years, and has a large (and old) version control system.

Structure of the ECMS The Company’s Source Code Control System

- ECMS ( Extended Change Management System)

MRs vs. Deltas Each MR could have multiple Deltas of

changes to one file Delta – each time a file was “touched”

The Test System

Called “System A” for anonymity purposes Has:

2M lines of source code 3000 files 100 modules

Over the last 10 years: 33171 MRs An average of 4 deltas each

How they Classified Maintenance Activities (Adaptive, Corrective, Perfective)

If you were given this project You have:

The CVS repository, and access to the descriptions along with commits

The goal of labelling each commit as “Adaptive”, “Corrective”, or “Perfective”.

What would you intuitively study in the descriptions?

How they Classified Maintenance Activities (Adaptive, Corrective, Perfective)

They had a 5 step process:1. Cleanup and normalization

2. Word Frequency Analysis

3. Keyword Clustering and Classification

4. MR abstract classification

5. Repeat analysis from step 2 on unclassified MR abstracts

Step 1: Cleanup and Normalization Their approach used WordNet

A software that eliminates prefixes and suffixes to get back to the root word. E.g. fixing and fixes are all of the root word fix

WordNet also had a synonym feature, but it was not used.

They would be hard to correlate properly to the context of SW maintenance, and could be misinterpreted.

Step 2: Word Frequency Analysis Determine the frequency of a set of words in

the descriptions (Histogram for each description)

What words in the English language would be “neutral” to these classifications and be noise in this experiment?

Step 3: Keyword Clustering Classification was done by reading the

description of 20 randomly selected changes for each selected term in their set, such as “cleanup” meaning perfective maintenance. Human reading was done.

If word matched less than 75% of cases, then deemed “neutral”

Found that “rework” was used a lot during “code inspection” (a new classification)

Step 4: MR Classification Rules Like the “hard-coded” answer when the

learning algorithm fails If an inspection word is found, then it is

deemed an inspection classification If fix, bug, error, fixup, or fail are present, the

change is corrective If more than one type of keyword is present,

the dominating frequency wins.

Step 5: Cycle Back to Step 2

As in Step 2 you cannot cover the frequency of every word in your document all at once, take some more now

Perform more “learning” and see if new frequent terms fit

Use static rules to resolve unclassified descriptions

When all else failed, considered fixes to be corrective

Case Study: Compare Against Human Classification 20 Candidates, 150 MRs More than 61% of the time, the tool and the real people

came to the same classification Kappa and ANOVA were used to show significance in

the results

How Purposes Affect Size and Interval Corrective and Adaptive

had the lowest change intervals

New Code Development and inspection changes added the most lines

Inspection deleted the most lines

Distribution functions are significant at a 0.01 level ANOVA described significance as well, but is inappropriate due skewed distributions

Change Difficulty

20 Candidates, 150 MRs Goal: To model the difficulty of each MR. Is

classification significant?

Modeling Difficulty

Modeling of Size: Deltas (# of files touched) Difficulty changed with number of deltas except in

corrective and perfective (changes in SW/HW) changes Length of time modeled in difficulty as well

Likes and Dislikes of this Paper Likes

The algorithm used to make classifications – good way to break down the problem

The accumulation graphs were interesting Their utilization of a real company is also a breath of fresh

air – real data! Dislikes

Asking developers months after the work how hard changes were. No better way at moment, but results can be skewed with time.

Using a real company, the anonymity made the product comparison in the paper less interesting

identifying reasons for software changes using historic databases

Documents