paper a new method to predict software defect based on rough sets

8/6/2019 PAPER a New Method to Predict Software Defect Based on Rough Sets

1/6

A New Method to Predict Software Defect Based on Rough Sets

VIVEKSUBRAMANIAN M(10MSC0019) B.K.TRIPATHY

RAJMOHAN K(10MSC0024)

Objectives

High quality assurance software requires

extensive assessment. Most of the

software organizations do not allocate

resources for software quality. We

research the detectors focusing on the

data sets of software defect prediction. A

rough set model is presented to deal with

the attributes of data sets of softwaredefect prediction in this paper. Most

famous public domain data set created by

the NASAs metric data program. There

we can see a good and extrordinary

performance.

Literature view

High quality software should have as few

defects as possible. Therefore, software

requires extensive and expensiveassessment. Most of software organizationsfrequently do not allocate enough resources

for quality. So, a variety of software defect

prediction techniques have been proposed,such as Statistical methods, Machine

learning techniques [1] and Mixed

techniques [2].Most of the wide range of prediction models

uses size and complexity metrics of software

to predict defects. Others are based on the

'quality' of the development process[3], ortake a multivariate approach [4].

Data sets used in these studies are three

folds:original, open source, and public domain.

First, as for original data sets are usually

used in empirical studies in industries

available in software development forPhilips Software Centre (PSC). Second, as

for the open source software

data, studies such as [6] collected and usedfor the evaluation of their software defect

prediction approaches. Finally, the public

domain data set, one of the most famous

public domain data set is the NASA'sMetrics Data Program (MDP) [7]. For

example, studies such as [8, 9] used the

NASAs MDP. By using such

public domain data sets, a new approach canbe easily comparable with other approaches.

3.Detailed Problem Definition

Rough set theory

Rough set is introduced in detailed problem

definition Rough set theory, introduced by

Z.Pawlak since the early 1980s [10], is anew mathematical tool to deal with

vagueness and uncertainty. It is amathematics methodology for processingapproximation space. It is used by

classification of an indiscernibility relation.

The area of uncertainty is boundary of roughset. The boundary region is defined by

difference of the upper approximation and

the lower approximation. Rough set

has some main advantage, such as based onsound mathematical definition, outstanding

robustness and little computation. The

information processing of rough set does notneed any preliminary or additional

information about data, such as basic

probability assignment in the Dempster-Shafer theory and the value of possibility in

fuzzy set theory. Knowledge representation

in the rough set model is done via a decisiontable which is a tabular form of an


2/6

ObjectAttribute Value relationship. An

object is a record in decision table.

There are a set U of objects and an

equivalence relation R on the set U. R is a

partition of the set U, using the attribute set. R={X1, X2, , Xn}. We call the pair

an approximation space. For PR

and P, then P (the intersection sets of all

equivalence relations in the P) is also anequivalence relation. We call P

indiscernibility relation on the P, write as

Ind(P). For XU, we say that R-

(X)=_{YiU | Ind(R):

YiX} is the lower approximation of X,

and R-(X)= {YiU|Ind(R): YiX} is the

upper approximation of X.R-(X)-R-(X) is an uncertain boundaryregion of X. In other words, the boundary

region does not comprise the objects that

cannot be classified with certainty to beinside X, or outside X, using the attribute set

.

Rough sets pretreatment translates the set of

objects into Ind(R) using the attribute set .Discernibility matrix is a matrix in which

the element is the set of condition attributes.

In the matrix the condition attributes whichcan be used to discern between the classes in

the corresponding row and column are

inserted.Let = (q) call as

informationsystem, where

(a) a finite set U of objects;

(b) a finite set of attributes;(c) a set Vq of attribute values;

(d) for each q, a mapping function fq:

UVq.We also will sometimes write Oq instead offq(O) to denote the value of object O with

respect to attribute q.Discernibility matrix

M()=(mi,j), (1i,jn, n= |U /Ind()|),where (mi,j)={q | OiqOj q}.

Discernibility matrix object translates roughset Ind() into discernibility matrix M().

Suppose = is an

information system, then = is a decision knowledge system(forshort decision system, sometimes called as

knowledge system). Subset C is the

condition attributes in . Subset D is thedecision attributes in .

We see that an attribute in corresponds to

an equivalence relation on U. A decisiontable can be regarded as a group of

equivalence relation. So data mining can be

predigested as shrinking attributes.

On decision system , discernibility matrixM(C,D)

= (mi,j), (1i, jn, n=|U/Ind(C)|), where

(mi,j)= {qC | OiqOjq, OidOjd, dD}.

Let Os be a set of objects. Let Ds be a set ofattributes belong to Os and Ms be a set of

methods belong to Os. For O in Os and M in

Ms, we denote by O.M the result of theapplication of method M on object O. For

Os, we will be interested in generating all

the objects that can be mapped into themodel just described as follows:

The set Os of objects consists of labeled data

tuples comprising the training rough set for

the similar classifier. The label identifies the

group to which the object belongs. Othertuple attributes specify the properties of the

object. The goal of the similar classificationproblem is to discover new rules each of

the groups in the training rough set.

Attributes of an object are of differentimportance. The importance is expressed as

right value. In order to discover importance

of a set of attributes, firstly we delete someattribute from the set of attributes, and

then review change of equivalence relation.

If classification is changed, then the attribute

is of bigger importance, or is of less.Suppose F={X1, X2, , Xn} is a cluster of

sets of classification on U based on

attributes set . The inherent metric of anapproximation space is the approximation

precision R(F) and the approximation

quality R (F).


3/6

R (F) = in= 1 | R-(X) | / in= 1 | R-(X) |.

R (F) = in= 1 | R-(X) | / | U |.

Let = be a decisionsystem.Ind(C, D)={(x,y) | (x,y)UU,

rC, r(x)=r(y)} is

called as indiscernibility relation of .Let = is a decision

system.Reduction of condition attributes setC is a nonempty

subset C'C such that:

Ind(C', D)=Ind(C, D);

No C''C' makes Ind(C'', D)=Ind(C, D).

Reduction set of C is written as Red(C).

The intersection of all reduction sets of toindiscernibility class X. If C'C, rC'

and that r cannot be cut out relative to

indiscernibility class X, we call C' thereduction of indiscernibility class X. Thereduction set of all indiscernibility class

Xi (i=1, 2, , n) is written as Red(C, Xi).

Core(C,Xi) = Red(C, Xi) is called as core relative

to Xi. In this way, we can get simplest

decision table using

relative reduction.

Theorem 1:

Let (U,R) be a knowledge base.then thescaled many valued context sc(U,R)

:=((U,R,W,I),(SR | R R))is defined by

W:={[x]R | x U,R R} and (x.R,w)

l:w=[x]R and the nominalscale SR:=(U/R,U/R,=)for each many

valued attribute RR. then theindiscernibility classes of (U,R)are exactly the contingents ofthe derived context K of

sc(U,R).

Theorem 2:

Let SC:=((G,M,W,I),(Sm|

mM)) be a scaled manyvalued context ,and K := (G,

{(m,n) | m M, n Mm }, J) its

derived context. Then theknowledge base kb(SC) isdefined by kb(SC):= (G, R),where R := {Rm | m M} and form M Rm := {(g,h) G G | m(g)= m(h) } and m is the object-concept mapping of the m-part ofK; clearly, the m-part of K is theformal context (G, {(m,n) | n Mm }, Jm) where J m := {(g,

(m,n)) J | n Mm }. Then theindiscernibility classes of kb(SC)are exactly the contingents of thederived context K of SC.

Theorem 3:

For any knowledge base (U, R):kb(sc(U, R)) = (U, R).

4. Solution Methodology

The data sets used in this study are shown in

Table 1 and are freely available to other

researchers via the web interface to NASAs

Metrics Data Program (MDP)136C is called

as core, Core(C),Core(C)=Red(C).

Fordecision system = , U

is parted into indiscernibility classes X1, X2,

, Xn byInd(C, D). Let D(Xi)={v = fq(x) |

xXi, qD} be the set taking value of

decision attributes set D. If D(XInd(C-

{r},D))=D(X), then condition attribute rC

can be cut out relative toat .Data sets

CM1,JM1,KC1,KC2,PC1. Table 1 Data sets

used in this study

project Number

ofinstances

FALS

EFalse

Rate

TRU

ETrue

Rate

CM1 498 449 49

JM1 10885 8779 2106


4/6

KC1 2109 1783 326

KC2 522 415 107

- - -

PC1 1109 1032 77

All data comes from McCabe and Halsteadfeatures extractors of source code. These

features were defined

in the 70s in an attempt to objectively

characterize code features that are associatedwith software quality. The McCabe and

Halstead measures are "module"- based

where a "module" is the smallest unit offunctionality. McCabe argued that code with

complicated pathways are more error-prone.

His metrics therefore reflect the pathwayswithin a code module. Halstead

argued that code that is hard to read is more

likely to be fault prone. Halstead estimatesReading complexity by counting the number

of concepts in a module. The McCabe

metrics are a collection of four software

metrics: essential complexity, cyclomaticcomplexity, design complexity and LOC,

Lines of Code. The Halstead falls into three

groups: the base measures, the derived

measures, and lines of codemeasures. Total Number of software metrics

is 22, including 5 different lines of codemeasure, 3 McCabe metrics, 4 base Halstead

measures, 8 derived Halstead measures,

a branch-count, and 1 goal field. List as

follows:loc: McCabe's line count of code

v(g): McCabe "cyclomatic complexity"

ev(g): McCabe "essential complexity"iv(g): McCabe "design complexity"

n: Halstead total operators + operandsv: Halstead "volume"l: Halstead "program length"

d: Halstead "difficulty"

i: Halstead "intelligence"e: Halstead "effort"

b: Halstead

t: Halstead's time estimator

lOCode: Halstead's line count

lOComment: Halstead's count of lines of

commentslOBlank: Halstead's count of blank lines

lOCodeAndComment: numeric

uniq_Op: unique operatorsuniq_Opnd: unique operands

total_Op: total operators

total_Opnd: total operandsbranchCount: of the flow graph

defects: module has/has not one or more

reported defects.

5.Details of experimentation,analysis and

methodology

In preprocessing of data set, we cluster each

software metrics into three classes: High,Medium and Low. Many methods of

computing distance in clustering can beused, such as presented by Euclidean,

We change CM1 into rough set information

table. Afterwards applying data reduction

algorithm, rule discovery algorithm andrule reduction algorithm given by us, some

available rules are discovered.

Part of information table of software defectprediction coded as =

depicted in Table 2. We have incorporated

complete identical samples into a record.The n denotes the number of records. The

{a, b, c, d} are all condition attributes. 'a'

as loc numeric, 'b' as v(g) numeric, 'c' asev(g) numeric, ..., 'u' as branchCount. The

value of 0, 1 and 2 in the cells instead of

Low, Medium and High. The 'e' is decision

attribute as defects, and the value of 0, 1means False, True.

Table 2 Information table

case a b c . u Decision(z)

R1 0 0 1 .. 0 0

R2 0 0 0 .. 0 1

R3 0 0 0 .. 0 0

.. .. .. .. .. ..


5/6

R498 0 0 1 .. 0 1

Discernibility matrix M(C, D) is calculated

by information table [13]. Then thereduced 137 R498 0discernibility

matrix, RULE information table ' andreduced RULE table can be can be reasoned.

Table 3 The result table of CM1

case a b C d e g j Decision(z)

T1 0 0 0 0 0 0 0 0

T2 0 0 0 0 0 0 1 0

T3 0 0 0 0 0 1 0 0

.. .. .. .. .. .. .. .. ..

T37 2 2 2 2 2 0 2 1

6. Result and Discussion

Using data reduction algorithm, we getreduced information table. Now condition

attributes are only a,b, c, d, e, g and j, which

means loc numeric, v(g) numeric, ev(g)numeric, iv(g) numeric, n numeric, l

numeric and e numeric. These attributes

exhibit high correlation.Applying rule discovery algorithm, we get

RULE table. Using rule reduction algorithm,get a new RULE table as Table 3.

So, the data set CM1 with 22 attributes and498 instances is reduced to a new matrix of

837. The theory of rough set and farther

researches assures the new matrixpreserving the characteristic of CM1.

Researching software defect prediction on

the new table instead of CM1, the ratio ofthe record and attribute number is 22:498

and 8:37.

7. ConclusionsWe proposes a new method for software

defectprediction based on rough set theory,

and applies to themost famous public domain data set created

by the NASA's Metrics Data Program

(MDP). The result shows its splendid

performance. Rough set theory draws into

approximation space from equivalence

relation. Its upper approximation andlower approximation concepts give the area

of certainty and the area of uncertainty.

Using these concepts, we can process thedata set of software defect prediction

effectively. We have presented knowledge

representation model propitious to rough set.We have given data reduction algorithm,

rule discovery algorithm and rule reduction

algorithm. Predicting the software defect,

using the model given, the matrixes of locnumeric, v(g) numeric, ev(g) numeric, iv(g)

numeric, n numeric, l numeric and e

numeric exhibit high correlation. We reduce

attributes from 22 to 8. This shows that ourmethod and algorithms are feasible.

References

[1] K. Kaminsky and G. D. Boetticher,

"Better Software

Defect Prediction using Equalized Learningwith Machine

Learners," Knowledge Sharing and

CollaborativeEngineering. 2004.

[2] V. U. B. Challagulla, F. B. Bastani, I.-L.

Yen, and R. A.Paul, "Empirical assessment of machine

learning based

software defect prediction techniques," The10th IEEE

WORDS'05, 2005.

[3] Fenton NE, Martain N, William M, Peter

H, Lukrad R,Paul K. Predicting software defects in

varying development

lifecycles using bayesian nets. Informationand Software

Technology, 2007,49:3243.

[4]N . E. Fenton, Krause, Paul, Neil,Martin.,: A probabilistic model for software defect

prediction,for submission to IEEE

transactions in software engineering,2001.


6/6

[5]GG.Denaro and M. Pezze, An empirical

evaluation of fault-proneness models,The

24th international conference on softwareengineering,vol.33,2007.

[6]J.G Tim Menzies ,Art Frank,Data

mining static code attributes to learn defectpredictors,IEEE transactions on software

engineering,vol,33,2007.

[7]P.Z.e. al, Rough set approach to multiattribute decision analysis:Warsaw

University Of Technology,1993.

[8]W.Pedrycz,CLUSTERING-from data to

information granules:john wiley &sons,inc,2005.

[9].L.Li,W.Yang,X. Li, and Y . Xu,

Research on data mining model based on

rough sets, The first internationalsymposium on pervasive computing and

applications(SPCA06), 2006.

paper a new method to predict software defect based on rough sets

Documents