paper a new method to predict software defect based on rough sets
TRANSCRIPT
-
8/6/2019 PAPER a New Method to Predict Software Defect Based on Rough Sets
1/6
A New Method to Predict Software Defect Based on Rough Sets
VIVEKSUBRAMANIAN M(10MSC0019) B.K.TRIPATHY
RAJMOHAN K(10MSC0024)
Objectives
High quality assurance software requires
extensive assessment. Most of the
software organizations do not allocate
resources for software quality. We
research the detectors focusing on the
data sets of software defect prediction. A
rough set model is presented to deal with
the attributes of data sets of softwaredefect prediction in this paper. Most
famous public domain data set created by
the NASAs metric data program. There
we can see a good and extrordinary
performance.
Literature view
High quality software should have as few
defects as possible. Therefore, software
requires extensive and expensiveassessment. Most of software organizationsfrequently do not allocate enough resources
for quality. So, a variety of software defect
prediction techniques have been proposed,such as Statistical methods, Machine
learning techniques [1] and Mixed
techniques [2].Most of the wide range of prediction models
uses size and complexity metrics of software
to predict defects. Others are based on the
'quality' of the development process[3], ortake a multivariate approach [4].
Data sets used in these studies are three
folds:original, open source, and public domain.
First, as for original data sets are usually
used in empirical studies in industries
available in software development forPhilips Software Centre (PSC). Second, as
for the open source software
data, studies such as [6] collected and usedfor the evaluation of their software defect
prediction approaches. Finally, the public
domain data set, one of the most famous
public domain data set is the NASA'sMetrics Data Program (MDP) [7]. For
example, studies such as [8, 9] used the
NASAs MDP. By using such
public domain data sets, a new approach canbe easily comparable with other approaches.
3.Detailed Problem Definition
Rough set theory
Rough set is introduced in detailed problem
definition Rough set theory, introduced by
Z.Pawlak since the early 1980s [10], is anew mathematical tool to deal with
vagueness and uncertainty. It is amathematics methodology for processingapproximation space. It is used by
classification of an indiscernibility relation.
The area of uncertainty is boundary of roughset. The boundary region is defined by
difference of the upper approximation and
the lower approximation. Rough set
has some main advantage, such as based onsound mathematical definition, outstanding
robustness and little computation. The
information processing of rough set does notneed any preliminary or additional
information about data, such as basic
probability assignment in the Dempster-Shafer theory and the value of possibility in
fuzzy set theory. Knowledge representation
in the rough set model is done via a decisiontable which is a tabular form of an
-
8/6/2019 PAPER a New Method to Predict Software Defect Based on Rough Sets
2/6
ObjectAttribute Value relationship. An
object is a record in decision table.
There are a set U of objects and an
equivalence relation R on the set U. R is a
partition of the set U, using the attribute set. R={X1, X2, , Xn}. We call the pair
an approximation space. For PR
and P, then P (the intersection sets of all
equivalence relations in the P) is also anequivalence relation. We call P
indiscernibility relation on the P, write as
Ind(P). For XU, we say that R-
(X)=_{YiU | Ind(R):
YiX} is the lower approximation of X,
and R-(X)= {YiU|Ind(R): YiX} is the
upper approximation of X.R-(X)-R-(X) is an uncertain boundaryregion of X. In other words, the boundary
region does not comprise the objects that
cannot be classified with certainty to beinside X, or outside X, using the attribute set
.
Rough sets pretreatment translates the set of
objects into Ind(R) using the attribute set .Discernibility matrix is a matrix in which
the element is the set of condition attributes.
In the matrix the condition attributes whichcan be used to discern between the classes in
the corresponding row and column are
inserted.Let = (q) call as
informationsystem, where
(a) a finite set U of objects;
(b) a finite set of attributes;(c) a set Vq of attribute values;
(d) for each q, a mapping function fq:
UVq.We also will sometimes write Oq instead offq(O) to denote the value of object O with
respect to attribute q.Discernibility matrix
M()=(mi,j), (1i,jn, n= |U /Ind()|),where (mi,j)={q | OiqOj q}.
Discernibility matrix object translates roughset Ind() into discernibility matrix M().
Suppose = is an
information system, then = is a decision knowledge system(forshort decision system, sometimes called as
knowledge system). Subset C is the
condition attributes in . Subset D is thedecision attributes in .
We see that an attribute in corresponds to
an equivalence relation on U. A decisiontable can be regarded as a group of
equivalence relation. So data mining can be
predigested as shrinking attributes.
On decision system , discernibility matrixM(C,D)
= (mi,j), (1i, jn, n=|U/Ind(C)|), where
(mi,j)= {qC | OiqOjq, OidOjd, dD}.
Let Os be a set of objects. Let Ds be a set ofattributes belong to Os and Ms be a set of
methods belong to Os. For O in Os and M in
Ms, we denote by O.M the result of theapplication of method M on object O. For
Os, we will be interested in generating all
the objects that can be mapped into themodel just described as follows:
The set Os of objects consists of labeled data
tuples comprising the training rough set for
the similar classifier. The label identifies the
group to which the object belongs. Othertuple attributes specify the properties of the
object. The goal of the similar classificationproblem is to discover new rules each of
the groups in the training rough set.
Attributes of an object are of differentimportance. The importance is expressed as
right value. In order to discover importance
of a set of attributes, firstly we delete someattribute from the set of attributes, and
then review change of equivalence relation.
If classification is changed, then the attribute
is of bigger importance, or is of less.Suppose F={X1, X2, , Xn} is a cluster of
sets of classification on U based on
attributes set . The inherent metric of anapproximation space is the approximation
precision R(F) and the approximation
quality R (F).
-
8/6/2019 PAPER a New Method to Predict Software Defect Based on Rough Sets
3/6
R (F) = in= 1 | R-(X) | / in= 1 | R-(X) |.
R (F) = in= 1 | R-(X) | / | U |.
Let = be a decisionsystem.Ind(C, D)={(x,y) | (x,y)UU,
rC, r(x)=r(y)} is
called as indiscernibility relation of .Let = is a decision
system.Reduction of condition attributes setC is a nonempty
subset C'C such that:
Ind(C', D)=Ind(C, D);
No C''C' makes Ind(C'', D)=Ind(C, D).
Reduction set of C is written as Red(C).
The intersection of all reduction sets of toindiscernibility class X. If C'C, rC'
and that r cannot be cut out relative to
indiscernibility class X, we call C' thereduction of indiscernibility class X. Thereduction set of all indiscernibility class
Xi (i=1, 2, , n) is written as Red(C, Xi).
Core(C,Xi) = Red(C, Xi) is called as core relative
to Xi. In this way, we can get simplest
decision table using
relative reduction.
Theorem 1:
Let (U,R) be a knowledge base.then thescaled many valued context sc(U,R)
:=((U,R,W,I),(SR | R R))is defined by
W:={[x]R | x U,R R} and (x.R,w)
l:w=[x]R and the nominalscale SR:=(U/R,U/R,=)for each many
valued attribute RR. then theindiscernibility classes of (U,R)are exactly the contingents ofthe derived context K of
sc(U,R).
Theorem 2:
Let SC:=((G,M,W,I),(Sm|
mM)) be a scaled manyvalued context ,and K := (G,
{(m,n) | m M, n Mm }, J) its
derived context. Then theknowledge base kb(SC) isdefined by kb(SC):= (G, R),where R := {Rm | m M} and form M Rm := {(g,h) G G | m(g)= m(h) } and m is the object-concept mapping of the m-part ofK; clearly, the m-part of K is theformal context (G, {(m,n) | n Mm }, Jm) where J m := {(g,
(m,n)) J | n Mm }. Then theindiscernibility classes of kb(SC)are exactly the contingents of thederived context K of SC.
Theorem 3:
For any knowledge base (U, R):kb(sc(U, R)) = (U, R).
4. Solution Methodology
The data sets used in this study are shown in
Table 1 and are freely available to other
researchers via the web interface to NASAs
Metrics Data Program (MDP)136C is called
as core, Core(C),Core(C)=Red(C).
Fordecision system = , U
is parted into indiscernibility classes X1, X2,
, Xn byInd(C, D). Let D(Xi)={v = fq(x) |
xXi, qD} be the set taking value of
decision attributes set D. If D(XInd(C-
{r},D))=D(X), then condition attribute rC
can be cut out relative toat .Data sets
CM1,JM1,KC1,KC2,PC1. Table 1 Data sets
used in this study
project Number
ofinstances
FALS
EFalse
Rate
TRU
ETrue
Rate
CM1 498 449 49
JM1 10885 8779 2106
-
8/6/2019 PAPER a New Method to Predict Software Defect Based on Rough Sets
4/6
KC1 2109 1783 326
KC2 522 415 107
- - -
PC1 1109 1032 77
All data comes from McCabe and Halsteadfeatures extractors of source code. These
features were defined
in the 70s in an attempt to objectively
characterize code features that are associatedwith software quality. The McCabe and
Halstead measures are "module"- based
where a "module" is the smallest unit offunctionality. McCabe argued that code with
complicated pathways are more error-prone.
His metrics therefore reflect the pathwayswithin a code module. Halstead
argued that code that is hard to read is more
likely to be fault prone. Halstead estimatesReading complexity by counting the number
of concepts in a module. The McCabe
metrics are a collection of four software
metrics: essential complexity, cyclomaticcomplexity, design complexity and LOC,
Lines of Code. The Halstead falls into three
groups: the base measures, the derived
measures, and lines of codemeasures. Total Number of software metrics
is 22, including 5 different lines of codemeasure, 3 McCabe metrics, 4 base Halstead
measures, 8 derived Halstead measures,
a branch-count, and 1 goal field. List as
follows:loc: McCabe's line count of code
v(g): McCabe "cyclomatic complexity"
ev(g): McCabe "essential complexity"iv(g): McCabe "design complexity"
n: Halstead total operators + operandsv: Halstead "volume"l: Halstead "program length"
d: Halstead "difficulty"
i: Halstead "intelligence"e: Halstead "effort"
b: Halstead
t: Halstead's time estimator
lOCode: Halstead's line count
lOComment: Halstead's count of lines of
commentslOBlank: Halstead's count of blank lines
lOCodeAndComment: numeric
uniq_Op: unique operatorsuniq_Opnd: unique operands
total_Op: total operators
total_Opnd: total operandsbranchCount: of the flow graph
defects: module has/has not one or more
reported defects.
5.Details of experimentation,analysis and
methodology
In preprocessing of data set, we cluster each
software metrics into three classes: High,Medium and Low. Many methods of
computing distance in clustering can beused, such as presented by Euclidean,
We change CM1 into rough set information
table. Afterwards applying data reduction
algorithm, rule discovery algorithm andrule reduction algorithm given by us, some
available rules are discovered.
Part of information table of software defectprediction coded as =
depicted in Table 2. We have incorporated
complete identical samples into a record.The n denotes the number of records. The
{a, b, c, d} are all condition attributes. 'a'
as loc numeric, 'b' as v(g) numeric, 'c' asev(g) numeric, ..., 'u' as branchCount. The
value of 0, 1 and 2 in the cells instead of
Low, Medium and High. The 'e' is decision
attribute as defects, and the value of 0, 1means False, True.
Table 2 Information table
case a b c . u Decision(z)
R1 0 0 1 .. 0 0
R2 0 0 0 .. 0 1
R3 0 0 0 .. 0 0
.. .. .. .. .. ..
-
8/6/2019 PAPER a New Method to Predict Software Defect Based on Rough Sets
5/6
R498 0 0 1 .. 0 1
Discernibility matrix M(C, D) is calculated
by information table [13]. Then thereduced 137 R498 0discernibility
matrix, RULE information table ' andreduced RULE table can be can be reasoned.
Table 3 The result table of CM1
case a b C d e g j Decision(z)
T1 0 0 0 0 0 0 0 0
T2 0 0 0 0 0 0 1 0
T3 0 0 0 0 0 1 0 0
.. .. .. .. .. .. .. .. ..
T37 2 2 2 2 2 0 2 1
6. Result and Discussion
Using data reduction algorithm, we getreduced information table. Now condition
attributes are only a,b, c, d, e, g and j, which
means loc numeric, v(g) numeric, ev(g)numeric, iv(g) numeric, n numeric, l
numeric and e numeric. These attributes
exhibit high correlation.Applying rule discovery algorithm, we get
RULE table. Using rule reduction algorithm,get a new RULE table as Table 3.
So, the data set CM1 with 22 attributes and498 instances is reduced to a new matrix of
837. The theory of rough set and farther
researches assures the new matrixpreserving the characteristic of CM1.
Researching software defect prediction on
the new table instead of CM1, the ratio ofthe record and attribute number is 22:498
and 8:37.
7. ConclusionsWe proposes a new method for software
defectprediction based on rough set theory,
and applies to themost famous public domain data set created
by the NASA's Metrics Data Program
(MDP). The result shows its splendid
performance. Rough set theory draws into
approximation space from equivalence
relation. Its upper approximation andlower approximation concepts give the area
of certainty and the area of uncertainty.
Using these concepts, we can process thedata set of software defect prediction
effectively. We have presented knowledge
representation model propitious to rough set.We have given data reduction algorithm,
rule discovery algorithm and rule reduction
algorithm. Predicting the software defect,
using the model given, the matrixes of locnumeric, v(g) numeric, ev(g) numeric, iv(g)
numeric, n numeric, l numeric and e
numeric exhibit high correlation. We reduce
attributes from 22 to 8. This shows that ourmethod and algorithms are feasible.
References
[1] K. Kaminsky and G. D. Boetticher,
"Better Software
Defect Prediction using Equalized Learningwith Machine
Learners," Knowledge Sharing and
CollaborativeEngineering. 2004.
[2] V. U. B. Challagulla, F. B. Bastani, I.-L.
Yen, and R. A.Paul, "Empirical assessment of machine
learning based
software defect prediction techniques," The10th IEEE
WORDS'05, 2005.
[3] Fenton NE, Martain N, William M, Peter
H, Lukrad R,Paul K. Predicting software defects in
varying development
lifecycles using bayesian nets. Informationand Software
Technology, 2007,49:3243.
[4]N . E. Fenton, Krause, Paul, Neil,Martin.,: A probabilistic model for software defect
prediction,for submission to IEEE
transactions in software engineering,2001.
-
8/6/2019 PAPER a New Method to Predict Software Defect Based on Rough Sets
6/6
[5]GG.Denaro and M. Pezze, An empirical
evaluation of fault-proneness models,The
24th international conference on softwareengineering,vol.33,2007.
[6]J.G Tim Menzies ,Art Frank,Data
mining static code attributes to learn defectpredictors,IEEE transactions on software
engineering,vol,33,2007.
[7]P.Z.e. al, Rough set approach to multiattribute decision analysis:Warsaw
University Of Technology,1993.
[8]W.Pedrycz,CLUSTERING-from data to
information granules:john wiley &sons,inc,2005.
[9].L.Li,W.Yang,X. Li, and Y . Xu,
Research on data mining model based on
rough sets, The first internationalsymposium on pervasive computing and
applications(SPCA06), 2006.