paper a new method to predict software defect based on rough sets

Upload: francis-xavier

Post on 07-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 PAPER a New Method to Predict Software Defect Based on Rough Sets

    1/6

    A New Method to Predict Software Defect Based on Rough Sets

    VIVEKSUBRAMANIAN M(10MSC0019) B.K.TRIPATHY

    RAJMOHAN K(10MSC0024)

    Objectives

    High quality assurance software requires

    extensive assessment. Most of the

    software organizations do not allocate

    resources for software quality. We

    research the detectors focusing on the

    data sets of software defect prediction. A

    rough set model is presented to deal with

    the attributes of data sets of softwaredefect prediction in this paper. Most

    famous public domain data set created by

    the NASAs metric data program. There

    we can see a good and extrordinary

    performance.

    Literature view

    High quality software should have as few

    defects as possible. Therefore, software

    requires extensive and expensiveassessment. Most of software organizationsfrequently do not allocate enough resources

    for quality. So, a variety of software defect

    prediction techniques have been proposed,such as Statistical methods, Machine

    learning techniques [1] and Mixed

    techniques [2].Most of the wide range of prediction models

    uses size and complexity metrics of software

    to predict defects. Others are based on the

    'quality' of the development process[3], ortake a multivariate approach [4].

    Data sets used in these studies are three

    folds:original, open source, and public domain.

    First, as for original data sets are usually

    used in empirical studies in industries

    available in software development forPhilips Software Centre (PSC). Second, as

    for the open source software

    data, studies such as [6] collected and usedfor the evaluation of their software defect

    prediction approaches. Finally, the public

    domain data set, one of the most famous

    public domain data set is the NASA'sMetrics Data Program (MDP) [7]. For

    example, studies such as [8, 9] used the

    NASAs MDP. By using such

    public domain data sets, a new approach canbe easily comparable with other approaches.

    3.Detailed Problem Definition

    Rough set theory

    Rough set is introduced in detailed problem

    definition Rough set theory, introduced by

    Z.Pawlak since the early 1980s [10], is anew mathematical tool to deal with

    vagueness and uncertainty. It is amathematics methodology for processingapproximation space. It is used by

    classification of an indiscernibility relation.

    The area of uncertainty is boundary of roughset. The boundary region is defined by

    difference of the upper approximation and

    the lower approximation. Rough set

    has some main advantage, such as based onsound mathematical definition, outstanding

    robustness and little computation. The

    information processing of rough set does notneed any preliminary or additional

    information about data, such as basic

    probability assignment in the Dempster-Shafer theory and the value of possibility in

    fuzzy set theory. Knowledge representation

    in the rough set model is done via a decisiontable which is a tabular form of an

  • 8/6/2019 PAPER a New Method to Predict Software Defect Based on Rough Sets

    2/6

    ObjectAttribute Value relationship. An

    object is a record in decision table.

    There are a set U of objects and an

    equivalence relation R on the set U. R is a

    partition of the set U, using the attribute set. R={X1, X2, , Xn}. We call the pair

    an approximation space. For PR

    and P, then P (the intersection sets of all

    equivalence relations in the P) is also anequivalence relation. We call P

    indiscernibility relation on the P, write as

    Ind(P). For XU, we say that R-

    (X)=_{YiU | Ind(R):

    YiX} is the lower approximation of X,

    and R-(X)= {YiU|Ind(R): YiX} is the

    upper approximation of X.R-(X)-R-(X) is an uncertain boundaryregion of X. In other words, the boundary

    region does not comprise the objects that

    cannot be classified with certainty to beinside X, or outside X, using the attribute set

    .

    Rough sets pretreatment translates the set of

    objects into Ind(R) using the attribute set .Discernibility matrix is a matrix in which

    the element is the set of condition attributes.

    In the matrix the condition attributes whichcan be used to discern between the classes in

    the corresponding row and column are

    inserted.Let = (q) call as

    informationsystem, where

    (a) a finite set U of objects;

    (b) a finite set of attributes;(c) a set Vq of attribute values;

    (d) for each q, a mapping function fq:

    UVq.We also will sometimes write Oq instead offq(O) to denote the value of object O with

    respect to attribute q.Discernibility matrix

    M()=(mi,j), (1i,jn, n= |U /Ind()|),where (mi,j)={q | OiqOj q}.

    Discernibility matrix object translates roughset Ind() into discernibility matrix M().

    Suppose = is an

    information system, then = is a decision knowledge system(forshort decision system, sometimes called as

    knowledge system). Subset C is the

    condition attributes in . Subset D is thedecision attributes in .

    We see that an attribute in corresponds to

    an equivalence relation on U. A decisiontable can be regarded as a group of

    equivalence relation. So data mining can be

    predigested as shrinking attributes.

    On decision system , discernibility matrixM(C,D)

    = (mi,j), (1i, jn, n=|U/Ind(C)|), where

    (mi,j)= {qC | OiqOjq, OidOjd, dD}.

    Let Os be a set of objects. Let Ds be a set ofattributes belong to Os and Ms be a set of

    methods belong to Os. For O in Os and M in

    Ms, we denote by O.M the result of theapplication of method M on object O. For

    Os, we will be interested in generating all

    the objects that can be mapped into themodel just described as follows:

    The set Os of objects consists of labeled data

    tuples comprising the training rough set for

    the similar classifier. The label identifies the

    group to which the object belongs. Othertuple attributes specify the properties of the

    object. The goal of the similar classificationproblem is to discover new rules each of

    the groups in the training rough set.

    Attributes of an object are of differentimportance. The importance is expressed as

    right value. In order to discover importance

    of a set of attributes, firstly we delete someattribute from the set of attributes, and

    then review change of equivalence relation.

    If classification is changed, then the attribute

    is of bigger importance, or is of less.Suppose F={X1, X2, , Xn} is a cluster of

    sets of classification on U based on

    attributes set . The inherent metric of anapproximation space is the approximation

    precision R(F) and the approximation

    quality R (F).

  • 8/6/2019 PAPER a New Method to Predict Software Defect Based on Rough Sets

    3/6

    R (F) = in= 1 | R-(X) | / in= 1 | R-(X) |.

    R (F) = in= 1 | R-(X) | / | U |.

    Let = be a decisionsystem.Ind(C, D)={(x,y) | (x,y)UU,

    rC, r(x)=r(y)} is

    called as indiscernibility relation of .Let = is a decision

    system.Reduction of condition attributes setC is a nonempty

    subset C'C such that:

    Ind(C', D)=Ind(C, D);

    No C''C' makes Ind(C'', D)=Ind(C, D).

    Reduction set of C is written as Red(C).

    The intersection of all reduction sets of toindiscernibility class X. If C'C, rC'

    and that r cannot be cut out relative to

    indiscernibility class X, we call C' thereduction of indiscernibility class X. Thereduction set of all indiscernibility class

    Xi (i=1, 2, , n) is written as Red(C, Xi).

    Core(C,Xi) = Red(C, Xi) is called as core relative

    to Xi. In this way, we can get simplest

    decision table using

    relative reduction.

    Theorem 1:

    Let (U,R) be a knowledge base.then thescaled many valued context sc(U,R)

    :=((U,R,W,I),(SR | R R))is defined by

    W:={[x]R | x U,R R} and (x.R,w)

    l:w=[x]R and the nominalscale SR:=(U/R,U/R,=)for each many

    valued attribute RR. then theindiscernibility classes of (U,R)are exactly the contingents ofthe derived context K of

    sc(U,R).

    Theorem 2:

    Let SC:=((G,M,W,I),(Sm|

    mM)) be a scaled manyvalued context ,and K := (G,

    {(m,n) | m M, n Mm }, J) its

    derived context. Then theknowledge base kb(SC) isdefined by kb(SC):= (G, R),where R := {Rm | m M} and form M Rm := {(g,h) G G | m(g)= m(h) } and m is the object-concept mapping of the m-part ofK; clearly, the m-part of K is theformal context (G, {(m,n) | n Mm }, Jm) where J m := {(g,

    (m,n)) J | n Mm }. Then theindiscernibility classes of kb(SC)are exactly the contingents of thederived context K of SC.

    Theorem 3:

    For any knowledge base (U, R):kb(sc(U, R)) = (U, R).

    4. Solution Methodology

    The data sets used in this study are shown in

    Table 1 and are freely available to other

    researchers via the web interface to NASAs

    Metrics Data Program (MDP)136C is called

    as core, Core(C),Core(C)=Red(C).

    Fordecision system = , U

    is parted into indiscernibility classes X1, X2,

    , Xn byInd(C, D). Let D(Xi)={v = fq(x) |

    xXi, qD} be the set taking value of

    decision attributes set D. If D(XInd(C-

    {r},D))=D(X), then condition attribute rC

    can be cut out relative toat .Data sets

    CM1,JM1,KC1,KC2,PC1. Table 1 Data sets

    used in this study

    project Number

    ofinstances

    FALS

    EFalse

    Rate

    TRU

    ETrue

    Rate

    CM1 498 449 49

    JM1 10885 8779 2106

  • 8/6/2019 PAPER a New Method to Predict Software Defect Based on Rough Sets

    4/6

    KC1 2109 1783 326

    KC2 522 415 107

    - - -

    PC1 1109 1032 77

    All data comes from McCabe and Halsteadfeatures extractors of source code. These

    features were defined

    in the 70s in an attempt to objectively

    characterize code features that are associatedwith software quality. The McCabe and

    Halstead measures are "module"- based

    where a "module" is the smallest unit offunctionality. McCabe argued that code with

    complicated pathways are more error-prone.

    His metrics therefore reflect the pathwayswithin a code module. Halstead

    argued that code that is hard to read is more

    likely to be fault prone. Halstead estimatesReading complexity by counting the number

    of concepts in a module. The McCabe

    metrics are a collection of four software

    metrics: essential complexity, cyclomaticcomplexity, design complexity and LOC,

    Lines of Code. The Halstead falls into three

    groups: the base measures, the derived

    measures, and lines of codemeasures. Total Number of software metrics

    is 22, including 5 different lines of codemeasure, 3 McCabe metrics, 4 base Halstead

    measures, 8 derived Halstead measures,

    a branch-count, and 1 goal field. List as

    follows:loc: McCabe's line count of code

    v(g): McCabe "cyclomatic complexity"

    ev(g): McCabe "essential complexity"iv(g): McCabe "design complexity"

    n: Halstead total operators + operandsv: Halstead "volume"l: Halstead "program length"

    d: Halstead "difficulty"

    i: Halstead "intelligence"e: Halstead "effort"

    b: Halstead

    t: Halstead's time estimator

    lOCode: Halstead's line count

    lOComment: Halstead's count of lines of

    commentslOBlank: Halstead's count of blank lines

    lOCodeAndComment: numeric

    uniq_Op: unique operatorsuniq_Opnd: unique operands

    total_Op: total operators

    total_Opnd: total operandsbranchCount: of the flow graph

    defects: module has/has not one or more

    reported defects.

    5.Details of experimentation,analysis and

    methodology

    In preprocessing of data set, we cluster each

    software metrics into three classes: High,Medium and Low. Many methods of

    computing distance in clustering can beused, such as presented by Euclidean,

    We change CM1 into rough set information

    table. Afterwards applying data reduction

    algorithm, rule discovery algorithm andrule reduction algorithm given by us, some

    available rules are discovered.

    Part of information table of software defectprediction coded as =

    depicted in Table 2. We have incorporated

    complete identical samples into a record.The n denotes the number of records. The

    {a, b, c, d} are all condition attributes. 'a'

    as loc numeric, 'b' as v(g) numeric, 'c' asev(g) numeric, ..., 'u' as branchCount. The

    value of 0, 1 and 2 in the cells instead of

    Low, Medium and High. The 'e' is decision

    attribute as defects, and the value of 0, 1means False, True.

    Table 2 Information table

    case a b c . u Decision(z)

    R1 0 0 1 .. 0 0

    R2 0 0 0 .. 0 1

    R3 0 0 0 .. 0 0

    .. .. .. .. .. ..

  • 8/6/2019 PAPER a New Method to Predict Software Defect Based on Rough Sets

    5/6

    R498 0 0 1 .. 0 1

    Discernibility matrix M(C, D) is calculated

    by information table [13]. Then thereduced 137 R498 0discernibility

    matrix, RULE information table ' andreduced RULE table can be can be reasoned.

    Table 3 The result table of CM1

    case a b C d e g j Decision(z)

    T1 0 0 0 0 0 0 0 0

    T2 0 0 0 0 0 0 1 0

    T3 0 0 0 0 0 1 0 0

    .. .. .. .. .. .. .. .. ..

    T37 2 2 2 2 2 0 2 1

    6. Result and Discussion

    Using data reduction algorithm, we getreduced information table. Now condition

    attributes are only a,b, c, d, e, g and j, which

    means loc numeric, v(g) numeric, ev(g)numeric, iv(g) numeric, n numeric, l

    numeric and e numeric. These attributes

    exhibit high correlation.Applying rule discovery algorithm, we get

    RULE table. Using rule reduction algorithm,get a new RULE table as Table 3.

    So, the data set CM1 with 22 attributes and498 instances is reduced to a new matrix of

    837. The theory of rough set and farther

    researches assures the new matrixpreserving the characteristic of CM1.

    Researching software defect prediction on

    the new table instead of CM1, the ratio ofthe record and attribute number is 22:498

    and 8:37.

    7. ConclusionsWe proposes a new method for software

    defectprediction based on rough set theory,

    and applies to themost famous public domain data set created

    by the NASA's Metrics Data Program

    (MDP). The result shows its splendid

    performance. Rough set theory draws into

    approximation space from equivalence

    relation. Its upper approximation andlower approximation concepts give the area

    of certainty and the area of uncertainty.

    Using these concepts, we can process thedata set of software defect prediction

    effectively. We have presented knowledge

    representation model propitious to rough set.We have given data reduction algorithm,

    rule discovery algorithm and rule reduction

    algorithm. Predicting the software defect,

    using the model given, the matrixes of locnumeric, v(g) numeric, ev(g) numeric, iv(g)

    numeric, n numeric, l numeric and e

    numeric exhibit high correlation. We reduce

    attributes from 22 to 8. This shows that ourmethod and algorithms are feasible.

    References

    [1] K. Kaminsky and G. D. Boetticher,

    "Better Software

    Defect Prediction using Equalized Learningwith Machine

    Learners," Knowledge Sharing and

    CollaborativeEngineering. 2004.

    [2] V. U. B. Challagulla, F. B. Bastani, I.-L.

    Yen, and R. A.Paul, "Empirical assessment of machine

    learning based

    software defect prediction techniques," The10th IEEE

    WORDS'05, 2005.

    [3] Fenton NE, Martain N, William M, Peter

    H, Lukrad R,Paul K. Predicting software defects in

    varying development

    lifecycles using bayesian nets. Informationand Software

    Technology, 2007,49:3243.

    [4]N . E. Fenton, Krause, Paul, Neil,Martin.,: A probabilistic model for software defect

    prediction,for submission to IEEE

    transactions in software engineering,2001.

  • 8/6/2019 PAPER a New Method to Predict Software Defect Based on Rough Sets

    6/6

    [5]GG.Denaro and M. Pezze, An empirical

    evaluation of fault-proneness models,The

    24th international conference on softwareengineering,vol.33,2007.

    [6]J.G Tim Menzies ,Art Frank,Data

    mining static code attributes to learn defectpredictors,IEEE transactions on software

    engineering,vol,33,2007.

    [7]P.Z.e. al, Rough set approach to multiattribute decision analysis:Warsaw

    University Of Technology,1993.

    [8]W.Pedrycz,CLUSTERING-from data to

    information granules:john wiley &sons,inc,2005.

    [9].L.Li,W.Yang,X. Li, and Y . Xu,

    Research on data mining model based on

    rough sets, The first internationalsymposium on pervasive computing and

    applications(SPCA06), 2006.