static!code!metrics!vs.processmetricsfor ...874471/fulltext01.pdf ·...

54
Static code metrics vs. process metrics for software fault prediction using Bayesian network learners Mälardalen University School of Innovation, Design and Technology Author: Biljana Stanić Thesis for the Degree of Master of Science in Software Engineering (30.0 credits) Date: 28 th October, 2015 Supervisor: Wasif Afzal Examiner: Antonio Cicchetti

Upload: others

Post on 26-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

         

Static  code  metrics  vs.  process  metrics  for  software  fault  prediction  using  Bayesian  

network  learners      

   Mälardalen  University      School  of  Innovation,  Design  and  Technology      Author:  Biljana  Stanić    Thesis  for  the  Degree  of  Master  of  Science  in  Software  Engineering  (30.0  credits)    Date:  28th  October,  2015    Supervisor:  Wasif  Afzal    Examiner:  Antonio  Cicchetti              

       

Page 2: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

2    

       

 

               

 “The  real  question  is  not  whether  machines  think  but  whether  men  do.    The  mystery  which  surrounds  a  thinking  machine  already  surrounds  a  

thinking  man.”      

Burrhus  Frederic  Skinner                                          

Page 3: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

3    

Acknowledgments    I   would   like   to   express   my   deep   appreciation   to  my   supervisor,   Dr.  Wasif   Afzal,   for   the  constructive,  useful  suggestions,  and  guidelines  throughout  my  research  work.      I  would  also  like  to  thank  to:    

EUROWEB   Project1,   funded   by   the   Erasmus  Mundus   Action   II   programme   of   the  European  Commission;  Lech  Madeyski  and  Marian  Jureczko  for  using  their  metrics  repository.  

 Finally,  I  wish  to  thank  my  dear  ones  for  their  support  and  for  believing  in  me  all  this  time.                                                                            

                                                                                                               1  http://www.mrtc.mdh.se/euroweb/  

Page 4: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

4    

Abstract    

 Software   fault   prediction   (SFP)   has   an   important   role   in   the   process   of  improving   software   product   quality   by   identifying   fault-­‐prone   modules.  Constructing   quality   models   includes   a   usage   of   metrics   that   describe   real  world   entities   defined   by   numbers   or   attributes.   Examining   the   nature   of  machine   learning   (ML),   researchers   proposed   its   algorithms   as   suitable   for  fault   prediction.  Moreover,   information   that   software  metrics   contain  will   be  used  as  statistical  data  necessary   to  build  models   for  a  certain  ML  algorithm.  One   of   the   most   used   ML   algorithms   is   a   Bayesian   network   (BN),   which   is  represented  as  a  graph,  with  a  set  of  variables  and  relations  between  them.    This  thesis  will  be  focused  on  the  usage  of  process  and  static  code  metrics  with  BN  learners   for  SFP.  First,  we  provided  an   informal  review  on  non-­‐static  code  metrics.  Furthermore,  we  created  models  that  contained  different  combinations  of   process   and   static   code   metrics,   and   then   we   used   them   to   conduct   an  experiment.   The   results   of   the   experiment  were   statistically   analyzed   using   a  non-­‐parametric  test,  the  Kruskal-­‐Wallis  test.    The  informal  review  reported  that  non-­‐static  code  metrics  are  beneficial  for  the  prediction  process  and  its  usage  is  highly  recommended  for  industrial  projects.  Finally,  experimental  results  did  not  provide  a  conclusion  which  process  metric  gives   a   statistically   significant   result;   therefore,   a   further   investigation   is  needed.    

 

 

 

 

                   

Page 5: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

5    

Contents      

Abstract  ...............................................................................................................................................  4  List  of  Figures  ....................................................................................................................................  7  

List  of  Tables  .....................................................................................................................................  8  

Abbreviations  ...................................................................................................................................  9  1.  Introduction  ...............................................................................................................................  11  1.1   Motivation  .......................................................................................................................................  12  1.2   Research  questions  ......................................................................................................................  13  

2.  Background  ................................................................................................................................  14  2.1  Software  fault  prediction  ..............................................................................................................  14  2.2  Software  metrics  ..............................................................................................................................  15  2.2.1  Static  code  metrics  .....................................................................................................................................  15  2.2.2  Process  metrics  ...........................................................................................................................................  15  

2.3  Machine  learning  ..............................................................................................................................  16  2.3.1  Bayesian  network  .......................................................................................................................................  17  2.3.1.1  The  Naive  Bayes  Classifier  ................................................................................................................................  19  2.3.1.2  Augmented  Naive  Bayes  Classifier  ................................................................................................................  19  

3.  Method  .........................................................................................................................................  21  3.1  Methodology  for  RQ1  ......................................................................................................................  21  3.2  Methodology  for  RQ2  ......................................................................................................................  23  

4.  Informal  review  ........................................................................................................................  25  4.1  Process  metrics  .................................................................................................................................  25  4.1.1  Code  churn  metrics  ....................................................................................................................................  26  4.1.2  Developer  metrics  ......................................................................................................................................  26  4.1.3  Other  process  metrics  ..............................................................................................................................  26  

5.  Design  ...........................................................................................................................................  27  5.1  Projects  ................................................................................................................................................  27  5.2  Extracted  metrics  .............................................................................................................................  27  5.3  Evaluation  of  results  .......................................................................................................................  29  5.4  Experiment  .........................................................................................................................................  29  

6.  Results  .........................................................................................................................................  33  6.1  Statistical  analysis  of  results  ........................................................................................................  33  6.2.1  Results  of  the  experiment  for  NB  classifier  .....................................................................................  33  6.2.1.1  Models  with  combined,  static  code  and  process  metric  ......................................................................  34  6.2.1.2  Models  with  static  code  and  1  process  metrics  .......................................................................................  35  6.2.1.3  Models  with  a  combination  of  2  process  and  static  code  metrics  ...................................................  36  6.2.1.4  Models  with  a  combination  of  3  process  and  static  code  metrics  ...................................................  37  

6.2.2  Results  of  the  experiment  for  TAN  classifier  ..................................................................................  38  6.2.2.1  Models  with  combined,  static  code  and  process  metric  ......................................................................  38  6.2.2.2  Models  with  static  code  and  1  process  metrics  .......................................................................................  39  6.2.2.3  Models  with  a  combination  of  2  process  and  static  code  metrics  ...................................................  40  6.2.2.4  Models  with  a  combination  of  3  process  and  static  code  metrics  ...................................................  41  

7.  Result  discussion  ......................................................................................................................  42  7.1  Related  work  .....................................................................................................................................  44  

Page 6: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

6    

8.  Validity  threats  .........................................................................................................................  45  8.1  Internal  validity  ................................................................................................................................  45  8.2  External  validity  ...............................................................................................................................  45  8.3  Statistical  conclusion  validity  ......................................................................................................  45  

9.  Conclusion  ..................................................................................................................................  46  9.1  Future  works  .....................................................................................................................................  46  9.1.1  Model  investigation  ...................................................................................................................................  46  9.1.2  Industrial  projects  ......................................................................................................................................  46  9.1.3  Data  extraction  ............................................................................................................................................  47  

Reference  .........................................................................................................................................  48  A  Graphs  for  NB  classifier  ..........................................................................................................  51  

B  Graphs  for  TAN  classifier  .......................................................................................................  53                                                                

Page 7: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

7    

List  of  Figures      Figure  1.  Example  of  NB  [6]  .........................................................................................................................  19  Figure  2.  Example  of  STAN  [6]  ....................................................................................................................  20  Figure  3.  Methodology  for  the  RQ1  ..........................................................................................................  22  Figure  4.  Methodology  for  the  RQ2  ..........................................................................................................  24  Figure  5.  Weka  Explorer  ................................................................................................................................  30  Figure  6.  Set  values  for  NB  classifier  ........................................................................................................  31  Figure  7.  Classifier  output  with  results  ...................................................................................................  32  Figure  8.  Graphs  with  comparison  results  for  NB  classifier  ..........................................................  42  Figure  9.  Graphs  with  comparison  results  for  TAN  classifier  .......................................................  43  Figure  10.  Graphical  comparison  of  combined,  static  code  and  process  models  using  NB

 ........................................................................................................................................................................  51  Figure  11.  Graphical  comparison  of  models  containing  1  process  and  static  code  metrics  

using  NB  ......................................................................................................................................................  51  Figure  12.  Graphical  comparison  of  models  containing  the  combination  of  2  process  and  

static  code  metrics  using  NB  ..............................................................................................................  52  Figure  13.  Graphical  comparison  of  models  containing  the  combination  of  3  process  and  

static  code  metrics  using  NB  ..............................................................................................................  52  Figure  14.  Graphical  comparison  of  combined,  static  code  and  process  models  using  TAN

 ........................................................................................................................................................................  53  Figure  15.  Graphical  comparison  of  models  containing  1  process  and  static  code  metrics  

using  TAN  ...................................................................................................................................................  53  Figure  16.  Graphical  comparison  of  models  containing  the  combination  of  2  process  and  

static  code  metrics  using  TAN  ...........................................................................................................  54  Figure  17.  Graphical  comparison  of  models  containing  the  combination  of  3  process  and  

static  code  metrics  using  TAN  ...........................................................................................................  54                        

     

Page 8: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

8    

List  of  Tables    Table  1.  Selected  projects  for  the  experiment  .....................................................................................  27  Table  2.  Results  for  combined,  static  code  and  process  metric  using  NB  classifier  ............  34  Table  3.  Comparison  results  for  combined,  static  code  and  process  models  for  NB  

classifier  ......................................................................................................................................................  34  Table  4.  Results  for  models  containing  1  process  and  static  code  metric  using  NB  

classifier  ......................................................................................................................................................  35  Table  5.  Comparison  results  for  models  containing  static  code  and  1  process  metrics  

using  NB  classifier  ..................................................................................................................................  35  Table  6.  Results  for  models  containing  combination  of  2  process  and  static  code  metrics  

using  NB  classifier  ..................................................................................................................................  36  Table  7.  Comparison  results  for  models  containing  combination  of  2  process  and  static  

code  metrics  using  NB  classifier  ......................................................................................................  36  Table  8.  Results  for  models  containing  combination  of  3  process  and  static  code  metrics  

using  NB  classifier  ..................................................................................................................................  37  Table  9.  Comparison  results  for  models  containing  combination  of  3  process  and  static  

code  metrics  using  NB  classifier  ......................................................................................................  37  Table  10.  Results  for  combined,  static  code  and  process  models  using  TAN  classifier  .....  38  Table  11.  Comparison  results  for  combined,  static  code  and  process  models  using  TAN  

classifier  ......................................................................................................................................................  38  Table  12.  Results  for  models  containing  1  process  and  static  code  metrics  using  TAN  

classifier  ......................................................................................................................................................  39  Table  13.  Comparison  results  for  models  containing  static  code  and  1  process  metrics  

using  TAN  classifier  ...............................................................................................................................  39  Table  14.  Results  for  models  containing  combination  of  2  process  and  static  code  metrics  

using  TAN  classifier  ...............................................................................................................................  40  Table  15.  Comparison  results  for  models  containing  combination  of  2  process  and  static  

code  metrics  using  TAN  classifier  ...................................................................................................  40  Table  16.  Results  for  models  containing  combination  of  3  process  and  static  code  metrics  

using  TAN  classifier  ...............................................................................................................................  41  Table  17.  Comparison  results  for  models  containing  combination  of  3  process  and  static  

code  metrics  using  TAN  classifier  ...................................................................................................  41  

     

 

     

Page 9: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

9    

Abbreviations      

SFP   Software  Fault  Prediction  

ML   Machine  Learning  

BN   Bayesian  Network  

DAG   Directed  Acyclic  Graph  

NB   Naive  Bayes  

ANB   Augmented  Naive  Bayes  

TAN   Tree  Augmented  Naive  Bayes  

FAN   Forest  Augmented  Naive  Bayes  

STAN   Selective  Tree  Augmented  Bayes  

STAND   Selective  Tree  Augmented  Bayes  with  Discarding  

SFAN   Selective  Forest  Augmented  Bayes  

SFAND   Selective  Forest  Augmented  Bayes  with  Discarding  

ANOVA   Analysis  of  Variance  

WMC   Weighted  Methods  per  Class  

DIT   Depth  of  Inheritance  Tree  

NOC   Number  Of  Children  

CBO   Coupling  Between  Object  class  

RFC   Response  For  a  Class  

LCOM   Lack  of  Cohesion  in  Methods  

LCOM3   Lack  of  Cohesion  in  Methods  (normalized  version  of  LCOM)  

Ca   Afferent  Coupling  

Ce   Efferent  Coupling  

LOC   Lines  Of  Code  

NPM   Number  of  Public  Methods  

Page 10: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

10    

DAM   Data  Access  Metric  

MOA   Measure  Of  Aggregation    

CAM   Cohesion  Among  Methods  of  class    

IC   Inheritance  Coupling    

CBM   Coupling  Between  Methods  

AMC   Average  Method  Complexity  

CC   McCabe’s  Cyclomatic  complexity    

NR   Number  of  Revisions    

NDC   Number  of  Distinct  Committers    

NML   Number  of  Modified  Lines    

NDPV   Number  of  Defects  in  the  Past  Version    

ROC   Receiver  Operating  Characteristic  

AUC   Area  Under  the  Curve  

RQ   Research  Question  

RQ1   Research  Question  1  

RQ2   Research  Question  2                                    

Page 11: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

11    

1.  Introduction  Introducing  the  term  ‘software  engineering  measurement’  and  its  basic  characteristics,  creates   a   logical   step   for   defining   another   one—software   metrics.   Software   metrics  represent  numbers  or  certain  attributes  used  for  describing  real  world  entities  formed  by  definite  rules.  Moreover,  there  are  certain  software  quality  assurances  that  use  such  metrics.  One  of  the  activities  involves  constructing  quality  models  based  on  metrics.  In  that   manner,   quality   metrics   are   beneficial   for   determining   if   the   software   system  delivers  the  intended  functionality  [1].      A   software   fault   prediction   (SFP)   model   is   one   type   of   the   quality   model   that   has  attracted  a  lot  of  research  in  the  past  few  decades  [2].  Complexity  of  the  system  brings  mistakes  in  the  code,  labeled  as  faults2  [16],  and  their  discovery,  as  the  system  grows,  is  not  an  easy  task  for  the  developers  [3].    The  process  of  detecting  faults  can  be  long  and  infinite,   since   it   is   hard   to   claim   that   a   system   is   100%   fault-­‐free.   This   also   requires  additional   resources   during   testing   which   leads   to   enlarging   costs   of   software  development.  One  of  the  solutions   is  the  creation  of  a  prediction  model  that  will  guide  development  team  towards  quicker  and  more  efficient  fault  detection  [3].  With  the  help  of  metrics,  it  is  possible  to  define  a  model  that  is  responsible  for  predicting  faults.  Hall  et  al.   [4]  claim  that   there  are  a   lot  of  complex  models   that  deal  with  the  problem  of   fault  prediction,  but  despite  that,  there  is  a  visible  lack  of  information  about  the  actual  state  in  this  area.      Song  et  al.  [2]  list  three  most  researched  problems  in  SFP.  These  problems  are:  

o Determining  the  number  of  faults  that  were  not  identified  during  testing;  o Finding  connections  between  remaining  faults  and;  o Classification  of  components  that  are  fault-­‐prone.    

To  solve  above  mentioned  problems,  a  lot  of  attention  in  the  past  has  been  on  applying  machine  learning  techniques.  Machine  learning  (ML)  is  about  programming  machines  in  order   to   get   optimized   results   using   statistical   data   or   previous   experience.   ML   uses  statistical   rules   to   build   different   mathematical   models   necessary   for   creating   a  conclusion  from  a  sample  [15].  The  nature  of  ML,  and  some  of  its  algorithms,  is  suitable  for  a  creation  of  fault  prediction  models,  which  has  shown  good  results.  ML  algorithms  use   patterns   to   identify   and   classify   features,   e.g.,   whether   a   software   component  contains  a  fault  or  not.  One  of  the  mostly  used  ML  algorithms  is  a  Bayesian  network  (BN)  [2].   The   BN   is   represented   through   a   directed   acyclic   graph   (DAG)   containing   set   of  variables,  a  certain  structure  that  is  defined  as  a  relation  between  variables  and  a  set  of  probability  distributions.  For  the  cases,  such  as  software  products,  the  usage  of  the  BN  can   be   depicted   as   a   relationship   between   software   features   and   possible   faults.   The  network  can  be  used  for  computing  probabilities  of  the  presence  of  different  faults   for  specified   features.  That  way,  we  will  know  which  faults  are  possible  to  occur  and  how  we  can  control  and  isolate  them  [5].      

                                                                                                               2  A  software  fault  causes  malfunctioning  of  an  software  product  

Page 12: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

12    

1.1 Motivation  The  main  arguments  for  this  thesis  are  found  in  recent  publications  [6],[7],  where  the  usage  of  BNs  has  been  found  to  be  performing  “surprisingly  well”  [6].        Dejaeger  et  al.  [6]  conducted  experiments  where  different  BN  learners  with  static  code  metrics  were   compared.   They  made   this   choice   of  metrics   because   they   are   easier   to  gather  and  they  are  widely  used   in  practice.  They  recorded  better  prediction   for  cases  where   a   smaller   set   of   features,   that   are   highly   predictive,   were   used.   For   the   future  work,  they  stated:    

“Recently,  several  researchers  turned  their  attention  to  another  topic  of  interest,  i.e.,   the   inclusion   of   information   other   than   static   code   features   into   fault  prediction  models  such  as  information  on  intermodule  relations  and  requirement  metrics.   The   relation   to   the  more   commonly   used   static   code   features   remains  however  unclear.  Using,  e.g.,  Bayesian  network   learners,   important   insights   into  these   different   information   sources   could   be   gained  which   is   left   as   a   topic   for  future  research.”  [6]    

 This  conclusion  represents  the  starting  point  and  the  main  motivation  for  the  thesis  definition.  Dejaeger  et  al.  [6]  indicated  that  one  interesting  direction  of  investigating  BN   learners   for   SFP   is   to   include   non-­‐static   code   metrics.   They   are   proposing  information  on   intermodule   relations   and   requirement  metrics.  These   two  are   just  examples  of  many  other  non-­‐static  code  metrics  that  can  be  used  for  SFP.      Another  paper  that  deals  with  BNs  for  SFP,  and  gives  a  great  motivation,  is  by  Okutan  et  al.   [7].   For   the   experiment,   they   reported   performance   of   some   static   code   metrics.  Namely,   they   stated   that   metrics   with   information   about   a   number   of   lines   of   code  (LOC),   low   quality   of   a   coding   style   and   a   class   response   were   the   most   effective.  Moreover,  they  presented  other  conclusions  of  their  experiment  and  possible  steps  for  an  extension  of  their  research:    

“As  a   future  direction,  we  plan   to   refine  our   research   to   include  other   software  and  process  metrics  in  our  model  to  reveal  the  relationships  among  them  and  to  determine  the  most  useful  ones  in  defect  prediction.  We  believe  that  rather  than  dealing  with  a   large  set  of  software  metrics,   focusing  on  the  most  effective  ones  will  improve  the  success  rate  in  defect  prediction  studies.”  [7]  

 Following  the  statement  in  [6],  Okutan  et  al.  are  proposing  the  usage  of  process  metrics.      Radjenović   et   al.   [8]   have   done   a   systematic   literature   review   on   SFP   metrics   and  collected  results   that  are  supporting  statement  given   in   [7]  about  process  metrics  and  their  future  usage  for  fault  prediction.  They  came  up  with  the  following  conclusions:  

o Process   metrics   have   shown   better   results   in   detecting   faults   on   post-­‐release  level  than  static  code  metrics;  

o Process   metrics,   comparing   to   static   code   metrics,   contain   more   descriptive  information  related  to  fault  distribution;  

o Better  prediction  of  faulty  classes  was  detected  using  process  metrics;  

Page 13: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

13    

o Unlike   object-­‐oriented   metrics   that   are   mostly   using   small   datasets,   process  metrics  are  applicable  for  larger  datasets  and  it  is  an  advantage  when  it  comes  to  the  validity  and  maturity  of  research  results  [4];  

o Process   metrics   can   be   very   beneficial,   but   they   are   mostly   used   in   industrial  domain.   Therefore,   it   is   challenging   for   researchers   to   use   process   metrics   in  their  studies  in  the  future.    

 

1.2 Research  questions  Taking  into  account  the  collected  statements  about  SFP,  BNs  and  the  usage  of  non-­‐static  code  metrics,  the  structure  of  the  thesis  will  be  defined  by  the  research  questions:  

o RQ1:  What  is  the  current  state-­‐of-­‐the-­‐art  with  respect  to  the  use  of  non-­‐static  code  metrics  for  SFP?      

o RQ2:  What   is   an   impact   of   process  metrics   combined  with   static   code  metrics,   in  terms  of  performance,  using  BN  learners  for  SFP?    

   In  RQ1,  we  will   discuss   and   explain   in   detail   the   current   state-­‐of-­‐the-­‐art   of   non-­‐static  code  metrics,  with  a   focus  on  process  metrics.  Research  will  be  driven  by   conclusions  made   by   Radjenović   et   al.   [8].   For   RQ2,   we   will   be   conducting   experiments   where  several   process  metrics   are   combined  with   static   code  metrics   and   they  will   be   used  with  BN  learners.  The  results  of  this  thesis  can  give  a  new  insight  into  software  metrics  that  can  be  beneficial  for  SFP.      The   content   of   the   thesis   is   organized   as   follows:   Section   2   contains   background  information  about  SFP,  software  metrics,  ML  and  BN.  Section  3  explains  the  method  for  collecting  resources  for  the  informal  review  and  the  experiment.    In  Section  4,  informal  review  on  non-­‐static  metrics  will  be  presented.  Section  5  contains  information  about  the  design  of  the  experiment,  used  datasets  and  a  response  variable.  In  Section  6,  results  of  the  experiment  and  statistical  analysis  will  be  presented.  Moreover,  Section  7  provides  a  discussion   of   a   comparison   of   experimental   outcomes.   Section   8   explains   possible  validity  threats.  Finally,  Section  9  offers  conclusions  and  lists  areas  of  future  work.                          

Page 14: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

14    

2.  Background    In   this   section,  we  will  describe  SFP,   the  purpose  of   software  metrics,  how  ML  can  be  used  for  fault  prediction  (with  a  focus  on  BN  learners).      

2.1  Software  fault  prediction  SFP,  as  quality  model,  is  recognized  as  an  important  tool  for  improving  software  product  by   identifying   possible   faults   that   can   occur   in   the   system.   Faults   directly   jeopardize  software,  which  decreases  its  performance,  and,  furthermore,  its  quality  [16].        Software   systems   are   designed   to   handle   complex   activities   in   different   domains.  Because  of   their  critical  nature,  every  system  is  supposed  to  provide  quality  of  service  for   its   users.   As   the   system   increases,   the   growth   of   potential   faults   enhancements.  Dejaeger  et  al.  [6]  found  several  cases  where  software  faults  were  examined  in  terms  of  reliability  and  bug  localization.  In  order  to  test  reliability,  we  need  to  create  a  stochastic  model   that  will   output   probability   of   fault   existence   once   the   component   is   executed.  Combining  remaining  components,  we  are  able   to  estimate   the  reliability  of   the  whole  system.  Bug   localization   is  based  on  usage  of  certain  patterns   that  are  associated  with  faulty  components.  Using  this  approach,  we  can  discover  faults  that  were  not  previously  detected  [6].  We  saw  that  it  is  crucial  to  classify  and  identify  fault-­‐prone  modules  of  the  software  on  time.  Different  studies  in  this  area  have  shown  that  faults,  in  majority  cases,  occur  just  in  several  modules  and  it  causes  malfunctioning  of  the  other  parts  of  the  system  and,  also,  those  that  are  in  direct  relation  with  faulty  ones.  This  means  that  the  costs  of  production  and  maintenance  will  increase  due  to  the  effect  of  faults  in  these  modules  [9].  Therefore,  much   research   has   been   focused   on   making   fault   prediction   regarding   system   under  development.  Such  predictions  focus  on  locating  modules  that  can  be  shown  to  be  fault-­‐prone  [10].  Thus,  creating  accurate  predictions  is  useful  for  increasing  the  quality  of  the  system  [9].      To  be  able  to  achieve  good  results,  software  engineers  have  to  find  a  suitable  prediction  technique   that   will   help   them   in   their   intention   of   detecting   faults   and   conducting  experimental  validation  [10].  Since  direct  measurement  of  fault  prediction  is  impossible,  we  need  to  use  metrics   for  necessary  estimation.  Reliable  results  of  prediction  depend  on  selection  of  software  metrics.  One  needs  to  select  appropriate  metrics,  because  there  exists   a   number   of   datasets   that   can   make   a   prediction   harder   and   might   result   in  unsatisfactory  and  misleading  results.      Once  the  metrics  are  chosen,  it  is  necessary  to  decide  which  technique  will  be  used  for  making  predictions.  ML  is  the  non-­‐parametric3  technique  that  is  frequently  used  for  SFP.  Sometimes  datasets  can  be  heavily  skewed,  which  results   in   inaccurate  prediction.  ML  techniques  have  a  mechanism  to  overcome  such  problems,  because   it  has  an  ability   to  “learn  from  imbalanced  datasets”  [11].    In  Section  2.2,  we  will  describe  software  quality  metrics  and  ML  as  key  elements  in  SFP  process.  

                                                                                                               3  distribution-­‐free  

Page 15: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

15    

2.2  Software  metrics  Software  quality  metrics  deserve  particular  attention  in  SFP  since  they  can  be  used  for  measuring   quality   of   the   system.   They   are   further   divided   into   in-­‐process   and   end-­‐process  metrics.  The  first  group  of  metrics  is  responsible  for  improving  the  development  process,  unlike  end-­‐process   that   is   focused  on  assessment  of   the  characteristics  of   the  final  product.      Based   on  measuring   specific   parts   of   the   system,   there   can   be   two   types   of   software  quality  metrics:  

o Static;  o Dynamic  metrics.  

 Static   code   metrics   are   suitable   for   checking   attributes   of   the   code,   such   as   the  complexity  of  the  software  and  accessing  the  length  of  the  code.  Dynamic  code  metrics  in  a   great   extent   examine   the   behavior   of   the   system,   presented   as   usability,   reliability,  maintainability  and  evaluation  of  the  efficiency  of  the  program  [6].      In  Section  2.2.1,  we  will  briefly  introduce  some  metrics,  namely  static  code  and  process,  which  are  relevant  for  the  further  content  of  this  thesis.      

2.2.1  Static  code  metrics  Static   code  metrics   are   a   type   of   quality  metrics.   Their   usability   is   noticeable  when   it  comes  to  measuring:  

o Size  (through  lines  of  code  (LOC)  counts);  o Complexity  (using  linearly  path  counts);  o Readability  (through  operator  counts  and  different  operands).  

 The  principle  of  calculating  static  code  metrics   is  based  on  parsing  of   the  source  code;  therefore,   the   process   of  metrics   collection   is   automated.   Because   of   this   feature,   it   is  manageable   to  measure  metrics   of   the  whole   system,   regardless   of   the   system’s   size.  Moreover,  it  is  possible  to  make  predictions  about  the  entire  system  based  on  metrics—  developers   can   easily   find   faulty   modules   since   they   have   a   clear   image   of   system’s  vulnerabilities  [12].  Static  code  metrics  are  easier  and  widely  used  in  practice;  therefore,  they  are  representing  a  safe  choice  for  predicting  faulty  software  [6].    

2.2.2  Process  metrics  Process  metrics  are  also  used  for  measuring  the  quality  of  the  system  [8].  They  can  be  specified  from  various  sources:  

o Developer’s  experience  [10];  o Software  change  history  [8],  etc.  

Developer’s   experience   is   concentrated  on  activities   that  are  envisaging  how  a   certain  part  of  the  code  (or  the  whole  system)  was  developed  [10].    Those  metrics   that  are  determined   through   the  software  change  history  split   into   two  groups:    

o Delta;  o Code  churn  metrics;  

Page 16: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

16    

Delta  metrics  are  defined  as  the  difference  between  versions  of  the  software  [8].  As  the  result,   we   have   the   changed   value   of   the   metrics   from   one   version   to   another.   An  illustrative  example  is  when  we  are  adding  new  lines  of  code  and  when  we  save  those  changes,  delta  value  will  be  different.  In  cases  when  we  are  adding,  and  at  the  same  time  removing  the  same  number  of  lines  of  code,  delta  value  between  those  two  versions  will  be  the  same.  To  be  able  to  track  changes,  code  churn  metrics  will  report  every  activity  on  the  code.    Advantage   of   using   process  metrics   is   that   they   contain  more   descriptive   information  about   a   faulty   part   of   code.   They   are   also   good   at   making   fault   predictions   on   post-­‐release   level.   Since   process  metrics   are   used  mostly   for   industrial   purposes,   they   are  able  to  handle  large  datasets,  which  leads  to  better  validity  of  the  predictions  [8].    

2.3  Machine  learning  ML   is   about  programming  machines   in   order   to   get   optimized   results   using   statistical  data   or   previous   experience.  ML   uses   statistical   rules   to   build   different  mathematical  models  necessary   for  creating   the  conclusion   from  the  sample   [15].  Those  models  can  give   predictions   of   the   future   steps   or   form   a   description   based   on   knowledge   from  different  data,  or  combining  both  in  particular  cases  [15].  The  first  step  for  building  the  model   is   to   train   data   using   certain   algorithms   in   order   to   optimize   the   specified  problem.  Learned  model  has  to  be  efficient  in  terms  of  time  and  space  complexity.    ML  has  several  applications  in:  

o Learning  Associations;  o Supervised  learning;  o Unsupervised  Learning;  o Regression;  o Reinforcement  Learning.  

 We  will  use  real-­‐life  examples  to  explain  types  of  ML.    Learning  Associations  are  suitable  for  “learning  a  conditional  probability”.  Probability  is  presented  in  equation  1:       𝑃(𝑋)   (1)  

 where  Y   is  a  variable   that   is   conditioned  on  X.  Moreover,  X   can  be  a   single  or  a   set  of  variables  of  the  same  type  as  Y.  Taking  the  example  of  a  bookstore;  Y  can  be  a  book  that  we  are  conditioning  on  X.  Based  on  customer's  behavior,  we  know  that  when  s(he)  buys  a  book  from  X,  there  is  a  high  probability  that  Y  will  we  bought  as  well.      In   supervised   learning   input   features   are   related   to   corresponding   outputs   and  machine  has  to  understand  rules  of  mapping  between  those  two  parameters.  Supervised  learning   is   applicable   for   prediction   tasks,  where   is   necessary   to   identify   connections  between   different  measures.     On   the   other   hand,   for  unsupervised   learning,   output  data   are   not   required,   however,   it   examines   structure   within   some   input   dataset.  Unsupervised  learning  is  not  suitable  for  predictions  of  existence  software  faults  in  the  

Page 17: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

17    

system   because   of   its   nature   to   create   the   result   without   previously   specified   output  data.      Cases   of   supervised   learning   problems   are   classification   and   regression.   In  classification,  we  are  making  predictions  based  on  a   rule   learned   from   the  past  data.  Assuming   that   a   behavior   is   similar   in   the   past   and   future,   we   can   easily   produce  predictions   for   every   future   case.   The   input   contains   data   that   we   need   to   analyze,  whereas  the  output  is  represented  as  classes  that  have  a  descriptive  value.    Regression   has   the   same   approach   for   selecting   the   input   data   as   the   classification  problem,   but   for   the   output   we   will   get   a   numerical   value.   In   classification   and  regression  problems  we  are  creating  the  model,  shown  in  equation  2:       𝑦 = 𝑔(𝑥|𝜃)   (2)  

 where  y  is  the  class,  in  classification,  or  the  numerical  value,  in  regression.  The  model  is  represented  as  g  (.)  and  model’s  parameters  as  𝜃.  The  task  of  ML  is  to  optimize  values  of  𝜃,  in  order  to  get  a  minimized  approximation  error.      Reinforcement  Learning  has  for  the  input  set  of  actions.  In  that  system,  it  is  important  that   all   actions   are   part   of   a   “good   policy”   [15],   which   will   lead   to   obtaining   correct  results.  In  this  particular  problem,  we  are  not  observing  a  single  action.  The  task  of  the  program  is  to  learn  with  characteristics  are  forming  the  good  policy  based  on  past  set  of  actions  [15].        Finally,  ML  has  a  mechanism  to  overcome  problems  of  datasets  that  are  heavily  skewed  and  that  can  cause  inaccurate  results  [11].  This  basically  means  that  it  possible  to  create  prediction  even  if  some  data  are  missing,  or  there  is  a  big  number  of  variables,  etc.  [17].        The  BN  is  a  type  of  supervised  learning  paradigm  and  one  type  of  algorithm  suitable  for  SFP  (see  Section  2.3.1).      

2.3.1  Bayesian  network  The   structure   of   a   BN   is   presented   as   a   directed   acyclic   graph   (DAG)   consisting   of  variables,  a  certain  structure  that  is  defined  as  a  relation  between  variables  and  a  set  of  probability   distributions   [5].   Considering   the   nature   of   SFP,   BN   can   be   used   for  calculating  probabilities  regarding  the  presence  of  faults  in  the  software  features.      The  BN  graphically  presented  as  a  graph,  contains  following  elements:  

o Variables  that  are  presented  as  vertices  or  nodes;  o Conditional  dependencies  that  are  edges  or  arcs.  

This  graph  cannot  contain  cycle  and  all  edges  inside  it  have  to  be  directed.  BNs  provide  theoretical   framework   that   combined   with   statistical   data,   can   give   good   fault  predictions  [5].  Model  for  SFP  can  be  defined  as  (see  equation  3):         𝐷 = 𝑡𝑟𝑛  { 𝑥! ,𝑦! }!! = 1   (3)  

 

Page 18: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

18    

where  D  is  a  set  with  N  observations,  with  𝑥! ∈ 𝑅!  shows  all  static  and  non  static  code  features  and  𝑦! ∈ {0, 1}  is  used  to  indicate  presence  of  some  fault.  Using   Bayesian   theorem   for   the   BN   to   calculate   probability   of   fault   presence,  we  will  have  the  following  equation  4:    

  𝑃 𝑦! = 1|𝑥! =  𝑃 𝑥!|𝑦! = 1 𝑃(𝑦! = 1)

𝑃(𝑥!)   (4)  

 The  concept  of  the  BN  is  based  on  probability  distribution  using  stochastic  variables  that  can  be  both  continuous  and  discrete.  Individual  variables  𝑥(!)  are  used  to  construct  the  graph  and  dependencies  between   those  variables   are  presented  with  directed  arcs.   In  the   same   way,   independence   between   different   nodes   𝑥(!)   and   𝑥(!!)   is   shown   by   the  absence  of  the  arc  [6].          Defining  and  building  the  BN  consists  of  three  steps  [14]:  

o “Set”  and  “field”  variables  have  to  be  defined;  o Network  topology  has  to  be  constructed;  o Probability  distribution  on  the  local  level  has  to  be  identified.  

 Set  and  field  variables  have  to  be  defined  “Set”   variable   inspects   possible   factors   that   are   causing   software   faults,   while   “field”  variable   has   to   change   the   range   of   all   variables   to   the   degree   of   software   faults.   The  value   of   “set”   variable   can   vary   depending   on   the   system,   organization,   environment  where  it  will  be  used.  Variable  “field”  can  be  stated  as  “high”,  “middle”  or  “low”  (or  more  precise  if  it  is  required).    Network  topology  has  to  be  constructed  Network   topology   is   defined   using   event   relation.   The   construction   of   the   certain  topology  depends  on  studies  or  literature  and  it  can  be  changed  if  experiments  demand  it.      Probability  distribution  on  the  local  level  has  to  be  identified  In   order   to   identify   probability   distribution   on   the   local   level,   it   is   required   to   find  marginal   and   conditional   probabilities.   Probability   distribution   can   be   used   to  emphasize  “affect  degree  of  causality”  [14].    BN  classifiers  are  used  for  problems  that  require  classification.  Other  attractive  features  that  classifiers  have  are  presented  in  following  list:  

o Models  can  be  created  even  if  the  knowledge  is  uncertain;  o Probabilistic  model  can  be  used  for  cost-­‐sensitive  problems;  o The  nature  of  BNs  classifiers  can  deal  with  issues  related  to  data  that  are  missing;  o Classifiers  can  solve  complex  classification  problems;  o Future  work  on  BNs  include  presenting  models  in  hierarchical  manner  based  on  

their  complexity;  o There  is  possibility  of  using  BN  classifiers  with  algorithms  that  have  linear  time  

complexity;  o Models  that  are  using  the  BN  have  shown  very  good  performance  [4].  

Page 19: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

19    

 In  Section  2.3.1.1,  we  will  present  2  types  of  BN  learners  (classifiers)  what  will  be  used  for  the  experiment.    

2.3.1.1  The  Naive  Bayes  Classifier    The  Naive  Bayes  (NB)  is  based  on  “conditional   independence  between  attributes  given  the  class  label”  [6],  which  is  represented  with  nodes  in  directed  acyclic  graph,  where  one  node   is   a   parent   and   the   rest   are   children.  The  node   is   a   certain   value   in   the  dataset.  Results  from  different  studies  have  shown  that  NB  classifiers  are  giving  good  results  in  fault  prediction.      In  order  to  calculate  fault  probability  for  classes,  each  of  them  will  have  a  vector  of  input  variables   for   every   new   code   segment.   Resulted   probabilities   are   gained   using  “frequency   counts   for   the   discrete   variables   and   a   normal   or   kernel   density-­‐based  method  for  continuous  variables”.    Because  of  their  simplified  nature,  NB  classifiers  can  be  constructed  very  easily,   containing  computational  efficiency.  The  structure  of  NB   is  shown  in  Figure  1  [6].      

 Figure  1.  Example  of  NB  [6]  

As  we  have  shown  on  Figure  1.  NB  consists  of  one  unobserved  parent  variable  (node  y)  and  a  number  of  observed  children  variables  (x  nodes).    

2.3.1.2  Augmented  Naive  Bayes  Classifier    Augmented  Naive   Bayes   (ANB)   classifiers  were   created   as   a  modification   of   the   basic  conditional  independence  assumption  principle  of  the  NB.  We  will  show  changes  in  the  graph   that   are   made   by   adding   new   arcs   and   removing   unnecessary   variables.   One  

Page 20: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

20    

example   of   ANB   classifier   is   the   Tree   Augmented   Naive   Bayes   (TAN),   where   each  variable   has   one   more   parent.   On   the   other   hand,   Semi-­‐Naive   Bayesian   classifier  “partitions  the  variables  into  pairwise  disjoint  groups”.  Finally,  Selective  Naive  Bayes,  in  order  to  face  correlation  between  attributes,  omits  some  variables.      There  are  several  ANB  classifiers:  

o Tree  Augmented  Naive  Bayes  (TAN);  o Forest  Augmented  Naive  Bayes  (FAN);  o Selective  Tree  Augmented  Naive  Bayes  (STAN);  o Selective  Tree  Augmented  Naive  Bayes    with  Discarding  (STAND);  o Selective  Forest  Augmented  Naive  Bayes  (SFAN);  o Selective  Forest  Augmented  Naive  Bayes    with  Discarding  (SFAND).  

 Graphical  representation  of  the  STAN  is  shown  on  the  following  Figure  2.    

 Figure  2.  Example  of  STAN  [6]  

In   Figure  2,   is   it   shown   that   each   child   (x)   node   is   allowed   to   have   additional   parent,  located  next  to  the  certain  class  node.  We  will  use  the  NB  to  balance  the  BN  classifiers  and  show  a  clear  image  of  all  dependencies  between  attributes  [6].  

 

 

 

Page 21: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

21    

3.  Method    This  thesis  will  consist  of  2  parts,  depending  on  RQs.  We  will  answer  RQ1  by  doing  the  informal  review  of  the  field  and  for  RQ2  we  will  conduct  the  experiment.      

3.1  Methodology  for  RQ1  In  order  to  answer  RQ1,  we  have  to  collect  material  related  to:    

o SFP;  o Metrics;  o ML  and    o BN;  

 Our  goal  is  not  focused  on  conducting  a  systematic  literature  review  on  the  topic  of  SFP,  software  metrics   or  ML   and   its   algorithms,   since   it   has   already   been   done   in   several  studies.  Therefore,  we  will  use  collected  sources  to  present  the  state-­‐of-­‐the-­‐art  for  non-­‐static  code  metrics  in  the  process  of  SFP.    Furthermore,   in   this   section,  we  will   describe   the   procedure   that  was   followed   in   the  process  of  selecting  papers  relevant  for  the  thesis  topic:    Step  1:  Define  keywords  relevant  for  the  RQ1:  

o SFP;  o Software  defect  prediction;  o Code  metrics;  o ML;  o BN.  

 Step  2:  Define  filters  in  terms  of  publication  years  and  type  of  the  papers  Selected   papers   are   published   between   2010   and   2015   and   each   of   them  has   to   be   a  journal   or/and   conference   paper.   The   5–year   range   was   set   to   review   only   recently  published  results,  which  also  include  some  recent  literature  reviews  on  the  topic.        Step  3:  Use  keywords  and  filters  in  several  databases:  

o IEEExplore4;  o ACM  Digital  Library5;  o Scopus6   (used   for   validation   of   papers   that   have   been   found   in   first   two  

databases).    

Step  4:  Collect  and  select  results  To   get   suitable   results   from   databases,   search   string   often   contained   combination   of  terms,  using  Boolean  AND  and  OR  operators,  such  as:  

(TITLE-­‐ABS-­‐KEY  AND  TITLE-­‐ABS-­‐KEY)  AND  DOCTYPE(ar  OR  re)  AND  (  LIMIT-­‐TO(SUBJAREA  )  AND  RECENT()7.  

                                                                                                               4  http://ieeexplore.ieee.org/Xplore/home.jsp  5  http://dl.acm.org/  6  http://www.scopus.com/  7  The  example  of  the  search  string  is  taken  from  Scopus  database.  

Page 22: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

22    

Papers  with  SFP  and  process  metrics  keywords,  found  in  Scopus  database,  were  used  for  snowball   sampling  and   collecting  other  papers   relevant   for   this   topic  with   the,   earlier  mentioned,  5–year  range.    During   the   selection   process,   papers   that  were   irrelevant,   but   have   been   displayed   in  search  results,  were  rejected  and  other  new  papers  were  taken  into  consideration.  The  majority  of  papers  were  found  in  IEEExplore  database.      Step  5:  Present  the  informal  review.  Present  all  relevant  information  regarding  non-­‐static  code  metrics.    In  Figure  3,  we  will  illustrate  the  activities  as  part  of  methodology  for  answering  RQ1.    

 Figure  3.  Methodology  for  the  RQ1  

 

 

 

 

Page 23: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

23    

3.2  Methodology  for  RQ2  We  identified  following  activities,  suggested  in  [15],  for  finalizing  RQ2:  

o Find   suitable   projects   with   datasets   containing   extracted   static   and   process  metrics;  

o Decide  upon  a  response  variable;  o Choose  the  design  of  the  experiment;  o Conduct   the   experiment   for   the   projects,   using   previously   selected   BN  

classifiers;  o Compare  results  using  a  statistical  analysis;  o Publish  conclusions  about  results  of  the  statistical  analysis.    

 Step   1:   Find   suitable   projects   with   datasets   containing   extracted   static   and   process  metrics  We  will   analyze   datasets   available   on   the  Metric   Repository8   collected   from   different  open-­‐source  projects.  The  datasets  were  used  in  the  study  of  Madeyski  et  al.  [18].      Step  2:  Decide  upon  a  response  variable  As  the  quality  measure,  we  will  use  area  under  the  curve,  AUC,  proposed  in  [6].  AUC  is  specified  by  an  average  value  of  performances  defined  by  all  thresholds.    Step  3:  Choose  the  design  of  the  experiment  A  principle  of  our  experimental  design  will  be  based  on  replication  [15],  which  means  that   the  experiment  will  be  run  a  defined  number  of   times   in  order  to   find  an  average  value  of  the  specific  quality  measure.  More  specifically,  we  will  use  cross-­‐validation.      Step  4:  Conduct  the  experiment  for  the  projects,  using  previously  selected  BN  classifiers  As  it  is  presented  in  the  Background  Section,  for  the  purposes  of  the  experiment,  we  will  use  NB  and  TAN  classifiers.  Using  specified  classifiers  with  the  datasets,  we  will  create  a  learner.  Doing  training  multiple  times  and  then  testing  with  the  validation  sets,  we  will  get   results   for   the   desired   measure   [15].     Implementation   of   classifiers   is   available  through  Weka9,  suggested  in  [6],  which  will  speed  up  the  process  of  collecting  results  of  prediction  for  the  chosen  datasets.      Step  5:  Compare  results  using  a  statistical  analysis    Conducting  the  statistical  analysis  of  data  is  done  for  the  reason  of  providing  objective  and   unbiased   results   regarding   the   comparison   of   different   datasets.   Since   we   are  operating  with  17  datasets  and  2  classifiers,  we  will  decide  between  a  one-­‐way  analysis  of  variance  (ANOVA)  or  Kruskal-­‐Wallis  test,  once  we  collect  results  from  the  experiment.      Step  6:  Publish  conclusions  about  results  of  the  statistical  analysis  The   statistical   analysis   will   give   answers   whether   samples   of   the   datasets   are  significantly  different  or  not.  In  case  we  have  differences,  we  will  discuss  which  datasets  are   more   superior   to   others,   otherwise   we   will   give   an   explanation   and   possible  improvements  for  the  future  experiments.    

                                                                                                               8  http://purl.org/MarianJureczko/MetricsRepo  9  http://www.cs.waikato.ac.nz/ml/weka/documentation.html  

Page 24: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

24    

 In  Figure  4,  we  will  illustrate  the  activities  as  part  of  methodology  for  answering  RQ2.      

 Figure  4.  Methodology  for  the  RQ2  

   We  will  describe  the  experiment  with  detailed  requirements  in  Section  5:  Design.      

Page 25: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

25    

4.  Informal  review  We  collected  recent  papers   that  are  dealing  with  SFP  using  non-­‐static  code  metrics,   in  order  to  answer  RQ1.  The  best  source  of  information  is  found  in  a  form  of  a  systematic  literature  review  and  studies,  which  objectives  were  to  investigate  the  performances  of  process  metrics   for  various  open-­‐source,   industrial  projects  and  software  systems.  We  classified  relevant  findings  into  several  groups  based  on  their  occurrence  in  the  papers.  The  groups  are  following:  

o Process  metrics  in  general;  o Code  churn  metrics;  o Developer  metrics;  o Other  process  metrics.  

 

4.1  Process  metrics  In   their   systematic   literature   review,   Radjenović   et   al.   [8]   analyzed   106   papers  concluding  that  process  metrics  are  represented  only  24%  among  other  metrics  for  the  purposes  of  fault  prediction.  Moreover,  they  identified  that  process  metrics,  such  as  the  number   of   different   developers   that  worked   on   the   same   file,   the   number   of   changes  made   on   some   file,   the   age   of   a   module,   etc.,   are   better   for   faults   detection   in   post-­‐release  phase  of  software  development  than  some  static  code  metrics.  Those  metrics  are  mainly   produced   extracting   source   code   and   the   history   from   the   repository.   Results  gained  from  different  studies  have  pointed  out  several  advantages  of  process  over  static  code  metrics:  

o Process  metrics  provide  a  better  description  regarding  distribution  of  faults  in  the  software;  

o Used  for  Java-­‐based  applications,  process  metrics  can  provide  better  models  in  terms  of  cost-­‐effectiveness;  

o Process  metrics  performed   the  best   in  cases  of  prediction   faulty  classes  using  the  ROC  area.  

 However,   some   studies   have   shown   that   process   metrics   have   not   performed   well  because  they  were  used  in  a  pre-­‐release  phase  of  software  development.  Results  of  the  conducted  experiments  have  confirmed  that  process  metrics  perform  better  and  have  to  be  used  in  post-­‐release  phase,  which  explains  previously  mentioned  issues  in  prediction  [8].   In   [4]   is   suggested   combination   of   code,   process   and   static  metrics   for   the   better  prediction.  

Xia  et  al.  [20]  analyzed  performances  of  code  and  process  metrics  for  the  TT&C  software.  They   emphasize   the   benefits   of   process  metrics   during   different   development   phases,  such   as   requirements   analysis,   design   and   coding.   There   have   been   listed   16   process  metrics,   suitable   for   the   specific   phase.   In   the   early   stages   of   software   development,  analysis   of   requirements   and   their  maturity   are  metrics   related   to   detection   of   faulty  software   module.   Therefore,   it   is   important   to   eliminate   errors   that   occur   in   the  requirements  phase,  which  can  be  transferred  to  other  stages  of  development,  i.e.  design  and  coding  phases.  Another   important   feature  of  process  metrics   is   tracking  historical  

Page 26: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

26    

changes   within   a   file   or   a   version   of   the   software,   which   could   give   the   insight   to  possible   faults.   They   concluded   that   process   metrics   combined   with   code   metrics  increased  accuracy  level,  while  error  rate  was  decreased.      

4.1.1  Code  churn  metrics  Reported  advantage  of  process  metrics  is  that  they  can  be  assessed  using  large  datasets.  Therefore,   this   is   an   important   feature   when   is   needed   to   create   the   prediction   for  industrial  projects.  However,  studies  have  shown  that  process  metrics  are  rather  used  for   academic   than   for   industrial   purposes.   Code   churn   metrics   are   the   one   that  performed  the  best   in  case  of   industry.  These  metrics  are  used  to  calculate  the  change  between  different  versions  of  the  software.  Taking  into  account  benefits  of  code  churn  metrics,  it  is  suggested  that  researchers  should  turn  their  interest  to  industrial  projects  combining   and   testing   them   using   those  metrics   [8].   Moreover,   according   to   [8],[19],  code  churn  metrics  performed  better  than  cyclomatic  complexity  in  case  of  fault  density  prediction.      

4.1.2  Developer  metrics  Opinions   about   developers’   data   are   divided.   While   some   studies   show   developer  metrics  improved  prediction  [4],[23],  other  claim  that  those  metrics  were  not  useful  or  they  had  a  minor  impact  [4],[8],[22].    Matsumoto   et   al.   [21]   conducted   the   experiment   in   order   to   determine   whether  developers’  metrics  can  be  beneficial  for  software  reliability.  They  defined  two  types  of  metrics  where  are  described  developer’s  activities  and  modules  that  were  inspected  by  developers.  After  performing  the  experiment,   in  order  to  prove  a  hypothesis  regarding  metrics,  they  offered  several  conclusions:    

o The  possibility   that  developer  will   create   fault   in   the  newer  version   is  higher   if  s(he)  previously  had  the  same  manner  of  working;  

o More   faults   can  occur   in   subsystems   that  were  modified  by  a   larger  number  of  developers;  

o Usage  of  developer  metrics  is  beneficial  for  fault  prediction.    

4.1.3  Other  process  metrics  Along   with   code   churn,   some   of   the   metrics   that   provided   the   best   results   in   fault  prediction  are  the  age  of  a  certain  module;  number  of  changes  made  on  a  file  or  module  and  changed  set  size.  Their  usage  is  recommended  in  large  industrial  systems,  because  of  their  reliability  and  efficiency  to  predict  fault-­‐prone  module  [8].                

Page 27: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

27    

5.  Design  In   this   section,   we   will   define   suitable   projects,   collected   datasets   and   design   of   the  experiment.  

 5.1  Projects  We   investigated   5   projects   that   contain   static   code   and   process  metrics.   Projects   are  Java-­‐based,  which   provides   consistency   in   terms   of   language   domain.   In   [18],   authors  emphasize   their   inability   to   extract   all   process   metrics   for   each   project.   Moreover,  during  datasets  selection,  metrics  that  did  not  contain  most  of  the  data  about  all  process  metrics   were   excluded.   In   Table   1,   the   description   of   the   projects   is   presented.  Explanation  for  process  metrics  is  provided  in  Section  5.2.      Table  1.  Selected  projects  for  the  experiment  

Project   Version(s)   Description   Extracted  process  metrics  

Ant10   1.4;  1.5;  1.6;  1.7   Java-­‐based  build  tool   NR,  NDC,  NML  NDPV  

jEdit11   4.0;  4.1   Java-­‐based  cross  platform  text  editor  

NR,  NC,  NML,  NDPV  

Synapse12   1.1;  1.2  Enterprise  Service  Bus  that  

supports  different  protocols  and  ways  for  information  exchange  

NR,  NC,  NML,  NDPV  

Xalan13   2.5.0;  2.6.0;  2.7.0  XSLT  processor  used  for  

transforming  XML  into  different  file  formats  

NR,  NC,  NML,  NDPV  

Xerces14   1.2.0;  1.3.0;  1.4.4  Parser  that  supports  XML  1.0  and  provides  parsing  functionalities  on  

advanced  level  

NR,  NC,  NML,  NDPV  

 

5.2  Extracted  metrics  Madeyski   et   al.   [18]  used  2   types  of   tools   to   collect  metrics   from  project   repositories.  Metrics  have  been  calculated  using  ckjm15  program.    Tool  BugInfo16  was  used  to  detect  bugs   from   a   log   history.   In   order   to   identify   bugs   from   commits,   regular   expressions  were   formalized   for   each   project.   Bugs  were   calculated   comparing   commit   comments  and  regular  expressions.      Static  code  metrics  relevant  for  the  experiment  are:                                                                                                                  10  http://ant.apache.org/  11  http://www.jedit.org/  12  http://synapse.apache.org/  13  http://xml.apache.org/  xalan-­‐j/  14  http://xerces.apache.org/  xerces-­‐j/  15  http://www.spinellis.gr/sw/ckjm/  16  https://kenai.com/projects/buginfo  

Page 28: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

28    

o Weighted  methods  per  class  (WMC)  metric  returns  a  number  of  methods  of  the  specific  class;  

o Depth  of  inheritance  tree  (DIT)  gives  the  number  of  inheritance  level  starting  from  the  Object  class;  

o Number  of  children  (NOC)  calculates  the  number  of  descendants  of  the  specific  class;  

o Coupling  between  Object  class  (CBO)  indicates  the  number  of  classes  that  are  linked   to   the   specific   class   in   a   form   of   method   calls   and/or   arguments,   field  declarations,  inheritance,  exceptions,  etc.;  

o Response  for  a  class  (RFC)  represents  the  total  number  of  called  class  methods,  as  well  as  methods  that  are  contained  in  bodies  of  the  class  methods;  

o Lack  of  cohesion  in  methods  (LCOM)  returns   the  number  of  methods   that  do  not  share  any  class  fields.  For  the  LCOM3,  result  can  be  in  the  range  of  0  and  2.  LCOM3  is  presented  in  the  equation  5:  

 

  𝐿𝐶𝑂𝑀3 =  (1𝑎 𝜇 𝐴! −𝑚!

!!!

1−𝑚   (5)  

   where   variable  a   is   defined   as   the   number   of   attributes   of   a   class,  m   is   a   total    number  of  methods  used  in  the  class  and  𝜇(𝐴!)  is  used  to  present  the  number  of  all  methods  that  will  access  the  attribute  A.    

o Afferent   couplings   (Ca)   returns   the   number   of   classes   that   are   dependent   on  some  observed  class;  

o Efferent  couplings  (Ce)  indicates  the  number  of  classes  that  the  observed  class  is  depended  on;  

o Lines  of  code  (LOC)  returns  the  total  number  of  class  fields,  methods  and  code  written  in  the  body  of  methods;  

o Number  of  public  methods  (NPM)  represents  the  number  of  all  methods  whose  access  modifier  has  a  value  defined  as  public;  

o Data   access   metric   (DAM)   is   the   proportion   of   public/protected   attributes  compared  to  the  total  number  of  all  class  attributes;  

o Measure  of  aggregation  (MOA)  gives  a  number  of  classes  that  are  defined  by  a  user;  

o Cohesion  among  methods  of  class  (CAM)  returns  the  number  of  methods  that  are  related  to  each  other  within  the  class  based  on  the  method’s  parameter  list;  

o Inheritance   coupling   (IC)   returns   the   number   of   base   classes   to   which   an  examined  class  is  connected.  

o Coupling  between  methods  (CBM)  gives  the  number  of  inherited  methods  that  are  coupled  to  modified  or  new  methods  of  the  class;  

o Average  method  complexity  (AMC)  metric  returns  the  number  of  average  size  of  all  methods  within  some  class;  

o McCabe’s   cyclomatic   complexity   (CC)   defines   the   number   of   various   paths  within   the   method   plus   one,   calculating   all   edges,   nodes   of   the   graph   and  components  that  are  connected.  CC  is  defined  in  the  equation  6:    

  𝐶𝐶 = 𝐸 − 𝑁 + 𝑃,   (6)  

Page 29: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

29    

 where   variable   E   defines   the   number   of   edges,   N   nodes   and   P   connected  components  of  the  graph  [18].    

 Process  metrics  that  were  extracted  (Table  1)  and  used  for  the  experiment  are:    

o Number  of  Revisions  (NR)  metric  contains  the  number  of  committed  revisions  of  some  Java  class  in  the  development  process;  

o Number   of   Distinct   Committers   (NDC)   will   return   a   number   of   different  developers  that  committed  changes  on  the  specific  Java  class;  

o Number   of   Modified   Lines   (NML)   metrics   will   show   a   number   of   added   or  removed  lines  of  code  for  the  specific  Java  class;  

o Number  of  Defects  in  the  Past  Version  (NDPV)  will  give  a  number  of  repaired  faults  from  the  past  version  of  the  system  [18].  

 

5.3  Evaluation  of  results  The   value   of   the   output   model   can   be   presented   using   various   performance  measurements.  According   to   [2],[6],  one  of   the  mostly  accurate  measures   is   called   the  receiver   operating   characteristic   (ROC).   The  ROC   represents   a   curve   that   displays   the  ratio  between  true  and  false  positive  rates,  taking  into  consideration  all  thresholds  that  could  have  value  in  a  range  of  0  and  1.  In  the  case  of  ROC,  we  have  the  equation  7:       𝑇𝑃𝑅 =   !"

!"!!",   (7)  

   where  TP  is  a  true  positive,  while  FN  is  a  false  negative  rate.  Although   ROC   gives   excellent   results   regarding   performance   comparison   between  classifiers,   the   practice   has   shown   that   experimenters   prefer   numeric   value   that   will  clearly   show   the   end   results.   For   that   purpose,   area   under   the   curve   (AUC)   was  proposed   as   one   of   the   solutions   [6].     AUC   is   determined   by   an   average   value   of  performances  defined  by  all  thresholds.  The  model  that  reports  the  best  prediction  is  the  one  where   the  ROC   and  AUC   are   close   to   1.   AUC   gives   the   possibility   to   compare   our  output  model  with  some  random  predictions  without  considering  the  ratio  of  defected  files  [6].    

5.4  Experiment  We  performed  the  experiment  in  Weka  using  NB  and  TAN  classifiers  for  the  given  static  code   and   process  metrics.  Weka   has   designed   Explorer,   shown   on   Figure   5,  where   is  possible  to  test  projects  regarding  performances  using  different  classifiers.      

Page 30: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

30    

 Figure  5.  Weka  Explorer  

 In  order  to  prevent  errors  that  can  occur  in  datasets  that  contain  String  values,  such  as  values  for  project  and  name  of  the  examined  class,  we  used  meta.FilteredClassifier  that  enables   us   to   choose   NaiveBayes   and   BayesNet   classifiers   with   additional   features.  Selecting  Bayes  Net  with  TAN  search  algorithm,  we  are  telling  Weka  that  we  want  to  test  datasets   using   TAN   classifier.     Furthermore,   filtering   with   unsupervised   attribute  StringToWordVector  helps  us  to  handle  String  values.  Figure  6  shows  the  example  of  set  values  in  case  of  using  NaiveBayes  classifier.        

Page 31: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

31    

 Figure  6.  Set  values  for  NB  classifier  

 We   did   the   experiment   using   10-­‐folds   cross-­‐validation.   This   experimental   design   is  suitable  for  small  datasets,  where  the  experiment  is  repeated  a  number  of  times  and  in  each  iteration  the  data  will  be  split  differently.  The  idea  is  to  generate  sets  for  training  and   testing   learners   and   to   ensure   the   lowest   percentage   of   overlapping   between  created  sets.  The  final  model  will  have  a  value  that  represents  an  averaged  result  of  all  values  gained  after  each  iteration  [15].    Classifier  output,  presented  on  Figure  7,  contains  several  values  gained  as  a  result  of  the  prediction.    

Page 32: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

32    

 Figure  7.  Classifier  output  with  results  

 As  it  can  be  seen  on  Figure  7,  the  output  offers  several  values  for  defined  classes,  such  as  correctly   classified   instances,   incorrectly   classified   instances,   true   positive   and   false  positive  rate,  precision,  etc.  In  our  results  evaluation,  we  will  use  the  values  of  ROC  Area.      We  defined  0  and  1  value  for  binary  classifiers.  In  every  file,  for  each  class  we  can  have  either  0  value  indicating  that  our  class  is  fault  free,  or  1,  in  case  of  fault  existence.  This  is  regulated  by  checking  the  value  of  bug  fixes  in  the  datasets.        We  will  have  several  models  for  our  experiment:  

o 1  combined  model  that  will  be  created  using  all  process  and  static  code  metrics;  o 1  model  will  contain  static  code  metrics;  o 1  model  will  have  process  metrics;  o 4  models  will  be  built  by  adding  1  of  the  process  metrics  to  the  static  metrics,  e.g.  

adding  the  value  of  NR  metric  to  the  dataset  that  already  has  only  values  of  static  code  metrics;  

o 6  models  will  contain  a  combination  of  2  process  and  all  static  code  metrics;  o 4  models  will  consist  of  the  combination  of  3  process  and  all  static  code  metrics.  

 

Page 33: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

33    

6.  Results    In  our  experiment,  we  had  18  static  code  metrics,  4  process  metrics  and  the  metric  that  contains  information  about  bug  fixes.  Furthermore,  we  used  AUC,  as  the  quality  measure  and   the  response  variable   for  each  project,  where   the   final  value   is   the  average  of  10-­‐folds  cross-­‐validation.  We  will  present  the  statistical  analysis  used  in  order  to  compare  results   of   the   experiment.   Prediction   results,   along   with   the   statistical   analysis   are  presented  in  Section  6.2.1  and  6.2.2.      

6.1  Statistical  analysis  of  results  Every   ML   experiment,   regardless   of   a   type   of   experimentation,   has   to   undergo   the  certain   steps   related   to   data   analysis.   Before   we   start   to   discuss   the   results   of   the  experiment,   we   need   to   carefully   choose   an   appropriate   statistical   test   to   compare  collected  results,  ensuring  objective  conclusions  [15].  We  will  introduce  a  term  samples,  referring  to  the  results  of  the  experiment  [24].    There  are  two  types  of  statistical  analysis   that  deal  with  a  comparison  of   two  or  more  sample  means  and/or  classifiers,  which  is  required  in  our  case.  One  test  is  the  analysis  of  variance  (ANOVA)  [15],[26],[27].  ANOVA  is  a  parametric  test,  where  is  the  basic  idea  to  compare   means   by   comparing   the   variances   of   given   samples   [27].   Assumptions   of  normal   distribution   of   samples   and   equal   variance   have   to   be  met,   which   sometimes  makes  ANOVA  unsuitable  for  ML  studies  [24].    Another   type  of  analysis   is   the  Kruskal-­‐Wallis   test   [15],[25].  The  Kruskal-­‐Wallis   test   is  the  type  of  a  non-­‐parametric  test.  In  this  case,  we  do  not  have  to  prove  any  assumption  regarding  normal  distribution  of  data  and  equal  variances   [25].  According   to   [15],   the  non-­‐parametric  test  is  more  applicable  when  we  have  to  compare  more  than  2  datasets  because  of  the  possibility  that  one  classifier  can  show  a  different  behavior  for  a  different  dataset.   In   that   situation   we   cannot   claim   that   error   values   for   tested   datasets   are  normally  distributed  [15].  Considering  all  listed  arguments,  we  decided  to  use  Kruskal-­‐Wallis  test  and,  therefore,  we  defined  2  hypotheses:      

o H0:  There  is  no  significant  difference  in  performance  between  datasets.  o H1:  There  is  a  significant  difference  in  performance  for  at  least  1  dataset.  

 Once  we  get  comparison  results,  we  will  consider  the  p-­‐value,  representing  a  probability  and  statistical   significance,  as  a  merit   indicator  whether   the  null  hypothesis  should  be  rejected  or  not  with  regard  to  significance  criteria  𝛼.  The  value  of  𝛼  is  0.05  and  in  case  we   have   that   the   p-­‐value   is   smaller   than   𝛼,   we   can   conclude   that   null   hypothesis   is  rejected  and  there  is  the  difference  in  performance  for  at  least  1  dataset  [26].      

6.2.1  Results  of  the  experiment  for  NB  classifier  In   this   Section,   we   will   present   experimental   results   using   NB   classifier,   along   with  statistical  analysis.  Each  table  contains  different  models  chosen  based  on  the  structure  of  datasets.  Below  tables,  we  will  report  comparison  results.    Models  are  grouped   in  4  tables,  based  on  the  number  of  process  metrics  in  the  dataset.  

Page 34: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

34    

6.2.1.1  Models  with  combined,  static  code  and  process  metric    In  Table  2,  we  presented  AUC  values  for  each  version  of  selected  projects  for  combined,  static  code  and  process  metrics  models  using  NB  classifier.    Table  2.  Results  for  combined,  static  code  and  process  metric  using  NB  classifier  

Project   Version   Combined   SC17   Process  

Ant  

1.4   0.805   0.736   0.827  1.5   0.824   0.813   0.816  1.6   0.847   0.842   0.816  1.7   0.871   0.852   0.856  

jEdit  4.0   0.896   0.855   0.915  4.1   0.934   0.887   0.919  4.3   0.78   0.795   0.922  

Synapse   1.1   0.753   0.765   0.746  1.2   0.785   0.774   0.793  

Xalan  2.5.0   0.807   0.788   0.822  2.6.0   0.843   0.846   0.764  2.7.0   0.965   0.947   0.906  

Xerces  1.2.0   0.878   0.861   0.936  1.3.0   0.845   0.836   0.811  1.4.4   0.857   0.856   0.876  

 Observing   the   raw   data   from   Table   2,  we   can   conclude   that   in   46.66%   cases   process  model  had  a  better  AUC  value,   followed  by  combined  and  static  code  model  with  40%  and  13.33%,  respectively.      In  the  following  Table  3,  we  will  present  statistical  results  for  combined,  static  code  and  process  models.    Table  3.  Comparison  results  for  combined,  static  code  and  process  models  for  NB  classifier  

Method   P-­‐Value  Not  adjusted  for  ties   0.7184  Adjusted  for  ties   0.7184  

 We  have  2  p-­‐values  for  our  analyzed  models,  where  one  of  them  is  not  adjusted  and  the  other  one  is  adjusted  for  ties.  Ties  can  be  found  in  situation  where  some  value  occurs  in  2   or   more   tested   samples.   We   will   observe   the   p-­‐value   for   the   method   that   is   not  adjusted  for  ties,  since  it  will  report  the  greater  value.  The  Kruskal-­‐Wallis  test  reported  the  statistical  significance  of  0.7184  (i.e.,  p=0.7184)  for  the  method  that  is  not  adjusted  

                                                                                                               17  The  abbreviation  “SC”  (referring  to  static  code  metrics)  will  be  used  only  in  tables  for  purposes  of  a  better  utilization  of  a  space  in  columns.    

Page 35: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

35    

for  ties  (see  Table  3).  This  is  above  0.05  and,  therefore,  we  can  conclude  that  there  is  no  basis  to  reject  the  null  hypothesis.      

6.2.1.2  Models  with  static  code  and  1  process  metrics    In  Table  4,  we  presented  AUC  values  for  4  models  containing  all  static  code  and  only  1  process  metrics  (NR,  NDC,  NML  and/or  NDPV)  using  NB  classifier.    Table  4.  Results  for  models  containing  1  process  and  static  code  metric  using  NB  classifier  

Project   Version   SC  +  NR   SC    +  NDC   SC  +  NML   SC  +  NDPV  

Ant  

1.4   0.753   0.76   0.772   0.732  1.5   0.813   0.824   0.811   0.812  1.6   0.841   0.85   0.841   0.842  1.7   0.852   0.871   0.848   0.856  

jEdit  4.0   0.844   0.907   0.842   0.86  4.1   0.885   0.927   0.882   0.908  4.3   0.769   0.828   0.776   0.788  

Synapse   1.1   0.761   0.764   0.761   0.763  1.2   0.78   0.782   0.773   0.778  

Xalan  2.5.0   0.792   0.802   0.786   0.793  2.6.0   0.844   0.853   0.832   0.852  2.7.0   0.95   0.961   0.962   0.954  

Xerces  1.2.0   0.871   0.859   0.855   0.873  1.3.0   0.834   0.85   0.827   0.842  1.4.4   0.855   0.867   0.849   0.855  

 In   80%   cases,   the   model   with   static   code   and   NDC   metrics   had   better   AUC   value,  followed  by  models  with  NML  and  NDPV  metrics  with  13.33%  and  6.66%,  respectively.    Table  5.  Comparison  results  for  models  containing  static  code  and  1  process  metrics  using  NB  classifier  

Method   P-­‐Value  Not  adjusted  for  ties   0.6460  Adjusted  for  ties   0.6459  

 Statistical  test  for  models  that  are  built  on  static  code  and  1  process  metrics  reported  the  statistical   significance  of   0.6460   (i.e.,   p=0.6460),   shown   in  Table  5.   This   is   above  0.05  and,  therefore,  we  can  conclude  that  there  is  no  basis  to  reject  the  null  hypothesis.        

Page 36: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

36    

6.2.1.3  Models  with  a  combination  of  2  process  and  static  code  metrics    In  Table  6,  we  presented  AUC  values  for  6  models  containing  combination  of  2  process  and  all  static  code  metrics  using  NB  classifier.        Table  6.  Results  for  models  containing  combination  of  2  process  and  static  code  metrics  using  NB  classifier  

Project   Version   SC  +  NR  +  NDC  

SC  +  NR  +  NML  

SC  +  NR  +  NDPV  

SC    +  NDC  +  NML  

SC    +  NDC  +  NDPV  

SC  +  NML  +  NDPV  

Ant  

1.4   0.775   0.787   0.749   0.794   0.756   0.769  1.5   0.825   0.812   0.812   0.823   0.822   0.81  1.6   0.849   0.839   0.841   0.85   0.85   0.841  1.7   0.872   0.846   0.855   0.869   0.873   0.852  

jEdit  4.0   0.903   0.83   0.849   0.903   0.907   0.847  4.1   0.929   0.877   0.901   0.925   0.938   0.903  4.3   0.804   0.752   0.763   0.812   0.819   0.77  

Synapse   1.1   0.759   0.757   0.759   0.759   0.762   0.759  1.2   0.786   0.78   0.782   0.78   0.784   0.777  

Xalan  2.5.0   0.804   0.79   0.796   0.799   0.806   0.792  2.6.0   0.851   0.83   0.85   0.841   0.858   0.84  2.7.0   0.959   0.962   0.954   0.966   0.964   0.965  

Xerces  1.2.0   0.869   0.861   0.879   0.855   0.875   0.868  1.3.0   0.847   0.824   0.839   0.843   0.854   0.833  1.4.4   0.864   0.849   0.853   0.863   0.864   0.849  

 In  60%  cases,  the  model  with  static  code,  NDC  and  NDPV  metrics  had  better  AUC  value,  followed  by  models  with  static  code,  NDC  and  NML,  and  static  code,  NR  and  NDC  metrics  with   16.66%.   The  model  with   static   code,   NR   and  NDPV   had   the   better   AUC   value   in  6.66%  cases.      Table  7.  Comparison  results  for  models  containing  combination  of  2  process  and  static  code  metrics  using  NB  classifier  

Method   P-­‐Value  Not  adjusted  for  ties   0.6847  Adjusted  for  ties   0.6845  

 In  Table  7,   the  Kruskal-­‐Wallis  test  has  shown  the  statistical  significance  of  0.6847  (i.e.,  p=0.6847).  This   is  above  0.05  and,  therefore,  we  can  conclude  that  there  is  no  basis  to  reject  the  null  hypothesis.        

Page 37: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

37    

6.2.1.4  Models  with  a  combination  of  3  process  and  static  code  metrics    In  Table  8,  we  presented  AUC  values  for  4  models  containing  combination  of  3  process  and  all  static  code  metrics  using  NB  classifier.        Table  8.  Results  for  models  containing  combination  of  3  process  and  static  code  metrics  using  NB  classifier  

Project   Version   SC  +  NR  +  NDC  +  NML  

SC    +  NR  +  NDC  +  NDPV  

SC  +  NR  +  NML  +  NDPV  

SC  +  NDC  +  NML  +  NDPV  

Ant  

1.4   0.807   0.772   0.784   0.791  1.5   0.824   0.824   0.812   0.822  1.6   0.848   0.849   0.838   0.849  1.7   0.869   0.874   0.849   0.871  

jEdit  4.0   0.895   0.903   0.835   0.903  4.1   0.926   0.936   0.894   0.936  4.3   0.786   0.797   0.745   0.803  

Synapse   1.1   0.755   0.757   0.755   0.757  1.2   0.786   0.786   0.781   0.784  

Xalan  2.5.0   0.803   0.807   0.795   0.805  2.6.0   0.838   0.855   0.837   0.847  2.7.0   0.964   0.962   0.963   0.968  

Xerces  1.2.0   0.862   0.882   0.871   0.873  1.3.0   0.84   0.852   0.829   0.848  1.4.4   0.86   0.862   0.848   0.86  

 In  60%  cases,   the  model  with   static   code,  NR,  NDC  and  NDPV  metrics  had  better  AUC  value,   followed  by  models  with   static   code,  NDC,  NML  and  NDPV,   and   static   code,  NR,  NDC  and  NML  metrics  with  26.66%  and  10%,  respectively.        Table  9.  Comparison  results  for  models  containing  combination  of  3  process  and  static  code  metrics  using  NB  classifier  

Method   P-­‐Value  Not  adjusted  for  ties   0.6326  Adjusted  for  ties   0.6323  

 The   Kruskal-­‐Wallis   test   reported   the   statistical   significance   of   0.6326   (i.e.,   p=0.6326),  shown   in  Table  9.  This   is   above  0.05   and,   therefore,  we   can   conclude   that   there   is  no  basis  to  reject  the  null  hypothesis.      

Page 38: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

38    

6.2.2  Results  of  the  experiment  for  TAN  classifier  In  this  Section,  we  will  present  experimental  for  the  previously  defined  test  cases  using  TAN  classifier.  As  we  mentioned  before,  tables  contain  different  models  along  with  the  results  of  statistical  analysis.      

6.2.2.1  Models  with  combined,  static  code  and  process  metric    In   Table   10,   we   presented   AUC   values   for   combined,   static   code   and   process  metrics  models  using  TAN  classifier.      Table  10.  Results  for  combined,  static  code  and  process  models  using  TAN  classifier  

Project   Version   Combined   SC   Process  

Ant  

1.4   0.828   0.783   0.811  1.5   0.819   0.795   0.82  1.6   0.895   0.861   0.875  1.7   0.893   0.859   0.883  

jEdit  4.0   0.94   0.882   0.934  4.1   0.935   0.886   0.927  4.3   0.851   0.425   0.864  

Synapse   1.1   0.802   0.802   0.606  1.2   0.809   0.817   0.69  

Xalan  2.5.0   0.846   0.812   0.842  2.6.0   0.879   0.856   0.825  2.7.0   0.982   0.971   0.965  

Xerces  1.2.0   0.894   0.886   0.879  1.3.0   0.871   0.838   0.8  1.4.4   0.978   0.926   0.961  

 We  can  see  that  in  76.66%  cases  combined  model  had  a  better  AUC  value,   followed  by  process  and  static  code  model  with  13.33%  and  10%,  respectively.    Table  11.  Comparison  results  for  combined,  static  code  and  process  models  using  TAN  classifier  

Method   P-­‐Value  Not  adjusted  for  ties   0.2833  Adjusted  for  ties   0.2833  

 After  performing  the  Kruskal-­‐Wallis  test  (see  Table  11),  we  are  reporting  the  statistical  significance  of  0.2833  (i.e.,  p=0.2833).  This  is  above  0.05  and,  therefore,  we  can  conclude  that  there  is  no  basis  to  reject  the  null  hypothesis.      

Page 39: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

39    

6.2.2.2  Models  with  static  code  and  1  process  metrics    In  Table  12,  we  presented  AUC  values  for  4  models  containing  all  static  code  and  only  1  process  metrics  (NR,  NDC,  NML  and/or  NDPV)  using  TAN  classifier.      Table  12.  Results  for  models  containing  1  process  and  static  code  metrics  using  TAN  classifier  

Project   Version   SC  +  NR   SC    +  NDC   SC  +  NML   SC  +  NDPV  

Ant  

1.4   0.775   0.832   0.782   0.783  1.5   0.795   0.807   0.794   0.795  1.6   0.862   0.886   0.863   0.86  1.7   0.887   0.884   0.872   0.87  

jEdit  4.0   0.904   0.939   0.892   0.892  4.1   0.914   0.926   0.903   0.905  4.3   0.426   0.64   0.837   0.425  

Synapse   1.1   0.802   0.802   0.802   0.802  1.2   0.812   0.822   0.814   0.817  

Xalan  2.5.0   0.826   0.843   0.812   0.819  2.6.0   0.86   0.87   0.856   0.868  2.7.0   0.971   0.98   0.971   0.975  

Xerces  1.2.0   0.881   0.899   0.885   0.88  1.3.0   0.865   0.855   0.844   0.838  1.4.4   0.93   0.969   0.926   0.927  

 Static   code   and  NDC  models   had   in   75%   cases   better   AUC   value,   followed   by  models  with  NR,  and  NML  metrics  with  15%  and  8.33%,  respectively.      Table  13.  Comparison  results  for  models  containing  static  code  and  1  process  metrics  using  TAN  classifier  

Method   P-­‐Value  Not  adjusted  for  ties   0.8512  Adjusted  for  ties   0.8512  

 In  Table  13,  The  Kruskal-­‐Wallis  test  has  shown  the  statistical  significance  of  0.8512  (i.e.,  p=0.8512).  This   is  above  0.05  and,  therefore,  we  can  conclude  that  there  is  no  basis  to  reject  the  null  hypothesis.                      

Page 40: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

40    

6.2.2.3  Models  with  a  combination  of  2  process  and  static  code  metrics    In  Table  14,  we  presented  AUC  values  for  each  version  of  selected  projects  for  6  models  containing  combination  of  2  process  and  all  static  code  metrics  using  TAN  classifier.        Table  14.  Results  for  models  containing  combination  of  2  process  and  static  code  metrics  using  TAN  classifier  

Project   Version  SC  +  NR  +  NDC  

SC  +  NR  +  NML  

SC  +  NR  +  NDPV  

SC    +  NDC  +  NML  

SC    +  NDC  +  NDPV  

SC  +  NML  +  NDPV  

Ant  

1.4   0.835   0.775   0.775   0.825   0.832   0.782  1.5   0.82   0.794   0.795   0.809   0.807   0.794  1.6   0.889   0.871   0.861   0.889   0.884   0.863  1.7   0.898   0.88   0.891   0.885   0.891   0.878  

jEdit  4.0   0.944   0.906   0.904   0.942   0.936   0.897  4.1   0.933   0.916   0.916   0.933   0.935   0.918  4.3   0.639   0.836   0.426   0.851   0.64   0.837  

Synapse   1.1   0.802   0.802   0.802   0.802   0.802   0.802  1.2   0.814   0.807   0.812   0.817   0.822   0.814  

Xalan  2.5.0   0.841   0.826   0.831   0.843   0.849   0.819  2.6.0   0.879   0.86   0.867   0.867   0.873   0.867  2.7.0   0.982   0.971   0.976   0.98   0.98   0.975  

Xerces  1.2.0   0.898   0.882   0.878   0.898   0.897   0.879  1.3.0   0.866   0.87   0.865   0.861   0.855   0.844  1.4.4   0.978   0.93   0.931   0.969   0.969   0.927  

 In  54.4%  cases,  the  model  with  static  code,  NR  and  NDC  metrics  had  better  AUC  value,  followed   by  models   with   static   code,   NDC   and   NDPV,   and   static   code,   NDC   and   NML  metrics  with  21.13%  and  14.4%,  respectively.  The  model  with  static  code,  NR  and  NML  had  the  better  AUC  value  in  7.8%  cases.      Table  15.  Comparison  results  for  models  containing  combination  of  2  process  and  static  code  metrics  using  TAN  classifier  

Method   P-­‐Value  Not  adjusted  for  ties   0.8685  Adjusted  for  ties   0.8683  

 The   Kruskal-­‐Wallis   test   reported   the   statistical   significance   of   0.8685   (i.e.,   p=0.8685),  shown  in  Table  15.  This   is  above  0.05  and,  therefore,  we  can  conclude  that  there  is  no  basis  to  reject  the  null  hypothesis.        

Page 41: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

41    

6.2.2.4  Models  with  a  combination  of  3  process  and  static  code  metrics    In  Table  16,  we  presented  AUC  values  for  each  version  of  selected  projects  for  4  models  containing  combination  of  3  process  and  all  static  code  metrics  using  TAN  classifier.    Table  16.  Results  for  models  containing  combination  of  3  process  and  static  code  metrics  using  TAN  classifier  

Project   Version   SC  +  NR  +  NDC  +  NML  

SC    +  NR  +  NDC  +  NDPV  

SC  +  NR  +  NML  +  NDPV  

SC  +  NDC  +  NML  +  NDPV  

Ant  

1.4   0.828   0.835   0.775   0.825  1.5   0.819   0.82   0.794   0.809  1.6   0.895   0.889   0.872   0.888  1.7   0.891   0.901   0.884   0.892  

jEdit  4.0   0.942   0.941   0.904   0.938  4.1   0.934   0.935   0.921   0.938  4.3   0.851   0.639   0.836   0.851  

Synapse   1.1   0.802   0.802   0.802   0.802  1.2   0.809   0.814   0.807   0.817  

Xalan  2.5.0   0.841   0.846   0.831   0.849  2.6.0   0.875   0.882   0.866   0.87  2.7.0   0.982   0.982   0.976   0.98  

Xerces  1.2.0   0.898   0.895   0.878   0.895  1.3.0   0.871   0.866   0.87   0.861  1.4.4   0.978   0.978   0.931   0.969  

 The   model   with   static   code,   NR,   NDC   and   NML   metrics   is   better   in   38.33%   cases,  followed  by  models  with  static  code,  NR,  NDC  and  NDPV,  and  static  code,  NDC,  NML  and  NDPV  metrics  with  35%  and  25%,  respectively.      Table  17.  Comparison  results  for  models  containing  combination  of  3  process  and  static  code  metrics  using  TAN  classifier  

Method   P-­‐Value  Not  adjusted  for  ties   0.8100  Adjusted  for  ties   0.8099  

 The  Kruskal-­‐Wallis  test  has  shown,  in  Table  17,  the  statistical  significance  of  0.8100  (i.e.,  p=0.8100).  This   is  above  0.05  and,  therefore,  we  can  conclude  that  there  is  no  basis  to  reject  the  null  hypothesis.            

Page 42: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

42    

7.  Result  discussion    We  investigated  17  models  in  order  to  determine  whether  and  which  process  metric(s)  can   provide   better   performance   in   the   process   of   software   faults   detection.   The  graphical  representations  of  comparison  results  are  shown  on  Figure  8  and  9  for  NB  and  TAN  classifiers,  respectively.  Models  are  divided  into  4  logical  groups  for  the  purpose  of  easier  representation  of  performance  results.      Considering  models  built  using  NB  classifier,  the  Kruskal-­‐Wallis  test  has  shown  that  we  do   not   have   the   basis   to   reject   the   null   hypothesis   for   any   case;   therefore,   we   can  conclude   that   there   is   no   statistical   difference   between  models.   This   is   illustrated   on  Figure  8,  where  we  have  overlapping  of  samples  for  each  model.      

 Figure  8.  Graphs  with  comparison  results  for  NB  classifier  

However,  observing  graphs,  we  can  see   that  models  containing  only  4  process  metrics  slightly   improved   experimental   results   compared   to  models  with   combined   and   static  code  metrics.   Although   the   difference  was   not   substantial   so  we   could   reject   the   null  hypothesis,  we  can  recommend  process  metrics  for  the  further  investigation.    Similarly  to  process  models,  NDC  metric   in  the  single  process  case  and  combined  with  NML   and/or   NDPV   metrics   in   a   small   degree   improved   model   and   should   be  investigated  more.    NR  metric  did  not  provide  any  significant  improvement  in  models.        

Page 43: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

43    

Models   built   using   TAN   classifier   after   statistical   testing,   did   not   show   the   significant  difference,  i.e.  we  cannot  claim  that  some  model  has  a  superior  performance  in  SFP.  This  statement   is   supported   by   graphical   representation   (see   Figure   9)   with   overlapping  samples  of  models.    Nevertheless,  based  on  experimental  results  and  graphs  with  statistical  comparison,  we  can  propose  combined  and  models  that  contain  NDC  and  NR  metrics  to  be  investigated  more.      

 Figure  9.  Graphs  with  comparison  results  for  TAN  classifier  

The  objective  of  our  experiment,  and  the  main  purpose  of  RQ2,  was  to  establish  whether  process  metrics   could   improve   SFP.   After   conducted   analysis,  we   saw   that  we   cannot  give   a   solid   assertion   and   recommend  a  usage  of   the   specific  model,   since   there   is   no  reported  evident  statistical  difference  between  samples.    We  propose  an  additional   investigation   for  models   containing   certain   combinations  of  previously   mentioned   metrics.   Furthermore,   in   Section   5,   we   stated   that   we   used  projects   where   some   data   about   process   metrics   were   missing.   Providing   complete  metrics   data   can   be   beneficial   for   the   further   research   and   can   possibly   open   new  discussions  regarding  process  metrics  in  SFP.        

Page 44: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

44    

7.1  Related  work  As   a   related   work,   we   considered   publications   where   are   observed   cases   of   process  metrics  available  on  the  repository  that  is  used  for  our  research.  Moreover,  classifiers  in  those  publications  have  to  be  different  from  the  ones  that  we  analyzed.      The   publication   that   fulfills   those   requests   is   the   one   from  Madeyski   et   al.   [28].   They  conducted  an  empirical  study  on  which  process  metrics  can  be  beneficial  for  the  defect  prediction.  Their   research   included  open-­‐source  and   industrial  projects,  which   lead   to  obtaining  some  statistically  significant  values.  They  examined  models  that  contained  all  static   code  and  only  1  process  metrics.  Using  a   stepwise   linear   regression  method   for  building  models,  they  came  to  the  conclusion  that  process  metrics,  because  of  the  nature  of   the   information   that   they  contain,  can   improve  defect  prediction  and  give  a  notable  contribution.   Namely,   metrics   that   indicate   a   number   of   developers   (NDC)   that  committed   changes   on   the   same   class,   can   considerably   improve   the   defect   process.  Also,   NML   metrics   can   be   taken   into   consideration   for   the   prediction   purposes.  Moreover,   statistical   significance   was   for   at   least   one   process   metric   was   reported  mainly  in  the  industrial  projects,  which  was  not  the  case  with  open-­‐source  projects.                                                                

Page 45: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

45    

8.  Validity  threats      The  purpose  of  any  experimental  study   is   to  estimate  usability  and  success  rate  of   the  proposed  technique,  algorithm  or,  in  our  case,  model.  Our  task  is  to  investigate  to  which  extent  gained  results  can  be  valid  for  similar  experimental  conditions  and  systems.  It  is  crucial  to  follow  defined  steps  appropriate  for  the  study  that  is  the  part  of  our  research  [15].   In   spite   of   our   detailed   study,   some   validity   threats   should   be   taken   into  consideration.      

8.1  Internal  validity  Internal   validity   implies   every   wrong   interpretation   of   the   study   results   [28].   In   this  thesis,  we  defined  2  RQs  with  different  methodologies.  The  potential  threat  for  the  RQ1  can   be   related   to   selected   papers   that   deal   with   usage   of   process   metrics   for   SFP.  Different   papers   will   give   new   and   a   different   perspective   for   the   review.   For   this  reason,  we  strictly  defined  steps,  filters  and  techniques  that  can  overcome  this  problem.  Furthermore,  parameters  of  the  RQ2,  such  as  observed  projects  with  datasets,  response  variable,  experimental  design  can  differ  depending  on  experimenter,  which  can  provide  different  results.      

8.2  External  validity  External   validity   defines   how   obtained   results   can   be   applied   to   other   population   or  system  [28].  We  investigated  only  open-­‐source  Java  projects.  Madeyski  et  al.  [28]  stated  that  process  metrics  have  shown  statistically   significant   results   for   industrial  projects.  This  can  be  a  basis   for   the  external   threats,   since   the  difference  among  project   can  be  considerable.      

8.3  Statistical  conclusion  validity  Statistical  validity  refers  to  problems  that  can  affect  the  results  of  the  statistical  analysis  [28].   In   order   to   ensure   correct   results,   we   used   two   types   of   tools   for   the   model  comparison   -­‐   Minitab18   and   SPSS19.  We   used   non-­‐parametric,   the   Kruskal-­‐Wallis   test,  suitable  for  comparison  over  multiple  datasets,  suggested  in  [15].                

                                                                                                               18  https://www.minitab.com/en-­‐us/  19  http://www.spss.co.in/products.php?p=statistics  

Page 46: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

46    

9.  Conclusion  In  this  thesis,  we  investigated  the  importance  of  process  metrics  for  SFP.  Therefore,  we  defined  2  RQs   in  order  to  cover  state-­‐of-­‐the-­‐art  regarding  usage  of  different  non-­‐static  code  metrics  and  to  conduct  the  experimental  study.      Informal   review   for   RQ1   offered   some   interesting   conclusions.   Namely,   we   identified  that  process  metrics,  such  as  code  churn,  developer  metrics  and  metrics,  which  contain  information   about   changes   in   the   file   or   module,   are   some   of   the   most   occurred   in  research  papers.  Also,   it   is   reported   that   process  metrics   are   suitable   for   post-­‐release  phase.  Furthermore,  combination  of  static  code  and  process  metrics  can  be  beneficial  for  fault   prediction   in   different   phases   of   software   development,   such   as   requirements  definition,   design   of   the   system   and   coding.   Finally,   the   usage   of   process   metrics   in  mostly  recommended  for  industrial  projects,  because  of  their  reliability  to  detect  faulty  modules  of  the  system.    Experimental  study,  for  RQ2,  involved  17  models  (created  from  static  code  and  process  metrics)   built   on   2   BN   classifiers.   Experimental   results   and   statistical   analysis   have  indicated   that   none   of   the   compared   models   have   shown   the   statistically   significant  difference  regarding  improving  SFP.    However,   observing   results   and   boxplots   with   comparison   analysis,   we   can   conclude  that:  

o It  is  worth  mentioning  that  combined,  process,  as  well  as,  models  containing  NDC  metric  can  slightly,  but  not  statistically,  improve  experimental  results;  

o NDC  metric   combined   with   NML,   NDPV   or   NR,   depending   on   classifier   that   is  used,  can  provide  a  better  model;  

o It  would  be  useful  if  we  contained  complete  data  about  all  4  process  metrics.  This  could  give  us  a  new   insight  about   the   impact  of   those  metrics   to   the  prediction  process.    

 

9.1  Future  works  In  this  subsection,  we  will  propose  some  of  the  possible  steps  for  the  future  work,  based  on  our  results  and  discussion  conclusions.    

9.1.1  Model  investigation  As   we   mentioned   before,   models   with   combined,   process   and   NDC   metric   should   be  investigated  more,   in  a  direction  of  choosing  other  representative  projects   that  can  be  beneficial  in  the  sense  of  providing  statistically  significant  results.      

9.1.2  Industrial  projects  Our  research  would  extend  to   industrial  projects  so  we  could  compare   the  differences  between  projects  of   the  different  domain,   and   that  way,   avoid  potential   threats   to   the  external  validity,  discussed  in  Section  8.        

Page 47: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

47    

9.1.3  Data  extraction  It  would  be  good  to  replicate  results  in  case  the  repository  offers  extracted  all  data  about  process  metrics  for  the  projects  that  were  examined  in  our  study.  Once  it  is  done,  we  can  give   the   concussion   how   the   changes   affected   results   and   whether   it   leads   to   our  inability   to   grasp   which   process   metric   can   be   recommended   as   useful   for   the   SFP  process.                                                                                    

Page 48: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

48    

Reference      [1]   N.   E.   Fenton   and   S.   L.   Pfleeger,   “Software   metrics:   A   rigorous   and   practical  approach”,  Course  Technology,  Boston,  MA,  USA,  2nd  edition,  1998.        [2]  Q.  Song,  Z.  Jia,  M.  Shepperd,  S.  Ying,  and  J.  Liu,  “A  general  software  defect-­‐proneness  prediction  framework”,  Software  Engineering,  IEEE  Transactions,  Volume  37,  Issue  3,  pp  356  –  370,  May-­‐June  2011.    [3]  R.  Shatnawi,  “Empirical  study  of  fault  prediction  for  open-­‐source  systems  using  the  Chidamber  and  Kemerer  metrics”,  Software,  IET,  Volume  8,  Issue  3,  pp  113  –  119,  June  2014.    [4]   T.   Hall,   S.   Beecham,   D.   Bowes,   D.   Gray,   and   S.   Counsell,   “A   systematic   literature  review  on  fault  prediction  performance    in  software  engineering”,  Software  Engineering,  IEEE  Transactions,  Volume  38,  Issue  6,  pp  1276  –  1304,  Nov.-­‐Dec.  2012.    [5]   E.   Harahap,   W.   Sakamoto   and   H.   Nishi   “Failure   prediction   method   for   network  management  system  by  using  Bayesian  network  and  shared  database”,  Information  and  Telecommunication  Technologies  (APSITT),  IEEE,  pp  1  –  6,  15-­‐18  June  2010.      [6]   K.   Dejaeger   T.   Verbraken   and   B.   Baesens,   “Toward   comprehensible   software   fault  prediction   models   using   Bayesian   network   classifiers”.   Software   Engineering,   IEEE  Transactions,  Volume  39,  Issue  2,  pp  237  –  257,  Feb.  2013.    [7]   A.   Okutan,   O.   T.   Yıldız,   “Software   defect   prediction   using   Bayesian   networks”,  Empirical  Software  Engineering,  Volume  19,  Issue  1,  pp  154-­‐181,  2014.    [8]  D.  Radjenović,  M.  Heričko,  R.  Torkar,  A.  Živkovič,    “Software  fault  prediction  metrics:  A  systematic  literature  review”,  Information  and  Software  Technology,  Volume  55,  Issue  8,  pp  1397–1418,  Aug.  2013.    [9]   C.   Jin,   S.-­‐W.   Jin,     J.-­‐M.   Ye,       “Artificial   neural   network-­‐based   metric   selection   for  software  fault-­‐prone  prediction  model”,  Software,  IET,  Volume  6,  Issue  6  ,  pp  479  –  487,  Dec.  2012.    [10]  M.  Shepperd,  D.  Bowes,  and  T.  Hall  “Researcher  bias:  The  use  of  machine  learning  in  software  defect  prediction”,  Software  Engineering,  IEEE  Transactions,  Volume  40,  Issue  6,  pp  603  –  616,  June  1  2014.      [11]   L.   Pelayo   and   S.   Dick   “Evaluating   stratification   alternatives   to   improve   software  defect  prediction”,  Reliability,  IEEE  Transactions,  Volume  61,  Issue  2,  pp  516  –  525,  June  2012.    [12]  D.  Gray,  D.  Bowes,  N.  Davey,  Y.  Sun  and  B.  Christianson  “Software  defect  prediction  using  static   code  metrics  underestimates  defect-­‐proneness”,  Neural  Networks   (IJCNN),  The  2010  International  Joint  Conference,  pp  1  –  7,  July  2010.  [13]   A.   J.   Stimpson   and  M.   L.   Cummings   “Assessing   intervention   timing   in   computer-­‐

Page 49: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

49    

based   education   using   machine   learning   algorithms”,   Access,   IEEE,   Volume   2,    pp  78  –  87,  2014.    [14]  C.  Zheng,  F.  Peng,  J.  Wu  and  Z.  Wu  “Software  life  cycle-­‐based  defects  prediction  and  diagnosis   technique   research”,   Computer   Application   and   System  Modeling   (ICCASM),  2010  International  Conference,  Volume  8,  pp  V8-­‐192  -­‐  V8-­‐195,  Oct.  2010.    [15]   E.   Alpaydin,   “Introduction   to   machine   learning”,   Massachusetts   Institute   of  Technology,  2010.      [16]  N.  E.  Fenton,  N.  Ohlsson,  “Quantitative  analysis  of   faults  and  failures   in  a  complex  software  system”,  IEEE  Transactions  on  Software  Engineering,  26(8):797–814,  2000.        [17]   A.   Gray,   S.   MacDonell.   “A   comparison   of   techniques   for   developing   predictive  models   of   software  metrics”,   Information   and   Software   Technology,   39(6):425   –   437,  1997.      

[18]   L.   Madeyski   ,M.   Jureczko,   “Significance   of   different   software   metrics   in   defect  prediction”,  Software  Engineering:  An  International  Journal,  Volume  1,  Number  1,  pp  86-­‐95,  2011.  

[19]  Y.  Kamei,  H.  Sato,  A.  Monden,  S.  Kawaguchi,  H  Uwano,  M.  Nagura,  K.-­‐i.  Matsumoto  and   N.   Ubayashi,   ”An   empirical   study   of   fault   prediction   with   code   clone   metrics”,  Software  Measurement,    pp  55-­‐61,  2011.  

[20]  Y.  Xia,  G.  Yan  and  H.  Zhang,  “Analyzing  the  significance  of  process  metrics  for  TT&C  software  defect  prediction”,  Software  Engineering  and  Service  Science  (ICSESS),  pp  77-­‐81,  2014.  

[21]  S.  Matsumoto,  Y.  Kamei,  A.  Monden,  K.-­‐i.  Matsumoto  and  M.  Nakamura,  “An  analysis  of   developer   metrics   for   fault   prediction”,   Proceedings   of   the   6th   International  Conference  on  Predictive  Models  in  Software  Engineering,  2010.    

[22]  T.  J.  Ostrand,  Elaine  J.  Weyuker  and  R.  M.  Bell.  “Programmer-­‐based  fault  prediction”,  Proceedings   of   the   6th   International   Conference   on   Predictive   Models   in   Software  Engineering,  2010.  

[23]   Y.   Shin,   A.   Meneely,   L.   Williams   and   J.   A.   Osborne,   “Evaluating   complexity,   code  churn  and  developer  activity  metrics  as   indicators  of  software  vulnerability”,  Software  Engineering,  IEEE  Transactions,  Volume  37,  Issue  6,  pp  772-­‐782,  2011.  

[24]  J.  Demšar,  “Statistical  comparison  of  classifiers  over  multiple  data  sets”,  Journal  of  Machine  Learning  Research,  Volume  7,  pp  1-­‐30,  2006.  

[25]  T.   Illes-­‐Seifert   and  B.   Paech,   “Exploring   the   relationship   of   a   file’s   history   and   its  fault-­‐proneness:   An   empirical   method   and   its   application   to   open   source   programs”,  Information  and  Software  Technology,  Volume  52,  Issue  5,  pp  539-­‐558,  May  2010.  

 

Page 50: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

50    

[26]   H.   Lu,   B.   Cukic   and   M.   Culp,   “A   semi-­‐supervised   approach   to   software   defect  prediction”,   Computer   Software   and   Application   Conference   (COMPSAC),   pp   416-­‐425,  July  2014.  

[27]  M.  J.  Crawley,  “The  R  book,  second  edition”,  Imperial  College  London  at  Silwood  Par,  UK,  2013.  

[28]   L.   Madeyski   and   M.   Jureczko,   ”Which   process   metrics   can   significantly   improve  defect   prediction   models?   An   empirical   study”,   Software   Quality   Journal,   Volume   23,  Issue  3,  pp  393-­‐422,  September  2015.      

                                                                     

Page 51: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

51    

A  Graphs  for  NB  classifier    

 Figure  10.  Graphical  comparison  of  combined,  static  code  and  process  models  using  NB  

 

 Figure  11.  Graphical  comparison  of  models  containing  1  process  and  static  code  metrics  using  NB  

Page 52: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

52    

 Figure  12.  Graphical  comparison  of  models  containing  the  combination  of  2  process  and  static  code  metrics  using  NB  

 

 Figure  13.  Graphical  comparison  of  models  containing  the  combination  of  3  process  and  static  code  metrics  using  NB  

     

Page 53: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

53    

B  Graphs  for  TAN  classifier    

 Figure  14.  Graphical  comparison  of  combined,  static  code  and  process  models  using  TAN  

 

 Figure  15.  Graphical  comparison  of  models  containing  1  process  and  static  code  metrics  using  TAN  

Page 54: Static!code!metrics!vs.processmetricsfor ...874471/FULLTEXT01.pdf · Static!code!metrics!vs.processmetricsfor softwarefaultpredictionusingBayesian! network!learners!!!!! Mälardalen)University))

Mälardalen  University                                                                                                                                                                                                            Master  thesis  

54    

 Figure  16.  Graphical  comparison  of  models  containing  the  combination  of  2  process  and  static  code  metrics  using  TAN  

 

 Figure  17.  Graphical  comparison  of  models  containing  the  combination  of  3  process  and  static  code  metrics  using  TAN