physical and conceptual identifier dispersion: measures and relation to fault proneness
DESCRIPTION
TRANSCRIPT
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Physical and Conceptual Identifier Dispersion:Measures and Relation to Fault Proneness
Venera Arnaoudova Laleh Eshkevari Rocco OlivetoYann-Gael Gueheneuc Giuliano Antoniol
SOCCER Lab. – DGIGL, Ecole Polytechnique de Montreal, Qc, CanadaSE@SA Lab – DMI, University of Salerno - Salerno - Italy
Ptidej Team – DGIGL, Ecole Polytechnique de Montreal, Qc, Canada
September 15, 2010
SOftware Cost-effective Change and Evolution Research LabSoftware Engineering @ SAlernoPattern Trace Identification, Detection, and Enhancement in Java
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Outline
Introduction
Our study
Dispersion measures
Our study - refined
Case studyRQ1 – Metric RelevanceRQ2 – Relation to Faults
Conclusions and future work
2 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Introduction
I Fault identification
I size (e.g., [Gyimothy et al., 2005])I cohesion (e.g., [Liu et al., 2009])I coupling (e.g., [Marcus et al., 2008])I number of changes (e.g., [Zimmermann et al., 2007])
I Importance of linguistic information
I program comprehension (e.g.,[Takang et al., 1996, Deissenboeck and Pizka, 2006,Haiduc and Marcus, 2008, Binkley et al., 2009])
I code quality (e.g., [Marcus et al., 2008,Poshyvanyk and Marcus, 2006, Butler et al., 2009])
3 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Our study
Term dispersionI We are interested in studying the relation between term
dispersion and the quality of the source code.
term basic component of identifiers
dispersion the way terms are scattered among differententities (attributes and methods)
quality absence of faults
I Example: What is the impact of using getRelativePath,returnAbsolutePath, and setPath as method names onthe fault proneness of those methods?
4 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Dispersion measures(1/3)
Physical dispersion - Entropy
fee
foo
bar
Terms
EntitiesE1 E2 E3 E4 E5
Entropy
The circle indicates the occurrences of a term in an entity. The higher the size of the circle the higher the number of occurrences.
5 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Dispersion measures(2/3)
Conceptual dispersion - Context Coverage
E1
E3
E2E5
E4C1
C3
C2
C4
Entity Contexts
Entity contexts are identified taking into account the terms contained in the entities.
fee
foo
bar
Terms
ContextsC1 C2 C3 C4
Context coverage
The star indicates that the term appears in the particular context.
6 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Dispersion measuresAggregated metric - numHEHCC(3/3)
Context Coverage
Entropy
th H
th CC
I For each entity, numHEHCC counts the number ofsuch terms
7 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Dispersion measuresAggregated metric - numHEHCC(3/3)
Context Coverage
Entropy
th H
th CC
?
I For each entity, numHEHCC counts the number ofsuch terms
7 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Dispersion measuresAggregated metric - numHEHCC(3/3)
Context Coverage
Entropy
th H
th CC
H: used in few identifiersCC: used in similar contexts
I For each entity, numHEHCC counts the number ofsuch terms
7 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Dispersion measuresAggregated metric - numHEHCC(3/3)
Context Coverage
Entropy
th H
th CC
?
I For each entity, numHEHCC counts the number ofsuch terms
7 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Dispersion measuresAggregated metric - numHEHCC(3/3)
Context Coverage
Entropy
th H
th CC
H: used in many identifiersCC: used in similar contexts
I For each entity, numHEHCC counts the number ofsuch terms
7 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Dispersion measuresAggregated metric - numHEHCC(3/3)
Context Coverage
Entropy
th H
th CC
?
I For each entity, numHEHCC counts the number ofsuch terms
7 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Dispersion measuresAggregated metric - numHEHCC(3/3)
Context Coverage
Entropy
th H
th CC
H: used in few identifiersCC: used in different contexts
I For each entity, numHEHCC counts the number ofsuch terms
7 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Dispersion measuresAggregated metric - numHEHCC(3/3)
Context Coverage
Entropy
th H
th CC
?
I For each entity, numHEHCC counts the number ofsuch terms
7 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Dispersion measuresAggregated metric - numHEHCC(3/3)
Context Coverage
Entropy
th H
th CC
H: used in many identifiersCC: used in different contexts
I For each entity, numHEHCC counts the number ofsuch terms
7 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Dispersion measuresAggregated metric - numHEHCC(3/3)
Context Coverage
Entropy
th H
th CC
H: used in many identifiersCC: used in different contexts
!
I For each entity, numHEHCC counts the number ofsuch terms
7 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Dispersion measuresAggregated metric - numHEHCC(3/3)
Context Coverage
Entropy
th H
th CC
H: used in many identifiersCC: used in different contexts
!
I For each entity, numHEHCC counts the number ofsuch terms
7 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Our study - refined(1/2)
Research question 1
I RQ1 – Metric Relevance: Does numHEHCC capturecharacteristics different from size?
I Our believe: Yes it does, although we expect someoverlap.
I To this end, we verify the following:
1. To what extend numHEHCC and size vary together.2. Can size explain numHEHCC?3. Does numHEHCC bring additional information to size
for fault explanation?
8 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Our study - refined(2/2)
Research question 2
I RQ2 – Relation to Faults: Do term entropy andcontext coverage help to explain the presence of faultsin an entity?
I Our believe: Yes it does!I How?
1. Estimate the risk of being faulty when entities containterms with high entropy and high context coverage.
9 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Objects
ObjectsI ArgoUML v0.16 – a UML modeling CASE tool.
I Rhino v1.4R3 – a JavaScript/ECMAScript interpreterand compiler.
Program LOC # Entities # Terms
ArgoUML 97,946 12,423 2517Rhino 18,163 1,624 949
We consider as entities both methods and attributes.
10 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Case studyRQ1 – Metric Relevance (1/3)
Results for RQ1 – Metric Relevance
I To what extend numHEHCC and size vary together?
ArgoUML: 40%Rhino: 43%
Correlation between numHEHCC and LOC
numHEHCC
LOC
11 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Case studyRQ1 – Metric Relevance (2/3)
Results for RQ1 – Metric Relevance
I Can size explain numHEHCC?
ArgoUML: 17%Rhino: 19%
Composition of numHEHCC.
12 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Case studyRQ1 – Metric Relevance (3/3)
Results for RQ1 – Metric Relevance (cont’d)
I Does numHEHCC bring additional information to sizefor fault explanation?
Variables Coefficients p-values
MArgoUML
Intercept -1.688e+00 ≺ 2e − 16LOC 7.703e-03 8.34e − 10numHEHCC 7.490e-02 1.42e − 05LOC:numHEHCC -2.819e-04 0.000211
MRhino
Intercept -4.9625130 ≺ 2e − 16LOC 0.0041486 0.17100numHEHCC 0.2446853 0.00310LOC:numHEHCC -0.0004976 0.29788
13 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Case studyResults for RQ2 – Relation to Faults (1/1)
I The risk of being faulty when entities contain termswith high entropy and high context coverage.
All entities
14 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Case studyResults for RQ2 – Relation to Faults (1/1)
I The risk of being faulty when entities contain termswith high entropy and high context coverage.
All entities
14 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Case studyResults for RQ2 – Relation to Faults (1/1)
I The risk of being faulty when entities contain termswith high entropy and high context coverage.
All entities
numHEHCC
10% of the entities
14 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Case studyResults for RQ2 – Relation to Faults (1/1)
I The risk of being faulty when entities contain termswith high entropy and high context coverage.
All entities
numHEHCC
10% of the entities
Risk of being faulty?
14 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Case studyResults for RQ2 – Relation to Faults (1/1)
I The risk of being faulty when entities contain termswith high entropy and high context coverage.
All entities
numHEHCC
10% of the entities
Risk of being faulty?ArgoUML: 2 x higherRhino: 6 x higher
14 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Conclusions and future work
ConclusionsI Entropy and context coverage, together, capture
characteristics different from size!
I Entropy and context coverage, together, help to explainthe presence of faults in entities!
Future directionsI Replicate the study to other systems.
I Use entropy and context coverage to suggestrefactoring.
I Study the impact of lexicon evolution on entropy andcontext coverage.
15 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Thank you!
Questions?
16 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Binkley, D., Davis, M., Lawrie, D., and Morrell, C.(2009).To CamelCase or Under score.In Proceedings of 17th IEEE International Conference onProgram Comprehension. IEEE CS Press.
Butler, S., Wermelinger, M., Yu, Y., and Sharp, H.(2009).Relating identifier naming flaws and code quality: Anempirical study.In Proceedings of the 16th Working Conference onReverse Engineering, pages 31–35. IEEE CS Press.
Deissenboeck, F. and Pizka, M. (2006).Concise and consistent naming.Software Quality Journal, 14(3):261–282.
Gyimothy, T., Ferenc, R., and Siket, I. (2005).Empirical validation of object-oriented metrics on opensource software for fault prediction.
16 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
IEEE Transactions on Software Engineering,31(10):897–910.
Haiduc, S. and Marcus, A. (2008).On the use of domain terms in source code.In Proceedings of 16th IEEE International Conference onProgram Comprehension, pages 113–122. IEEE CSPress.
Liu, Y., Poshyvanyk, D., Ferenc, R., Gyimothy, T., andChrisochoides, N. (2009).Modelling class cohesion as mixtures of latent topics.In Proceedings of 25th IEEE International Conference onSoftware Maintenance, pages 233–242, Edmonton,Canada. IEEE CS Press.
Marcus, A., Poshyvanyk, D., and Ferenc, R. (2008).Using the conceptual cohesion of classes for faultprediction in object-oriented systems.IEEE Transactions on Software Engineering,34(2):287–300.
16 / 16
Physical andConceptual
IdentifierDispersion
VeneraArnaoudova, LalehEshkevari, Rocco
Oliveto, Yann-GaelGueheneuc,
Giuliano Antoniol
Introduction
Our study
Dispersionmeasures
Our study - refined
Case study
RQ1 – Metric Relevance
RQ2 – Relation to Faults
Conclusions andfuture work
Poshyvanyk, D. and Marcus, A. (2006).The conceptual coupling metrics for object-orientedsystems.In Proceedings of 22nd IEEE International Conference onSoftware Maintenance, pages 469 – 478. IEEE CS Press.
Takang, A., Grubb, P., and Macredie, R. (1996).The effects of comments and identifier names onprogram comprehensibility: an experiential study.Journal of Program Languages, 4(3):143–167.
Zimmermann, T., Premraj, R., and Zeller, A. (2007).Predicting defects for eclipse.In Proceedings of the Third International Workshop onPredictor Models in Software Engineering.
16 / 16