physical and conceptual identifier dispersion: measures and relation to fault proneness

Post on 21-Nov-2014

667 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Physical and Conceptual Identifier Dispersion:Measures and Relation to Fault Proneness

Venera Arnaoudova Laleh Eshkevari Rocco OlivetoYann-Gael Gueheneuc Giuliano Antoniol

SOCCER Lab. – DGIGL, Ecole Polytechnique de Montreal, Qc, CanadaSE@SA Lab – DMI, University of Salerno - Salerno - Italy

Ptidej Team – DGIGL, Ecole Polytechnique de Montreal, Qc, Canada

September 15, 2010

SOftware Cost-effective Change and Evolution Research LabSoftware Engineering @ SAlernoPattern Trace Identification, Detection, and Enhancement in Java

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Outline

Introduction

Our study

Dispersion measures

Our study - refined

Case studyRQ1 – Metric RelevanceRQ2 – Relation to Faults

Conclusions and future work

2 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Introduction

I Fault identification

I size (e.g., [Gyimothy et al., 2005])I cohesion (e.g., [Liu et al., 2009])I coupling (e.g., [Marcus et al., 2008])I number of changes (e.g., [Zimmermann et al., 2007])

I Importance of linguistic information

I program comprehension (e.g.,[Takang et al., 1996, Deissenboeck and Pizka, 2006,Haiduc and Marcus, 2008, Binkley et al., 2009])

I code quality (e.g., [Marcus et al., 2008,Poshyvanyk and Marcus, 2006, Butler et al., 2009])

3 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Our study

Term dispersionI We are interested in studying the relation between term

dispersion and the quality of the source code.

term basic component of identifiers

dispersion the way terms are scattered among differententities (attributes and methods)

quality absence of faults

I Example: What is the impact of using getRelativePath,returnAbsolutePath, and setPath as method names onthe fault proneness of those methods?

4 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Dispersion measures(1/3)

Physical dispersion - Entropy

fee

foo

bar

Terms

EntitiesE1 E2 E3 E4 E5

Entropy

The circle indicates the occurrences of a term in an entity. The higher the size of the circle the higher the number of occurrences.

5 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Dispersion measures(2/3)

Conceptual dispersion - Context Coverage

E1

E3

E2E5

E4C1

C3

C2

C4

Entity Contexts

Entity contexts are identified taking into account the terms contained in the entities.

fee

foo

bar

Terms

ContextsC1 C2 C3 C4

Context coverage

The star indicates that the term appears in the particular context.

6 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Dispersion measuresAggregated metric - numHEHCC(3/3)

Context Coverage

Entropy

th H

th CC

I For each entity, numHEHCC counts the number ofsuch terms

7 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Dispersion measuresAggregated metric - numHEHCC(3/3)

Context Coverage

Entropy

th H

th CC

?

I For each entity, numHEHCC counts the number ofsuch terms

7 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Dispersion measuresAggregated metric - numHEHCC(3/3)

Context Coverage

Entropy

th H

th CC

H: used in few identifiersCC: used in similar contexts

I For each entity, numHEHCC counts the number ofsuch terms

7 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Dispersion measuresAggregated metric - numHEHCC(3/3)

Context Coverage

Entropy

th H

th CC

?

I For each entity, numHEHCC counts the number ofsuch terms

7 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Dispersion measuresAggregated metric - numHEHCC(3/3)

Context Coverage

Entropy

th H

th CC

H: used in many identifiersCC: used in similar contexts

I For each entity, numHEHCC counts the number ofsuch terms

7 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Dispersion measuresAggregated metric - numHEHCC(3/3)

Context Coverage

Entropy

th H

th CC

?

I For each entity, numHEHCC counts the number ofsuch terms

7 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Dispersion measuresAggregated metric - numHEHCC(3/3)

Context Coverage

Entropy

th H

th CC

H: used in few identifiersCC: used in different contexts

I For each entity, numHEHCC counts the number ofsuch terms

7 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Dispersion measuresAggregated metric - numHEHCC(3/3)

Context Coverage

Entropy

th H

th CC

?

I For each entity, numHEHCC counts the number ofsuch terms

7 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Dispersion measuresAggregated metric - numHEHCC(3/3)

Context Coverage

Entropy

th H

th CC

H: used in many identifiersCC: used in different contexts

I For each entity, numHEHCC counts the number ofsuch terms

7 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Dispersion measuresAggregated metric - numHEHCC(3/3)

Context Coverage

Entropy

th H

th CC

H: used in many identifiersCC: used in different contexts

!

I For each entity, numHEHCC counts the number ofsuch terms

7 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Dispersion measuresAggregated metric - numHEHCC(3/3)

Context Coverage

Entropy

th H

th CC

H: used in many identifiersCC: used in different contexts

!

I For each entity, numHEHCC counts the number ofsuch terms

7 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Our study - refined(1/2)

Research question 1

I RQ1 – Metric Relevance: Does numHEHCC capturecharacteristics different from size?

I Our believe: Yes it does, although we expect someoverlap.

I To this end, we verify the following:

1. To what extend numHEHCC and size vary together.2. Can size explain numHEHCC?3. Does numHEHCC bring additional information to size

for fault explanation?

8 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Our study - refined(2/2)

Research question 2

I RQ2 – Relation to Faults: Do term entropy andcontext coverage help to explain the presence of faultsin an entity?

I Our believe: Yes it does!I How?

1. Estimate the risk of being faulty when entities containterms with high entropy and high context coverage.

9 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Objects

ObjectsI ArgoUML v0.16 – a UML modeling CASE tool.

I Rhino v1.4R3 – a JavaScript/ECMAScript interpreterand compiler.

Program LOC # Entities # Terms

ArgoUML 97,946 12,423 2517Rhino 18,163 1,624 949

We consider as entities both methods and attributes.

10 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Case studyRQ1 – Metric Relevance (1/3)

Results for RQ1 – Metric Relevance

I To what extend numHEHCC and size vary together?

ArgoUML: 40%Rhino: 43%

Correlation between numHEHCC and LOC

numHEHCC

LOC

11 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Case studyRQ1 – Metric Relevance (2/3)

Results for RQ1 – Metric Relevance

I Can size explain numHEHCC?

ArgoUML: 17%Rhino: 19%

Composition of numHEHCC.

12 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Case studyRQ1 – Metric Relevance (3/3)

Results for RQ1 – Metric Relevance (cont’d)

I Does numHEHCC bring additional information to sizefor fault explanation?

Variables Coefficients p-values

MArgoUML

Intercept -1.688e+00 ≺ 2e − 16LOC 7.703e-03 8.34e − 10numHEHCC 7.490e-02 1.42e − 05LOC:numHEHCC -2.819e-04 0.000211

MRhino

Intercept -4.9625130 ≺ 2e − 16LOC 0.0041486 0.17100numHEHCC 0.2446853 0.00310LOC:numHEHCC -0.0004976 0.29788

13 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Case studyResults for RQ2 – Relation to Faults (1/1)

I The risk of being faulty when entities contain termswith high entropy and high context coverage.

All entities

14 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Case studyResults for RQ2 – Relation to Faults (1/1)

I The risk of being faulty when entities contain termswith high entropy and high context coverage.

All entities

14 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Case studyResults for RQ2 – Relation to Faults (1/1)

I The risk of being faulty when entities contain termswith high entropy and high context coverage.

All entities

numHEHCC

10% of the entities

14 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Case studyResults for RQ2 – Relation to Faults (1/1)

I The risk of being faulty when entities contain termswith high entropy and high context coverage.

All entities

numHEHCC

10% of the entities

Risk of being faulty?

14 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Case studyResults for RQ2 – Relation to Faults (1/1)

I The risk of being faulty when entities contain termswith high entropy and high context coverage.

All entities

numHEHCC

10% of the entities

Risk of being faulty?ArgoUML: 2 x higherRhino: 6 x higher

14 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Conclusions and future work

ConclusionsI Entropy and context coverage, together, capture

characteristics different from size!

I Entropy and context coverage, together, help to explainthe presence of faults in entities!

Future directionsI Replicate the study to other systems.

I Use entropy and context coverage to suggestrefactoring.

I Study the impact of lexicon evolution on entropy andcontext coverage.

15 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Thank you!

Questions?

16 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Binkley, D., Davis, M., Lawrie, D., and Morrell, C.(2009).To CamelCase or Under score.In Proceedings of 17th IEEE International Conference onProgram Comprehension. IEEE CS Press.

Butler, S., Wermelinger, M., Yu, Y., and Sharp, H.(2009).Relating identifier naming flaws and code quality: Anempirical study.In Proceedings of the 16th Working Conference onReverse Engineering, pages 31–35. IEEE CS Press.

Deissenboeck, F. and Pizka, M. (2006).Concise and consistent naming.Software Quality Journal, 14(3):261–282.

Gyimothy, T., Ferenc, R., and Siket, I. (2005).Empirical validation of object-oriented metrics on opensource software for fault prediction.

16 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

IEEE Transactions on Software Engineering,31(10):897–910.

Haiduc, S. and Marcus, A. (2008).On the use of domain terms in source code.In Proceedings of 16th IEEE International Conference onProgram Comprehension, pages 113–122. IEEE CSPress.

Liu, Y., Poshyvanyk, D., Ferenc, R., Gyimothy, T., andChrisochoides, N. (2009).Modelling class cohesion as mixtures of latent topics.In Proceedings of 25th IEEE International Conference onSoftware Maintenance, pages 233–242, Edmonton,Canada. IEEE CS Press.

Marcus, A., Poshyvanyk, D., and Ferenc, R. (2008).Using the conceptual cohesion of classes for faultprediction in object-oriented systems.IEEE Transactions on Software Engineering,34(2):287–300.

16 / 16

Physical andConceptual

IdentifierDispersion

VeneraArnaoudova, LalehEshkevari, Rocco

Oliveto, Yann-GaelGueheneuc,

Giuliano Antoniol

Introduction

Our study

Dispersionmeasures

Our study - refined

Case study

RQ1 – Metric Relevance

RQ2 – Relation to Faults

Conclusions andfuture work

Poshyvanyk, D. and Marcus, A. (2006).The conceptual coupling metrics for object-orientedsystems.In Proceedings of 22nd IEEE International Conference onSoftware Maintenance, pages 469 – 478. IEEE CS Press.

Takang, A., Grubb, P., and Macredie, R. (1996).The effects of comments and identifier names onprogram comprehensibility: an experiential study.Journal of Program Languages, 4(3):143–167.

Zimmermann, T., Premraj, R., and Zeller, A. (2007).Predicting defects for eclipse.In Proceedings of the Third International Workshop onPredictor Models in Software Engineering.

16 / 16

top related