can better identifier splitting thi h lf l i ? techniques h elp

64
Can Better Identifier Splitting T h i H l F L i ? B d Di L if G j D P h k Gi li A il Techniques Help Feature Location? Bogdan Dit, Latifa Guerrouj, Denys Poshyvanyk, Giuliano Antoniol SEMERU 19 th IEEE International Conference on Program Comprehension (ICPC’11) – Kingston, Ontario, Canada

Upload: others

Post on 03-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Can Better Identifier Splitting T h i H l F L i ?

B d Di L if G j D P h k Gi li A i l

Techniques Help Feature Location?Bogdan Dit, Latifa Guerrouj, Denys Poshyvanyk, Giuliano Antoniol

SEMERU

19th IEEE International Conference on Program Comprehension (ICPC’11) – Kingston, Ontario, Canada

Page 2: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

2

Page 3: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Textual information embeds d i k l ddomain knowledge

3

Page 4: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Textual information embeds d i k l ddomain knowledge

About 70% of source codeAbout 70% of source code consists of identifiers*

* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software Quality Journal, vol. 14, no. 3, 2006, pp. 261-282

4

Page 5: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Textual information embeds d i k l ddomain knowledge

About 70% of source codeAbout 70% of source code consists of identifiers*

Identifiers are important source of Identifiers are important source of information for maintenance tasks:information for maintenance tasks:

• traceability link recovery• feature location

* Deissenboeck, F. and Pizka , M., "Concise and Consistent Naming", Software Quality Journal, vol. 14, no. 3, 2006, pp. 261-282

feature location5

Page 6: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

# of Cumulative Feature Location Papers based on Textual InformationPapers based on Textual Information

6

Page 7: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Related Work on IdentifiersRelated Work on Identifiers

• Takang et al (JPL’96) • Takang et al. (JPL 96) – programs with full-word identifiers are more

understandable than those with abbreviated understandable than those with abbreviated ones

• Lawrie et al (ICPC’06) • Lawrie et al. (ICPC 06) – full words and recognizable abbreviations

lead to better comprehensionlead to better comprehension

• Binkley et al. (ICPC’09) CamelCase style is easier to recognize than – CamelCase style is easier to recognize than underscore 7

Page 8: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Related Work on IdentifiersRelated Work on Identifiers

• Enslen et al (MSR’09)• Enslen et al. (MSR 09)– Samurai: algorithm for splitting identifiers

(using tables of identifier frequencies)(using tables of identifier frequencies)

• Guerrouj et al. (JSME’11)TIDIER: algorithm for splitting identifiers – TIDIER: algorithm for splitting identifiers (using contextual information)

Other related work• Other related work– Deissenboeck and Pizka (SQJ’06), Antoniol et

al (ICSM’07) Haiduc and Marcus (ICPC’08) al. (ICSM’07), Haiduc and Marcus (ICPC’08), etc. 8

Page 9: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Splitting Identifiers Correctly is h llChallenging

9

Page 10: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Identifier Splitting AlgorithmsIdentifier Splitting AlgorithmsOriginal IdentifieruserIdsetGIDprint file2deviceprint_file2deviceSSLCertificateMINstringUSERIDcurrentsizereadadapterobjectp jtolocaleimitatingDEFMASKBitDEFMASKBit

10

Page 11: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Identifier Splitting AlgorithmsIdentifier Splitting AlgorithmsOriginal Identifier Camel CaseuserId user IdsetGID set GIDprint file2device print file 2 deviceprint_file2device print file 2 deviceSSLCertificate SSL CertificateMINstring MI NstringUSERID USERIDcurrentsize currentsizereadadapterobject readadapterobjectp j p jtolocale tolocaleimitating imitatingDEFMASKBit DEFMASK BitDEFMASKBit DEFMASK Bit

11

Page 12: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Identifier Splitting AlgorithmsIdentifier Splitting AlgorithmsOriginal Identifier Camel Case

Handles underscore and

userId user IdsetGID set GIDprint file2device print file 2 device

underscore and digits

print_file2device print file 2 deviceSSLCertificate SSL CertificateMINstring MI NstringUSERID USERIDcurrentsize currentsizereadadapterobject readadapterobject Fails at mixed casesp j p jtolocale tolocaleimitating imitatingDEFMASKBit DEFMASK BitDEFMASKBit DEFMASK Bit

12Fails at same case

identifiers

Page 13: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Identifier Splitting AlgorithmsIdentifier Splitting AlgorithmsOriginal Identifier Camel CaseuserId user IdsetGID set GIDprint file2device print file 2 deviceprint_file2device print file 2 deviceSSLCertificate SSL CertificateMINstring MI NstringUSERID USERIDcurrentsize currentsizereadadapterobject readadapterobjectp j p jtolocale tolocaleimitating imitatingDEFMASKBit DEFMASK BitDEFMASKBit DEFMASK Bit

13

Page 14: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Identifier Splitting AlgorithmsIdentifier Splitting AlgorithmsOriginal Identifier Camel Case SamuraiuserId user Id user IdsetGID set GID set GIDprint file2device print file 2 device print file 2 deviceprint_file2device print file 2 device print file 2 deviceSSLCertificate SSL Certificate SSL CertificateMINstring MI Nstring MIN stringUSERID USERID USER IDcurrentsize currentsize current sizereadadapterobject readadapterobject read adapter objectp j p j p jtolocale tolocale tol ocal eimitating imitating imi ta tingDEFMASKBit DEFMASK Bit DEF MASK BitDEFMASKBit DEFMASK Bit DEF MASK Bit

14

Page 15: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Identifier Splitting AlgorithmsIdentifier Splitting AlgorithmsOriginal Identifier Camel Case SamuraiuserId user Id user IdsetGID set GID set GIDprint file2device print file 2 device print file 2 device

Splits some cases where CamelCase

print_file2device print file 2 device print file 2 deviceSSLCertificate SSL Certificate SSL CertificateMINstring MI Nstring MIN string

cannot

USERID USERID USER IDcurrentsize currentsize current sizereadadapterobject readadapterobject read adapter objectp j p j p jtolocale tolocale tol ocal eimitating imitating imi ta tingDEFMASKBit DEFMASK Bit DEF MASK BitDEFMASKBit DEFMASK Bit DEF MASK Bit

15Oversplits

Page 16: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

# of Cumulative Feature Location Papers based on Textual InformationPapers based on Textual Information

16

Page 17: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

# of Cumulative Feature Location Papers based on Textual InformationPapers based on Textual Information

Existing feature location techniques use Camel Case splittinguse Camel Case splitting

17

Page 18: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Information Retrieval FLTInformation Retrieval FLT• Generate corpus synchronized void print(TestResult result,

l Ti ) h IOE i {• Preprocessing– Remove non-literals– Remove stop words

long runTime) throws IOException{printHeader(runTime);printErrors(result);printFailures(result);Remove stop words

– Split identifiers– Stemming

I d i

p ( );printFooter(result);

}

synchronized void print TestResult result• Indexing

– Term-by-document matrix– Singular Value

synchronized void print TestResult result long runTime throws IOExceptionprintHeader runTime printErrors result printFailures result printFooter result g

Decomposition

• User formulate queryG t lt

print TestResult result runTimeIOException printHeader runTime• Generate results

• Ranked list 18

O cept o p t eade u eprintErrors result printFailures result printFooter result

Page 19: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Information Retrieval FLTInformation Retrieval FLT• Generate corpus print Test Result result run Time IO

E ti i t H d Ti i t• Preprocessing– Remove non-literals– Remove stop words

Exception print Header run Time print Errors result print Failures result print Footer result

Remove stop words– Split identifiers– Stemming

I d i

print test result result run time ioexception print head run time print error

• Indexing– Term-by-document matrix– Singular Value

result print fail result print foot result

gDecomposition

• User formulate queryG t lt

print test result ...

m1 5 1 3 ...• Generate results• Ranked list 19

m2 ... ... ... ...

Page 20: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Information Retrieval FLTInformation Retrieval FLT• Generate corpus print test result ...

• Preprocessing– Remove non-literals– Remove stop words

m1 5 1 3 ...

m2 ... ... ... ...

Remove stop words– Split identifiers– Stemming

I d i• Indexing– Term-by-document matrix– Singular Value g

Decomposition

• User formulate queryG t lt• Generate results

• Ranked list 20

Page 21: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

IR and Dynamic Information FLTIR and Dynamic Information FLT• Generate corpus• Preprocessing

– Remove non-literals– Remove stop wordsRemove stop words– Split identifiers– Stemming

I d i• Indexing– Term-by-document matrix– Singular Value Decompositiong p

• User formulate query

Collect execution trace

• Generate results• Ranked list of executed methods 21

Page 22: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

R h G lResearch GoalEvaluate how advanced splitting techniques impact

th f f f t l ti t h ithe performance of feature location techniques

22

Page 23: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Information Retrieval FLTInformation Retrieval FLT• Generate corpus• Preprocessing

– Remove non-literals– Remove stop words

Replace Camel Case with :•SamuraiRemove stop words

– Split identifiers– Stemming

I d ig ( )

•“Perfect” Splitting algorithm (Oracle)

• Indexing– Term-by-document matrix– Singular Value Bg

Decomposition

• User formulate queryG t lt

etter W

• Generate results• Ranked list 23

Worst

Page 24: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Extract Identifiers

AllAll Identifiers

B ildi Building the Oracle 24

Page 25: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Extract Identifiers

AllSame split?All

Identifiersp

(CamelCaseSamurai TIDIER)

B ildi Concordant

YES

Building the Oracle

ConcordantSplit

Identifiers 25

Page 26: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Extract Identifiers

AllSame split?All

Identifiersp

(CamelCaseSamurai TIDIER)

• Assume they are correct

B ildi Concordant

YES• Manually verified a sample

Building the Oracle

ConcordantSplit

Identifiers 26

sample

• Threat to validity

Page 27: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Manually Split

IdentifiersIdentifiers

Extract Identifiers Manual Split

AllDiscordant

SplitSame split? NOAll

IdentifiersSplit

Identifiersp

(CamelCaseSamurai TIDIER)

B ildi Concordant

YES

Building the Oracle

ConcordantSplit

Identifiers 27

Page 28: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Manually Split

IdentifiersIdentifiers

Extract Identifiers Manual SplitConsensus

between authors

AllDiscordant

SplitSame split? NO

Checked source codeAll

IdentifiersSplit

Identifiersp

(CamelCaseSamurai TIDIER)

Concordant

YES

B ildi ConcordantSplit

IdentifiersBuilding

the Oracle 28

Page 29: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Manually Split

Identifiers

Identifiers that could

not be split• Examples: DT, i3,

Identifiersnot be splitP754, zzz, etc.

• Left unchangedExtract Identifiers Manual Split

Left unchanged

AllDiscordant

SplitSame split? NOAll

IdentifiersSplit

Identifiersp

(CamelCaseSamurai TIDIER)

Concordant

YES

B ildi ConcordantSplit

IdentifiersBuilding

the Oracle 29

Page 30: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Design of the Case StudyDesign of the Case Study

30

Page 31: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Design of the Case StudyDesign of the Case Study

• RQ: Does a FLT with an advanced • RQ: Does a FLT with an advanced splitting algorithm produce better results than the same FLT using the CamelCase than the same FLT using the CamelCase splitting algorithm?

31

Page 32: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

How to Compare two FLTs?How to Compare two FLTs?

32

Page 33: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

How to Compare two FLTs?

• Effectiveness measure for each feature

How to Compare two FLTs?

• Effectiveness measure for each feature

Method LSIIR

G ld t th d

Method LSI score

M121 0.92M64 0.89

Gold set methodM15 0.86M39 0.80M7 0.74M 0 65M152 0.65M234 0.56M12 0.54M 0 52

Effectiveness = 5

M78 0.52

33

Page 34: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

How to Compare two FLTs?

• Effectiveness measure for each feature

How to Compare two FLTs?

• Effectiveness measure for each feature

Method LSIIR IRDynMethod LSI

scoreM121 0.92M64 0.89

Method LSI score

M15 0.86Gold set method

y

M15 0.86M39 0.80M7 0.74M 0 65

M7 0.74M234 0.56M12 0.54Effectiveness = 2M152 0.65

M234 0.56M12 0.54M 0 52

ect e ess

M78 0.52

34MethodExecuted method (from trace)

Page 35: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Which FLTs are we Comparing?Which FLTs are we Comparing?

35

Page 36: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Software SystemsSoftware Systems

• Rhino 1 6R5• Rhino 1.6R5• 138 classes, 1,870 methods, 32K LOC

E dd l ’ d *• Eaddy et al.’s data*• 2 datasets

Dataset Size Queries Gold Sets Execution Information

RhinoFeatures 241 Sections of ECMAScriptdocumentation

Eaddy et al.* Full Execution Traces (from unit tests)

Rhi 143 B titl d E dd t l * N/A

36* http://www.cs.columbia.edu/~eaddy/concerntagger/

RhinoBugs 143 Bug title and description

Eaddy et al.*(CVS)

N/A

Page 37: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Software SystemsSoftware Systems

• jEdit 4 3• jEdit 4.3• 483 classes, 6.4K methods, 109K LOC

2 d t t• 2 datasets

Dataset Size Queries Gold Sets Execution Information

jEditFeatures 64 Feature (or Patch) title and description

SVN Marked Execution Traces

jEditBugs 86 Bug title and description

SVN Marked Execution Traces

37

Datasets available at:http://www.cs.wm.edu/semeru/data/icpc11-identifier-splitting/

Page 38: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Generating the jEdit Datasets

SVN Commits between v4.2-v4.3

38

Page 39: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Generating the jEdit Datasets

SVN commit message

Title

Description

+

Query

=Query

Page 40: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Generating the jEdit Datasets

Changed files

Previous V i

CurrentV i

Compare using

Eclipse AST

Version Version

c pse S

Modified methods(gold set) 40

Page 41: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Presenting the Results

41

Page 42: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Presenting the Results

Box plot of all effectiveness measure in datasets

(e.g., 241 datapoints for RhinoFeatures)

Average

Median

42

Page 43: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

IR FLTs

RhinoFeaturesRhinoFeatures

Page 44: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

IR FLTs

S

RhinoFeatures

Similar median and average

RhinoFeatures

Page 45: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

IR FLTs

S

RhinoFeatures RhinoBugs

Similar median and average

RhinoFeatures RhinoBugs

45jEditFeatures jEditBugs

Page 46: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

IR FLTs

S

RhinoFeatures RhinoBugs

Similar median and average

RhinoFeatures RhinoBugs

Datasets with features have better results than datasets

with bugs

46jEditFeatures jEditBugs

Page 47: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

IRDyn FLTs

S

N/A

RhinoFeatures RhinoBugs

Similar median and average

RhinoFeatures RhinoBugs

47jEditFeatures jEditBugs

Page 48: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

IRDyn FLTs

S

N/A

RhinoFeatures RhinoBugs

Similar median and average

RhinoFeatures RhinoBugs

Datasets with atasets tfeatures have better results than datasetsthan datasets

with bugs

48jEditFeatures jEditBugs

Page 49: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Compare FLTs by PercentagesCompare FLTs by PercentagesIROracle IRCamelCase

(Baseline)(Baseline)10 1720 1520 1518 185 94 16

19 712 28

4914 15

Page 50: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Compare FLTs by PercentagesCompare FLTs by PercentagesIROracle IRCamelCase

(Baseline) 5/8(Baseline)10 1720 1520 1518 18 2/85 94 16

2/8

19 712 28

5014 15

Page 51: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

IRIR

RhinoFeatures RhinoBugs

51jEditFeatures jEditBugs

Page 52: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

IRIR

Datasets with featuresRhinoFeatures RhinoBugs

Datasets with features vs.

Datasets with bugs

52jEditFeatures jEditBugs

Page 53: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

IRDynN/A

Similar trend

RhinoFeatures RhinoBugs

53jEditFeatures jEditBugs

Page 54: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Statistical ResultsStatistical Results

• Wilcoxon signed-rank test• Wilcoxon signed rank test• Null hypothesis

Th i t ti ti l i ifi diff – There is no statistical significance difference in terms of effectiveness between IR /IR l and IR lIRSamurai/IROracle and IRCamelCase

• Alternative hypothesisIR /IR h t ti ti ll i ifi tl – IRSamurai/IROracle has statistically significantly higher effectiveness than IRCamelCase

l h 0 05• alpha = 0.0554

Page 55: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

IRIR

RhinoFeatures RhinoBugsThe only statistical

significant resultsignificant result(p=0.05)

55jEditFeatures jEditB

Page 56: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Qualitative ResultsQualitative Results• Vocabulary mismatch between queries

and code:and code:– Name of developers (e.g., Slava, Carlos)

Id ifi ifi i i ( – Identifiers specific to communication (e.g., thanks, greetings, annoying)

56

Page 57: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Qualitative ResultsQualitative Results• Features are more “descriptive” than

bugsbugs

57

Page 58: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Qualitative ResultsQualitative Results• Features are more “descriptive” than

bugsbugs

Words “join” and “line” are not

mentioned

58

Page 59: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Threats to ValidityThreats to Validity• External

– 2 Java applications (different domains)– More systems needed

• Construct– Errors may be present in Oracle and gold setsy p g– We used data produced by other researchers

• Internal• Internal– Subjectivity and bias in building the Oracle

Conclusion• Conclusion– Non-parametric test: Wilcoxon signed-rank 59

Page 60: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Research Questions• RQ1 Does IRSamurai outperform IRCamelCase in

terms of effectiveness? NO

• RQ2 Does IRS iDyn outperform IRC lC Dyn• RQ2 Does IRSamuraiDyn outperform IRCamelCaseDynin terms of effectiveness? NO

• RQ3 Does IROracle outperform IRCamelCase in terms f ff ti ? I (Rhi )of effectiveness? In some cases (Rhino)

• RQ4 Does IROracleDyn outperform IRCamelCaseDynin terms of effectiveness? NO

60

Page 61: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Future WorkFuture Work

• More systems and datasets• More systems and datasets• Different maintenance tasks

T bilit li k – Traceability link recovery

• Consider other splitting algorithms

61

Page 62: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

ConclusionsConclusions

• Advanced splitting technique could • Advanced splitting technique could improve FLTs

We found some empirical evidence– We found some empirical evidence

• Splitting has more impact on IR FLTf f l bl • If execution information is available, it is not necessary to use an advance splitting

h itechnique

62

Page 63: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

Thank you! Questions?Thank you! Questions?

SEMERU @ William and MarySEMERU @ William and Maryhttp://www.cs.wm.edu/semeru/

bdi [email protected]

SEMERU

63

Page 64: Can Better Identifier Splitting Thi H lF L i ? Techniques H elp

ReferencesReferences• Takang et al. (1996) Takang, A., Grubb, P., and Macredie, R., "The

Effects of Comments and Identifier Names on Program Comprehensibility: An Experimental Investigation", Journal of Programming Languages, vol. 4, no. 3, 1996, pp. 143-167

• Lawrie et al (2006) Lawrie D Morrell C Feild H and Binkley D • Lawrie et al. (2006) Lawrie, D., Morrell, C., Feild, H., and Binkley, D., "What's in a Name? A Study of Identifiers", in Proc. of IEEE ICPC'06, June 14-16 2006, pp. 3-12

• Binkley et al. (2009) Binkley, D., Davis, M., Lawrie, D., and Morrell, C., y ( ) y, , , , , , , ,"To CamelCase or Under_score", in Proc. of IEEE ICPC'09, May 17-19 2009, pp. 158-167

• Enslen et al. (2009) Enslen, E., Hill, E., Pollock, L., and Vijay-Shanker, K., "Mi i S C d A i ll S li Id ifi f S f "Mining Source Code to Automatically Split Identifiers for Software Analysis", in Proc. of IEEE MSR'09, May 16-17 2009, pp. 71-80

• Guerrouj et al. (2011) Guerrouj, L., Di Penta, M., Antoniol, G., and Guéhéneuc Y -G "TIDIER: An Identifier Splitting Approach using Speech Guéhéneuc, Y. G., TIDIER: An Identifier Splitting Approach using Speech Recognition Techniques", JSME, vol. to appear, 2011

64