era - a comparison of stemmers on source code identifiers for software search

10
A Comparison of Stemmers on Source Code Identifiers for Software Search Andrew Wiese,Valerie Ho, Emily Hill Montclair State University Thursday, October 6, 2011

Upload: icsm-2011

Post on 13-Jan-2015

239 views

Category:

Technology


1 download

DESCRIPTION

Paper: A Comparison of Stemmers on Source Code Identifiers for SoftwareSearchAuthors: Andrew Wiese, Valerie Ho, Emily Hill.Session: ERA1 - Linguistic Analysis of Software Artifacts

TRANSCRIPT

Page 1: ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

A Comparison of Stemmers on Source Code Identifiers for

Software SearchAndrew Wiese, Valerie Ho, Emily Hill

Montclair State University

Thursday, October 6, 2011

Page 2: ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

Problem: Source Code Search

• Challenge: Query words may not exactly match source code words & can hurt search

• Example: “add item” query should match

• add, adds, adding, added

• item, items

• Stemming used by Information Retrieval (IR) systems to strip suffixes

• reduce all words to root form, or stem

• a.k.a. word conflation

Thursday, October 6, 2011

Page 3: ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

What makes stemming source code different from traditional IR?

• Word choice more restrictive in naming identifiers than in natural language (NL) documents

• NL: stem, stems, stemmer, stemming, stemmed

• Code: stem, stemmer

• Classes that encapsulate actions have names with nominalized verbs:

• play → player

• compile → compiler

• Tradtional IR prefer light Porter’s

• tends not to stem across parts of speech

• E.g., noun ‘player’ will not stem to verb ‘play’

Thursday, October 6, 2011

Page 4: ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

Stemming Challenges

• Understemming

• stemmer assigns different stems to words in the same concept

• reduces number of relevant results in search (i.e., reduces recall)

• Overstemming

• stemmer assigns the same stem for words with different meanings (e.g., business conflated with busy, university with universe)

• increases number of irrelevant results (i.e., reduces precision)

• Stemmers categorized by type of error

• Light stemmers: understem

• Heavy stemmers: overstem

Thursday, October 6, 2011

Page 5: ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

A Brief History of Stemming• Light Stemmers (tend not to stem across parts of speech)

• Porter (1980): rule-based, simple & efficient• Most popular stemmer in IR & SE

• Snowball (2001): minor rule improvements

• KStem (1993): morphology-based• based on word’s structure & hand-tuned dictionary

• in experiments shown to outperform porter’s

• Heavy Stemmers

• Lovins (1968): rule-based

• Paice (1990): rule-based

• MStem: morphological (PC-Kimmo), specialized for source code using word frequencies

Thursday, October 6, 2011

Page 6: ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

Our Contribution

• Compare performance of 5 stemmers on source code identifiers

• Evaluation 1: compare conflated word classes

• started from 100 most frequently occurring words in 9,000 open source Java programs

• analyzed by 2 human Java programmers in terms of accuracy & completeness

• Evaluation 2: compare effect of using 5 stemmers vs not stemming on 8 search tasks

Thursday, October 6, 2011

Page 7: ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

Stemmer Word Classes Comparison

• accurate: word class contains no unrelated words

• complete: word class not missing related words(rely on greediness & diversity of stemmers)

• context sensitive (CS): multiple senses or disagreement

None Context Sensitive

PORTER PAICE SNOWBALL KSTEM MSTEM

TOTAL%NEW RESULTS

10 13 13 13 13 27

5 5 13 13

4 4 6 6

1 3 5 4

2 3 3

2 1 2

1

1

10 13 23 31 41 550.30 0.40 0.53 0.71

3 11 25 28 32 46 50

10

20

30

40

50

60

70

80

90

100

None CSPorte

rPaice

Snowball

KStemMStem

No

. A

ccura

te &

Co

mp

lete

58%53%

37%32%29%

Thursday, October 6, 2011

Page 8: ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

Word Classes Example• Stemmer comparison for 2 examples

• Underlined words in all stemmer classesTable I

STEMMER WORD CLASS COMPARISONS FOR 4 EXAMPLES (UNDERLINEDWORDS ARE IN THE WORD CLASSES FOR ALL STEMMERS)

Word(A & C)

Stemmer Word Class

Porter element, elemental, elemente, elementsSnwbl element, elemental, elemente, elements

element KStem element(MStem) MStem element, elemental, elements

Paice el, ela, ele, element, elemental, elementary,elemente, elementen, elements, elen, eles,eli, elif, elise, elist, ell, elle, ellen, eller, els,else, elseif, elses, elsif

Porter import, importable, importance, important,imported, importer, importers, importing,imports

Snowbl import, importable, importance, important,importantly, imported, importer, importers,importing, imports

import(Kstem)

KStem import, importable, imported, importer,importers, importing, imports

MStem import, importable, importance, important,importantly, imported, importer, importers,importing, imports

Paice import, importable, importance, important,importantly, importar, imported, importer,importers, importing, imports

Porter add, adde, addes, addsSnwbl add, adde, addes, adds

add KStem add, addable, added, addes, adding, adds(CS) MStem add, addable, added, adder, adding, addition,

additional, additionally, additions, additive,additivity, adds

Paice ad, ada, add, addable, adde, added, adder,addes, adding, adds, ade, ads

Porter name, named, namely, names, namingSnwbl name, named, namely, names, naming

name(None)

KStem name, nameable, named, namer, names,naming

MStem name, named, nameless, namely, namer,names, naming, surname

Paice name, nameable, namely, names

contrast, the sense of ‘add’ being used to join something toa list is not typically related to ‘addition’. The word classesfor ‘add’, as well as 3 other examples, are shown in Table I.

Overall, the annotators found the morphological parsersMStem and KStem to be the most accurate. The resultsof these two subjects indicate that morphology may bemore important than degree of under- or overstemming,since MStem is a heavy stemmer and KStem light. MStemwas the only accurate and complete stemmer for 12 of thewords, whereas KStem was accurate and complete for 11. Incontrast, the rule-based stemmers Porter and Snowball wereuniquely accurate and complete stemmers for 2 words, andPaice 6. Of the rule-based stemmers, light Snowball has aclear advantage over light Porter and heavy Paice overall.

As expected with heavy stemmers, MStem and Paice bothtend to overstem, although for different reasons. MStemfrequently stems across different parts of speech, whichgenerally leads to increased completeness. However, occa-sionally this tendency conflates words that do not representthe same concept, as in conflating the adjective ‘true’ with

the adverb ‘truly’ and noun ‘truth’. In contrast, Paice fre-quently conflates unrelated words together, such as ‘element’with ‘else’ and ‘static’ with ‘state’, ‘statement’, ‘station’,‘stationary’, ‘statistic’, and ‘status’.

The annotators observed a difference between the mor-phological stemmers (MStem and KStem) and the rule-basedstemmers (Porter, Paice, and Snowball), which frequentlyand inaccurately associated non-words or foreign languagewords. For example, all 3 rule-based stemmers conflated‘method’ with french ‘methode’ and ‘methodes’; ‘panel’with Spanish ‘paneles’; and ‘any’ with non-words ‘anys’and, in the case of Porter and Snowball, ‘ani’. MStem andKStem were less prone to these errors because MStem usesword frequencies to eliminate unlikely stems, and KStemuses an English dictionary.

C. Threats to ValidityBecause the words were selected exclusively from Java

programs, these results may not generalize to all program-ming languages. MStem was trained on the same set of9,000+ Java programs that were used to create the 100 mostfrequent word set annotated by the human evaluators. Due tothe large size of the entire word set (over 700,000 words),it is unlikely that MStem was over-trained on the subsetof 100 words. Since completeness is based on the unionof word classes created by the stemmers, the observationsmay not generalize to all morphological and rule-basedstemmers. Because determining accuracy and completenesscan be ambiguous, we limited this threat by separating outthe ‘context sensitive’ examples in our analysis.

III. EFFECT OF STEMMING ON SOURCE CODE SEARCH

In this section, we compare the effect of using Porter,Snowball, KStem, Paice, and MStem with no stemming(None) on searching source code.

A. Study DesignTo compare the effect of stemming on software search,

we use the common tf-idf scoring function [9] to scorea method’s relevance to the query. Tf-idf multiplies twocomponent scores together: term frequency (tf) and inversedocument frequency (idf). The intuition behind tf is thatthe more frequently a word occurs in a method, the morerelevant the method is to the query. In contrast, idf dampensthe tf by how frequently the word occurs in the code base.Because we recalculate idf values for each program andstemmer combination, the tf-idf scores can widely varybetween stemmers that are heavy and light.

We use 8 of 9 concerns and queries from a previous sourcecode search study of 18 developers [10]. For one concern nosubject was able to formulate a query returning any relevantresults, leaving us with 8 concerns. For each concern, 6developers formulated queries, totaling 48 queries, 29 ofwhich are unique. The concerns are mapped at the method

Table ISTEMMER WORD CLASS COMPARISONS FOR 4 EXAMPLES (UNDERLINED

WORDS ARE IN THE WORD CLASSES FOR ALL STEMMERS)

Word(A & C)

Stemmer Word Class

Porter element, elemental, elemente, elementsSnwbl element, elemental, elemente, elements

element KStem element(MStem) MStem element, elemental, elements

Paice el, ela, ele, element, elemental, elementary,elemente, elementen, elements, elen, eles,eli, elif, elise, elist, ell, elle, ellen, eller, els,else, elseif, elses, elsif

Porter import, importable, importance, important,imported, importer, importers, importing,imports

Snowbl import, importable, importance, important,importantly, imported, importer, importers,importing, imports

import(Kstem)

KStem import, importable, imported, importer,importers, importing, imports

MStem import, importable, importance, important,importantly, imported, importer, importers,importing, imports

Paice import, importable, importance, important,importantly, importar, imported, importer,importers, importing, imports

Porter add, adde, addes, addsSnwbl add, adde, addes, adds

add KStem add, addable, added, addes, adding, adds(CS) MStem add, addable, added, adder, adding, addition,

additional, additionally, additions, additive,additivity, adds

Paice ad, ada, add, addable, adde, added, adder,addes, adding, adds, ade, ads

Porter name, named, namely, names, namingSnwbl name, named, namely, names, naming

name(None)

KStem name, nameable, named, namer, names,naming

MStem name, named, nameless, namely, namer,names, naming, surname

Paice name, nameable, namely, names

contrast, the sense of ‘add’ being used to join something toa list is not typically related to ‘addition’. The word classesfor ‘add’, as well as 3 other examples, are shown in Table I.

Overall, the annotators found the morphological parsersMStem and KStem to be the most accurate. The resultsof these two subjects indicate that morphology may bemore important than degree of under- or overstemming,since MStem is a heavy stemmer and KStem light. MStemwas the only accurate and complete stemmer for 12 of thewords, whereas KStem was accurate and complete for 11. Incontrast, the rule-based stemmers Porter and Snowball wereuniquely accurate and complete stemmers for 2 words, andPaice 6. Of the rule-based stemmers, light Snowball has aclear advantage over light Porter and heavy Paice overall.

As expected with heavy stemmers, MStem and Paice bothtend to overstem, although for different reasons. MStemfrequently stems across different parts of speech, whichgenerally leads to increased completeness. However, occa-sionally this tendency conflates words that do not representthe same concept, as in conflating the adjective ‘true’ with

the adverb ‘truly’ and noun ‘truth’. In contrast, Paice fre-quently conflates unrelated words together, such as ‘element’with ‘else’ and ‘static’ with ‘state’, ‘statement’, ‘station’,‘stationary’, ‘statistic’, and ‘status’.

The annotators observed a difference between the mor-phological stemmers (MStem and KStem) and the rule-basedstemmers (Porter, Paice, and Snowball), which frequentlyand inaccurately associated non-words or foreign languagewords. For example, all 3 rule-based stemmers conflated‘method’ with french ‘methode’ and ‘methodes’; ‘panel’with Spanish ‘paneles’; and ‘any’ with non-words ‘anys’and, in the case of Porter and Snowball, ‘ani’. MStem andKStem were less prone to these errors because MStem usesword frequencies to eliminate unlikely stems, and KStemuses an English dictionary.

C. Threats to ValidityBecause the words were selected exclusively from Java

programs, these results may not generalize to all program-ming languages. MStem was trained on the same set of9,000+ Java programs that were used to create the 100 mostfrequent word set annotated by the human evaluators. Due tothe large size of the entire word set (over 700,000 words),it is unlikely that MStem was over-trained on the subsetof 100 words. Since completeness is based on the unionof word classes created by the stemmers, the observationsmay not generalize to all morphological and rule-basedstemmers. Because determining accuracy and completenesscan be ambiguous, we limited this threat by separating outthe ‘context sensitive’ examples in our analysis.

III. EFFECT OF STEMMING ON SOURCE CODE SEARCH

In this section, we compare the effect of using Porter,Snowball, KStem, Paice, and MStem with no stemming(None) on searching source code.

A. Study DesignTo compare the effect of stemming on software search,

we use the common tf-idf scoring function [9] to scorea method’s relevance to the query. Tf-idf multiplies twocomponent scores together: term frequency (tf) and inversedocument frequency (idf). The intuition behind tf is thatthe more frequently a word occurs in a method, the morerelevant the method is to the query. In contrast, idf dampensthe tf by how frequently the word occurs in the code base.Because we recalculate idf values for each program andstemmer combination, the tf-idf scores can widely varybetween stemmers that are heavy and light.

We use 8 of 9 concerns and queries from a previous sourcecode search study of 18 developers [10]. For one concern nosubject was able to formulate a query returning any relevantresults, leaving us with 8 concerns. For each concern, 6developers formulated queries, totaling 48 queries, 29 ofwhich are unique. The concerns are mapped at the method

Thursday, October 6, 2011

Page 9: ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

Stemming and Source Code Search• search technique: tf-idf

• search tasks: 8 with 48 queries from prior study [Shepherd, et al. ’07]

• Paice: overstemming & understemming mistakes improved results for 2 tasks (e.g., textfield report element)

!

!

!!! !

!

!!! !

!!

!!! !

!

!!! !

!

!!!

Area

Und

er th

e C

urve

NoStem Porter Snowbl KStem MStem Paice

0.5

0.6

0.7

0.8

0.9

1.0

Thursday, October 6, 2011

Page 10: ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

Conclusion

• Morphological stemmers appear to be more accurate & complete than rule-based

• In search, stemming more consistently produces relevant results than not stemming

• Heavy stemmers like MStem & Paice appear to be more effective in searching source code than light stemmers like Porter

• Future work: more examples (less frequent & more domain-specific), more human judgements, more search tasks, other SE tasks beyond search

Thursday, October 6, 2011