is text search an effective approach for fault localization: a practitioners perspective

Vibha Singhal Sinha, Senthil Mani and Debdoot MukherjeeIBM Research - India

23rd October 2012, SPLASH-Wavefront, Tucson, AZ, USA

Can Text Search help

in Debugging?

3

1.Search within past bug reports• Find similar bug reports and identify patches linked to them

2.Search within source – code• Search comments, method names, variable names etc to identify code regions with high text overlap

4

No dependence on program sizes, programming languages, types of faults or the presence of passing & failing test inputs; unlike existing program-analysis based approaches: Program slicing Statistical debugging / spectra-based

techniques Delta debugging / mutation based approaches

Can be readily applied to jumpstart debuggingPossible Tactic: Identify a small set of files with text

search and feed that as input to a program analysis based technique to localize to a set of lines

5

IR systems proposed in different areas of software maintenance to recommend relevant artifacts in context of developer tasks Hipikat, Lassie, DebugAdvisor

Efficacy of different language models have been evaluated for fault localization (Rao et al, Marcus et al, Cleary et al) Vector Space Model, Latent Semantic Indexing, Latent

Dirichlet Allocation, Cluster Based Decision Making Rao and Kak suggest that IR-based bug localization is at

least as effective as static and dynamic analysis techniques

Enslen proposed Identifier-Splitting to increase vocabulary overlap between bug reports and code base E.g., a code word TextFieldTool is split into three words: text,

field, tool.6

Index CreatorIndex

Creator

Query CreatorQuery

Creator

From repository of past Resolved Bugs

Search For {query}

Search On {created indices}

Incoming Bug

Past Resolved Bugs and linked Code repository

Search Results: Ranked list of files

Bug Index (BI)

Bug Index (BI)

Code Index (CI)

Code Index (CI)

From code repository

Meta Index (MI)

Meta Index (MI)

From code repository – processed through identifier splitting

Search ModuleSearch Module

Results CollatorResults Collator

Collate Title & Description (A)Collate Title & Description (A)

Boost weight of Title Words

Boost weight of Title Words

Boost weight of Code Words

Boost weight of Code Words

Indexing Strategies

Querying Strategies

7

RQ1 : How do the following search approaches compare in terms of efficacy? Are they any better than chance? Search on past bug reports – Bug Index (BI) Search on code repository – Code Index (CI) Search on processed code repository– Meta Index (MI)

RQ2 : Can we combine them to increase efficacy?

RQ3 : How do different features of the source code and the bugs available in a project impact the effectiveness of search?

8

4 open source subjects BIRT, Datatools (Eclipse) Derby, Hadoop (Apache)

Linking bug reports to change-sets Mined from references to bug-

ids in commit comments Tracing JIRA links

Test set has bug reports with at least one source file associated with them 1177 bugs in test set 35% of total bugs in chosen

releases 3-4% of the bug repositories

9

Average Precision, Recall and F1-Score For each bug in the test set taken as a query,

we calculate precision, recall and F1-score and then average across the test-set.

Bug Coverage Percentage of bugs in the test set for which

the search returns at least one file in the recommendation set matches the ground truth.

10




11

CI:A

MI:A

BI:A

PRECISION RECALL F1-SCORE

Increase in recall much slower than drop in precision; so F-score dips beyond result-set size of 3Increase in recall much slower than drop in precision; so F-score dips beyond result-set size of 3

Suggests that search techniques may NOT help in identifying ALL files that needs to be fixedSuggests that search techniques may NOT help in identifying ALL files that needs to be fixed

BIRT DATATOOLS

DERBY HADOOP

Bug Coverage Increases with Increase in Result-set SizeBug Coverage Increases with Increase in Result-set Size

None of the techniques emerge as the clear winner None of the techniques emerge as the clear winner

MI isn’t any better than CI. Sometimes it performs worseMI isn’t any better than CI. Sometimes it performs worse

Hadoop gives much better results than other 3 subjectsHadoop gives much better results than other 3 subjects

13

Compare with efficacy of a user who randomly selects source files from the code repository as the files to be fixed to resolve a bug Think of the code repository as a bin of black and white balls,

where the files that need fix for a bug resolution are white balls; rest are black balls.

The hyper-geometric distribution gives the probability of choosing white balls without replacement

probability p of getting at least x files that require a fix by choosing k files at random from the repository:

If p < 0.05, reject the null hypothesis that search technique is no better than chance. Do FDR test for multiple hypothesis testing.

14

Even if one correct result is returned for a bug, then the result is usually significant.

Datatools has many queries failing the FDR test Certain queries have a large number of fixed files (e.g., 491 in

2 bugs) Record the average number of files in the repository at

which the techniques break even with chance: p >= 0.05 Ranges from 66 in Derby (MI:A) to 158 in Datatools (CI:A)

15




16

Fleiss’ Kappa analysis to measure the degree of agreement amongst the three techniques

Each technique rates a bug: Yes, if technique covers the bug; else No

Code based techniques (CI, MI) are similar, they are quite different from the bug based technique (BI)

Combine bug based and code based to get better results ??

17

Fire the same query on the 3 different indices and choose the top X search results using the following ranking schemes: RankScore: Rank using the absolute search similarity

scores returned by the search engine NormScore: Rank using a normalized similarity score -

fraction of maximum score returned by the query AggregateScore: Rank on the basis of sum of scores

from different techniques Sample: Pick the top 2*(X/5) search results from the

results of BI:A and CI:A, and the remaining X/5 results from MI:A.

18

RankScore works better than the best of the individual techniques across all subjects Improvement in bug coverage ranges from

1% to 46%19




20

Since query sizes can become very large, there may be a need for artificially boosting important words – TitleWords, CodeWords

TitleBoost helps improve bug coverage Except in Hadoop where the fraction of titleWords that

come up significant is already high even without boost.

MI MI

MI MI

BI CI BI

BI CI

CI

BI CI

21

Compare the efficacy of techniques that directly search the code repository with those that search over past bug reports No clear winner is observed Bug coverage ranges from 20 to 60% across 4

subjects Techniques are better than chance Identifier splitting does not yield much benefit

The techniques are complementary Bug coverage improves by 1% - 46% by combining them

Favoring title-words help in most cases24

is text search an effective approach for fault localization: a practitioners perspective

Technology

search techniques

source code search comments

code base

code regions

bug resolution

resolvea bug

bug repositories9

effectivenessof search