Potential Biases in Bug Localization: Do They Matter?
Pavneet Singh Kochhar, Yuan Tian, David LoSingapore Management University
{kochharps.2012, yuan.tian.2012,davidlo}@smu.edu.sg
Issue Tracking
• Projects use issue tracking systems like JIRA
• Well-known projects receive large number of issue reports
• Large number of bug reports can overwhelm the number of developers.
• Mozilla developer - “Everyday, almost 300 bugs appear that need triaging. This is far too much for only the Mozilla programmers to handle” *
What have researchers proposed to overcome this issue?
* J. Anvik, L. Hiew, and G. C. Murphy, “Coping with an open bug repository,” in ETX, pp. 35–39, 2005
2/25
Bug Localization
Thousands of Source Code Files
GOAL: Find the buggy files ------>
3/25
How Bug Localization Works
• Uses fixed/closed bug reports
• Uses standard information retrieval (IR) techniques such as Vector space model (VSM)
• Computes similarity between bug reports & source code
• Returns rank list of potential buggy source code files
• Returned list is compared with actual buggy files to compute accuracy
4/25
Issues in Bug Localization
HOWEVER
What if bug localization results are biased?
• Past study* shows: • Upto 80% of the bug reports can be localized by
inspecting 5 source code files.• Results are promising
* Improving bug localization using structured information retrieval, R. K. Saha, M. Lease, S. Khurshid, and D. E. Perry, ASE 2013
5/25
Our Study
Potential Biases in Bug Localization
1. Wrongly Classified Reports Herzig et al. *– 1/3 of reports marked as bugs are not bugs 2. Already Localized Reports
3. Incorrect Ground Truth Files Kawrykow et al.+ - Lot of changes are non-essential
* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013+ Non-essential changes in version histories D. Kawrykow and M. P. Robillard.. ICSE, 2011.
6/25
Our Study
Potential Biases in Bug Localization
1. Wrongly Classified Reports Herzig et al. *– 1/3 of reports marked as bugs are not bugs 2. Already Localized Reports
3. Incorrect Ground Truth Files Kawrykow et al.+ - Lot of changes are non-essential
* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013+ Non-essential changes in version histories D. Kawrykow and M. P. Robillard.. ICSE, 2011.
7/25
Dataset
Projects Organization Tracker Number of Issue Reports
HTTPClient Apache JIRA 746
Jackrabbit Apache JIRA 2402
Lucene-Java Apache JIRA 2443
Total = 5591 Issue Reports *
* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013
8/25
Evaluation Metric
Average precision
Mean Average Precision (MAP) – Mean of average precisions over all ranked lists.
9/25
BIAS 1– Report Misclassification
Projects Reported Actual Difference Cohen’s dHTTPClient 0.429 0.419 -2.33% 0.13
Jackrabbit 0.302 0.339 12.25%* 0.06
Lucene-Java
0.301 0.322 6.98% 0.04
Difference of -2.33% to 12.25% between MAP scores* Statistical significant differences (Mann-Whitney Wilcoxon test)Effect sizes are trivial (d<0.2)
Mean Average Precision (MAP) Scores
10/25
BIAS 1– Report Misclassification
Mean Average Precision (MAP) ScoresActual to Reported HC JB LJ Overall
None 0.429 0.302 0.301 0.312RFE to BUG 0.427 0.303 0.304 0.313DOCUMENTATION to BUG 0.430 0.304 0.305 0.315IMPROVEMENT to BUG 0.416 0.299 0.295 0.307
REFACTORING to BUG 0.428 0.301 0.301 0.311BACKPORT to BUG 0.430 0.303 0.300 0.313CLEANUP to BUG 0.429 0.303 0.303 0.314
SPEC to BUG 0.435 0.302 0.301 0.312TASK to BUG 0.432 0.302 0.301 0.312TEST to BUG 0.429 0.328 0.313 0.334BUILD_SYSTEM to BUG 0.429 0.306 0.303 0.315
DESIGN_DEFECT to BUG 0.424 0.301 0.301 0.311OTHERS to BUG 0.439 0.303 0.301 0.313
* HC – HTTPClient, JB- Jackrabbit, LJ – Lucene-Java
11/25
BIAS 1– Report Misclassification
Results:Significantly impacts bug localization result for 1/3 projectsHowever, effect sizes are negligible i.e., <0.2
12/25
BIAS 2– Localized Bug Reports
Category DescriptionFully All the buggy files are mentioned in the bug report
Partially Some of the buggy files are specified in the bug report
Not Bug reports do not specify any buggy files
Fully Localized Report (Example)Category DescriptionSummary DecompressingEntity not calling close on InputStream
retrieved by getContent
Description The method DecompressingEntity.writeTo(OutputStream outstream) does not close the InputStream retrieved by getContent().
Buggy Files DecompressingEntity.java
Categories
13/25
BIAS 2– Localized Bug Reports
Manually Identifying Localized Reports
5591 Issue reports
1191 bug reports (Herzig et al.*)
Randomly selected 350
Files changed Summary & Description
Classified bug reports
* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013
14/25
BIAS 2– Localized Bug Reports
Based on manual investigation:Build an algorithm to automatically classify bug reportsInput – Summary/Description of bug reports &
Files changed to fix the bugOutput – Bug reports classified into 1 out of 3
categories
Automatically Identifying Localized Reports
15/25
BIAS 2– Localized Bug Reports
Number/ProportionProject Category Number Proportion
Fully 36 3.02%
HTTPClient Partially 28 2.35%
Not 35 2.93%
Fully 299 25.10%
Jackrabbit Partially 132 11.08%
Not 402 33.75%
Fully 63 5.28%
Lucene-Java Partially 87 7.30%
Not 109 9.15%
Overall 33.41% are fully localizedMore than 50% fully or partially localized
16/25
BIAS 2– Localized Bug Reports
Projects Fully Partially NotHTTPClient 0.615 0.349 0.250
Jackrabbit 0.560 0.373 0.187
Lucene-Java 0.527 0.338 0.197
Difference between Fully & Not HTTPClient - 84.39% Jackrabbit - 99.86% Lucene-Java - 91.16%
Mean Average Precision (MAP) Scores
17/25
BIAS 2– Localized Bug Reports
ProjectsFully-Partially Partially-Not Fully-Not
p-value
d Effect Size
p-value
d Effect Size
p-value
d Effect Size
HTTPClient * 0.94 L * 0.53 M * 1.27 L
Jackrabbit * 0.56 M * 0.55 M * 1.14 L
Lucene-Java
* 0.53 M * 0.41 S * 1.04 L
Comparison – Fully vs. Partially vs Not
*Significant differences (p-value<0.05)
Effect sizes b/w Fully & Not are LARGE
18/25
BIAS 2– Localized Bug Reports
Best & Worst bug reportsProject Fully Partially Not p-value
HTTPClientUpper 16 5 4
0.0041*Lower 6 4 15
*Significant differences (p-value<0.05)
JackrabbitUpper 35 9 6
2.807e-13*
Lower 7 1 42
Lucene-JavaUpper 22 18 10
8.724e-05*
Lower 5 18 27
19/25
BIAS 2– Localized Bug Reports
Results:More than 50% of bugs are either fully or partially localizedMAP scores for fully & partially localized much higher than not localizedEffect sizes between fully & not localized are LARGE
20/25
BIAS 3– Non-Buggy Files
Manual Investigation
Randomly selected 100 not localized bug reports
Files changed to fix these bugs
Diff between original & modified file
Non-buggy = Cosmetic changes, refactorings etc.
clean GROUND TRUTH files
21/25
BIAS 3– Non-Buggy Files
Example
22/25
BIAS 3– Non-Buggy Files
Differences are not significantEffect sizes are trivial (<0.2)
Mean Average Precision (MAP) Scores
Projects Dirty Clean Difference dHTTPClient 0.207 0.171 0.036 0.08
Jackrabbit 0.115 0.115 0.000 0.08
Lucene-Java
0.271 0.239 0.032 0.17
23/25
BIAS 3– Non-Buggy Files
Results:28.11% of the files in the ground-truth are non-buggyDifferences between MAP scores are not significantEffect sizes are negligible i.e., <0.2
24/25
Conclusion
BIAS 1- Wrongly classified issue reports NOT statistically significant NO substantial impact
BIAS 2 – Localized bug reports Statistically significant Substantial impact
BIAS 3 – Non-buggy files: NOT statistically significant NO substantial impact
25/25
Thank You!
Email: [email protected]
Other Evaluation Metrics
HIT@N : Percentage of bug reports with at least one buggy file in top N ranked results
Mean Reciprocal Rank (MRR) Reciprocal rank is inverse of the rank of the 1st buggy file. MRR is average of the reciprocal ranks.
BIAS 1- Report Misclassification
BIAS 2- Localized Bug Reports
BIAS 3- Non-Buggy Files
BIAS 1, BIAS 2 & BIAS 3
Mean Reciprocal Rank (MRR) Scores
Appendix (Statistical Analysis)
• Mann-Whitney-Wilcoxon (MWW) test: Given a significance level = 0.05,if p-value <, then the test rejects the null hypothesis.
Appendix (BIAS-2 Results)Actual to Reported HC JB LJ Overall
None 0.429 0.302 0.301 0.312
RFE to BUG 0.427 0.303 0.304 0.313
DOCUMENTATION to BUG 0.430 0.304 0.305 0.315
IMPROVEMENT to BUG 0.416 0.299 0.295 0.307
REFACTORING to BUG 0.428 0.301 0.301 0.311
BACKPORT to BUG 0.430 0.303 0.300 0.313
CLEANUP to BUG 0.429 0.303 0.303 0.314
SPEC to BUG 0.435 0.302 0.301 0.312
TASK to BUG 0.432 0.302 0.301 0.312
TEST to BUG 0.429 0.328 0.313 0.334
BUILD_SYSTEM to BUG 0.429 0.306 0.303 0.315
DESIGN_DEFECT to BUG 0.424 0.301 0.301 0.311
OTHERS to BUG 0.439 0.303 0.301 0.313