automatic fine-grained issue report reclassification
TRANSCRIPT
Automatic Fine-Grained Issue ReportReclassification
Pavneet Singh Kochhar, Ferdian Thung, David LoSingapore Management University
{kochharps.2012, ferdiant.2013, davidlo}@smu.edu.sg
2/24
Misclassification of Issue Reports
BUG
Herzig et al. *• 40% of issue reports are misclassified.• 1/3 issue reports are wrongly classified as bugs.
* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013
DOCUMENTATIONIMPROVEMENT
REFACTORING
BACKPORTCLEANUP
DESIGN DEFECT
TASK
TEST
Impact of Misclassification
• Well-known projects receive large number of issue reports
• Large number of bug reports can overwhelm the number of developers.
• Mozilla developer - “Everyday, almost 300 bugs appear that need triaging.” *
• Manual Process
• Misclassified reports take more time to fix+
* J. Anvik, L. Hiew, and G. C. Murphy, “Coping with an open bug repository,” in ETX, pp. 35–39, 2005+ X. Xia, D. Lo, M. Wen, E. Shihab, and B. Zhou, “An empirical study of bug report field reassignment,” in CSMR-WCRE, pp. 174–183, 2014.
3/24
Related Work
• Herzig et al. [1] – • Manually classify over 7000 issue reports.• 14 different categories
We use the same dataset We use 13 categories (merge UNKNOWN & OTHERS)
• Antoniol et al. [2] – • Classify issue reports either as “bug” or “enhancement”
We consider “reclassification” problem We use 13 different categories
[1] It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013[2] G. Antoniol, K. Ayari, M. D. Penta, F. Khomh, and Y.-G. Gueheneuc, “Is it a bug or an enhancement? a text-based approach to classify change requests,” in CASCON, pp. 23:304–23:318, 2008.
4/24
Our Study
Fine-Grained Issue Report Reclassification
13 Categories*
BUG RFE IMPROVEMENT DOCUMENTATION
TASK BUILD
REFACTORING
DESIGN DEFECT
TEST CLEANUP
BACKPORT
SPECIFICATION
OTHERS* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013
5/24
(Adaptive Maintenance)
(PerfectiveMaintenance)
(Deallocatingmemory)
(RemovingDuplicate methods)
Overall Framework
Training Issue
Reports
Ground Truth
Categories*
New Issue Reports
Model Building Model
Feature Extraction
Predicted Reclassified Categories
Training Phase Deployment Phase
*Herzig et al.
6/24
Pre-Processing
• Text Pre-Processing• Summary & Description fields
• Stop-word removal • eg., “is”, “are”, “if”
• Stemming (Reducing to root form)• eg., “reads” and “reading” -----> “read”• Use Porter Stemmer*
*http://tartarus.org/martin/PorterStemmer/
7/24
Feature Extraction
1. TF-IDF TF - Term Frequency, IDF- Inverse Document Frequency
2. Reported Category (C1-C13) Cn=1 where n=1 to 13
8/24
Feature Extraction
3. Exception Trace (S) a) Phrase: “Exception in thread” b) Regex : [A-Za-z0-9$.]+Exception eg., java.lang.NullPointerException c) Regex : [A-Za-z0-9$.]+[A-Za-z0-9]+([A-Za-z0-9]+(java:[0-9]+)?) eg., oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:447)
4. Issue Reporter (R1-RM) where M is total number of reporters
9/24
Model Building
• LibSVM (Support Vector Machine)*• Multi-class classification
• Inputs• L, Learner (Training Algorithm)• X, Set of Training Data i.e., Issue Reports• y, where }, Labels i.e., 13 categories
• Output• A list of classifiers for k },
• Classifiers are applied on unseen data to predict label k
*http://www.csie.ntu.edu.tw/~cjlin/libsvm/10/24
Dataset
Projects Organization Tracker Number of Issue Reports
HTTPClient Apache JIRA 746
Jackrabbit Apache JIRA 2402
Lucene-Java Apache JIRA 2443
Rhino Mozilla BugZilla 1226
Tomcat5 Apache BugZilla 584
Total = 7401 Issue Reports *
* It’s not a Bug, it’s a Feature: How Misclassification Impacts Bug Prediction, K. Herzig, S. Just, A. Zeller, ICSE 2013
11/24
Evaluation Metrics
(Precision)
(Recall)
(F-Measure)
( Weighted F-Measure)
We use Weighted Precision, Recall & F-Measure
12/24
Baselines
• Baseline-1 Predicts reclassified category same as assigned category
• Baseline-2 Predicts reclassified category as “BUG” (Majority of the issues are BUGS)
13/24
Research Questions
RQ1: Effectiveness of Our Approach
RQ2: Varying the Amount of Training Data
RQ3: Most Discriminative Features
RQ4: Analysis of Correctly & Wrongly Classified Issue Reports
RQ5: Comparison to Other Classification Algorithms
14/24
RQ1: Effectiveness of Our ApproachHTTPClient Jackrabbit Lucene-Java
Prec Rec WF1 Prec Rec WF1 Prec Rec WF1
Ours 0.61 0.63 0.60 0.71 0.72 0.71 0.63 0.62 0.63
Baseline-1 0.54 0.52 0.43 0.61 0.62 0.54 0.50 0.50 0.43
Baseline-2 0.16 0.40 0.23 0.15 0.39 0.21 0.08 0.28 0.12
Improvement-1 12.96 21.15 39.53 16.39 16.12 31.48 24.00 26.00 44.18Improvement-2 281.2 57.4 160.8 373.3 84.6 238.0 675.0 125.0 416.6
Rhino Tomcat5Prec Rec WF1 Prec Rec WF1
Ours 0.58 0.61 0.57 0.58 0.62 0.58
Baseline-1 0.35 0.57 0.43 0.36 0.58 0.45
Baseline-2 0.26 0.51 0.35 0.30 0.54 0.38
Improvement-1 65.71 7.01 32.55 61.11 6.89 28.88Improvement-2 123.0 19.6 62.85 93.3 14.8 52.63
15/24
RQ2: Varying Training Data
% of Issue Reports
HTTPClient Jackrabbit Lucene-Java
Prec Rec WF1 Prec Rec WF1 Prec Rec WF1
10 0.49 0.56 0.47 0.63 0.65 0.60 0.55 0.57 0.5320 0.54 0.55 0.46 0.64 0.66 0.61 0.57 0.57 0.5430 0.58 0.60 0.54 0.68 0.70 0.67 0.59 0.60 0.5840 0.54 0.53 0.48 0.69 0.71 0.68 0.59 0.58 0.5650 0.58 0.61 0.57 0.69 0.71 0.69 0.62 0.63 0.6160 0.59 0.62 0.58 0.64 0.65 0.62 0.61 0.62 0.6170 0.60 0.62 0.58 0.70 0.72 0.70 0.62 0.63 0.6280 0.62 0.68 0.61 0.70 0.72 0.70 0.63 0.64 0.6390 0.61 0.64 0.60 0.71 0.73 0.71 0.62 0.63 0.62
16/24
RQ2: Varying Training Data
% of Issue Reports
Rhino Tomcat5
Prec Rec WF1 Prec Rec WF1
10 0.45 0.52 0.40 0.47 0.54 0.4320 0.46 0.50 0.39 0.50 0.55 0.4530 0.46 0.50 0.40 0.54 0.60 0.5340 0.47 0.48 0.40 0.56 0.62 0.5650 0.52 0.58 0.50 0.56 0.61 0.5660 0.55 0.59 0.53 0.50 0.48 0.4270 0.56 0.60 0.54 0.49 0.44 0.3880 0.58 0.61 0.56 0.57 0.62 0.5890 0.59 0.61 0.56 0.54 0.59 0.55
17/24
RQ3: Most Discriminative Features
HTTPClient JackrabbitFeature Fisher
ScoreFeature Fisher
ScoreStemmed word “test” 1.73 Reported Category (BUG) 0.72
Reported Category (TASK) 0.58 Stemmed word “test” 0.55
Stemmed word “privat” 0.56 Stemmed word “maven” 0.51
Reported Category (BUG) 0.54 Stemmed word “backport” 0.46
Stemmed word “cleanup” 0.50 Reported Category (IMPR) 0.43
18/24
RQ3: Most Discriminative FeaturesLucene-Java Rhino
Feature Fisher Score
Feature Fisher Score
Stemmed word “test” 0.94 Stemmed word “test” 3.84
Reported Category (BUG) 0.61 Stemmed word “suit” 0.43
Reported Category (TEST) 0.50 Stemmed word “patch” 0.32
Stemmed word “backport” 0.45 Stemmed word “driver” 0.29
Stemmed word “remov” 0.38 Stemmed word “regress” 0.27
Tomcat5Feature Fisher Score
Stemmed word “longer” 1.15
Issue Reporter “starksm” 0.71
Stemmed word “class” 0.64
Stemmed word “ant” 0.62
Reported Category (BUG) 0.56
19/24
RQ4: Correctly & Wrongly Classified Reports
BUG RFE IMPR TEST DOC BUILD CLEANUP REFACBUG 2631 48 119 26 23 8 8 1
RFE 139 765 223 6 13 7 13 31
IMPR 320 214 658 8 12 13 16 19
TEST 84 12 15 220 1 8 4 3
DOC 95 39 37 0 209 13 17 2
BUILD 29 17 19 11 10 127 5 1
CLEANUP 58 30 42 6 11 5 104 12
REFAC 20 51 61 1 2 0 16 91
Predicted Labels
Gro
und
Tru
th L
abel
s
Table shows 8 categories (Total 13 categories)
BUG – 2631/2914 (90.3%)TEST – 220/349 (63%)
RFE – 765/1221 (62.7%)
20/24
RQ4: Correctly & Wrongly Classified Reports
BUG RFE IMPR TEST DOC BUILD CLEANUP REFACBUG 2631 48 119 26 23 8 8 1
RFE 139 765 223 6 13 7 13 31
IMPR 320 214 658 8 12 13 16 19
TEST 84 12 15 220 1 8 4 3
DOC 95 39 37 0 209 13 17 2
BUILD 29 17 19 11 10 127 5 1
CLEANUP 58 30 42 6 11 5 104 12
REFAC 20 51 61 1 2 0 16 91
Predicted Labels
Gro
und
Tru
th L
abel
s
21/24
RQ5: Comparison with Other Algorithms
Approach HTTPClient Jackrabbit Lucene-JavaPrec Rec WF1 Prec Rec WF1 Prec Rec WF1
Ours (LibSVM) 0.61 0.63 0.60 0.71 0.72 0.71 0.62 0.63 0.62Naïve Bayes 0.49 0.47 0.48 0.51 0.39 0.43 0.46 0.37 0.40
NB Multinomial
0.53 0.60 0.54 0.64 0.66 0.61 0.60 0.59 0.56
K-Nearest Neighbors
0.47 0.29 0.34 0.60 0.58 0.59 0.46 0.40 0.42
Random Forest
0.45 0.56 0.46 0.54 0.58 0.53 0.45 0.48 0.43
RBF Network 0.37 0.39 0.37 0.39 0.41 0.40 0.31 0.31 0.30
22/24
RQ5: Comparison with Other Algorithms
Approach Rhino Tomcat5Prec Rec WF1 Prec Rec WF1
Ours (LibSVM) 0.58 0.61 0.57 0.58 0.62 0.58Naïve Bayes 0.51 0.51 0.51 0.48 0.40 0.42
NB Multinomial
0.52 0.58 0.49 0.51 0.58 0.47
K-Nearest Neighbors
0.50 0.43 0.43 0.43 0.43 0.42
Random Forest
0.51 0.56 0.47 0.45 0.56 0.46
RBF Network 0.40 0.43 0.41 0.33 0.54 0.39
23/24
Conclusion & Future Work
Automated approach to reclassify issue reportsEvaluate over 7000 issue reportsExtract features such as TF-IDF, Reported category, Exception trace, Issue reporterPerform multi-class classification (13 Categories)F-Measure Score 0.57-0.71Improvement of 28.88% - 414.66% over baselines
Future Work: Analyse more issue reports Design advanced multi-class solution
24/24
Thank You!
Email: [email protected]