1 automated support for classifying software failure reports andy podgurski, david leon, patrick...
Post on 21-Dec-2015
218 views
TRANSCRIPT
1
Automated Support for Classifying Software Failure Reports
Andy Podgurski, David Leon, Patrick Francis, Wes Masri, Melinda Minch, Jiayang Sun, Bin Wang
Case Western Reserve University
Presented by: Hamid Haidarian Shahri
2
Automated failure reporting Recent software products
automatically detect and report crashes/exceptions to developer Netscape Navigator Microsoft products
Report includes call stack, register values, other debug info
4
User-initiated reporting Other products permit user to
report a failure at any time User describes problem Application state info may also be
included in report
5
Mixed blessing Good news:
More failures reported More precise diagnostic information
Bad news: Dramatic increase in failure reports Too many to review manually
6
Our approach Help developers group reported failures
with same cause – before cause is known Provide “semi-automatic” support
For execution profiling Supervised and unsupervised pattern
classification Multivariate visualization
Initial classification is checked, refined by developer
8
How classification helps (Benefits)
Aids prioritization and debugging: Suggests number of underlying
defects Reflects how often each defect
causes failures Assembles evidence relevant to
prioritizing, diagnosing each defect
9
Formal view of problem Let F = { f1, f2, ..., fm } be set of
reported failures True failure classification: partition of
F into subsets F1, F2, ..., Fk such that in each Fi all failures have same cause
Our approach produces approximate failure classification G1, G2, ..., Gp
10
Classification strategy (***)
1. Software instrumented to collect and upload profiles or captured executions for developer
2. Profiles of reported failures combined with those of apparently successful executions (reducing bias)
3. Subset of relevant features selected4. Failure profiles analyzed using cluster analysis
and multivariate visualization5. Initial classification of failures examined, refined
11
Execution profiling Our approach not limited to classifying crashes
and exceptions User may report failure well after critical
events leading to failure Profiles should characterize entire execution Profiles should characterize events potentially
relevant to failure, e.g., Control flow, data flow, variable values, event
sequences, state transitions Full execution capture/replay permits arbitrary
profiling
12
Feature selection
1. Generate candidate feature sets2. Use each one to train classifier to
distinguish failures from successful executions
3. Select features of classifier, which performs best overall
4. Use those features to group (cluster) related failures
13
Probabilistic wrapper method
Used to select features in our experiments Due to Liu and Setiono
Random feature sets generated Each used with one part of profile data to train
classifier Misclassification rate of each classifier estimated
using another part of data (testing) Features of classifier with smallest estimated
misclassification rate used for grouping failures
14
Logistic regression (skip)
Simple, widely-used classifier Binary dependent variable Y Expected value E(Y | x) of Y given
predictor x = (x1, x2, ..., xp) is (x) = P(Y = 1 | x)
)g(
)g(
e
ex
x
1x
15
Logistic regression cont. (skip)
Log odds ratio (logit) g(x) defined by
Coefficients estimated from sample of x and Y values.
Estimate of Y given x is 1 iff estimate of g(x) is positive
ppxxg
...ln)( 1x
16
Grouping related failures
Alternatives: 1) Automatic cluster analysis
Can be fully automated 2) Multivariate visualization
User must identify groups in display Weaknesses of each approach
offset by combining them
17
1) Automatic cluster analysis Identifies clusters among objects
based on similarity of feature values
Employs dissimilarity metric e.g., Euclidean, Manhattan distance
Must estimate number of clusters Difficult problem Several “reasonable” ways to cluster a
population may exist
18
Estimating number of clusters Widely-used metric of quality of clustering due
to Calinski and Harabasz:
B is total between-cluster sum of squared distances
W is total within-cluster sum of squared distances from cluster centroids
n is number of objects in population Local maxima represent alternative estimates
)/(
)1/()(
cnW
cBcCH
19
2) Multidimensional scaling (MDS) Represents dissimilarities between
objects by 2D scatter plot Distances between points in display
approximate dissimilarities Small dissimilarities poorly
represented with high-dimensional profiles
Our solution: hierarchical MDS (HMDS)
20
Confirming or refining the initial classification Select 2+ failures from each group
Choose ones with maximally dissimilar profiles
Debug to determine if they are related If not, split group
Examine neighboring groups to see if they should be combined
21
Limitations Classification unlikely to be exact
Sampling error Modeling error Representation error Spurious correlations Form of profiling Human judgment
22
Experimental validation Implemented classification strategy
with three large subject programs GCC, Jikes, javac compilers
Failures clustered automatically (what failure?)
Resulting clusters examined manually Most or all failures in each cluster
examined
23
Subject programs GCC 2.95.2 C compiler
Written in C Used subset of regression test suite (self-
validating execution tests) 3333 tests run, 136 failures Profiled with Gnu Gcov (2214 function call counts)
Jikes 1.15 java compiler Written in C++ Used Jacks test suite (self-validating) 3149 tests run, 225 failures Profiled with Gcov (3644 function call counts)
24
Subject programs cont. javac 1.3.1_02-b02 java compiler
Written in Java Used Jacks test suite 3140 tests run, 233 failures Profiled with function-call profiler
written using JVMPI (1554 call counts)
25
Experimental methodology (skip)
400-500 candidate Logistic Regression (LR) models generated per data set 500 randomly selected features per model Model with lowest estimated
misclassification rate chosen Data partitioned into three subsets:
Train (50%): used to train candidate models TestA (25%): used to pick best model TestB (25%): used for final estimate of
misclassification rate
26
Experimental Methodology cont. (skip)
Measure used to pick best model:
Gives extra weight to misclassification of failures
Final LR models correctly classified 72% of failures and 91% of successes
Linearly dependent features omitted from fitted LR models
2
successes iedmisclassif % failures iedmisclassif %
27
Experimental methodology cont. (skip)
Cluster analysis S-Plus clustering algorithm clara
Based on k-medoids criterion Calinski-Harabasz index plotted for
2 c 50, local maxima examined Visualization
Hierarchical MDS (HMDS) algorithm used
28
Manual examination of failures (skip)
Several GCC tests often have same source file, different optimization levels
Such tests often fail or succeed together Hence, GCC failures were grouped
manually based on Source file Information about bug fixes Date of first version to pass test
29
Manual examination cont. (skip)
Jikes, javac failures grouped in two stages1. Automatically formed clustered checked2. Overlapping clusters in HMDS display checked
Activities: Debugging Comparing versions Examining error codes Inspecting source files Check correspondence between tests and JLS
sections
30
GCC results
Number of clusters
% size of largest groupof failures in cluster with
same cause
Total failures (136)
21 100 77 (57%)
1 83 6 (4%)
3 75,75, 71 23 (17%)
1 60 5 (4%)
1 24 25 (18%)
31
GCC results cont.
HMDS display of GCC failure profiles after feature selection. Convex hulls indicate results of
automatic clustering into 27 clusters.
HMDS display of GCC failure profiles after feature selection. Convex hulls indicate failures involving same defect using
HMDS (more accurate).
32
GCC results cont.
HMDS display of GCC failure profiles before feature selection. Convex hulls indicate
failures involving same defect. So feature selection helps in grouping.
33
javac results
Number of clusters
% size of largest groupof failures in cluster with
same cause
Total failures (232)
9 100 70 (30%)
5 88, 85, 85, 85, 83 64 (28%)
4 75, 67, 67, 57 49 (21%)
2 50, 50 20 (9%)
1 17 23 (10%)
34
javac results cont.
HMDS display of javac failures. Convex hulls indicate results of
manual classification with HMDS.
35
Jikes results
Number of clusters
% size of largest groupof failures in cluster with
same cause
Totalfailures (225)
12 100 64 (29%)
5 85, 83, 80, 75, 75 41 (18%)
4 70, 67, 67, 56 25 (11%)
8 50, 50, 50, 43, 41, 33, 33, 25
76 (34%)
36
Jikes results cont.
HMDS display of Jikes failures. Convex hulls indicate results of
manual classification with HMDS.
37
Summary of results In most automatically-created clusters,
majority of failures had same cause A few large, non-homogenous clusters
were created Sub-clusters evident in HMDS displays
Automatic clustering sometimes splits groups of failures with same cause HMDS displays didn’t have this problem
Overall, failures with same cause formed fairly cohesive clusters
38
Threats to validity One type of program used in
experiments Hand-crafted test inputs used for
profiling
Think of Microsoft..
39
Related work Slice [Agrawal, et al] Path spectra [Reps, et al] Tarantula [Jones, et al] Delta debugging [Hildebrand &
Zeller] Cluster filtering [Dickinson, et al] Clustering IDS alarms [Julisch &
Dacier]
40
Conclusions Demonstrated that our classification strategy is
potentially useful with compilers Further evaluation needed with different types of
software, failure reports from field
Note: Input space is huge. More accurate reporting (severity, location) could facilitate a better grouping and overcome these problems
Note: Limited labeled data available and error causes/types constantly changing (errors are debugged), so effectiveness of learning is somewhat questionable (like following your shadow)