1 automated support for classifying software failure reports andy podgurski, david leon, patrick...

41
1 Automated Support for Classifying Software Failure Reports Andy Podgurski, David Leon, Patrick Francis, Wes Masri, Melinda Minch, Jiayang Sun, Bin Wang Case Western Reserve University Presented by: Hamid Haidarian Shahri

Post on 21-Dec-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

1

Automated Support for Classifying Software Failure Reports

Andy Podgurski, David Leon, Patrick Francis, Wes Masri, Melinda Minch, Jiayang Sun, Bin Wang

Case Western Reserve University

Presented by: Hamid Haidarian Shahri

2

Automated failure reporting Recent software products

automatically detect and report crashes/exceptions to developer Netscape Navigator Microsoft products

Report includes call stack, register values, other debug info

3

Example

4

User-initiated reporting Other products permit user to

report a failure at any time User describes problem Application state info may also be

included in report

5

Mixed blessing Good news:

More failures reported More precise diagnostic information

Bad news: Dramatic increase in failure reports Too many to review manually

6

Our approach Help developers group reported failures

with same cause – before cause is known Provide “semi-automatic” support

For execution profiling Supervised and unsupervised pattern

classification Multivariate visualization

Initial classification is checked, refined by developer

7

Example classification

8

How classification helps (Benefits)

Aids prioritization and debugging: Suggests number of underlying

defects Reflects how often each defect

causes failures Assembles evidence relevant to

prioritizing, diagnosing each defect

9

Formal view of problem Let F = { f1, f2, ..., fm } be set of

reported failures True failure classification: partition of

F into subsets F1, F2, ..., Fk such that in each Fi all failures have same cause

Our approach produces approximate failure classification G1, G2, ..., Gp

10

Classification strategy (***)

1. Software instrumented to collect and upload profiles or captured executions for developer

2. Profiles of reported failures combined with those of apparently successful executions (reducing bias)

3. Subset of relevant features selected4. Failure profiles analyzed using cluster analysis

and multivariate visualization5. Initial classification of failures examined, refined

11

Execution profiling Our approach not limited to classifying crashes

and exceptions User may report failure well after critical

events leading to failure Profiles should characterize entire execution Profiles should characterize events potentially

relevant to failure, e.g., Control flow, data flow, variable values, event

sequences, state transitions Full execution capture/replay permits arbitrary

profiling

12

Feature selection

1. Generate candidate feature sets2. Use each one to train classifier to

distinguish failures from successful executions

3. Select features of classifier, which performs best overall

4. Use those features to group (cluster) related failures

13

Probabilistic wrapper method

Used to select features in our experiments Due to Liu and Setiono

Random feature sets generated Each used with one part of profile data to train

classifier Misclassification rate of each classifier estimated

using another part of data (testing) Features of classifier with smallest estimated

misclassification rate used for grouping failures

14

Logistic regression (skip)

Simple, widely-used classifier Binary dependent variable Y Expected value E(Y | x) of Y given

predictor x = (x1, x2, ..., xp) is (x) = P(Y = 1 | x)

)g(

)g(

e

ex

x

1x

15

Logistic regression cont. (skip)

Log odds ratio (logit) g(x) defined by

Coefficients estimated from sample of x and Y values.

Estimate of Y given x is 1 iff estimate of g(x) is positive

ppxxg

...ln)( 1x

16

Grouping related failures

Alternatives: 1) Automatic cluster analysis

Can be fully automated 2) Multivariate visualization

User must identify groups in display Weaknesses of each approach

offset by combining them

17

1) Automatic cluster analysis Identifies clusters among objects

based on similarity of feature values

Employs dissimilarity metric e.g., Euclidean, Manhattan distance

Must estimate number of clusters Difficult problem Several “reasonable” ways to cluster a

population may exist

18

Estimating number of clusters Widely-used metric of quality of clustering due

to Calinski and Harabasz:

B is total between-cluster sum of squared distances

W is total within-cluster sum of squared distances from cluster centroids

n is number of objects in population Local maxima represent alternative estimates

)/(

)1/()(

cnW

cBcCH

19

2) Multidimensional scaling (MDS) Represents dissimilarities between

objects by 2D scatter plot Distances between points in display

approximate dissimilarities Small dissimilarities poorly

represented with high-dimensional profiles

Our solution: hierarchical MDS (HMDS)

20

Confirming or refining the initial classification Select 2+ failures from each group

Choose ones with maximally dissimilar profiles

Debug to determine if they are related If not, split group

Examine neighboring groups to see if they should be combined

21

Limitations Classification unlikely to be exact

Sampling error Modeling error Representation error Spurious correlations Form of profiling Human judgment

22

Experimental validation Implemented classification strategy

with three large subject programs GCC, Jikes, javac compilers

Failures clustered automatically (what failure?)

Resulting clusters examined manually Most or all failures in each cluster

examined

23

Subject programs GCC 2.95.2 C compiler

Written in C Used subset of regression test suite (self-

validating execution tests) 3333 tests run, 136 failures Profiled with Gnu Gcov (2214 function call counts)

Jikes 1.15 java compiler Written in C++ Used Jacks test suite (self-validating) 3149 tests run, 225 failures Profiled with Gcov (3644 function call counts)

24

Subject programs cont. javac 1.3.1_02-b02 java compiler

Written in Java Used Jacks test suite 3140 tests run, 233 failures Profiled with function-call profiler

written using JVMPI (1554 call counts)

25

Experimental methodology (skip)

400-500 candidate Logistic Regression (LR) models generated per data set 500 randomly selected features per model Model with lowest estimated

misclassification rate chosen Data partitioned into three subsets:

Train (50%): used to train candidate models TestA (25%): used to pick best model TestB (25%): used for final estimate of

misclassification rate

26

Experimental Methodology cont. (skip)

Measure used to pick best model:

Gives extra weight to misclassification of failures

Final LR models correctly classified 72% of failures and 91% of successes

Linearly dependent features omitted from fitted LR models

2

successes iedmisclassif % failures iedmisclassif %

27

Experimental methodology cont. (skip)

Cluster analysis S-Plus clustering algorithm clara

Based on k-medoids criterion Calinski-Harabasz index plotted for

2 c 50, local maxima examined Visualization

Hierarchical MDS (HMDS) algorithm used

28

Manual examination of failures (skip)

Several GCC tests often have same source file, different optimization levels

Such tests often fail or succeed together Hence, GCC failures were grouped

manually based on Source file Information about bug fixes Date of first version to pass test

29

Manual examination cont. (skip)

Jikes, javac failures grouped in two stages1. Automatically formed clustered checked2. Overlapping clusters in HMDS display checked

Activities: Debugging Comparing versions Examining error codes Inspecting source files Check correspondence between tests and JLS

sections

30

GCC results

Number of clusters

% size of largest groupof failures in cluster with

same cause

Total failures (136)

21 100 77 (57%)

1 83 6 (4%)

3 75,75, 71 23 (17%)

1 60 5 (4%)

1 24 25 (18%)

31

GCC results cont.

HMDS display of GCC failure profiles after feature selection. Convex hulls indicate results of

automatic clustering into 27 clusters.

HMDS display of GCC failure profiles after feature selection. Convex hulls indicate failures involving same defect using

HMDS (more accurate).

32

GCC results cont.

HMDS display of GCC failure profiles before feature selection. Convex hulls indicate

failures involving same defect. So feature selection helps in grouping.

33

javac results

Number of clusters

% size of largest groupof failures in cluster with

same cause

Total failures (232)

9 100 70 (30%)

5 88, 85, 85, 85, 83 64 (28%)

4 75, 67, 67, 57 49 (21%)

2 50, 50 20 (9%)

1 17 23 (10%)

34

javac results cont.

HMDS display of javac failures. Convex hulls indicate results of

manual classification with HMDS.

35

Jikes results

Number of clusters

% size of largest groupof failures in cluster with

same cause

Totalfailures (225)

12 100 64 (29%)

5 85, 83, 80, 75, 75 41 (18%)

4 70, 67, 67, 56 25 (11%)

8 50, 50, 50, 43, 41, 33, 33, 25

76 (34%)

36

Jikes results cont.

HMDS display of Jikes failures. Convex hulls indicate results of

manual classification with HMDS.

37

Summary of results In most automatically-created clusters,

majority of failures had same cause A few large, non-homogenous clusters

were created Sub-clusters evident in HMDS displays

Automatic clustering sometimes splits groups of failures with same cause HMDS displays didn’t have this problem

Overall, failures with same cause formed fairly cohesive clusters

38

Threats to validity One type of program used in

experiments Hand-crafted test inputs used for

profiling

Think of Microsoft..

39

Related work Slice [Agrawal, et al] Path spectra [Reps, et al] Tarantula [Jones, et al] Delta debugging [Hildebrand &

Zeller] Cluster filtering [Dickinson, et al] Clustering IDS alarms [Julisch &

Dacier]

40

Conclusions Demonstrated that our classification strategy is

potentially useful with compilers Further evaluation needed with different types of

software, failure reports from field

Note: Input space is huge. More accurate reporting (severity, location) could facilitate a better grouping and overcome these problems

Note: Limited labeled data available and error causes/types constantly changing (errors are debugged), so effectiveness of learning is somewhat questionable (like following your shadow)

41

Future work Further experimental evaluation Use more powerful classification,

clustering techniques Use different profiling techniques Extract additional diagnostic information Use techniques for classifying intrusions

reported by anomaly detection systems