cs590z statistical debugging xiangyu zhang (part of the slides are from chao liu)
Post on 20-Dec-2015
229 views
TRANSCRIPT
CS590Z Statistical Debugging
Xiangyu Zhang
(part of the slides are from Chao Liu)
A Very Important Principle Traditional debugging techniques deal with
single (or very few) executions. With the acquisition of a large set of
executions, including passing and failing executions, statistical debugging is often highly effective. Failure reporting In house testing
Tarantula (ASE 2005, ISSTA 2007)
Scalable Remote Bug Isolation (PLDI 2004, 2005) Look at predicates
Branches Function returns (<0, <=0, >0, >=0, ==0, !=0) Scalar pairs
For each assignment x=…, find all variables y_i and constants c_j, each pair of x (=,<,<=…) y_i/c_j
Sample the predicate evaluations (Bernoulli sampling) Investigate the relation of the probability of a predicate be
ing true with the bug manifestion.
Bug Isolation
Bug Isolation
How much does P being true increase the probability of failure over simply reaching the line P is sampled.
An Example Symptoms
563 lines of C code 130 out of 5542 test
cases fail to give correct outputs
No crashes The predicate are
evaluated to both true and false in one execution
void subline(char *lin, char *pat, char *sub)
{
int i, lastm, m;
lastm = -1;
i = 0;
while((lin[i] != ENDSTR)) {
m = amatch(lin, i, pat, 0);
if (m >= 0){
putsub(lin, i, m, sub);
lastm = m;
}
if ((m == -1) || (m == i)){
fputc(lin[i], stdout);
i = i + 1;
} else
i = m;
}
}
void subline(char *lin, char *pat, char *sub)
{
int i, lastm, m;
lastm = -1;
i = 0;
while((lin[i] != ENDSTR)) {
m = amatch(lin, i, pat, 0);
if ((m >= 0) && (lastm != m) ){
putsub(lin, i, m, sub);
lastm = m;
}
if ((m == -1) || (m == i)){
fputc(lin[i], stdout);
i = i + 1;
} else
i = m;
}
}
not enough
void subline(char *lin, char *pat, char *sub)
{
int i, lastm, m;
lastm = -1;
i = 0;
while((lin[i] != ENDSTR)) {
m = amatch(lin, i, pat, 0);
if ((m >= 0) && (lastm != m) ){
putsub(lin, i, m, sub);
lastm = m;
}
if ((m == -1) || (m == i)){
fputc(lin[i], stdout);
i = i + 1;
} else
i = m;
}
}
P_f (A) = tilde P (A | A & !B)
P_t (A) = tilde P (A | !(A&!B))
Program Predicates A predicate is a proposition about any program prop
erties e.g., idx < BUFSIZE, a + b == c, foo() > 0 … Each can be evaluated multiple times during one executio
n Every evaluation gives either true or false
Therefore, a predicate is simply a boolean random variable, which encodes program executions from a particular aspect.
Evaluation Bias of Predicate P Evaluation bias
Def’n: the probability of being evaluated as true within one execution Maximum likelihood estimation: Number of true evaluations over the tot
al number of evaluations in one run Each run gives one observation of evaluation bias for predicate P
Suppose we have n correct and m incorrect executions, for any predicate P, we end up with An observation sequence for correct runs
S_p = (X’_1, X’_2, …, X’_n) An observation sequence for incorrect runs
S_f = (X_1, X_2, …, X_m) Can we infer whether P is suspicious based on S_p and S_f?
Underlying Populations Imagine the underlying distribution of evaluation bias for correct and
incorrect executions are and S_p and S_f can be viewed as a random sample from the underlying p
opulations respectively One major heuristic is
The larger the divergence between and , the more relevant the predicate P is to the bug
0 1
Prob
Evaluation bias0 1
Prob
Evaluation bias
Major Challenges
No knowledge of the closed forms of both distributions
Usually, we do not have sufficient incorrect executions to estimate reliably.
0 1
Prob
Evaluation bias0 1
Prob
Evaluation bias
Our Approach
Algorithm Outputs A ranked list of program predicates w.r.t. the
bug relevance score s(P) Higher-ranked predicates are regarded more relev
ant to the bug What’s the use?
Top-ranked predicates suggest the possible buggy regions
Several predicates may point to the same region … …
Outline Program Predicates Predicate Rankings Experimental Results Case Study: bc-1.06 Future Work Conclusions
Experiment Results Localization quality metric
Software bug benchmark Quantitative metric
Related works Cause Transition (CT), [CZ05] Statistical Debugging, [LN+05]
Performance comparisons
Bug Benchmark Bug benchmark
Dreaming benchmark Large number of known bugs on large-scale programs with adequate test suite
Siemens Program Suite 130 variants of 7 subject programs, each of 100-600 LOC 130 known bugs in total mainly logic (or semantic) bugs
Advantages Known bugs, thus judgments are objective Large number of bugs, thus comparative study is statistically significant.
Disadvantages Small-scaled subject programs
State-of-the-art performance, so far claimed in literature, Cause-transition approach, [CZ05]
Localization Quality Metric [RR03]
1st Example
1
23
5
4
9
6
10
8
7
T-score = 70%
2nd Example
1
23
74
9
6
10
5
T-score = 20%8
Related Works Cause Transition (CT) approach [CZ05]
A variant of delta debugging [Z02] Previous state-of-the-art performance holder on Siemens s
uite Published in ICSE’05, May 15, 2005 Cons: it relies on memory abnormality, hence its performan
ce is restricted. Statistical Debugging (Liblit05) [LN+05]
Predicate ranking based on discriminant analysis Published in PLDI’05, June 12, 2005 Cons: Ignores evaluation patterns of predicates within each e
xecution
Localized bugs w.r.t. Examined Code
Cumulative Effects w.r.t. Code Examination
Top-k Selection
Regardless of specific selection of k, both Liblit05 and SOBER are better than CT, the current state-of-the-art holder
From k=2 to 10, SOBER is better than Liblit05 consistently
Outline Evaluation Bias of Predicates Predicate Rankings Experimental Results Case Study: bc-1.06 Future Work Conclusions
Case Study: bc 1.06 bc 1.06
14288 LOC An arbitrary-precision calculator shipped with most
distributions of Unix/Linux Two bugs were localized
One was reported by Liblit in [LN+05] One was not reported previously
Some lights on scalability
Outline Evaluation Bias of Predicates Predicate Rankings Experimental Results Case Study: bc-1.06 Future Work Conclusions
Future Work Further leverage the localization quality Robustness to sampling Torture on large-scale programs to confirm its
scalability to code size …
Conclusions We devised a principled statistical method for
bug localization. No parameter setting hassles It handles both crashing and noncrashing bugs.
Best quality so far.
Discussion Features
Easy implementation Difficult experimentation More advanced statistical technique may not be necessary Go wide, not go deep…
Predicates are treated as independent random variables.
Can execution indexing help? Can statistical principles be combined with slicing or
IWIH ?