cs590z statistical debugging xiangyu zhang (part of the slides are from chao liu)

CS590Z Statistical Debugging

Xiangyu Zhang

(part of the slides are from Chao Liu)

A Very Important Principle Traditional debugging techniques deal with

single (or very few) executions. With the acquisition of a large set of

executions, including passing and failing executions, statistical debugging is often highly effective. Failure reporting In house testing

Tarantula (ASE 2005, ISSTA 2007)

Scalable Remote Bug Isolation (PLDI 2004, 2005) Look at predicates

Branches Function returns (<0, <=0, >0, >=0, ==0, !=0) Scalar pairs

For each assignment x=…, find all variables y_i and constants c_j, each pair of x (=,<,<=…) y_i/c_j

Sample the predicate evaluations (Bernoulli sampling) Investigate the relation of the probability of a predicate be

ing true with the bug manifestion.

Bug Isolation

Bug Isolation

How much does P being true increase the probability of failure over simply reaching the line P is sampled.

An Example Symptoms

563 lines of C code 130 out of 5542 test

cases fail to give correct outputs

No crashes The predicate are

evaluated to both true and false in one execution

void subline(char *lin, char *pat, char *sub)

{

int i, lastm, m;

lastm = -1;

i = 0;

while((lin[i] != ENDSTR)) {

m = amatch(lin, i, pat, 0);

if (m >= 0){

putsub(lin, i, m, sub);

lastm = m;

}

if ((m == -1) || (m == i)){

fputc(lin[i], stdout);

i = i + 1;

} else

i = m;

}

}


{

int i, lastm, m;

lastm = -1;

i = 0;



if ((m >= 0) && (lastm != m) ){


lastm = m;

}

if ((m == -1) || (m == i)){


i = i + 1;

} else

i = m;

}

}

not enough


{

int i, lastm, m;

lastm = -1;

i = 0;



if ((m >= 0) && (lastm != m) ){


lastm = m;

}

if ((m == -1) || (m == i)){


i = i + 1;

} else

i = m;

}

}

P_f (A) = tilde P (A | A & !B)

P_t (A) = tilde P (A | !(A&!B))

Program Predicates A predicate is a proposition about any program prop

erties e.g., idx < BUFSIZE, a + b == c, foo() > 0 … Each can be evaluated multiple times during one executio

n Every evaluation gives either true or false

Therefore, a predicate is simply a boolean random variable, which encodes program executions from a particular aspect.

Evaluation Bias of Predicate P Evaluation bias

Def’n: the probability of being evaluated as true within one execution Maximum likelihood estimation: Number of true evaluations over the tot

al number of evaluations in one run Each run gives one observation of evaluation bias for predicate P

Suppose we have n correct and m incorrect executions, for any predicate P, we end up with An observation sequence for correct runs

S_p = (X’_1, X’_2, …, X’_n) An observation sequence for incorrect runs

S_f = (X_1, X_2, …, X_m) Can we infer whether P is suspicious based on S_p and S_f?

Underlying Populations Imagine the underlying distribution of evaluation bias for correct and

incorrect executions are and S_p and S_f can be viewed as a random sample from the underlying p

opulations respectively One major heuristic is

The larger the divergence between and , the more relevant the predicate P is to the bug

0 1

Prob

Evaluation bias0 1

Prob

Evaluation bias

Major Challenges

No knowledge of the closed forms of both distributions

Usually, we do not have sufficient incorrect executions to estimate reliably.

0 1

Prob

Evaluation bias0 1

Prob

Evaluation bias

Our Approach

Algorithm Outputs A ranked list of program predicates w.r.t. the

bug relevance score s(P) Higher-ranked predicates are regarded more relev

ant to the bug What’s the use?

Top-ranked predicates suggest the possible buggy regions

Several predicates may point to the same region … …

Outline Program Predicates Predicate Rankings Experimental Results Case Study: bc-1.06 Future Work Conclusions

Experiment Results Localization quality metric

Software bug benchmark Quantitative metric

Related works Cause Transition (CT), [CZ05] Statistical Debugging, [LN+05]

Performance comparisons

Bug Benchmark Bug benchmark

Dreaming benchmark Large number of known bugs on large-scale programs with adequate test suite

Siemens Program Suite 130 variants of 7 subject programs, each of 100-600 LOC 130 known bugs in total mainly logic (or semantic) bugs

Advantages Known bugs, thus judgments are objective Large number of bugs, thus comparative study is statistically significant.

Disadvantages Small-scaled subject programs

State-of-the-art performance, so far claimed in literature, Cause-transition approach, [CZ05]

Localization Quality Metric [RR03]

1st Example

1

23

5

4

9

6

10

8

7

T-score = 70%

2nd Example

1

23

74

9

6

10

5

T-score = 20%8

Related Works Cause Transition (CT) approach [CZ05]

A variant of delta debugging [Z02] Previous state-of-the-art performance holder on Siemens s

uite Published in ICSE’05, May 15, 2005 Cons: it relies on memory abnormality, hence its performan

ce is restricted. Statistical Debugging (Liblit05) [LN+05]

Predicate ranking based on discriminant analysis Published in PLDI’05, June 12, 2005 Cons: Ignores evaluation patterns of predicates within each e

xecution

Localized bugs w.r.t. Examined Code

Cumulative Effects w.r.t. Code Examination

Top-k Selection

Regardless of specific selection of k, both Liblit05 and SOBER are better than CT, the current state-of-the-art holder

From k=2 to 10, SOBER is better than Liblit05 consistently

Outline Evaluation Bias of Predicates Predicate Rankings Experimental Results Case Study: bc-1.06 Future Work Conclusions

Case Study: bc 1.06 bc 1.06

14288 LOC An arbitrary-precision calculator shipped with most

distributions of Unix/Linux Two bugs were localized

One was reported by Liblit in [LN+05] One was not reported previously

Some lights on scalability

Outline Evaluation Bias of Predicates Predicate Rankings Experimental Results Case Study: bc-1.06 Future Work Conclusions

Future Work Further leverage the localization quality Robustness to sampling Torture on large-scale programs to confirm its

scalability to code size …

Conclusions We devised a principled statistical method for

bug localization. No parameter setting hassles It handles both crashing and noncrashing bugs.

Best quality so far.

Discussion Features

Easy implementation Difficult experimentation More advanced statistical technique may not be necessary Go wide, not go deep…

Predicates are treated as independent random variables.

Can execution indexing help? Can statistical principles be combined with slicing or

IWIH ?

cs590z statistical debugging xiangyu zhang (part of the slides are from chao liu)

Documents