using mutation analysis for assessing and comparing test

Using Mutation Analysis for Assessing and Comparing Test

Coverage Criteria

ⓒ KAIST SE LAB 2013

IEEE Transactions on Software Engineering(TSE 2006)

James H. Andrews, Lionel c. Briand, Yvan Labiche, and Akbar Siami Namin

2013-06-11Yoo Jin Lim

Contents

IntroductionRelated WorkExperimental DescriptionAnalysis ResultsConclusionDiscussion

2/24 ⓒ KAIST SE LAB 2013

Introduction (1/4)

Software testing research Find effective test techniques for software

3/24ⓒ KAIST SE LAB 2013

..public void bar() {

//print reportif (report == null) println(report);}

Faults

Real Produced

AutomaticallyManually

Mutation Analysis

Introduction (2/4)

Mutant Generation

(+) Well-defined, scalable (–) Less realistic than hand-seeded faults

Mutation Analysis The process of analyzing when mutants fail


for ( int i = 1; i <= n; i++)fact = fact * i;

for ( int i = 1; i < n; i++)fact = fact * i;

Mutation Operator MutantSource Code

What is the external validity of mutants in assessing the fault detection effectiveness of testing techniques?

Introduction (3/4)

Test coverage criteria Data flow examines how variables are used.

• C-Use (Computation) coverage and P-Use (Predicate) coverage

Control flow examines logical expressions.• Block coverage and decision coverage

Subsumption Relationship*


All-Use

C-Use P-Use

Decision

Block*Lori A. Clarke, Andy Prodgurski, Debra J. Richardson, Steven J. Zeli, “A Comparison of Data Flow Path selection Criteria,” 8th ICSE, 1985.

Introduction (4/4)

Basis: The previous work (ICSE 2005)


• Generated mutants are similar to the real faults but different from hand-seeded faults, which are harder to detect than the real faults.

Randomly selected

Goal: The current work (TSE 2006)• Extend “Coverage criteria selected test suites”• Investigate: their cost-effectiveness using

• fault detection, • test suite size• control/data flow coverage

CoverageCriteria

Selected

Related work

Frankl & Weiss (1991)

Hutchins(1994)

Frankl &Iakounenko

(1998)

This work(TSE 2006)

CoverageCriteria

All-Use and Decision

Def-Use and Decision

All-Use and Decision

Block, Decision, P-Use, and C-Use

Subject Program 33 – 60 LOCs 141 – 512 LOCs 6218 LOCs 6218 LOCs

Faults 7 real faults 130 hand-seeded faults 33 real faults 34 real faults

736 mutants

VariablesFault detection, test suite size,and coverage

Fault detection, test suite size, and coverage

Fault detection, and coverage

Fault detection, test suite size, and coverage


Experimental Description (1/4)


FaultDetection

Effectiveness

TestSuite Size

(Cost)

Control/Data Flow

Coverage

MutationDetection

Fault Detection1

2

3

6

47

Random Test suite 5

Q1: Are mutation scores good predictors of actual fault detection ratios?

Q2: What is the cost of achieving given levels of the coverage criteria?

Q3: What levels of coverage should be achieved to obtain reasonable levels of fault detection effectiveness?

Q4: What is the cost effectiveness of the control and data flow coverage criteria?



FaultDetection

Effectiveness

TestSuite Size

(Cost)

Control/Data Flow

Coverage

MutationDetection

Fault Detection1

2

3

6

4

Fault Detection Difficulty

7

Random Test suite 5

Q7: How is the cost-benefit analysis of coverage criteria affected by variations in fault detection difficulty?Q5: What is the gain of

using coverage criteria compared to random testing?

Q6: Do we still find a statistically significant relationship b/t coverage level and fault detection effectiveness when we account for test suite size?


Subject Programs A middle size industrial program: space.c

• Total of 34 faulty versions

Mutant Generation Four classes of “mutation operators” were applied.

• Replace an integer constant by another integer.• Replace an operator with another operator from the same class.• Negate the decision in an if or while statement.• Delete a statement.


Experimental Description (4/4)Test programs 11,379 mutant programs Run on every 10th mutant generated 1,137 736 of them faulty behavior

Overall experiments approach Generate test suites for various coverage criteria and levels. Find test suites from 50% coverage to high coverage.

• Randomly select five test suites to each of these intervals.– 50.00% - 50.99% Coverage– …– 95.00% - 95.99% Coverage

From the same pool, generate • Random Suite = 1700 test suites


Coverage Suite = 230 test suites

Analysis Results (1/10)

Q1: Are mutation scores good predictors of actual fault detection ratios? Calculate the difference between the scores and ratios

• Am = mutation detection ratio (mutation scores)• Af = fault detection ratio

Calculate statistical significance of the difference• P-value < 0.001


Am will not significantly and systematically under/overestimate Af.

Descriptive Statistics for Af-Am


Q1 Is Am a good predictor of Af? MRE: Can we accurately predict Af based on Am?

• MRE = Magnitude of Relative Error = |Af – Am| / Af


Average MRE range from 0.083 to 0.098 relative error under 10%

MRE decreases as %Coverage increases

Achieve high %Coverage and predict Af more accurately.

Am is an unbiased and accurate predictor of Af.

Q1 Is Am a good predictor of Af? Perform linear regression between Af and Am.



Use Am as a measure of fault detection effectiveness

R 0.908


Q2: What is the cost of achieving the %Coverage? Use exponential regression. Assume the effort associated with a coverage level is

proportional to test suite size.


Achieving higher levels of coverage becomes increasingly expensive.

R 0.943 0.977

Cost of achieving coverage level: Block < C-Use < Decision < P-Use

Analysis Results (5/10)Q3: What levels of coverage should be achieved

to obtain reasonable levels of Am? Use exponential regression


Curves range from nearly linear to mildly exponential.

Am of a given level of coverage increases from Block, C-Use, Decision, to P-Use.

Am of achieving coverage level: Block < C-Use < Decision < P-Use


Q4: What is the cost effectiveness of the control and data flow coverage criteria? Use exponential regression between Am and size

• Am = a * sizeb


None of the four criteria is more cost-effective than the others

R 0.976

Regression Coefficients, R2, and Surface Areas for All Am-Size Models

Q5: What is the gain of using coverage criteria compared to random test suites (=null-criterion)? Use exponential regression between Am and size


The gain of using coverage criteria increases as test suite size increases.


Coverage criteria are more cost-effective than null-criterion.

R 0.99

Q6: Is the effect of coverage simply a size effect? If so, random test suites would perform as well as

coverage criteria test suites. Use bivariate regression considering both test suite size

and coverage level to explain variations in Am.



Coverage criteria help develop more effective test suites beyond size effect.

PredAm is Am predictor obtained from using size and coverage as covariates.

Both size and coverage play a complementary role in explaining fault detection despite their strong correlation.

Q7: How is the cost-benefit analysis of coverage criteria affected by fault detection difficulty? Use “hard” mutants detected by < 5% and “very hard”

mutants detected by < 1.5% of the test cases.


Both subsets show a nearly linear relationship.

Similar relationships were reported by Hutchins et al. They had seeded much more difficult faults.


Q7: How is the cost-benefit analysis of coverage criteria affected by fault detection difficulty?


Am sharply increases in the highest 10-20% coverage levels.

Similar relationships were reported by Hutchins et al. They had seeded much more difficult faults.


The relationships between Am and test suite size or coverage are very sensitive to the fault detection difficulty.

Conclusion

Contributions Investigated the feasibility of using mutation analysis to

assess the cost-effectiveness of control/data flow criteria• mutation analysis can be used in a research context to

compare and assess new testing techniques

Applied mutation analysis to fundamental questions regarding the relationships between fault detection, test suite size, and control/data flow coverage

Suggested a way to tune the mutation analysis process to a specific environment’s fault detection pattern


Discussion

Pros Precisely defined and justified analysis procedure

based on a large number of mutant programs replicable experiment

Cons Only one small program with real faults Testing cost modeled by the size of test suites


using mutation analysis for assessing and comparing test

Documents