using mutation analysis for assessing and comparing test
Post on 09-Jun-2022
33 Views
Preview:
TRANSCRIPT
Using Mutation Analysis for Assessing and Comparing Test
Coverage Criteria
ⓒ KAIST SE LAB 2013
IEEE Transactions on Software Engineering(TSE 2006)
James H. Andrews, Lionel c. Briand, Yvan Labiche, and Akbar Siami Namin
2013-06-11Yoo Jin Lim
Contents
IntroductionRelated WorkExperimental DescriptionAnalysis ResultsConclusionDiscussion
2/24 ⓒ KAIST SE LAB 2013
Introduction (1/4)
Software testing research Find effective test techniques for software
3/24ⓒ KAIST SE LAB 2013
..public void bar() {
//print reportif (report == null) println(report);}
Faults
Real Produced
AutomaticallyManually
Mutation Analysis
Introduction (2/4)
Mutant Generation
(+) Well-defined, scalable (–) Less realistic than hand-seeded faults
Mutation Analysis The process of analyzing when mutants fail
4/24ⓒ KAIST SE LAB 2013
for ( int i = 1; i <= n; i++)fact = fact * i;
for ( int i = 1; i < n; i++)fact = fact * i;
Mutation Operator MutantSource Code
What is the external validity of mutants in assessing the fault detection effectiveness of testing techniques?
Introduction (3/4)
Test coverage criteria Data flow examines how variables are used.
• C-Use (Computation) coverage and P-Use (Predicate) coverage
Control flow examines logical expressions.• Block coverage and decision coverage
Subsumption Relationship*
5/24ⓒ KAIST SE LAB 2013
All-Use
C-Use P-Use
Decision
Block*Lori A. Clarke, Andy Prodgurski, Debra J. Richardson, Steven J. Zeli, “A Comparison of Data Flow Path selection Criteria,” 8th ICSE, 1985.
Introduction (4/4)
Basis: The previous work (ICSE 2005)
6/24ⓒ KAIST SE LAB 2013
• Generated mutants are similar to the real faults but different from hand-seeded faults, which are harder to detect than the real faults.
Randomly selected
Goal: The current work (TSE 2006)• Extend “Coverage criteria selected test suites”• Investigate: their cost-effectiveness using
• fault detection, • test suite size• control/data flow coverage
CoverageCriteria
Selected
Related work
Frankl & Weiss (1991)
Hutchins(1994)
Frankl &Iakounenko
(1998)
This work(TSE 2006)
CoverageCriteria
All-Use and Decision
Def-Use and Decision
All-Use and Decision
Block, Decision, P-Use, and C-Use
Subject Program 33 – 60 LOCs 141 – 512 LOCs 6218 LOCs 6218 LOCs
Faults 7 real faults 130 hand-seeded faults 33 real faults 34 real faults
736 mutants
VariablesFault detection, test suite size,and coverage
Fault detection, test suite size, and coverage
Fault detection, and coverage
Fault detection, test suite size, and coverage
7/24 ⓒ KAIST SE LAB 2012
Experimental Description (1/4)
8/24ⓒ KAIST SE LAB 2013
FaultDetection
Effectiveness
TestSuite Size
(Cost)
Control/Data Flow
Coverage
MutationDetection
Fault Detection1
2
3
6
47
Random Test suite 5
Q1: Are mutation scores good predictors of actual fault detection ratios?
Q2: What is the cost of achieving given levels of the coverage criteria?
Q3: What levels of coverage should be achieved to obtain reasonable levels of fault detection effectiveness?
Q4: What is the cost effectiveness of the control and data flow coverage criteria?
Experimental Description (2/4)
9/24ⓒ KAIST SE LAB 2013
FaultDetection
Effectiveness
TestSuite Size
(Cost)
Control/Data Flow
Coverage
MutationDetection
Fault Detection1
2
3
6
4
Fault Detection Difficulty
7
Random Test suite 5
Q7: How is the cost-benefit analysis of coverage criteria affected by variations in fault detection difficulty?Q5: What is the gain of
using coverage criteria compared to random testing?
Q6: Do we still find a statistically significant relationship b/t coverage level and fault detection effectiveness when we account for test suite size?
Experimental Description (3/4)
Subject Programs A middle size industrial program: space.c
• Total of 34 faulty versions
Mutant Generation Four classes of “mutation operators” were applied.
• Replace an integer constant by another integer.• Replace an operator with another operator from the same class.• Negate the decision in an if or while statement.• Delete a statement.
10/24ⓒ KAIST SE LAB 2012
Experimental Description (4/4)Test programs 11,379 mutant programs Run on every 10th mutant generated 1,137 736 of them faulty behavior
Overall experiments approach Generate test suites for various coverage criteria and levels. Find test suites from 50% coverage to high coverage.
• Randomly select five test suites to each of these intervals.– 50.00% - 50.99% Coverage– …– 95.00% - 95.99% Coverage
From the same pool, generate • Random Suite = 1700 test suites
11/24ⓒ KAIST SE LAB 2012
Coverage Suite = 230 test suites
Analysis Results (1/10)
Q1: Are mutation scores good predictors of actual fault detection ratios? Calculate the difference between the scores and ratios
• Am = mutation detection ratio (mutation scores)• Af = fault detection ratio
Calculate statistical significance of the difference• P-value < 0.001
12/24ⓒ KAIST SE LAB 2013
Am will not significantly and systematically under/overestimate Af.
Descriptive Statistics for Af-Am
Analysis Results (2/10)
Q1 Is Am a good predictor of Af? MRE: Can we accurately predict Af based on Am?
• MRE = Magnitude of Relative Error = |Af – Am| / Af
13/24ⓒ KAIST SE LAB 2013
Average MRE range from 0.083 to 0.098 relative error under 10%
MRE decreases as %Coverage increases
Achieve high %Coverage and predict Af more accurately.
Am is an unbiased and accurate predictor of Af.
Q1 Is Am a good predictor of Af? Perform linear regression between Af and Am.
Analysis Results (3/10)
14/24ⓒ KAIST SE LAB 2013
Use Am as a measure of fault detection effectiveness
R 0.908
Analysis Results (4/10)
Q2: What is the cost of achieving the %Coverage? Use exponential regression. Assume the effort associated with a coverage level is
proportional to test suite size.
15/24ⓒ KAIST SE LAB 2013
Achieving higher levels of coverage becomes increasingly expensive.
R 0.943 0.977
Cost of achieving coverage level: Block < C-Use < Decision < P-Use
Analysis Results (5/10)Q3: What levels of coverage should be achieved
to obtain reasonable levels of Am? Use exponential regression
16/24ⓒ KAIST SE LAB 2013
Curves range from nearly linear to mildly exponential.
Am of a given level of coverage increases from Block, C-Use, Decision, to P-Use.
Am of achieving coverage level: Block < C-Use < Decision < P-Use
Analysis Results (6/10)
Q4: What is the cost effectiveness of the control and data flow coverage criteria? Use exponential regression between Am and size
• Am = a * sizeb
17/24ⓒ KAIST SE LAB 2013
None of the four criteria is more cost-effective than the others
R 0.976
Regression Coefficients, R2, and Surface Areas for All Am-Size Models
Q5: What is the gain of using coverage criteria compared to random test suites (=null-criterion)? Use exponential regression between Am and size
Analysis Results (7/10)
The gain of using coverage criteria increases as test suite size increases.
18/24ⓒ KAIST SE LAB 2013
Coverage criteria are more cost-effective than null-criterion.
R 0.99
Q6: Is the effect of coverage simply a size effect? If so, random test suites would perform as well as
coverage criteria test suites. Use bivariate regression considering both test suite size
and coverage level to explain variations in Am.
Analysis Results (8/10)
19/24ⓒ KAIST SE LAB 2013
Coverage criteria help develop more effective test suites beyond size effect.
PredAm is Am predictor obtained from using size and coverage as covariates.
Both size and coverage play a complementary role in explaining fault detection despite their strong correlation.
Q7: How is the cost-benefit analysis of coverage criteria affected by fault detection difficulty? Use “hard” mutants detected by < 5% and “very hard”
mutants detected by < 1.5% of the test cases.
Analysis Results (9/10)
Both subsets show a nearly linear relationship.
Similar relationships were reported by Hutchins et al. They had seeded much more difficult faults.
20/24ⓒ KAIST SE LAB 2013
Q7: How is the cost-benefit analysis of coverage criteria affected by fault detection difficulty?
Analysis Results (10/10)
Am sharply increases in the highest 10-20% coverage levels.
Similar relationships were reported by Hutchins et al. They had seeded much more difficult faults.
21/24ⓒ KAIST SE LAB 2013
The relationships between Am and test suite size or coverage are very sensitive to the fault detection difficulty.
Conclusion
Contributions Investigated the feasibility of using mutation analysis to
assess the cost-effectiveness of control/data flow criteria• mutation analysis can be used in a research context to
compare and assess new testing techniques
Applied mutation analysis to fundamental questions regarding the relationships between fault detection, test suite size, and control/data flow coverage
Suggested a way to tune the mutation analysis process to a specific environment’s fault detection pattern
22/24ⓒ KAIST SE LAB 2013
Discussion
Pros Precisely defined and justified analysis procedure
based on a large number of mutant programs replicable experiment
Cons Only one small program with real faults Testing cost modeled by the size of test suites
23/24 ⓒ KAIST SE LAB 2012
ⓒ KAIST SE LAB 2013
top related