test suite minimization an empirical investigation by a ......a project submitted to oregon state...
TRANSCRIPT
Test Suite MinimizationAn Empirical Investigation
by
Jeffery von Ronne
A PROJECT
submitted to
Oregon State University
University Honors College
in partial fulfillment ofthe requirements for the degree of
Honors Bachelors of Science in Computer Science (Honors Scholar)
Presented May 28, 1999Commencement June 1999
Jeffery von Ronne for the degree of Honors Bachelors of Science in Computer Science
presented on May 28, 1999. Title: Test Suite Minimization: An Empirical Investigation.
AN ABSTRACT OF THE THESIS OF
Abstract approved:
Gregg Rothermel
Test suite minimization techniques attempt to reduce the cost of
saving and reusing tests during software maintenance, by eliminating
redundant tests from test suites. A potential drawback of these
techniques is that in minimizing a test suite, they might reduce
the ability of that test suite to reveal faults in the software.
Previous studies have shown that sometimes this reduction is small,
but sometimes this reduction is severe. This work investigates
the minimization process, what factors can affect its performance,
and techniques for reducing this loss.
Test Suite MinimizationAn Empirical Investigation
by
Jeffery von Ronne
A PROJECT
submitted to
Oregon State University
University Honors College
in partial fulfillment ofthe requirements for the degree of
Honors Bachelors of Science in Computer Science (Honors Scholar)
Presented May 28, 1999Commencement June 1999
Dean of University Honors College
Committee Member and Chair, Department of Computer Science
Committee Member, representing Mathematics
Mentor, representing Computer Science
I understand that my project will become part of the permanent collection of Oregon StateUniversity Honors College. My signature below authorizes release of my project to anyreader upon request.
Jeffery von Ronne, Author
APPROVED:
Honors Bachelors of Science in Computer Science project of Jeffery von Ronne presented on May 28, 1999
Acknowledgment
Many thanks are due to Dr. Rothermel, who provided much advice and guidance during
the past year, as well as collaborating on the work in this thesis.
My other committee members were Dr. Robby Robson and Dr. Michael Quinn.
Dr. Roland Untch of Middle Tennessee State University provided the mutation data nec-
essary for the experiments with PSSC minimization. Chengyun Chu prepared the Space
program and assisted in the preparation of the mutation data.
Dr. Mary Jean Harrold and Christie Hong of Ohio State University and Jeffery Ostrin also
collaborated on parts of this work.
The "Siemens" programs were provided by Siemens Corporate Research. The space pro-
gram came from the European Space Agency via Dr.’s Pasquini and Phyllis.
The NSF funded my work through a Research Experience for Undergraduates grant to Dr.
Rothermel. The equipment and other collaborators were funded in part by grants from
Microsoft and the NSF.
Thanks everyone.
Contributing Co-Authors
The second and third chapters of this thesis are based on an article entitledExperiments
to Assess the Cost-Benefits of Test Suite Minimizationby Dr. Gregg Rothermel, Dr. Mary
Jean Harrold (Ohio State University), Christie Hong (Ohio State University), and myself,
which is currently in preperation for submission to Transactions in Software Engineering,
and is a revised and expanded version of an earlier paper, entitledAn empirical study of the
effects of minimization on the fault detection capabilities of test suites, which was authored
by Dr. Gregg Rothermel, Dr. Mary Jean Harrold, Christie Hong, and Jeffery Ostrin, and
presented at the November 1998 International Conference on Software Maintenance.
Table of Contents
1. Introduction and Motivation ........................................................................................1
1.1. Motivation.............................................................................................................11.2. Overview of This Thesis.......................................................................................2
2. Background and Literature Review.............................................................................4
2.1. Test suite minimization.........................................................................................42.2. Previous empirical work .......................................................................................4
2.2.1. The Wong98 study.....................................................................................52.2.2. The Wong97 study.....................................................................................8
3. Edge-Minimization Experiments ...............................................................................11
3.1. Research Questions.............................................................................................113.2. Measures and Tools ............................................................................................12
3.2.1. Measures..................................................................................................123.2.1.1. Measuring savings. .......................................................................123.2.1.2. Measuring costs. ...........................................................................13
3.2.2. Tool infrastructure. ..................................................................................153.3. Experiments with smaller C programs ..............................................................15
3.3.1. Subject programs, faulty versions, test cases, and test suites. .................163.3.2. Experiment design. ..................................................................................213.3.3. Threats to validity. ...................................................................................213.3.4. Minimization of edge-coverage-adequate test suites...............................22
3.3.4.1. Test suite size reduction................................................................233.3.4.2. Fault detection effectiveness reduction.........................................26
3.3.5. Minimization of randomly generated test suites......................................323.3.5.1. Test suite size reduction................................................................333.3.5.2. Fault detection effectiveness reduction.........................................33
3.4. Experiment with the Space Program ..................................................................393.4.1. Subject program, faulty versions, test cases, and test suites....................393.4.2. Experiment design. ..................................................................................413.4.3. Threats to validity. ...................................................................................423.4.4. Data and Analysis ....................................................................................43
3.4.4.1. Test suite size reduction................................................................433.4.4.2. Fault detection effectiveness reduction.........................................45
3.5. Comparison to Previous Empirical Results ........................................................47
4. A New Minimization Technique .................................................................................52
4.1. Mutation Analysis and Minimization .................................................................524.1.1. Mutation Analysis and Sensitivity...........................................................524.1.2. Adapting Sensitivity for use as a Coverage Criterion..............................54
4.2. An Algorithm to Facilitate Minimization based on PSSC..................................55
4.2.1. A Conventional Test Suite Minimization Heuristic.................................554.2.2. A Multi-Hit Minimization Algorithm......................................................574.2.3. Using the Multi-Hit Reduction Algorithm for PSSC Minimization........604.2.4. Asymptotic Analysis of the Multi-Hit Reduction Algorithm..................63
4.3. An Experiment with PSSC Minimization ..........................................................644.3.1. Experimental Design ...............................................................................644.3.2. Results .....................................................................................................65
4.3.2.1. Minimized Test Suite Size ............................................................654.3.2.2. Minimized Test Suite Performance ..............................................67
5. Conclusion ....................................................................................................................71
5.1. Results ................................................................................................................715.2. Practical Implications .........................................................................................725.3. Limitations of This Investigation and Future Work............................................73
Bibliography .....................................................................................................................75
List of Figures3-1. Percentage of Inputs that Expose Each Fault .............................................................18
3-2. Size Distribution among Unminimized Test Suites for the Siemens Programs .........20
3-3. Size of Minimized vs. Size of Original Test Suites ...................................................23
3-4. Percent Reduction in Test Suite Size vs. Original Test Suite Size.............................23
3-5. Minimization: Percentage Effectiveness Reduction vs. Original Size.......................27
3-6. Effectiveness in Original and after Minimization vs. Original Size ..........................27
3-7. Random Reduction: Percentage Effectiveness Reduction vs. Original Suite Size ....34
3-8. Minimization and Random Reduction: Fault Detection vs Original Size..................34
3-9. Random Reduction: Percent Effectiveness Reduction...............................................37
3-10. Percentage of Test Cases that Expose each of Space’s Faults..................................40
3-11. Size of Minimized Test Suites vs Size of Original Test Suites ................................44
3-12. Percent Reduction in Test Suite vs Original Test Suite Size....................................44
3-13. Percent Reduction in Effectiveness vs. Original Size ..............................................45
3-14. Original and Minimized: Faults Detected vs. Original Size....................................45
4-1. The Harrold, Gupta, and Soffa Test Suite Minimization Algorithm..........................55
4-2. A Multi-Hit Test Suite Reduction Algorithm.............................................................58
4-3. A C Program...............................................................................................................60
4-4. Sizes of Test Suites after PSSC minimization............................................................65
4-5. Average Test Suite Size vs. Average Number of Faults Detected..............................67
List of Tables3-1. The Siemens Programs...............................................................................................16
3-2. Correlation Between Size Reduction and Original Size ............................................26
3-3. Minimization: Correlation between Effectiveness Reduction and Original Size.......30
3-4. Random Reduction: Correlation between Effectiveness Loss and Origianl Size ......34
3-5. Comparison of Fault Detection Reduction.................................................................37
3-6. Comparison of Fault Detection Reduction Variance..................................................38
3-7. The Space Application ...............................................................................................39
3-8. Correlation between Size Reduction and Initial Size.................................................43
3-9. Correlation between Initial Size and Effectiveness Reduction ..................................45
3-10. Average Reductions in Fault Detection Effectiveness..............................................47
3-11. Fault detection abilities of tests used in the Wong98 study......................................50
4-1. The Initial Test Suite for the Example Program.........................................................61
4-2. The Coverage Requirements for the Example Program.............................................62
Chapter 1. Introduction and Motivation
1.1. Motivation
Testing is an important but expensive task necessary for the construction of high quality
software. As such, there is great potential for any practical technique that enables the
detection of more faults with limited software testing funds. One testing strategy is to
orient the testing regimen around concrete, achievable criteria. These includefunctional
tests, designed to exercise the program’s documented features, and alsostructural tests,
designed to exercise each statement in the program. It is thought that a testing regimen
designed around explicit criteria such as those just mentioned is more effective than either
random or ad hoc testing1. In fact, experimentation, such as that done by researchers at
Siemens, has shown that structural testing based on either controlflow or dataflow coverage
criteria can show significantly better fault detection than random testing[Hutchins94].
Coverage criteria are also used as a stopping point to decide when a program is sufficiently
tested. In this case, additional tests are added until the test suite has achieved a specified
coverage level according to a specific adequacy criterion. For example, to achievestate-
ment coverageadequacy for a program, one would add additional test cases to the test suite
until each statement in that program is executed by at least one of the test cases.
It is often the case that as a program evolves, additional tests are needed to maintain ade-
quate coverage. Sometimes, as the test suite grows, it can become prohibitively expensive
1. Random testing is selecting inputs at random, from some input distribution, and using those astest cases. Ad hoc testing is testing with inputs chosen by the tester with no explicit selectioncriteria.
Chapter 1. Introduction and Motivation 2
to execute on new versions of the program. These test suites will often contain test cases
that are no longer needed to satisfy the coverage criteria, because they are now obsolete or
redundant [Chen96, Harrold93, Horgan92, Offutt95].2 For example, Harrold et al. propose
that a reduced test suite, made up of the smallest subset of the test cases that still exercises
all of the coverage items, could be used in place of the original test suite[Harrold93]. The
reduced subset of the original test suite will be referred to as aminimized test suite, and the
process of obtaining the minimized test suite will be calledminimization.
Unfortunately, minimized test suites are not without drawbacks. In addition to the cost
of determining the reduced set, minimization may remove test cases that detect program
faults that are not detected by other test cases that satisfy the same criterion. In the worse
case, the minimized test suites will no longer detect any faults, including those that would
be detected by the original test suite. This work begins to quantify this loss over a limited
range of coverage criteria, programs, program faults, and test cases and compares it to the
benefit in reduced test suite size.
1.2. Overview of This Thesis
Some studies have shown that minimization can result in significant savings in test suite
size with little reduction in the ability of the minimized test suite to detect faults [Wong95,
Wong97, Wong98]. This work, however, shows that this is not necessarily the case. For the
combination of programs, faults, and types of test suites we utilized in two empirical stud-
2. Obsolete test cases no longer exercise any coverage items. Redundant test cases are those thatexercise only test cases that are also exercised by other test cases in the test suite.
Chapter 1. Introduction and Motivation 3
ies, the loss in fault detection was substantial. While a third study showed a less extreme
loss in fault detection, this loss was neither statistically nor practically insignificant.
These findings motivated the search for alternative coverage criteria that could be used in
place of or in conjunction with structural criteria. This resulted in a new coverage criterion:
Probabilistic Statement Sensitivity Coverage. In the process, a new minimization heuristic
was developed.
The next chapter will discuss coverage criteria, test suite minimization, and previous work.
The third chapter will discuss the experiments we conducted to assess the performance of
a conventional minimization technique. Chapter 4 introduces the PSSC criterion, explains
how it could be used, and compares its performance to conventional techniques. Finally,
the conclusion will recap our experiment results, explain the practical consequences of this
work, and suggest areas for further study.
4
Chapter 2. Background and Literature Review
2.1. Test suite minimization
The test suite minimization problem may be stated as follows [Harrold93]
Given: Test suiteT, a set of test case requirementsr1, r
2, ..., r
nthat must be satisfied to provide
the desired test coverage of the program, and subsets ofT, T1, T
2, ... , T
n, one associated with
each of theris such that any one of the test casest
jbelonging toT
ican be used to testr
i.
Problem: Find a representative set of test cases fromT that satisfies all of theris.
The ris in the foregoing statement can represent various test case requirements, such as
source statements, decisions, definition-use associations, or specification items.
A representative set of test cases that satisfies all of theris must contain at least one test
case from eachTi; such a set is called ahitting setof the group of setsT, T
1, T
2, ... ,T
n. To
achieve a maximum reduction, it is necessary to find the smallest representative set of test
cases. However, this subset of the test suite is the minimum cardinality hitting set of the
Tis, and the problem of finding such a set is NP-complete [Garey79]. Thus, minimization
techniques resort to heuristics.
Several test suite minimization techniques have been proposed (e.g., [Chen96, Harrold93,
Horgan92, Offutt95]); in this work we utilize the technique of Harrold, Gupta, and Soffa
[Harrold93].
Chapter 2. Background and Literature Review 5
2.2. Previous empirical work
Many empirical studies of software testing have been performed. Some of these studies,
such as those reported in References [Frankl93,Hutchins94,Wong94], provide only indi-
rect data about the effects of test suite minimization through consideration of the effects of
test suite size on costs and benefits of testing. Other studies, such as the study reported in
Reference [Graves98], provide only indirect data about the effects of test suite minimiza-
tion through a comparison of regression test selection techniques that practice or do not
practice minimization.1
Recent studies by Wong, Horgan, London, and Mathur [Wong95,Wong98]2 and Wong,
Horgan, Mathur, and Pasquini [Wong97], however, directly examine the costs and benefits
of test suite minimization. We refer to these studies collectively as the “Wong” studies, and
individually as the “Wong98” and “Wong97” studies. We summarize the results of these
studies here; the references provide further details.
2.2.1. The Wong98 study
The Wong98 study involved ten common C UNIX utility programs, including nine pro-
grams ranging in size from 90 to 289 lines of code, and one program of 842 lines of code.
1. Whereas minimization considers a program and test suite, regression test selection considers aprogram, test suite, and modified program version, and selects test cases that are appropriate forthat version without removing them from the test suite. The problems of regression test selectionand test suite minimization are thus related but distinct. For further discussion of regression testselection see Reference [Rothermel96].
2. Reference [Wong98] (1998) extends work reported earlier in Reference [Wong95] (1995); thus,except where otherwise noted, we here focus on the most recent (1998) reference.
Chapter 2. Background and Literature Review 6
For each of these programs, the researchers used a random domain-based test generator to
generate an initial test case pool; the number of test cases in these pools ranged from 156
to 997. No attempt was made, in generating these pools, to achieve complete coverage of
program components (blocks, decisions, or definition-use associations).
The researchers next drew multiple distinct test suites from their test case pools, by ran-
domly selecting test cases. The resulting test suites achieved basic block coverages ranging
from 50% to 95%; overall, 1198 test suites were generated. Reference [Wong98] reports
the sizes of the resulting test suites as averages over groups of test cases that achieved sim-
ilar coverage: 270 test suites belonged to groups in which average test suite size ranged
from 9.07 to 33.73 test cases, and 928 test suites belonged to groups in which average test
suite size ranged from only 1 to 4.43 test cases.
The researchers enlisted graduate students to inject simple mutation-like faults into each
of the subject programs. The researchers excluded faults that could not be detected by any
test case. All told, 181 faulty versions of the programs were retained for use in the study.
To assess the difficulty of detecting these faults, the researchers measured the percentages
of test cases, in the associated test pools, that were able to detect the faults. Of the 181
faults, 78 (43%) were “Quartile I” faults detectable by fewer than 25% of the associated
test cases, 42 (23%) were “Quartile II” faults detectable by between 25% and 50% of the
associated test cases, 37 (20%) were “Quartile III” faults detectable by between 50% and
75% of the associated test cases, and 24 (13%) were “Quartile IV” faults detectable by at
least 75% of the associated test cases.
The researchers minimized their test suites using ATACMIN [Horgan92], a minimization
Chapter 2. Background and Literature Review 7
tool based on an implicit enumeration algorithm that found exact minimization solutions
for all of the test suites utilized in the study. Test suites were minimized with respect to
block, decision, and all-uses dataflow coverage. The researchers measured the reduction
in test suite size achieved through minimization, and the reduction in fault-detection effec-
tiveness of the minimized test suites. The researchers also repeated this procedure on the
entire test pools (effectively, treating these test pools as if they were test suites.) Finally,
they used null hypothesis checking to determine whether the minimized test suites had bet-
ter fault detection capabilities than test suites of the same size generated randomly from
the unminimized test suites.
The researchers drew several overall conclusions from the study, including the following:
• As the coverage achieved by initial test suites increased, minimization produced greater
savings with respect to those test suites, at rates ranging from 0% (for several of the
50-55% coverage suites) to 72.79% (for one of the 90-95% block coverage suites).
• As the coverage achieved by initial test suites increased, minimization produced greater
losses in the fault-detection effectiveness of those suites. However, losses in fault detec-
tion effectiveness were small compared to savings in test suite size: in all but one case,
reductions were less than 7.27 percent, and most reductions were less than 4.99 percent.
• Fault difficulty partially determined whether minimization caused losses in fault-
detection effectiveness: Quartile I and II faults were more easily missed than Quartile
III and IV faults following minimization.
• The null hypothesis testing showed that minimized test suites retain a size/effectiveness
Chapter 2. Background and Literature Review 8
advantage over their random counterparts.
The authors draw the following overall conclusion:
...when the size of a test set is reduced while the coverage is kept constant, there is little or no
reduction in its fault detection effectiveness.... A test set which is minimized to preserve its
coverage is likely to be as effective for detecting faults at a lower execution cost. [Wong98].
2.2.2. The Wong97 study
Whereas the Wong98 study examined test suite minimization on 10 common Unix utili-
ties, the Wong97 study involved a single C application developed for the European Space
Agency to aid in the management of large antenna arrays. At 6,100 executable lines, this
application is several times the size of the largest program used for the Wong98 study.
Unlike the Wong98 study, in which an initial pool of test cases was generated randomly
based solely on program specifications, the Wong97 study used a pool of 1000 test cases
generated based on an operational profile.
In the Wong98 study, test suites were generated and categorized based on block coverage.
For the Wong97 study, two different procedures were followed for generating test suites:
the first to create test suites of fixed size, and the second to create test suites of fixed block-
coverage. For the fixed size test suites, test cases were chosen randomly from the test
pool until the desired number of test cases had been selected. In all, 120 test suites were
generated in this manner: 30 distinct test suites for each of the target sizes of 50, 100,
Chapter 2. Background and Literature Review 9
150, 200. For the fixed coverage test suites, test cases were chosen randomly from the test
pool until the test suite reached the desired coverage. Only test cases that added coverage
were added to the fixed coverage test suites. In all, 180 test suites were generated in this
manner: 30 distinct test suites for each of the target coverages ranging from 50% to 75%
block coverage.
Whereas the faults in the Wong98 study were injected by graduate students, the faults used
in the Wong97 study were obtained from an error log maintained during the creation of
the application. The researchers selected 16 of these faults, of which all but one were
detected by fewer than 7% of the test cases, making them similar in detection difficulty to
the “Quartile I” faults used in the Wong98 study. The exceptional fault was detected by
320 (32%) of the test cases.
As in the Wong98 study, all of the test suites were minimized using ATACMIN. In both
studies, the size of each test suite was reduced, while the coverage was kept constant. In
the Wong97 study, however, minimization with respect to block coverage was the only
minimization attempted. Reduction in test suite size and in fault detection effectiveness
were measured. Finally, null hypothesis testing was used to compare test suites minimized
for coverage to test suites that were randomly minimized.
Chapter 2. Background and Literature Review 10
The researchers drew the following overall conclusions from the study:
• There were substantial reductions in size achieved from minimizing the fixed size test
suites. For the fixed coverage test suites, reductions in size also occurred but were
smaller.
• As in the Wong98 study, the effectiveness reductions of the minimized test suites
were smaller than the size reductions, so that minimized test suites resulted in a
size/effectiveness advantage over the unminimized test suites. The average effective-
ness reduction due to minimization was less than 7.3%, and most reductions were less
than 3.6%.
• The null hypothesis testing again showed that minimized test suites retain a
size/effectiveness advantage over their random counterparts.
Thus, the Wong97 study supports the findings of the Wong98 study, while broadening the
scope of the study in terms of both the programs under scrutiny and the types of initial test
suites utilized.
11
Chapter 3. Edge-Minimization Experiments
3.1. Research Questions
The Wong studies leave a number of open research questions, primarily concerning the
extent to which the results observed in those studies generalize to other testing situations.
Among the open questions are the following, which motivate the present work.
1. How does minimization fare in terms of costs and benefits when test suites have a
wider range of sizes than the test suites utilized in the Wong studies.
2. How does minimization fare in terms of costs and benefits when test suites are
coverage-adequate?
3. How does minimization fare in terms of costs and benefits when test suites contain
additional coverage-redundant test cases?
The first and third questions are addressed by the Wong97 study in its use of fixed-size
test suites; however, that study examines only one program. Neither of the Wong studies
considers the second question.
Test suites used in practice often contain test cases designed not for code coverage, but
rather, designed to exercise product features, specification items, or exceptional behaviors.
Such test suites may contain larger numbers of test cases, and larger numbers of coverage-
redundant test cases, than the test suites utilized in the Wong98 study, or than the coverage-
based test suites utilized in the Wong97 study.
Chapter 3. Edge-Minimization Experiments 12
Similarly, a typical tactic for utilizing coverage-based testing is to begin with a base of
specification-based tests, and add additional tests to achieve complete coverage. Such test
suites may also contain greater coverage-redundancy than the coverage-based test suites
utilized in the Wong studies, but can be expected to distribute coverage more evenly than
the fixed-size test suites constructed by random selection for the Wong97 study.
It is important to understand the cost-benefit tradeoffs involved in minimizing such test
suites. Thus, to investigate these tradeoffs, we performed a family of experiments.
3.2. Measures and Tools
We now discuss the measures and tools utilized in our experiments; subsequent sections
discuss the individual experiments. LetT be a test suite, and letTmin
be the reduced test
suite that results from the application of a minimization technique toT.
3.2.1. Measures
We need to measure the costs and savings of test suite minimization.
3.2.1.1. Measuring savings.
Test suite minimization lets testers spend less time executing test cases, examining test
results, and managing the data associated with testing. These savings in time are dependent
on the extent to which minimization reduces test suite size. Thus, to measure the savings
that can result from test suite minimization, we can follow the methodology used in the
Wong studies and measure the reduction in test suite size achieved by minimization. For
Chapter 3. Edge-Minimization Experiments 13
each program, we measure savings in terms of the number and the percentage of tests
eliminated by minimization. (The former measure provides a notion of the magnitude of
the savings; the latter lets us compare and contrast savings across test suites of varying
sizes.) The number of tests eliminated is given by (T - Tmin
), and the percentage of tests
eliminated is given by ((T - Tmin
)/T * 100).
This approach makes several assumptions: it assumes that all test cases have uniform costs,
it does not differentiate between components of cost such as CPU time or human time,
and it does not directly measure the compounding of savings that results from using the
minimized test suites over a sequence of subsequent releases. This approach, however,
has the advantage of simplicity, and using it we can draw several conclusions that are
independent of these assumptions and compare our results with those achieved in the Wong
studies.
3.2.1.2. Measuring costs.
There are two costs to consider with respect to test suite minimization. The first cost is
the cost of executing a minimization tool to produce the minimized test suite. However, a
minimization tool can be run following the release of a product, automatically and during
off-peak hours, and in this case the cost of running the tool may be noncritical. Moreover,
having minimized a test suite, the cost of minimization is amortized over the uses of that
suite on subsequent product releases, and thus assumes progressively less significance in
relation to other costs.
The second cost to consider is more significant. Test suite minimization may discard some
Chapter 3. Edge-Minimization Experiments 14
test cases that, if executed, would reveal defects in the software. Discarding these test cases
reduces the fault detection effectiveness of the test suite. The cost of this reduced effec-
tiveness may be compounded over uses of the test suite on subsequent product releases,
and the effects of the missed faults may be critical. Thus, in this experiment, we focus on
the costs associated with discardingfault-revealingtest cases.
We considered two methods for calculating reductions in fault detection effectiveness.
On a per-test-case basis: One way to measure the cost of minimization in terms of effects
on fault detection, given faulty programP and test suiteT, is to identify the test cases inT
that reveal a fault inP but are not inTmin
. This quantity can be normalized by the number
of fault-revealing test cases inT. One problem with this approach is that multiple test cases
may reveal a given fault. In this case some test cases could be discarded without reducing
fault-detection effectiveness; this measure penalizes such a decision.
On a per-test-suite basis: Another approach is to classify the results of test suite min-
imization, relative to a given fault inP, in one of three ways: (1) no test case inT is
fault-revealing, and, thus, no test case inTmin
is fault-revealing; (2) some test case in both
T andTmin
is fault-revealing; or (3) some test case inT is fault-revealing, but no test case in
Tmin
is fault-revealing. Case 1 denotes situations in whichT is inadequate. Case 2 indicates
a use of minimization that does not reduce fault detection, and Case 3 captures situations
in which minimization compromises fault detection.
Chapter 3. Edge-Minimization Experiments 15
The Wong experiments utilized the second approach; we do the same. For each program,
we measure reduced effectiveness in terms of the number and the percentage of faults for
which Tmin
contains no fault-revealing test cases, butT does contain fault-revealing test
cases. More precisely, ifF denotes the number of faults revealed byT over the faulty
versions of programP, andFmin
denotes the number of faults revealed byTmin
over those
versions, the number of faults lost is given by (F - Fmin
), and the percentage reduction in
fault-detection effectiveness of minimization is given by ((F - Fmin
)/F * 100).
Note that this method of measuring the cost of minimization calculates cost relative to a
fixed set of faults. This approach also assumes that missed faults have equal costs, an
assumption that typically does not hold in practice.
3.2.2. Tool infrastructure.
To perform our experiments we required several tools. First, we required a test suite min-
imization tool; to obtain this, we implemented the algorithm of Harrold, Gupta and Soffa
[Harrold93] within the Aristotle program analysis system [Harrold97]. The Aristotle sys-
tem also provided us with with code instrumenters for use in determining edge coverage.
Chapter 3. Edge-Minimization Experiments 16
3.3. Experiments with smaller C programs
Our first two experiments address our research questions on several small C programs,
similar in size to the C utilities utilized in the Wong98 study. In this section we first
describe details common to these two experiments, and then we report the results of the
experiments in turn.
3.3.1. Subject programs, faulty versions, test cases, and test suites.
We used seven C programs as subjects (see Table 3-1). The programs range in size from
138 to 516 lines of C code and perform a variety of functions. Each program has several
faulty versions, each containing a single fault. Each program also has a large test pool.
The programs, versions, and test pools were assembled by researchers at Siemens Corpo-
rate Research for a study of the fault-detection capabilities of control-flow and data-flow
coverage criteria [Hutchins94]. We refer to these programs collectively as the “Siemens”
programs.
Table 3-1.The Siemens ProgramsProgram Lines of Code No. of Versions Test Pool Size Description
totinfo 346 23 1052 information measure
schedule1 299 9 2650 priority scheduler
schedule2 297 10 2710 priority scheduler
tcas 138 41 1608 altitude separation
printtok1 402 7 4130 lexical analyzer
printtok2 483 10 4115 lexical analyzer
replace 516 32 5542 pattern replacement
The researchers at Siemens sought to study the fault-detecting effectiveness of coverage
criteria. Therefore, they created faulty versions of the seven base programs by manually
Chapter 3. Edge-Minimization Experiments 17
seeding those programs with faults, usually by modifying a single line of code in the pro-
gram. In a few cases they modified between two and five lines of code. Their goal was
to introduce faults that were as realistic as possible, based on their experience with real
programs. Ten people performed the fault seeding, working “mostly without knowledge of
each other’s work” [Hutchins94].
For each of the seven programs, the researchers at Siemens created a largetest poolcon-
taining possible test cases for the program. To populate these test pools, they first created
an initial set of black-box test cases “according to good testing practices, based on the
tester’s understanding of the program’s functionality and knowledge of special values and
boundary points that are easily observable in the code” [Hutchins94], using thecategory
partition methodand the Siemens Test Specification Language tool [Balcer89, Ostrand88].
They then augmented this set with manually-created white-box test cases to ensure that
each executable statement, edge, and definition-use pair in the base program or its control
flow graph was exercised by at least 30 test cases. To obtain meaningful results with the
seeded versions of the programs, the researchers retained only faults that were “neither too
easy nor too hard to detect” [Hutchins94], which they defined as being detectable by at
least three and at most 350 test cases in the test pool associated with each program.1
1. When we execute these faulty versions, we find four faults that are not detected, and three thatare detected by only one or two test cases. This difference may be attributable to some factorinvolving the system on which we are executing our tests; the difference does not impact theresults of our study.
Chapter 3. Edge-Minimization Experiments 18
Figure 3-1. Percentage of Inputs that Expose Each Fault
20
18
16
14
12
10
8
6
4
2
printtok2 replace
perc
enta
ge o
f te
sts
that
rev
eal f
ault
s
subject program
totinfo schedule1 schedule2 tcas printtok1
Figure 3-1 shows the sensitivity to detection of the faults in the Siemens versions relative
to the test pools; the boxplots2 illustrate that the sensitivities of the faults vary within and
between versions, but overall are all lower than 19.77%. Therefore, all of these faults were,
in the terminology of the Wong studies, “Quadrant I” faults, detectable by fewer than 25%
of the test pool inputs.
To investigate our research questions we required coverage-adequate test suites that exhibit
redundancy in coverage, and we required these in a range of sizes. To create these test
2. A boxplot is a standard statistical device for representing data sets [Johnson92]. In these plots,each data set’s distribution is represented by a box. The box’s height spans the central 50% ofthe data and its upper and lower ends mark the upper and lower quartiles. The middle of thethree horizontal lines within the box represents the median. The vertical lines attached to thebox indicate the tails of the distribution.
Chapter 3. Edge-Minimization Experiments 19
suites we utilized theedge coveragecriterion. The edge coverage criterion is similar to
the decision coverage criterion used in the Wong98 study, but is defined on control flow
graphs.3
We used the Siemens program test pools to obtain coverage-adequate test suites for each
subject program. Our test suites consist of a varying number of test cases selected randomly
from the associated test pool, together with any additional test cases required to achieve
100% coverage of coverable edges.4 We did not add any particular test case to any particular
test suite more than once. To ensure that these test suites would possess varying ranges of
coverage redundancy, we randomly varied the number of randomly selected test cases over
sizes ranging from 0 to .5 times the number of lines of code in the program. Altogether,
we generated 1000 test suites for each program.
Figure 3-2 provides views of the range of sizes of test suites created by the process just
described. The boxplots illustrate that for each subject program, our test suite generation
procedure yielded a collection of test suites of sizes that are relatively evenly distributed
across the range of sizes utilized for that program. The all-uses-coverage-adequate suites
3. A test suiteT is edge-coverage adequatefor programP iff, for each edgee in each control flowgraph for some procedure inP, if e is dynamically exercisable, then there exists at least one testcaset in T that exercisese. A test caset exercisesan edgee = (n
1,n
2) in control flow graphG iff
t causes execution of the statement associated withn1, followed immediately by the statement
associated withn2.
4. To randomly select test cases from the test pools, we used the C pseudo-random-number gen-erator “rand”, seeded initially with the output of the C “time” system call, to obtain an integerwhich we treated as an indexi into the test pool (modulo the size of that pool).
Chapter 3. Edge-Minimization Experiments 20
Figure 3-2. Size Distribution among Unminimized Test Suites for the Siemens Programs
subject program
size
of
test
sui
te
120
150
180
210
240
30
60
90
270
tcastotinfo schedule1 schedule2 printtok1 printtok2 replace
are larger on average than the edge-coverage-adequate suites because in general, more tests
are required to achieve all-uses coverage than to achieve edge coverage.
Analysis of the fault-detection effectiveness of these test suites shows that, except for eight
of the edge-coverage-based test suites forschedule2 , every test suite revealed at least
one fault in the set of faulty versions of the associated program. Thus, although each fault
individually is difficult to detect relative to the entire test pool for the program, almost all
of the test suites utilized in the study possessed at least some fault-detection effectiveness
relative to the set of faulty programs utilized.
Chapter 3. Edge-Minimization Experiments 21
3.3.2. Experiment design.
The experiments were run using a full-factorial design with 1000 size-reduction and 1000
effectiveness-reduction measures per cell.5 The independent variables manipulated were:
• The subject program (the seven programs, each with a variety of faulty versions).
• Test suite size (between 0 and .5 times lines-of-code test cases randomly selected from
the test pool, together with additional test cases as necessary to achieve code coverage).
For each subject program, we applied minimization techniques to each of the sample test
suites for that program. We then computed the size and effectiveness reductions for these
test suites.
3.3.3. Threats to validity.
In this section we discuss potential threats to the validity of our experiments with the
Siemens programs.
Threats to internal validity are influences that can affect the dependent variables without
the researcher’s knowledge, and that thus affect any supposition of a causal relationship
between the phenomena underlying the independent and dependent variables. In these
experiments, our greatest concerns for internal validity involve the fact that we do not
5. The single exception involvedschedule2 , for which only 992 measures were available withrespect to edge-coverage-based test suites, due to exclusion of the eight test suites that did notexpose any faults.
Chapter 3. Edge-Minimization Experiments 22
control for the structure of the subject programs or the locality of program changes.
Threats to external validity are conditions that limit our ability to generalize our results.
The primary threats to external validity for this study concern the representativeness of the
artifacts utilized. The Siemens programs, though nontrivial, are small, and larger programs
may be subject to different cost-benefit tradeoffs. Also, there is exactly one seeded fault
in each Siemens program; in practice, programs have much more complex error patterns.
Furthermore, the faults in the Siemens programs were deliberately chosen (by the Siemens
researchers) to be faults that were relatively difficult to detect. (However, the fact that the
faults in these programs were not chosen by us does eliminate one potential source of bias.)
Finally, the test suites we utilized represent only two types of test suite that could occur in
practice if a mix of non-coverage-based and coverage-based testing were utilized. These
threats can only be addressed by additional studies utilizing a wider range of artifacts.
Threats to construct validity arise when measurement instruments do not adequately cap-
ture the concepts they are supposed to measure. For example, in this experiment our mea-
sures of cost and effectiveness are very coarse: they treat all faults as equally severe, and
all test cases as equally expensive.
3.3.4. Minimization of edge-coverage-adequate test suites
Our first experiment addresses our research questions by applying minimization to the
Siemens programs and their edge-coverage-adequate test suites. In reporting results we
first consider test suite size reduction, and then we consider fault detection effectiveness
reduction.
Chapter 3. Edge-Minimization Experiments 23
3.3.4.1. Test suite size reduction
Figure 3-3 depicts the sizes of the minimized edge-coverage-adequate test suites for the
seven Siemens programs, plotted against original test suite size. The data for each pro-
gramP is depicted by a scatterplot containing a point for each of the test suites utilized for
P. As the figure shows, the average sizes of the minimized test suites ranges from approx-
imately 5 (fortcas ) to 12 (for replace ). For each program, the minimized test suites
demonstrate little variance in size:tcas exhibiting the least variance (between 4 and 5
test cases), andprinttok1 showing the greatest variance (between 5 and 14 test cases).
Considered across the range of original test suite sizes, minimized test suite size for each
program is also relatively stable.
Figure 3-4 depicts the percentage reduction in test suite size produced by minimization in
terms of the formula discussed in Section 3.2.1.1, for each of the subject programs. The
data for each programP is represented by a scatterplot containing a point for each of the
test suites utilized forP ; each point shows the percentage size reduction achieved for a
test suite versus the size of that test suite prior to minimization. Visual inspection of the
plots indicates a sharp increase in test suite size reduction over the first quartile of test suite
sizes, tapering off as size increases beyond the first quartile. The data gives the impression
of fitting a hyperbolic curve.
To verify the correctness of this impression, we performed least-squares regression to fit the
data depicted in these plots with a hyperbolic curve. Table 3-2 shows the best-fit curve for
each of the subject, along with its square of correlation, r2.6 They indicate a strong hyper-
Chapter 3. Edge-Minimization Experiments 24
Figure 3-3. Size of Minimized vs. Size of Original Test Suites
0 25 50 75 100 125 150 175 200
original test suite size
0
1
2
3
4
5
6
7
8
9
10
min
imiz
ed te
st s
uite
siz
e
average
tot info
0 25 50 75 100 125 150 175
original test suite size
0
1
2
3
4
5
6
7
8
9
10
min
imiz
ed te
st s
uite
siz
e
average
schedule 1
0 25 50 75 100 125 150 175
original test suite size
0
1
2
3
4
5
6
7
8
9
10
min
imiz
ed te
st s
uite
siz
e
average
schedule 2
0 10 20 30 40 50 60 70
original test suite size
0
1
2
3
4
5
6
min
imiz
ed te
st s
uite
siz
e
average
tcas
0 25 50 75 100 125 150 175 200 225 250
original test suite size
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
min
imiz
ed te
st s
uite
siz
e
average
print tokens 1
0 25 50 75 100 125 150 175 200 225 250 275
original test suite size
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
min
imiz
ed te
st s
uite
siz
e
average
print tokens 2
0 50 100 150 200 250 300
original test suite size
10
12
14
16
18
20
min
imiz
ed te
st s
uite
siz
e
average
replace
Chapter 3. Edge-Minimization Experiments 25
Figure 3-4. Percent Reduction in Test Suite Size vs. Original Test Suite Size
0
20
40
60
80
100
0 50 100 150 200
0
20
40
60
80
100
0 10 20 30 40 50 60 70
0
20
40
60
80
100
0 50 100 150 200 250
0
20
40
60
80
100
0 20 40 60 80 100 120 140 160
0
20
40
60
80
100
0 50 100 150 200 250
0
20
40
60
80
100
0 50 100 150 200
0
20
40
60
80
100
0 20 40 60 80 100 120 140 160
replace
schedule2
schedule1
totinfo tcas
printtok1
printtok2
Chapter 3. Edge-Minimization Experiments 26
bolic correlation between percentage reduction in test suite size (savings of minimization)
and original test suite size.
Table 3-2.Correlation Between Size Reduction and Original Sizeprogram regression equation r 2
totinfo y = 100 * (1 - (5.20762/x)) 0.99
schedule1 y = 100 * (1 - (5.45457/x)) 0.96
schedule2 y = 100 * (1 - (5.12267/x)) 0.94
tcas y = 100 * (1 - (4.97019/x)) 1.00
printtok1 y = 100 * (1 - (7.49780/x)) 0.90
printtok2 y = 100 * (1 - (6.77076/x)) 0.93
replace y = 100 * (1 - (12.1008/x)) 0.99
Our experimental results indicate that test suite minimization can produce savings in test
suite size on coverage-adequate, coverage-redundant test suites. The results also indicate
that as test suite size increases, the savings produced by test suite minimization increase; a
consequence of the relatively stable size of the minimized suites.
3.3.4.2. Fault detection effectiveness reduction
Figure 3-5 depicts the cost (reduction in fault detection effectiveness) incurred by mini-
mization, in terms of the formula discussed in Section 3.2.1.2, for each of the seven subject
programs. The data for each programP is represented by a scatterplot containing a point
for each of the test suites utilized forP; each point shows the percentage reduction in fault
detection effectiveness observed for a test suite versus the size of that test suite prior to
minimization.
6. r2 is a dimensionless index that ranges from zero to 1.0, inclusive, and is “the fraction of variationin the values of y that is explained by the least-squares regression of y on x” [Moore99].
Chapter 3. Edge-Minimization Experiments 27
Figure 3-6 illustrates the magnitude of the fault detection effectiveness reduction observed
for the seven subject programs. Again, this figure contains a scatterplot for each program;
however, we find it most revealing to depictfaults detectedversus original test suite size,
simultaneously for both test suites minimized for edge-coverage (black) and for original
test suites (grey). The solid lines in the plots denote average numbers of faults detected over
the range of original test suite sizes, the gap between these lines indicates the magnitude
of the fault detection effectiveness reduction for test suites minimized for edge coverage.
The plots show that the fault detection effectiveness of test suites can be severely com-
promised by minimization. For example, onreplace , the largest of the programs, mini-
mization reduces fault-detection effectiveness by over 50%, with average fault loss ranging
from 4 faults to 20 across the range of test suite sizes, on more than half of the test suites.
Also, although there are cases in which minimization does not reduce fault-detection ef-
fectiveness (e.g., onprinttok1 ), there are also cases in which minimization reduces the
fault-detection effectiveness of test suites by 100% (e.g., onschedule2 ).
Visual inspection of the plots suggests that reduction in fault detection effectiveness
slightly increases as test suite size increases. Test suites in the smallest size ranges do
produce effectiveness losses of less than 50% more frequently than they produce losses in
excess of 50%, a situation not true of the larger test suites. Even the smallest test cases,
however, exhibit effectiveness reductions in most cases: for example, onreplace , test
suites containing fewer than 50 test cases exhibit an average effectiveness reduction of
nearly 40% (fault detection reduction ranging from 4 to 8 faults), and few such test suites
do not lose effectiveness.
Chapter 3. Edge-Minimization Experiments 28
Figure 3-5. Minimization: Percentage Effectiveness Reduction vs. Original Size
0
20
40
60
80
100
0 50 100 150 200
0
20
40
60
80
100
0 10 20 30 40 50 60 70
0
20
40
60
80
100
0 50 100 150 200 250
0
20
40
60
80
100
0 20 40 60 80 100 120 140 160
0
20
40
60
80
100
0 50 100 150 200 250
0
20
40
60
80
100
0 50 100 150 200
0
20
40
60
80
100
0 20 40 60 80 100 120 140 160
replace
schedule2
schedule1
totinfo tcas
printtok1
printtok2
Chapter 3. Edge-Minimization Experiments 29
Figure 3-6. Effectiveness in Original and after Minimization vs. Original Size
0 50 100 150
original test suite size
0
5
10
15
20
25
30
faul
ts d
etec
ted
by original
after minimization
avg. by orig.
avg. after min.
tinfo
0 50 100 150 200 250
original test suite size
0
5
10
15
20
25
30
35
faul
ts d
etec
ted
by original
after minimization
avg. by orig.
avg. after min.
schedule 1
0 50 100 150
original test suite size
0
2
4
6
8
10
12
faul
ts d
etec
ted
by original
after minimization
avg. by orig.
avg. after min.
schedule 2
0 10 20 30 40 50 60 70
original test suite size
0
10
20
30
40
faul
ts d
etec
ted
by original
after minimization
avg. by orig.
avg. after min.
tcas
0 50 100 150 200
original test suite size
0
1
2
3
4
5
6
7
8
9
10
faul
ts d
etec
ted
by original
after minimization
avg. by orig.
avg. after min.
print tokens 1
0 50 100 150 200 250
original test suite size
0
2
4
6
8
10
12
14
faul
ts d
etec
ted
by original
after minimization
avg. by orig.
avg. after min.
print tokens 2
0 50 100 150 200 250
original test suite size
0
5
10
15
20
25
30
35
faul
ts d
etec
ted
by original
after minimization
avg. by orig.
avg. after min.
replace
Chapter 3. Edge-Minimization Experiments 30
Table 3-3.Minimization: Correlation between Effectiveness Reduction and Original Sizeprogram regression line 1 r 2 regression line 2 r 2 regression line 3 r 2
totinfo y = 0.13x + 27.79 0.16 y = 9.56Ln(x) - 1.71 0.22 y = -0.002x^2 + 0.44x + 17.74 0.21
schedule1 y = 0.15x + 38.92 0.12 y = 10.03Ln(x) + 9.25 0.15 y = -0.002x2 + 0.47x + 29.80 0.15
schedule2 y = 0.28x + 34.86 0.16 y = 17.70Ln(x) - 17.12 0.20 y = -0.004x^2 + 0.89x + 17.07 0.21
tcas y = 0.68x + 34.89 0.38 y = 22.18Ln(x) - 16.28 0.47 y = -0.020x^2 + 2.18x + 13.41 0.46
printtok1 y = 0.16x + 22.48 0.18 y = 14.68Ln(x) - 26.34 0.20 y = -0.001x^2 + 0.44x + 10.94 0.20
printtok2 y = 0.07x + 12.57 0.11 y = 6.82Ln(x) - 10.73 0.13 y = -0.001x^2 + 0.19x + 6.95 0.13
replace y = 0.11x + 42.67 0.20 y = 13.07Ln(x) - 4.82 0.27 y = -0.001x^2 + 0.41x + 26.79 0.28
In contrast to the plots of size reduction effectiveness, the plots of fault detection effective-
ness reduction do not give a strong impression of closely fitting any curve or line: the data
is much more scattered than the data for test suite size reduction. Our attempts to fit linear,
logarithmic, and quadratic regression curves to the data validate this impression: the data
in Table 3-3 reveals little linear, logarithmic, or quadratic correlation between reduction in
fault detection effectiveness and original test suite size.
These results indicate that test suite minimization can compromise the fault-detection ef-
fectiveness of coverage-adequate, coverage-redundant test suites. However, the results
only weakly suggest that as test suite size increases, the reduction in the fault-detection
effectiveness of those test suites will increase.
Chapter 3. Edge-Minimization Experiments 31
One additional feature of the scatterplots of Figure 3-5 warrants discussion: on several
of the graphs, there are markedly visible “horizontal lines” of points. In the graph for
printtok1 , for example, there are particularly strong horizontal lines at 0%, 20%, 25%,
33%, 40%, 50%, 60%, and 67%. Such lines indicate a tendency for minimization to ex-
clude particular percentages of faults for the programs on which they occur.
This tendency is partially explained by our use of a discrete number of faults in each subject
program. Given a test suite that exposes k faults, minimization can exclude test cases that
detect between 0 and k of these faults, yielding discrete percentages of reductions in fault-
detection effectiveness. Forprinttok1 , for example, there are seven faults, of which the
unminimized test suites may reveal between zero and seven. When minimization is applied
to the test suites forprinttok1 , only 19 distinct percentages of fault detection effective-
ness reduction can occur: 100%, 86%, 83%, 80%, 75%, 71%, 67%, 60%, 57%, 50%, 43%,
40%, 33%, 29%, 25%, 20%, 17%, 14%, and 0%. Each of these percentages except 29%
and 100% is evident in the scatterplot forprinttok1 . With all points occurring on these
16 percentages, the appearance of lines in the graph is unsurprising.
It follows that as the number of faults utilized for a program increases, the presence of hor-
izontal lines should decrease; this is easily verified by inspecting the graphs, considering in
turnprinttok1 with 7 faults,schedule1 with 9, schedule2 with 10,printtok2
with 10, totinfo with 23, replace with 32, andtcas with 41.
This explanation, however, is only partial: if it were complete, we would expect points to
lie more equally among the various reduction percentages (with allowances for the fact that
there may be multiple ways to achieve particular reduction percentages). %100%, 67%,
Chapter 3. Edge-Minimization Experiments 32
50%, 33%, and 0% reductions). The fact that the occurrences of reduction percentages are
not thus distributed reflects, we believe, variance in fault locations across the programs,
coupled with variance in test coverage patterns of faulty statements.
3.3.5. Minimization of randomly generated test suites
Our second experiment addresses the question of how edge-coverage-based minimization
compares to random selection as a test suite reduction technique. To facilitate discussion,
we refer to test suites where size was minimized while keeping coverage constant asmin-
imized test suites, and we refer to test suites where the size was reduced to a specific level
by random selection asrandomly reduced test suites.
To randomly reduce test suites, we used Perl’s built in pseudo-random number generator.
Perl’s pseudo-random number generator was automatically seeded with the system time,
process-ID, and various other system variables.7
For each of the test suites, the original test pool was set to the test cases in the unminimized
test suite. The random number generator returned a positive integer less than the size of
the test pool. This integer was treated as an index to the test cases in the test pool, and the
indexed test case was placed in the output test suite and removed from the test pool. The
process was repeated until the output test suite reached the size of the minimized test suite.
7. This is dependent on the version of Perl, and is described in the perlfunc man page for Perl5.004.
Chapter 3. Edge-Minimization Experiments 33
The experiment follows a paired-T test design [Johnson92]. A paired-T test is an experi-
ment in which two large sets of data (populations) are compared by comparing the many
elements (subjects), which are extracted from the two populations in pairs, in such a way
that the pairs control extraneous variables.
In this case, by pairing our minimized edge-coverage-adequate test suites with randomly
reduced test suites, we were able to control for differences in the unminimized test suites
and differences in minimized test suite sizes. As a result, we were able to compare the
overall fault detection effectiveness of minimized test suites with the overall fault detection
effectiveness of the randomly reduced test suites.
3.3.5.1. Test suite size reduction
By design, we produced randomly reduced test suites of the same size as those produced by
minimization in our first experiment. Thus, the test suites produce the same size reductions
as those depicted in Figure 3-3 and Figure 3-4.
3.3.5.2. Fault detection effectiveness reduction
Figure 3-7 depicts the cost (reduction in fault detection) incurred by randomly selecting
a subset of the original test suite. These scatterplots look similar to those of Figure 3-5,
which depict the reduction in fault detection incurred by minimization. The only noticeable
difference is that the scatterplot for the randomly selected test suites is somewhat denser
for high failure rates.
Figure 3-8 illustrates the magnitude of the fault detection effectiveness reduction observed
Chapter 3. Edge-Minimization Experiments 34
for the seven subject programs for random test suite reduction, compared with the reduction
for edge-coverage-based minimization. Again, this figure contains a scatterplot for each
program, and we depictfaults detectedversus original test suite size, simultaneously for
both test suites minimized for edge-coverage (black) and for randomly reduced test suites
(grey). The solid lines in the plots denote average numbers of faults detected over the range
of original test suite sizes, the gap between these lines indicates the difference between the
two minimization techniques. The plots indicate a noticeable difference between the two
techniques.
As with the minimized test suites, we attempted to fit the data points for fault detection
reduction of randomly reduced test suites (Figure 3-7) to some simple functions. The
results of this attempt (shown in Table 3-4) were similar to those for minimization (Table
3-3) as both show little linear, logarithmic, and quadratic correlation between reduction in
fault detection effectiveness and the size of the original test suite. The randomly reduced
test suites, however, have even lower correlation coefficients, reflecting the more variable
nature of random reduction.
Table 3-4.Random Reduction: Correlation between Effectiveness Loss and Origianl Sizeprogram regression line 1 r 2 regression line 2 r 2 regression line 3 r 2
totinfo y = 0.14x + 45.49 0.10 y = 11.35Ln(x) + 9.67 0.16 y = -0.002x^2 + 0.55x + 32.53 0.15
schedule1 y = 0.15x + 62.37 0.09 y = 10.63Ln(x) + 29.92 0.14 y = -0.003x^2 + 0.58x + 50.16 0.13
schedule2 y = 0.19x + 66.50 0.09 y = 13.49Ln(x) + 25.43 0.14 y = -0.004x^2 + 0.78x + 49.73 0.14
tcas y = 0.61x + 48.30 0.24 y = 20.61Ln(x) - 0.24 0.32 y = -0.020x^2 + 2.10x + 27.01 0.30
printtok1 y = 0.15x + 60.11 0.13 y = 14.62Ln(x) + 10.81 0.16 y = -0.001x^2 + 0.44x + 48.32 0.15
printtok2 y = 0.06x + 55.26 0.02 y = 6.52Ln(x) + 32.39 0.03 y = -0.001x^2 + 0.20x + 48.81 0.03
replace y = 0.10x + 55.44 0.15 y = 12.53Ln(x) + 9.96 0.20 y = -0.001x^2 + 0.36x + 42.27 0.19
Figure 3-9 shows boxplots representing (comparatively) the span of the fault detection
Chapter 3. Edge-Minimization Experiments 35
Figure 3-7. Random Reduction: Percentage Effectiveness Reduction vs. Original SuiteSize
0
20
40
60
80
100
0 50 100 150 200
0
20
40
60
80
100
0 10 20 30 40 50 60 70
0
20
40
60
80
100
0 50 100 150 200 250
0
20
40
60
80
100
0 20 40 60 80 100 120 140 160
0
20
40
60
80
100
0 50 100 150 200 250
0
20
40
60
80
100
0 50 100 150 200
0
20
40
60
80
100
0 20 40 60 80 100 120 140 160
replace
schedule2
schedule1
totinfo tcas
printtok1
printtok2
Chapter 3. Edge-Minimization Experiments 36
Figure 3-8. Minimization and Random Reduction: Fault Detection vs Original Size
0 50 100 150
original test suite size
0
5
10
15
20
25
30
faul
ts d
etec
ted
after minimization
after random reduction
avg. after min.
avg after rand. red.
tinfo
0 50 100 150 200 250
original test suite size
0
5
10
15
20
25
30
35
faul
ts d
etec
ted
after minimization
after random reduction
avg. after min.
avg after rand. red.
schedule 1
0 50 100 150
original test suite size
0
2
4
6
8
10
12
faul
ts d
etec
ted
after minimization
after random reduction
avg. after min.
avg after rand. red.
schedule 2
0 10 20 30 40 50 60 70
original test suite size
0
10
20
30
40
faul
ts d
etec
ted
after minimization
after random reduction
avg. after min.
avg after rand. red.
tcas
0 50 100 150 200
original test suite size
0
1
2
3
4
5
6
7
8
9
10
faul
ts d
etec
ted
after minimization
after random reduction
avg. after min.
avg after rand. red.
print tokens 1
0 50 100 150 200 250
original test suite size
0
2
4
6
8
10
12
14
faul
ts d
etec
ted
after minimization
after random reduction
avg. after min.
avg after rand. red.
print tokens 2
0 50 100 150 200 250
original test suite size
0
5
10
15
20
25
30
35
faul
ts d
etec
ted
after minimization
after random reduction
avg. after min.
avg after rand. red.
replace
Chapter 3. Edge-Minimization Experiments 37
Figure 3-9. Random Reduction: Percent Effectiveness Reduction
printtokens2totinfo schedule1 schedule2 tcas replaceprinttokens1
100
90
80
70
60
50
40
30
20
10
0minimized random minimized random minimized random minimized random randomminimized minimized randomrandommininimized
reductions for the various Siemens subjects. Boxplots for the minimized and randomly
reduced test suites are shown side by side. The boxplots show a consistent pattern of
greater loss in fault detection in the randomly reduced test suites than in their associated
minimized test suites.
Table 3-5.Comparison of Fault Detection Reductionprogram Average Reduction in
Fault Detection for
Minimized Test Suites
Average Reduction in
Fault Detection for
Randomly Reduced Test
Suites
Average Difference in
Fault Detection Reduction
totinfo 39.2 58.2 19.0� 2.5
schedule1 51.1 74.2 23.2� 2.7
schedule2 56.7 81.7 25.0� 3.0
tcas 60.9 71.4 10.6� 2.4
printtok1 40.8 77.7 36.9� 2.7
printtok2 21.3 63.1 41.7� 2.8
replace 57.2 69.4 12.2� 2.3
Chapter 3. Edge-Minimization Experiments 38
Table 3-5 shows statistical data for randomly reduced and minimized test suites. The data
confirms conclusions drawn from Figure 3-9: the minimized test suites tended to find more
faults than their randomly reduced counterparts. The fourth column shows the difference
in lost fault detection between minimized and random reduction; the margins of error are
shown for the 99.9% confidence level. The average advantage ranged from 10.6% for
tcas to 41.7% forprinttok2 . The differences are significant as all of their confidence
intervals lie entirely above zero.
Table 3-6.Comparison of Fault Detection Reduction Varianceprogram Sample Standard
Deviation in Reduction
in Fault Detection for
Minimized Test Suites
Sample Standard
Deviation in Reduction
in Fault Detection for
Randomly Reduced
Test Suites
totinfo 15.7 22.1
schedule1 18.5 20.7
schedule2 23.2 24.0
tcas 20.3 22.9
printtok1 20.5 22.7
printtok2 13.2 25.3
replace 16.7 18.6
Table 3-6 further quantifies results not directly apparent in Figure 3-9: for all programs,
the minimized test suites detected faults more consistently than their randomly reduced
counterparts. The smallest difference was found forschedule2 , where the reduction
in fault detection for the minimized test suites had a standard deviation of 23.2, and the
reduction in fault detection for the randomly reduced test suites was only slightly less
consistent with a standard deviation of 24.0. The largest difference was forprinttok2 ,
where the reduction in fault detection for minimized test suites had a standard deviation of
only 13.2, while the reduction in fault detection for the randomly reduced test suites had a
Chapter 3. Edge-Minimization Experiments 39
standard deviation of 25.3.
3.4. Experiment with the Space Program
Our next experiment addresses our research questions by applying minimization to the
Space program utilized in the Wong97 study.
3.4.1. Subject program, faulty versions, test cases, and test suites.
Space (see Table 3-7), consisting of 9564 lines of C code (6218 executable), functions
as an interpreter for an array definition language (ADL). The program reads a file that
contains several ADL statements, and checks the contents of the file for adherence to the
ADL grammar, and to specific consistency rules. If the ADL file is correct,Space outputs
an array data file containing a list of array elements, positions, and excitations; otherwise
the program outputs error messages.
Table 3-7.The Space ApplicationLines of Code 6218
No. of Versions 35
Test Pool Size 13585
Description language interpreter
Space has 33 associated versions, each containing a single fault that had been discovered
during the program’s development. (The Wong97 study utilized only eighteen of these
faulty versions.) Through working with this program, we discovered five additional faults,
Chapter 3. Edge-Minimization Experiments 40
Figure 3-10.Percentage of Test Cases that Expose each of Space’s Faults
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
faulty version
0
10
20
30
40
50
60
70
80
90
100
perc
enta
ge o
f tes
t cas
esRates of Detection
for each fault
and created versions containing just those faults. We also discovered that three of the
“faulty versions” were actually semantically equivalent to the base version. We excluded
these from our study; therefore, we ultimately utilized 35 faulty versions.
The test pool forSpace was constructed in two stages. An initial pool of 10,000 tests was
obtained from Frankl and Vokolos, who had constructed the pool for another study by ran-
domly generating test cases [Vokolos98]. Beginning with this initial pool, we instrumented
the program for edge coverage, measured coverage, and then added additional test cases to
the pool until it contained, for each dynamically executable edge in the control flow graph
for the program, at least 30 test cases that exercised that edge. This process yielded a test
pool of 13,585 test cases.
Figure 3-10 shows the sensitivity to detection of the faults inSpace relative to the test
pools; the bar graph illustrates that the sensitivities of the faults varies, but overall falls
between .13% and 99.77%. In all, 74% (26/35) of these faults were, in the terminology of
the Wong studies, “Quadrant I” faults, detectable by fewer than 25% of the test pool inputs.
Chapter 3. Edge-Minimization Experiments 41
As with the Siemens programs, we used theSpace program’s test pools to obtain
coverage-adequate test suites for the program; however, due to limitations in our dataflow
analyzer, we were able to create only edge-coverage-adequate suites.
As with the Siemens programs we utilized test suites consisting of a random number of
randomly selected tests together with additional tests necessary to achieve coverage. In
addition, we also utilized a set of smaller test suites generated for coverage by beginning
with an empty test suite, and then greedily selecting test cases and adding them to the test
suite only if they added coverage, until full coverage was achieved. We call these two
varieties of test suites “extended” and “unextended”, respectively. The extended test suites
ranged in size from 159 to 4712 test cases, and the unextended test suites ranged in size
from 141 to 169 tests. Because the unextended test suites were greedily generated, they
do contain coverage-redundant test cases – though far fewer than most of the extended test
suites.
3.4.2. Experiment design.
The experiments were run using a full-factorial design with 1000 size-reduction and 1000
effectiveness-reduction measures per cell. The independent variables manipulated were:
• The test suite type (extended or unextended).
• Test suite size (for the extended suites, between 0 and .5 times lines-of-code test cases
randomly selected from the test pool, together with additional test cases as necessary to
achieve code coverage; for the unextended test suites, the range of sizes generated that
Chapter 3. Edge-Minimization Experiments 42
achieved coverage.)
For each test suite type, we applied minimization techniques to each of the 1000 sample
test suites of that type. We then computed the size and effectiveness reductions for these
test suites.
Also, as with the Siemens programs, we conducted an additional experimental run utilizing
randomly selected tests, using a paired T test design. We report both experimental runs
together.
3.4.3. Threats to validity.
This experiment shares, with our experiments on the Siemens programs, the threats to
validity described in Section 3.3.3. In addition, the program’s naturally occurring faults
had been separated into single faulty versions, abstracting out effects that could occur from
interacting faults; however, this abstraction was necessary in order to be able to attribute
failures properly to faults. On the other hand,Space is a real program, with real faults
uncovered in practice, and its size is an order of magnitude greater than that of the Siemens
programs; these factors augment the external validity of the study.
Chapter 3. Edge-Minimization Experiments 43
3.4.4. Data and Analysis
We report results for all types of test suites and for random reduction together in this
section.
3.4.4.1. Test suite size reduction
Figure 3-11 depicts the sizes of the minimized edge-coverage-adequate test suites (both
extended and unextended) forSpace . These show that the size of the minimized test
suites is relatively stable and does not show a large variance even between the extended
and unextended test suites.
Figure 3-12 shows the percentage reduction in minimized test suite size versus the size
of the original test suite forSpace . The scatterplot for minimization from the extended
test suites (at left in the figure) is hyperbolic – and nearly identical to those seen in the
experiments on the Siemens programs – reflecting reductions to a nearly constant size
across a wide range of original test suites sizes. Although not apparent given the scale of
the plot, the points in the plot for unextended test suites (at left in the figure) fit into the
bottom of the curve for the extended test suites amidst the points for the least extended test
suites.
Table 3-8.Correlation between Size Reduction and Initial Sizeprogram regression equation r^2
Space y = -13.919Ln(x) - 14.832 0.7949
Space y = 0.0065327x + 74.436 0.4702
Space y = -4.0311e-06x^2 + 0.025842x + 58.449 0.7256
Space y = y = 100 - 100*121.012/x 0.9994
Chapter 3. Edge-Minimization Experiments 44
Figure 3-11.Size of Minimized Test Suites vs Size of Original Test Suites
0 10 20 30 40 50 60 70 80 90100 110 120 130 140 150 160 170 180
original test suite size
0
25
50
75
100
125
150
min
imiz
ed
te
st s
uite
siz
e
average
spaceunextended
0 1000 2000 3000 4000 5000
original test suite size
0
25
50
75
100
125
150
min
imiz
ed
te
st su
ite
siz
e
average
spaceextended
Figure 3-12.Percent Reduction in Test Suite vs Original Test Suite Size
0
20
40
60
80
100
135 140 145 150 155 160 165 170 175
sf-prtss
0
20
40
60
80
100
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
se-prtss
space, minimized from extended space, minimized from unextended
Table 3-8 shows attempts to fit curves to the points in the plot for extended test suites.
Again, the best fit is the hyperbolic curve.
Chapter 3. Edge-Minimization Experiments 45
3.4.4.2. Fault detection effectiveness reduction
Figure 3-13 shows scatterplots for the reduction in fault detection for the minimization
and random reduction ofSpace ’s unextended and extended test suites. For all four of
the treatments, the highest loss in fault detection effectiveness is less than 40%. Table 3-8
shows attempts to fit a curve to the effectiveness reduction resulting from minimization of
the extended test suites. Unlike for size reduction, we found no curves that fit the data well.
Figure 3-14 illustrates the magnitude of the fault detection effectiveness reduction observed
for Space . Similar to the figures presented for the Siemens programs, the figure contains
a scatterplot for the extended test suites (left) and one for the unextended test suites (right).
Each scatterplot depictsfaults detectedversus original test suite size, simultaneously for
original test suites (black), test suites minimized for all-edges-coverage (dark grey) and
randomly reduced test suites (light grey). The solid lines in the plots denote average num-
bers of faults detected over the range of original test suite sizes, the gaps between these
lines indicate differences between the test suites. The plots reveal noticeable differences
between the techniques as test suite size grows; however, on the smaller unextended test
suites, the differences are much smaller.
Table 3-9.Correlation between Initial Size and Effectiveness Reductionprogram regression equation r^2
Space y = 0.0011x + 6.23 0.1047
Space y = 2.1771Ln(x) - 7.4968 0.1478
Space y = -6.36e-07x^2 + 0.004169x + 3.708 0.1531
Chapter 3. Edge-Minimization Experiments 46
Figure 3-13.Percent Reduction in Effectiveness vs. Original Size
0
20
40
60
80
100
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
se-prfdc
0
20
40
60
80
100
135 140 145 150 155 160 165 170 175
sf-prfdc-random
0
20
40
60
80
100
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
se-prfdc-random
0
20
40
60
80
100
135 140 145 150 155 160 165 170 175
sf-prfdc-random
space, minimized from extended test suites space, minimized from unextended suites
space, randomly reduced from extended space, randomly reduced from unextended
Figure 3-14.Original and Minimized: Faults Detected vs. Original Size
0 25 50 75 100 125 150 175
original test suite size
0
5
10
15
20
25
30
35
40
45
50
fau
lts d
ete
cte
d
by original
after minimization
after random reduction
avg. by orig.
avg. after min.
avg after rand. red.
spaceunextended
0 1000 2000 3000 4000 5000
original test suite size
0
5
10
15
20
25
30
35
40
45
50
fau
lts d
ete
cte
d
by original
after minimization
after random reduction
avg. by orig.
avg. after min.
avg after rand. red.
spaceextended
Chapter 3. Edge-Minimization Experiments 47
Table 3-10 shows average reductions in fault detection resulting from applying different
treatments toSpace . Not surprisingly, the larger extended test suites exhibited a greater
average fault reduction due to minimization, 8.9% versus 2.3%, and due to random reduc-
tion, 18.1% versus 3.4%. However, in general, the losses in fault detection effectiveness are
here much smaller than those observed with the Siemens programs. As in the experiments
with the Siemens programs, fault detection reduction due to random reduction was greater
than that due to minimization (9.2� 0.7 for the extended test suites), and the margin of
errors show that to be significant at� = 0.001 (99.9% confidence).
Table 3-10.Average Reductions in Fault Detection EffectivenessUnminimized Test Suites Average Reduction in Fault
Detection for Minimized
Test Suites
Average Reduction in Fault
Detection for Randomly
Reduced Test Suites
Average Difference in Fault
Detection Reduction
extended 8.9 18.1 9.2� 0.7
unextended 2.3 3.4 1.2� 0.5
3.5. Comparison to Previous Empirical Results
Both the Wong98 studies and our studies indicate that test suite minimization can produce
savings (in test suite size reduction), and that these savings increase with test suite size.
Both sets of studies also support, to some degree, a claim that reduction in fault-detection
effectiveness increases as test suite size increases.
Chapter 3. Edge-Minimization Experiments 48
The two sets of studies differ substantially, however, in their results pertaining to fault-
detection effectiveness reduction. The authors of the Wong studies conclude that, for the
programs and test cases they considered: (1) test suites that do not add coverage are not
likely to detect additional faults, and (2) fault detection effectiveness reduction is insignif-
icant even for test suites that have high block coverage.
Our results onSpace are similar in this respect to those of the Wong97 study. That study
shows an average fault detection effectiveness reduction less than 10% for all of test suite
sizes. While Figure 3-13 shows somewhat larger losses in fault reduction in some cases,
the average reduction in fault detection capability of 8.9% is not far from that discovered
in Wong97.
However, this conclusion contrasts markedly with our results on the Siemens programs,
where fault-detection effectiveness was severely compromised by minimization. We would
like to know the causes of this difference. We here discuss several potential causes of these
differences.
First, the Siemens programs differ from the programs utilized in the Wong98 study; all but
one of the Siemens programs is larger than all but one of the programs used in that study.
Second, the Wong98 study used ATAC for minimization, whereas our study utilized the
algorithm of Reference [Harrold93]. Reference [Wong95] reports that ATAC achieved
minimal test selection on the cases studied; we have not yet determined whether our al-
gorithm was equally successful. However, if our algorithm is less successful than the
algorithm used in the Wong98 study, we would expect this to cause us to underestimate
possible reductions in fault detection effectiveness. A better algorithm, if possible, could
Chapter 3. Edge-Minimization Experiments 49
only exacerbate the already large difference in results.
Third, our experiments with the Siemens programs and the Wong98 study both utilized
seeded faults that may accurately be described as “mutation-like”. However, all of the
faults utilized in our study were Quartile I faults, whereas only 41% of the faults used
in the Wong98 study were Quartile I faults. Easily detected faults are less likely to go
unrecognized in minimized test suites than faults that are more difficult to detect; thus, we
would expect our results overall to show greater reductions in fault-detection effectiveness
than the Wong98 study. However, the authors of the Wong98 study did separately report
results for Quartile I faults, and in their study, minimized test suites missed few of these
faults. Our fourth study further considers this factor by utilizing the same faulty versions
utilized in the Wong97 study.
Fourth, a factor more likely to be responsible for differences in results of the studies, in
our opinion, involves the types of test suites utilized. The Wong98 study used test suites
that were not coverage-adequate, and used coverage-based suites that were relatively small
compared to our test suites. Overall, 928 of the 1198 test suites utilized in the Wong98
study belonged to groups of test cases whose average sizes did not exceed 4.5 test cases.
Small test suite size reduces opportunities both for minimization, and for reduction in
fault-detection effectiveness. Further differences in test suites stem from the fact that the
test pools used in the Wong98 study as sources for test suites did not necessarily contain
any minimum number of test cases per covered item. These differences may contribute
Chapter 3. Edge-Minimization Experiments 50
to reduced redundancy in test coverage within test suites, and reduce the likelihood that
minimization will exclude fault-revealing test cases.
Finally, another plausible factor involves the specific test cases included in the test case
pools. To illustrate, we reproduce, in Table 3-11, data on the Wong98 study presented
in [Wong98]. The table lists the ten programs used in the Wong98 study (column 1), the
total number of faulty versions of those programs utilized (column 2), and the number of
tests in the test pools created for those programs (column 3). The next two columns report
data obtained when the researchers used ATAC to minimize the entire test pools for these
programs: column 4 indicates the size of the minimized test pool and column 5 indicates
the total number of faults missed by the minimized suites. From this data we derive column
6, the number of faults detected by the minimized test pools.
Table 3-11.Fault detection abilities of tests used in the Wong98 studyprogram number of faults test pool size minimized test
suite size
number of faults
missed
number of faults
detected
cal 20 162 6 1 19
checkeq 20 166 3 1 19
col 30 156 3 1 29
comm 15 754 11 3 12
crypt 15 156 2 0 15
look 15 193 6 2 13
sort 23 997 11 1 22
spline 13 700 5 1 12
tr 12 870 2 4 8
uniq 18 431 5 0 18
Consider columns 2 and 6 for programcrypt . Two tests in the test pool forcrypt were
able to detect all 15 faults in that program. Similarly, 3 tests in the test pool forcol were
able to detect 29 of the 30 faults in that program. These are powerful tests in relation to
Chapter 3. Edge-Minimization Experiments 51
the faults; similarly powerful tests appear to exist for the other programs. When such tests
are included in test suites, minimized versions of those suites may well exhibit little loss in
fault detection effectiveness.
This data, and the presence of powerful tests in the test pools, suggests why minimized
test pools retain fault-detection effectiveness in the Wong98 studies. We cannot know,
however, the extent to which such powerful tests in the Wong98 test pools are distributed
among the Wong98 test suites, nor can we know that, distributed among those suites, they
would necessarily have be selected by ATAC. Nevertheless, the data supports a conjecture:
characteristics of the tests in the test pool, and their relation in terms of coverage and fault-
exposing potential to the subject programs, can affect the performance of minimization
techniques.
52
Chapter 4. A New Minimization Technique
Because minimization sometimes results in test suites that are significantly less effective
than the test suites from which they were minimized, we developed a new coverage cri-
terion that might manifest better behavior when used for minimization. The first section
of this chapter introduces mutation analysis and explains how we can use it as a coverage
criterion. Then, the second section shows how this coverage criterion can be used for min-
imization, and it presents an algorithm to enable it. The final section presents our findings
as to the performance of the algorithm.
4.1. Mutation Analysis and Minimization
4.1.1. Mutation Analysis and Sensitivity
One method of assessing test suite effectiveness is known asmutation analysis[Untch93].
Mutation analysis creates many mutant versions of a program. Each statement in the pro-
gram is altered based on a set of rules for creating the mutants. The set of rules used to
create the mutants are calledmutagenic operators. Test cases can be run against the mu-
tant versions of the program to see if their outputs differ from that of the original version
of the program. If the output differs, the mutant is said to bekilled by that test case. A test
suite that contains test cases that kill every mutant is consideredmutation adequate. The
percentage of mutants killed by at least one test case in a test suite is called that test suite’s
Mutation Adequacy Score. The extent to which this is an accurate assessment of test suite
effectiveness depends on the hypothesis that a test suite that detects the minor variations
represented by the mutants will also detect real faults.
Chapter 4. A New Minimization Technique 53
In [Voas92], Jeffrey Voas proposes a related technique which he callspropagation, infec-
tion, and execution (PIE) analysis. PIE Analysis is a dynamic technique for estimating
three characteristics:
1. the probability that a particular section of a program is executed [execution],
2. the probability that the particular section affects the data state [infection], and
3. the probability that a data state produced by that section has an effect on program output[propagation][Voas92]
Combined, these provide an estimate of a statement’ssensitivity, the likelihood that a single
input is able to reveal a hypothetical fault in that statement.
The more sensitive a statement is, the more likely a test case is to reveal any faults in
that statement. Thus a fault in a sensitive statement may be revealed by a single test case
executing the statement, but if the statement is insensitive (has a low sensitivity), it will
often take many test cases executing that statement to reveal a fault in that statement.
Unfortunately, the sensitivity estimate is not unqualified. The "execution" component is
specific to a particular input distribution. The acquiring of the "infection" and "propaga-
tion" estimates depends on the correlation between the statistical behavior faults simulated
through mutation and the extent to which faults found "in the wild"1. The estimate is
1. The mutants used in [Voas92] are slightly different from those traditionally used in mutationanalysis, in that in [Voas92] propagation and infection are treated separately. For infection anal-ysis, the code is mutated as in conventional mutation analysis, but instead of counting kills basedon output, [Voas92] compares the data states of the original and mutant version immediately af-ter the altered statement is executed. For propagation analysis, the data states, rather than theexecutable instructions, areperturbed.
Chapter 4. A New Minimization Technique 54
accurate to the extent that the test suite matches the input distribution and the faults in the
program behave like those simulated through mutation.
4.1.2. Adapting Sensitivity for use as a Coverage Criterion
Probabilistic Statement Sensitivity Coverage (PSSC) is qualitatively different than con-
trolflow or dataflow criteria. Instead of looking at program components and making sure
that they are all exercised, PSSC requires that each statement be executed enough times
such that there is at least a minimum likelihood that if a fault exists in the statement then it
will be revealed by one of the test cases. The number of tests required to execute a given
statement is calculated from an estimate of that statement’s sensitivity.
Given the "propagation" and "infection" components of sensitivity, it is possible to deter-
mine the number of times a statement should be executed to achieve a certain confidence
level that if there were a fault, it would be revealed by testing, using the equation: T =
ln(1-c)/ln(1-Ol) where T is the number of executions, c is the confidence level, and O
lis
an estimate of the likelihood of each test revealing a fault based on the "propagation" and
"infection" components of the sensitivity of location l[Voas92].
For a test suite to be PSSC adequate at a particular confidence level each statement must
be executed the requisite number of times. As a coverage criterion PSSC is similar to
statement coverage, where each statement is a requirement that must be covered, but it has
the added element that each statement needs to be exercised by more than one test case
Chapter 4. A New Minimization Technique 55
to be sufficiently exercised. This problem is not equivalent to the traditional minimization
problem, so a multi-hit minimization algorithm must be used
4.2. An Algorithm to Facilitate Minimization based on PSSC
PSSC, as defined in the previous section, can be used as the basis for test suite minimiza-
tion. Because of the unique nature of PSSC minimization, we require a new algorithm.
This section will describe one conventional minimization heuristic, give a new more gen-
eral algorithm based on it, show how this can be used for PSSC minimization, and assess
the complexity of this algorithm.
4.2.1. A Conventional Test Suite Minimization Heuristic
As previously stated, the goal of test suite minimization is to find the smallest set of test
cases that cover all testing requirements with at least one test case. Unfortunately, finding
the minimum set is NP-complete[Garey79]. Harrold et al. present an algorithm, shown in
Figure 4-1 that finds a reduced test suite that satisfies all of the testing requirements, but is
not necessarily of the minimal cardinality[Harrold93]. The algorithm uses a heuristic that
selects tests, one at a time, until all of the requirements are exercised, by choosing the test
case that hits the most requirements that are the hardest to satisfy. (In the case of a tie, it
uses the test cases that hit the most requirements that are next hardest to satisfy. If neces-
sary, this is repeated until all of the requirements have been examined, and if it is still a tie,
one of the tests is chosen at random.) The requirements that are the hardest to satisfy are
Chapter 4. A New Minimization Technique 56
Figure 4-1. The Harrold, Gupta, and Soffa Test Suite Minimization Algorithmalgorithm ReduceTestSuite
input T1, T2, ..., Tn: associated testing sets for r1, r2, ..., rn respectively,containing test cases from t1, t2, ... tnt
output RS: a representative set of T1, T2, ... Tndeclare MAX_CARD, CUR_CARD: 1...nt
LIST: list of ti’sNEXT_TEST: one of t1, t2, ... tntMARKED: array[1..n] of boolean, initially falseMAY_REDUCE: booleanMax(): returns the maximum of a set of numbersCard(): returns the cardinality of a set
begin/* Step 1: initialization */
MAX_CARD := Max_i(Card(Ti)) /* get the maximum cardinality of the Ti’s */RS := U_i T_i, Card(Ti) = 1 /* take union of all single element Ti’s */foreach Ti such that Ti \ RS != O/ do MARKED[i] := true /* mark all Ti containing elements in RS */CUR_CARD := 1 /* consider single element sets first */
/* Step 2: compute RS according to the heuristic for sets of higher cardinality */loop
CUR_CARD := CUR_CARD + 1 /* consider all sets with next higher cardinality */while there are Ti such that (Card(Ti) = CUR_CARD and not Marked[i] do
/* process all unmarked sets of current cardinality */LIST := all tj 2 Ti where Card(Ti) = CUR_CARD and not Marked[i]
/* all tj in ti in Ti of size CUR_CARD */NEXT_TEST := SelectTest(CUR_CARD,LIST) /* get another tj to include in RS */RS := RS [ {NEXT_TEST} /* add the test to RS */MAY_REDUCE := falseforeach Ti where NEXT_TEST 2 Ti do
MARKED[i] := true /* mark Ti containing NEXT_TEST */if Card(Ti) = Max(Card(Ti) then MAY_REDUCE :=1
endforif MAY_REDUCE then
MAX_CARD := Max(Card(Ti))), for all i where MARKED[i] = falseend while
until CUR_CARD = MAX_CARDend ReduceTestSuite
- - - - - - -
function SelectTest(SIZE, LIST)/* this function selects the next ti to be included in RS */
declare COUNT: array[1..nt]begin
foreach ti in LIST do compute COUNT[ti], the number of unmarked Tj’s of cardinality SIZE containing tiConstruct TESTLIST consisting of tests from LIST for which COUNT[i] is the maximumif Card(TESTLIST=1) then return(the test case in TESTLIST)elseif SIZE = MAX_CARD then return(any test case in TESTLIST)else return(SelectTest(SIZE+1, TESTLIST))
end SelectTest
Chapter 4. A New Minimization Technique 57
those whose associated set, the set of test cases that can hit the requirement, has the small-
est cardinality. This algorithm has been described as "greedy on bottlenecks"[Wong98],
because it greedily chooses test cases that makes the most progress towards satisfying the
bottlenecks, the requirements that are hardest to satisfy.
4.2.2. A Multi-Hit Minimization Algorithm
The Harrold, Gupta, and Soffa algorithm is general in that it can be applied to any set
covering problem. This includes minimization based on control-flow, data-flow, functional
coverage, test cases that have found faults in the past, and even combinations of these.
All that is necessary is that the problem be expressed as a set of requirements, each with
a set of test cases (or other objects) that can satisfy those requirements. One thing the
algorithm cannot handle, however, is requirements that need to be hit by more that one
test case before they are satisfied. For example, if we wanted to say that all statements in
the program needed to be executed at least twice, the Harrold, Gupta, and Soffa algorithm
could not meet this requirement.
The structure of this algorithm can be adapted to the more general case where each re-
quirement must be hit an arbitrary number of times rather than just once. What is needed
is a more general measure of the difficulty of satisfying a requirement. Instead of using
the cardinality of the requirement’s associated set, the new algorithm defines a dynamic
measure called the requirement’s "hitting-factor", which represents the ratio of test cases
that can still hit the requirement divided by the number of times the requirement still needs
to be hit.
Chapter 4. A New Minimization Technique 58
The hitting-factor of a requirement depends on the cardinality of its associated set (c), the
number of times the requirement needs to be satisfied (n), and the number of test cases that
have already hit the requirement (h). The hitting-factor can be expressed as (c-h)/(n-h),
rounded up to the nearest integer. If h >= n, the requirement is satisfied and the hitting
factor is not needed. If c < n, then the number of hits required is more than the number
of test cases that can hit the requirement, and the requirement cannot be satisfied; the
algorithm presented assumes this is not the case2. Under normal conditions the hitting-
factor of an unsatisfied requirement is a number from one to the associated sets cardinality.
If the hitting-factor is one, it means that all of the test cases in the associated set are needed
to satisfy the requirement. Higher hitting-factors indicate that any particular test case in
the associated set is less likely to be needed to satisfy that requirement. As requirements
accumulate hits, their hitting-factors increase, but always remain less than the associated
set’s cardinality as long as the requirement is not completely satisfied. When the number
of required hits is one, the hitting-factor of an unsatisfied requirement is its associated set’s
cardinality. Thus in the degenerate case, where all of the requirements only need one hit,
this new algorithm is equivalent to that presented in [Harrold93].
The new algorithm is shown in Figure 4-2. Briefly, the algorithm first chooses all of the
tests that satisfy requirements with a hitting factor of one, adds them to the reduced suite
(RS), and updates the requirements’ hitting factors accordingly. Then, it picks test cases
2. In cases where the original test suite does not satisfy the requirement completely, there are acouple options available to the implementer: The tester could be asked to add test cases untilthe requirement is satisfied. The number of hits required could be reduced to the number of hitson the requirement in the original test suite. Or the requirement could be simply be dropped asunsatisfiable.
Chapter 4. A New Minimization Technique 59
Figure 4-2. A Multi-Hit Test Suite Reduction Algorithmalgorithm ReduceTestSuite
bounds y: number of requirments (HGS: n)z: highest number of test cases (HGS: nt)
input T1, T2, ..., Ty: associated testing sets for r1, r2, ..., ry respectively,containing test cases from t1, t2, ... tz
N1, N2, ..., Ny: the "hitting number" (minimum number of hits) for r1, r2, ..., ry respectively,each an integer from 1 to Card(r1), Card(r2), ... Card(ry) respectively.
output S: a representative set of T1, T2, ... Tndeclare MAX_HF, CUR_HF: 1...z
LIST: list of ti’sNEXT_TEST: one of t1, t2, ... tzH1, H2, ..., Hy: the number of test cases in S that are also in T1, T2, ... Ty respectively.Max(): returns the maximum of a set of numbersHitFac(ri): returns the hitting factor of a requirement
HitFac(ri) = roundup ((Card(Ti) - Hi) / (Ni - Hi))IsSat(ri): returns true if Hi >= Ni, otherwise false
begin/* Step 1: initialization */
MAX_HF := Max(HitFac(Ti) for every i) /* get the maximum hitting factors of the Ti’s */S := U_i T_i, Card(Ti) = 1 /* take union of all single element Ti’s */Hi := Hi + Card(Ti \ S ), for every i /* mark all Ti containing elements in S */CUR_HF := 1 /* consider requirements needing all of their test cases */
/* Step 2: compute S according to the heuristic for sets with higher hitting factors */loop
CUR_HF := CUR_HF + 1 /* consider all sets with next higher HF */while there are i such that (HitFac(ri) = CUR_HF and not IsSat(ri) do
/* process all unmarked sets of current HF */LIST := all tj 2 Ti where HitFac(Ti) = CUR_HF and not IsSat(ri)
/* all tj in ti in Ti of size CUR_HF */LIST := LIST - R /* but remove those we’ve already chosen */NEXT_TEST := SelectTest(CUR_HF,LIST) /* get another tj to include in S */S := S [ {NEXT_TEST} /* add the test to S */foreach Ti where NEXT_TEST 2 Ti do
Hi := Hi + 1 /* mark ri containing NEXT_TEST */endforMAX_HF := Max({0} U (HF(ri), for all i not IsSat ri))
end whileuntil CUR_HF = MAX_HF
end ReduceTestSuite
- - - - - - -
function SelectTest(HF, LIST)/* this function selects the next ti to be included in S */
declare COUNT: array[1..nt]begin
foreach ti in LIST do compute COUNT[ti], the number of unsatisfied rj’s of hitting-facotr HF containing tiConstruct TESTLIST consisting of tests from LIST for which COUNT[i] is the maximumif Card(TESTLIST=1) then return(the test case in TESTLIST)elseif HF = MAX_HF then return(any test case in TESTLIST)else return(SelectTest(HF+1, TESTLIST))
end SelectTest
Chapter 4. A New Minimization Technique 60
Figure 4-3. A C Programinclude <stdio.h>
int main(){
char *cp;int i,j;
scanf( "%d", i );
if( i < 0 ) {cp = "hello world";
} else {cp = "good bye world";
}
scanf( "%d", j );if( j >= 0 ) {
printf( "%s", cp );}
}
one at a time, and adds each one to the reduced suite, until all of the requirements are
satisfied. The test case that hits the most requirements of the smallest hitting-factor (HF)
is chosen. If necessary, the number of hit requirements of the next higher hitting-factor is
used as a tie breaker for the smaller hitting-factor. The result is a reduced suite satisfying
all of the requirements with at least the necessary number of hits.
4.2.3. Using the Multi-Hit Reduction Algorithm for PSSC Minimization
To illustrate how PSSC can be minimized using the multi-hit minimization algorithm, con-
sider the following: Suppose we wish to test the program shown in Figure 4-3, and we
have the test cases given in Table 4-1. Table 4-2 shows each executable statement in the
program along with information about its sensitivity requirement. (The sensitivity values
listed in the table are chosen for the purposes of illustration only. They reflect an intuitive
estimate of the probability of an error in the statement propagating to the output; in practice
they would be obtained through a more rigorous methodology, such as mutation analysis.)
The number of hits required (HN) is calculated from the sensitivity value using
Chapter 4. A New Minimization Technique 61
the equation:HN=(ln(1-.75))/(ln(1-sens)) so that there is a 75% chance that
testing will reveal a fault, if exists, on any particular line.
Table 4-1.The Initial Test Suite for the Example Programtest inputs ( i,j) output
t1 -1,-1 ""
t2 -1,0 "hello world"
t3 -1,1 "hello world"
t4 -1,5 "hello world"
t5 0,-1 ""
t6 0,0 "good bye world"
t7 0,5 "good bye world"
t8 1,-1 ""
t9 1,0 "good bye world"
t10 1,5 "good bye world"
t11 2,3 "good bye world"
The algorithm proceeds in the following manner. There is no requirement with an initial
hitting-factor of one, so as the algorithm enters into the main loop, all of the requirements’
hitting-factors are the number of tests in the associated set divided by the number of hits
required (HN), so the state of the algorithm is:
RS:0
HF: r1: 11 r2: 4 r3: 4 r4: 4 r5: 6 r6: 4 r7: 8
Chapter 4. A New Minimization Technique 62
Table 4-2.The Coverage Requirements for the Example Programstatements req’s sens HN tests in associated set
scanf( "%d", i ) r1 0.75 1 t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11
if( i < 0 ) r2 0.40 3 t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11
cp = "hello world" r3 0.75 1 t1, t2, t3, t4
cp = "good bye world" r4 0.60 2 t5, t6, t7, t8, t9, t10, t11
scanf( "%d", j ) r5 0.50 2 t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11
if( j >= 0 ) r6 0.40 3 t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11
printf( "%s", cp ) r7 1.00 1 t2, t3, t4, t6, t8, t9, t10, t11
The unsatisfied requirements with the lowest cardinality are r2, r3, r4, and r6. In these
requirements test cases t1-t11 each hit three of the requirements. (Each of them hit r2, r6,
and one of r3 or r4.) Since all the test cases tie, they are examined against test requirements
of the next hitting-factor, but since none of the tests score hits against the nonexistent
requirements of HF=5, they all are checked against the next set of requirements: r5 with
a hitting-factor of six. Since r5 contains all of the test cases, they are all checked against
r7, which has a hitting-factor of eight. All of the test cases, except t1, t5, and t8, tie with 1
hit against requirements of HF=8. These test cases also tie on all of the remaining hitting
factors, so one of the remaining 8 tests is chosen at random. For the sake of this example,
t4 is chosen. After marking the hits of t4, r1, r3 and r7 are completely satisfied, and the
algorithm’s new state is
RS: {t4}
HF: r2: 5 r4: 4 r5: 10 r6: 5
The lowest hitting-factor is 4, so hits are first counted against r4, leaving t5-t11 with 1 hit
each. All of these tests score 2 hits against the associated sets for r2 and r5, the require-
Chapter 4. A New Minimization Technique 63
ments with HF=5, so t5-t11 are checked against r5 where they all earn 1 hit, so one test
case from the set of {t5,t6,t7,t8,t9,t10,t11} is chosen at random. Choosing t9 satisfies r5
completely, and increases the hitting factors of r2, r4, and r6, resulting in
RS: {t4,t9}
HF: r2: 9 r4: 6 r6: 9
Tests {t5,t6,t7,t8,t10,t11} hit r4, the only requirement with the lowest hitting-factor. Those
tests also hit both r2 and r6, so one of them, t11 perhaps, is chosen at random. t11 satis-
fies all of the remaining requirements, so ReduceSuite returns the set {t4,t9,t11}, a 72%
reduction in test suite size.
4.2.4. Asymptotic Analysis of the Multi-Hit Reduction Algorithm
This multi-hit reduction algorithm can reduce the size of a test suite, while still maintaining
the required coverage, but this is only useful if it runs in a reasonable amount of time.
The single-hit minimization algorithm on which this algorithm is based has a worse case
runtime of O(y(y+z)r), where y is the number of requirements, z is the number of test cases,
and r is the maximum cardinality of the requirements’ associated sets[Harrold93].
It can be shown that this new algorithm has a similar worse case behavior. Let y be the
number of requirements, h be the sum of the number of hits required over all of the re-
quirements, z be the number of test cases, and r be the maximum cardinality of all of the
requirements’ associated sets. As in Harrold et al.’s original algorithm, the most expensive
Chapter 4. A New Minimization Technique 64
parts of the new algorithm occur in the SelectTest subprocedure[Harrold93]. The two im-
portant parts of the SelectTest procedure are (1) counting the number of hits of each test,
and (2) picking the best test(s) based on the count. Counting the number of hits takes at
most O(yr) times, because SelectTest will, in the worse case, examine all of the test cases
for each of the requirements. Picking the best test cases involves going through test cases
and selecting those with the highest count, which can be done in O(z) time, but this may
have to be repeated O(r) times to resolve ties, this is done by calling SelectTest recursively.
SelectTest itself is called from the main algorithm once for each test case that makes it into
the reduced suite. In the worse case, the reduced suite will have O(h) test cases, with each
test case scoring one hit on one requirement. Thus, the overall worse case runtime of this
multi-hit reduction algorithm is O(h(y+z)r).
4.3. An Experiment with PSSC Minimization
A final experiment was performed to assess the performance characteristics of PSSC min-
imization. The experiment was similar to that in Section 3.3, except that PSSC rather than
edge-coverage was used as the adequacy criteria during minimization.
4.3.1. Experimental Design
We used the same subjects, faulty versions, test cases, and initial test suites as were used
in our first two experiments (Section 3.3): the seven small C programs, along with their
faulty versions and test cases, and extended edge-coverage adequate test suitex.
Chapter 4. A New Minimization Technique 65
As in the earlier experiments, we used a full-factorial experimental design minimizing
1000 test suites of various sizes for each of the cells. PSSC’s dependence on aconfidence
introduced a third independent variable for the experiment, in addition to the subject pro-
gram and test suite size. A set of 14 discrete confidences were used ranging from 0.05 to
0.995. The combination of the 14 confidences with each of the 7 programs resulted in 98
cells for the experiment.
The threats to validity mentioned in Section 3.3.3 apply also to this experiment. Our mini-
mization program may be incorrectly implemented or fail to find the smallest adequate test
suite. We cannot control the nature or locality of the faults. The programs, faults, and test
suites may not be representative of those "in the wild". In our measurements, we treat all
test cases as equally expensive and all faults as equally severe.
4.3.2. Results
4.3.2.1. Minimized Test Suite Size
Figure 4-4 contains boxplots depicting the magnitude of the test suites after minimization
and the variability in their sizes. The boxplots are shown with one set of axes per program.
The horizontal axis for each subject program shows the PSSC-minimization confidence
levels used to produced their corresponding boxplots. Each program’s vertical axis enu-
merates the number of test cases.
Chapter 4. A New Minimization Technique 66
Figure 4-4. Sizes of Test Suites after PSSC minimization50454035302520151050
0.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995
70635649423528211470
0.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995
230207184161138115
92694623
00.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995
140126112
98847056422814
00.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995
70635649423528211470
0.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995
80726456484032241680
0.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995
50454035302520151050
0.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995
replace
schedule2
schedule1
totinfo tcas
printtok1
printtok2
Chapter 4. A New Minimization Technique 67
Not surprisingly, the higher confidence levels require more tests. For example, on schedule
the median test suite size is 10 at a 0.8 confidence level, and it increases to 23 test cases at
0.975 confidence level.
One difference in these results, from the results of our other experiments, involved the
variance in minimized test suite sizes. The high variability of test suites for print_tokens,
schedule2, and tcas is quite different from the variability among minimized test suite sizes
we found in our first experiment (Section 3.3.4.1), which were very stable despite a large
range of initial test suite sizes. One explanation may be that many of the initial test suites–
while edge coverage adequate–did not completely satisfy the PSSC criteria for which they
were being minimized. Thus the variability in test suite size may correspond to the extent
to which each of the original test suites satisfied the PSSC criteria.
4.3.2.2. Minimized Test Suite Performance
For each subject program, Figure 4-5 shows the mean sizes and mean number of faults de-
tected by the original test suites, edge-minimized test suites, randomly reduced test suites,
and PSSC-minimized test suites represented by the circles, diamonds, squares, and curved
lines, respectively. The averages for each PSSC confidence level is represented by an as-
terisk. Each mean size can be taken as a rough indicator of the cost of running each type
of test suite. Each mean count of faults detected can be thought of as a measure of the
benefit of running the tests. Generally, the PSSC curve starts near the origin with the point
for confidence level of 0.05. The number of faults rises rapidly at first and the PSSC curve
falls in between the points representing the minimized and randomly reduced test suites.
Chapter 4. A New Minimization Technique 68
Figure 4-5. Average Test Suite Size vs. Average Number of Faults Detected
0 10 20 30 40
test suite size
0
5
10
15
20
25
faul
ts
original
minimized
randomly reduced
PSSC minimization (0.05−0.995)
tinfo
0 20 40 60 80 100
test suite size
0
1
2
3
4
5
6
7
8
9
10
faul
ts
original
minimized
randomly reduced
PSSC minimization (0.05−0.995)
schedule
0 20 40 60 80 100
test suite size
0
1
2
3
4
5
6
7
8
faul
ts
original
minimized
randomly reduced
PSSC minimization (0.05−0.995)
schedule2
0 10 20 30 40
test suite size
0
5
10
15
20
25
faul
ts
original
minimized
randomly reduced
PSSC minimization (0.05−0.995)
tcas
0 50 100 150
test suite size
0
1
2
3
4
5
6
7
faul
ts
original
minimized
randomly reduced
PSSC minimization (0.05−0.995)
print tokens
0 50 100 150
test suite size
0
2
4
6
8
10
12
faul
ts
original
minimized
randomly reduced
PSSC minimization (0.05−0.995)
print tokens2
0 50 100 150 200
test suite size
0
5
10
15
20
25
30
faul
ts
original
minimized
randomly reduced
PSSC minimization (0.05−0.995)
replace
Chapter 4. A New Minimization Technique 69
For some of the subjects, such as schedule2, the line is closer to the point representing
the randomly reduced test suites. For others, such as tcas, it is closer to the point repre-
senting the minimized test suites. This suggests that, at the size of edge-minimized test
suites, PSSC-adequate test suites may be less cost-effective than edge-adequate test suites
but more cost-effective than random testing.
The confidence level on the PSSC curve that is closest to the minimized test suite size
varies. For tot_info, tcas, and replace a confidence level of 0.2 seems to be comparable
to edge-minimization in terms of size and faults detected. The schedule, schedule2, and
print_tokens subjects require higher confidence levels and larger test suites to detect a
comparable number of faults as compared to edge minimization. The print_tokens2 subject
performs comparably to edge minimization with a confidence level of 0.67.
The variability in the confidence level required to perform comparably to edge-coverage
is unfortunate because it indicates that the practitioner would need to either set a low con-
fidence level and risk losing even more effectiveness than edge-coverage or set a high
confidence level and risk having larger test suites than necessary for the desired level of
fault detection.
The performance variation also suggests that some faults are more disposed to being de-
tected by edge-minimized test suites and others may be better detected by PSSC-minimized
test suites. To investigate this further we more closely examined the 41 faults in the tcas
subject. For each fault we counted the number of edge-minimized test suites, and 0.2-
confidence-level PSSC-minimized test suites, that detect that fault. These test suites have
similar characteristics in terms of average faults detected and average test suite size, but
Chapter 4. A New Minimization Technique 70
differ in terms of the criteria by which they were created. We found that 16 of the faults
were detected by the PSSC-minimized test suites more than twice as often as by the edge-
minimized test suites. 8 of the faults were detected by the edge-minimized suites more
than twice as often as they were by the PSSC-minimized. The remaining 17 had less than
a 2:1 difference in detection rates. While not conclusive, this would seem to support the
hypothesis that different coverage criteria are useful in detecting different classes of faults.
71
Chapter 5. Conclusion
5.1. Results
In our experiments edge-coverage based minimization typically resulted in substantial size
reductions. Each program exhibited a remarkably stable minimized test suite size, so the
amount of savings depended almost entirely on the size of the original test suite. The
experiments with the unextended test suites for the Space program showed that test suites
greedily generated to be edge-coverage adequate may still have 20% more tests than are
needed for edge-coverage adequacy. Test suites designed with more test cases resulted in
even greater savings.
Unfortunately, this savings comes at the cost of missing the discovery of faults that would
otherwise be detected. Our experiments on the Siemens programs indicated that this loss
could be drastic and unpredictable. Our experimentation with the Space program, however,
showed that this cost can be small and relatively stable.
PSSC offers an alternative minimization technique. Because of its ability to be used at
varying confidence levels, it can be used to achieve a range of results on a cost-benefit
curve. Although PSSC is less effective than edge-coverage at the same test suite size,
PSSC is capable of scaling to test suite sizes larger than edge-coverage but still smaller
than the original test suites. There are some indications that PSSC may be complimentary
to edge-coverage in the sense that they are useful in detecting different faults.
The multi-hit minimization algorithm provides a new generalization of the traditional min-
imization algorithm. In addition to enabling PSSC minimization this algorithm could be
Chapter 5. Conclusion 72
used to scale other coverage criteria, by, for example, requiring that each statement be
exercised by at least two test cases in the test suite.
5.2. Practical Implications
The savings from minimization are substantial and can be of practical importance. This
is especially true if running or verifying the correctness of each test case is expensive in
terms of human labor or external hardware.
In our experience, the cost of running edge-coverage minimization has not been excessive.
Once the instrumentation and minimization system is in place, even for the large test suites
of the space application, the edge-coverage-minimization itself takes only a few minutes of
CPU time and very little human interaction. In addition, if used in the context of regression
testing, this cost can be amortized across several versions in which the minimized test suite
is used in the place of the original.
PSSC, on the other hand, though providing an alternative to traditional minimization, re-
quired a large amount of work to find out what mutants were killed by each of the test
cases. A static alternative method of sensitivity estimation would substantially reduce the
time it takes to perform PSSC minimization.
Unfortunately, potential loss of fault detection is problematic, so in any case, minimization
should be utilized with caution. It would probably be unwise to minimize on coverage cri-
teria less stringent than that used to produce the initial test suite. For example, if the initial
test suite was designed by creating a core of functional tests and then adding additional
Chapter 5. Conclusion 73
tests to achieve edge coverage adequacy, then it would make greater sense to use a mini-
mization tool that allows for the inclusion of the functional requirements into the criteria
against which the test suite is minimized. If minimization is to be used in the context of
regression testing, it may be wise to use the unminimized test suite to test the initial release
and then decide to use the unminimized or minimized test suites for later releases on the
basis of how well the minimized test suite would have detected faults in the initial version.
5.3. Limitations of This Investigation and Future Work
Our empirical investigation was limited in the size and nature of the subject programs, in
the number and nature of faults seeded in the programs, and in the nature of test cases and
test suites utilized.
It is unclear to what extent our results extend to other programs. The only way to ad-
dress this deficiency is to perform further experimentation on a wider range of programs,
especially additional programs as large as or larger than the Space program.
The faults used with the Siemens programs were artificial ones seeded by researchers. The
faults used in the Space program were real faults found during and after the program’s de-
velopment. Unfortunately, we are not sure how accurately the faults that have been detected
in the Space program reflect the actual distribution of faults that existed in the program at
the time where minimization would have been employed. A better understanding of the
nature, distribution, and severity of faults existing "in the wild" would be useful in better
understanding the practical cost of the loss in the number of faults detected.
Chapter 5. Conclusion 74
The original test cases and test suites we used were created by researchers and may not
accurately reflect the test cases and test suites used in practice. In particular our test suites
were chosen from large test pools containing test cases generated according to multiple
coverage criteria. Further studies using either test suites from practitioners or test suites
that better model practice–by, for example, generating test suites from separate pools of
functionally oriented and edge-coverage oriented test cases–might provide a more accurate
simulation of minimization’s use in practice.
The minimization algorithms discussed in this paper do not always find the minimally
sized test suite meeting all requirements; they find a test suite as small as or smaller than
the original which satisfies all of the requirements. An empirical investigation into various
test suite minimization algorithms, comparing them in terms of optimality and execution
time of minimizing test suites from practice, may be useful for practitioners.
Test Suite Minimization has potential for substantial savings in the cost of testing, but
further investigation is needed to better understand the practical effect of the risks inherent
in test suite minimization.
75
Bibliography
Balcer, M. and Hasling, W. and Ostrand, T.Automatic generation of test scripts from for-mal test specifications. Proc. of the 3rd Symp. on Softw. Testing, Analysis, andVerification, Pages 210-218, December 1989.
Beizer, B.Softw. Testing Techniques. (New York, NY: Van Nostrand Reinhold, 1990).
Chen, T.Y. and Lau, M.F.Dividing strategies for the optimization of a test suite. Infor-mation Processing Letters: Dividing strategies for the optimization of a test suite,60(3):135-141, March 1996.
Frankl, P.G. and Weiss, S.N.An experimental comparison of the effectiveness of branchtesting and data flow testing. IEEE Trans. on Softw. Eng.: An experimental compar-ison of the effectiveness of branch testing and data flow testing, 19(8):774-787, Aug1993.
Garey, M.R. and Johnson, D.S.Computers and Intractability. (New York: W.H. Freeman,1979)
Graves, T.L. and Harrold, M.J. and Kim, J-M and Porter, A. and Rothermel, G.An empir-ical study of regression test selection techniques. Proc. 20th Int’l. Conf. on Softw.Eng., April, 1998
M. J. Harrold and R. Gupta and M. L. SoffaA methodology for controlling the size of a testsuite. ACM Trans. on Softw. Eng. and Methodology: A methodology for controllingthe size of a test suite, 2(3):270-285, July 1993.
Harrold, M.J. and Rothermel, G.Aristotle: A System for Research on and Developmentof Program Analysis Based Tools. Technical Report: OSU-CISRC-3/97-TR17, TheOhio State University, March 1997.
Horgan, J.R. and London, S.A.ATAC: A data flow coverage testing tool for C. Proc. Symp.on Assessment of Quality Softw. Dev. Tools, pages 2-10, May 1992.
Hutchins, M. and Foster, H. and Goradia, T. and Ostrand, T.Experiments on the effective-ness of dataflow- and controlflow-based test adequacy criteria. Proc. 16th Int’l. Conf.on Softw. Eng.: Experiments on the effectiveness of dataflow- and controlflow-basedtest adequacy criteria, pages 191-200, May 1994
Johnson,R.Elementary Statistics. Sixth Edition. (Belmont, CA: Duxbury Press, 1992)
Moore, D. S. and McCabe, G. P.Introduction to the Practice of Statistics. Third Edition.(New York: W.H. Freeman and Company, 1999)
Offutt, J. and Pan, J. and Voas, J.M.Procedures for reducing the size of coverage-basedtest setsProc. of the Twelfth Int’l. Conf. on Testing Comp. Softw. pages 111-123,June 1995.
76
Ostrand, T.J. and Balcer, M.J.The category-partition method for specifying and generatingfunctional tests. Comm. of the ACM: The category-partition method for specifyingand generating functional tests, 31(6), June 1988.
Rothermel, G. and Harrold, M.J.Analyzing regression test selection techniques IEEETrans. on Softw. Eng.: Analyzing regression test selection techniques, 22(8):529-551,August 1996.
Rothermel, G. and Harrold, M.J. and Ostrin, J. and Hong, C.An empirical study of theeffects of minimization on the fault detection capabilities of test suites. Proc. of Int’l.Conf. on Softw. Maintenance, November 1998.
Untch, Roland H. and Offutt, A. Jefferson and Harrold, Mary Jean.Mutation AnalysisUsing Mutant Schemata. International Symposium on Software Testing and Analysis,pages 139-148, June 1993.
Voas, J.M.PIE: A Dynamic Failure-Based Technique. IEEE Trans. on Soft. Eng.: PIE: ADynamic Failure-Based Technique, 18(8):717-727, August 1992.
Vokolos, F. I. and Frankl, P. G.Empirical evaluation of the textual differencing regressiontesting technique. Proceedings of the International Conference on Software Mainte-nance, pages 44-53, November 1998.
Wong, W.E. and Horgan, J.R. and London, S. and Mathur, A.P.Effect of test set size andblock coverage on the fault detection effectiveness. Proc. Fifth Intl. Symp. on Softw.Rel. Engr. pages 230-238, November 1994.
Wong, W.E. and Horgan, J.R. and London, S. and Mathur, A.P.Effect of test set minimiza-tion on fault detection effectivenessProc. 17th Int’l. Conf. on Softw. Eng. pages41-50, April 1995
Wong, W.E. and Horgan, J.R. and Mathur, A.P. and Pasquini, A.Test set size minimizationand fault detection effectiveness: A case study in a space applicationProc. 21stAnnual Int’l Comp. Softw. & Applic. Conf. pages 522-528, August 1997
Wong, W.E. and Horgan, J.R. and London, S. and Mathur, A.P.Effect of test set minimiza-tion on fault detection effectiveness Software – Practice and Experience: Effect of testset minimization on fault detection effectiveness, 28(4):247-369, April 1998.