test suite minimization an empirical investigation by a ......a project submitted to oregon state...

Test Suite MinimizationAn Empirical Investigation

by

Jeffery von Ronne

A PROJECT

submitted to

Oregon State University

University Honors College

in partial fulfillment ofthe requirements for the degree of

Honors Bachelors of Science in Computer Science (Honors Scholar)

Presented May 28, 1999Commencement June 1999

Jeffery von Ronne for the degree of Honors Bachelors of Science in Computer Science

presented on May 28, 1999. Title: Test Suite Minimization: An Empirical Investigation.

AN ABSTRACT OF THE THESIS OF

Abstract approved:

Gregg Rothermel

Test suite minimization techniques attempt to reduce the cost of

saving and reusing tests during software maintenance, by eliminating

redundant tests from test suites. A potential drawback of these

techniques is that in minimizing a test suite, they might reduce

the ability of that test suite to reveal faults in the software.

Previous studies have shown that sometimes this reduction is small,

but sometimes this reduction is severe. This work investigates

the minimization process, what factors can affect its performance,

and techniques for reducing this loss.

Test Suite MinimizationAn Empirical Investigation

by

Jeffery von Ronne

A PROJECT

submitted to

Oregon State University

University Honors College

in partial fulfillment ofthe requirements for the degree of

Honors Bachelors of Science in Computer Science (Honors Scholar)

Presented May 28, 1999Commencement June 1999

Dean of University Honors College

Committee Member and Chair, Department of Computer Science

Committee Member, representing Mathematics

Mentor, representing Computer Science

I understand that my project will become part of the permanent collection of Oregon StateUniversity Honors College. My signature below authorizes release of my project to anyreader upon request.

Jeffery von Ronne, Author

APPROVED:

Honors Bachelors of Science in Computer Science project of Jeffery von Ronne presented on May 28, 1999

Acknowledgment

Many thanks are due to Dr. Rothermel, who provided much advice and guidance during

the past year, as well as collaborating on the work in this thesis.

My other committee members were Dr. Robby Robson and Dr. Michael Quinn.

Dr. Roland Untch of Middle Tennessee State University provided the mutation data nec-

essary for the experiments with PSSC minimization. Chengyun Chu prepared the Space

program and assisted in the preparation of the mutation data.

Dr. Mary Jean Harrold and Christie Hong of Ohio State University and Jeffery Ostrin also

collaborated on parts of this work.

The "Siemens" programs were provided by Siemens Corporate Research. The space pro-

gram came from the European Space Agency via Dr.’s Pasquini and Phyllis.

The NSF funded my work through a Research Experience for Undergraduates grant to Dr.

Rothermel. The equipment and other collaborators were funded in part by grants from

Microsoft and the NSF.

Thanks everyone.

Contributing Co-Authors

The second and third chapters of this thesis are based on an article entitledExperiments

to Assess the Cost-Benefits of Test Suite Minimizationby Dr. Gregg Rothermel, Dr. Mary

Jean Harrold (Ohio State University), Christie Hong (Ohio State University), and myself,

which is currently in preperation for submission to Transactions in Software Engineering,

and is a revised and expanded version of an earlier paper, entitledAn empirical study of the

effects of minimization on the fault detection capabilities of test suites, which was authored

by Dr. Gregg Rothermel, Dr. Mary Jean Harrold, Christie Hong, and Jeffery Ostrin, and

presented at the November 1998 International Conference on Software Maintenance.

Table of Contents

1. Introduction and Motivation ........................................................................................1

1.1. Motivation.............................................................................................................11.2. Overview of This Thesis.......................................................................................2

2. Background and Literature Review.............................................................................4

2.1. Test suite minimization.........................................................................................42.2. Previous empirical work .......................................................................................4

2.2.1. The Wong98 study.....................................................................................52.2.2. The Wong97 study.....................................................................................8

3. Edge-Minimization Experiments ...............................................................................11

3.1. Research Questions.............................................................................................113.2. Measures and Tools ............................................................................................12

3.2.1. Measures..................................................................................................123.2.1.1. Measuring savings. .......................................................................123.2.1.2. Measuring costs. ...........................................................................13

3.2.2. Tool infrastructure. ..................................................................................153.3. Experiments with smaller C programs ..............................................................15

3.3.1. Subject programs, faulty versions, test cases, and test suites. .................163.3.2. Experiment design. ..................................................................................213.3.3. Threats to validity. ...................................................................................213.3.4. Minimization of edge-coverage-adequate test suites...............................22

3.3.4.1. Test suite size reduction................................................................233.3.4.2. Fault detection effectiveness reduction.........................................26

3.3.5. Minimization of randomly generated test suites......................................323.3.5.1. Test suite size reduction................................................................333.3.5.2. Fault detection effectiveness reduction.........................................33

3.4. Experiment with the Space Program ..................................................................393.4.1. Subject program, faulty versions, test cases, and test suites....................393.4.2. Experiment design. ..................................................................................413.4.3. Threats to validity. ...................................................................................423.4.4. Data and Analysis ....................................................................................43

3.4.4.1. Test suite size reduction................................................................433.4.4.2. Fault detection effectiveness reduction.........................................45

3.5. Comparison to Previous Empirical Results ........................................................47

4. A New Minimization Technique .................................................................................52

4.1. Mutation Analysis and Minimization .................................................................524.1.1. Mutation Analysis and Sensitivity...........................................................524.1.2. Adapting Sensitivity for use as a Coverage Criterion..............................54

4.2. An Algorithm to Facilitate Minimization based on PSSC..................................55

4.2.1. A Conventional Test Suite Minimization Heuristic.................................554.2.2. A Multi-Hit Minimization Algorithm......................................................574.2.3. Using the Multi-Hit Reduction Algorithm for PSSC Minimization........604.2.4. Asymptotic Analysis of the Multi-Hit Reduction Algorithm..................63

4.3. An Experiment with PSSC Minimization ..........................................................644.3.1. Experimental Design ...............................................................................644.3.2. Results .....................................................................................................65

4.3.2.1. Minimized Test Suite Size ............................................................654.3.2.2. Minimized Test Suite Performance ..............................................67

5. Conclusion ....................................................................................................................71

5.1. Results ................................................................................................................715.2. Practical Implications .........................................................................................725.3. Limitations of This Investigation and Future Work............................................73

Bibliography .....................................................................................................................75

List of Figures3-1. Percentage of Inputs that Expose Each Fault .............................................................18

3-2. Size Distribution among Unminimized Test Suites for the Siemens Programs .........20

3-3. Size of Minimized vs. Size of Original Test Suites ...................................................23

3-4. Percent Reduction in Test Suite Size vs. Original Test Suite Size.............................23

3-5. Minimization: Percentage Effectiveness Reduction vs. Original Size.......................27

3-6. Effectiveness in Original and after Minimization vs. Original Size ..........................27

3-7. Random Reduction: Percentage Effectiveness Reduction vs. Original Suite Size ....34

3-8. Minimization and Random Reduction: Fault Detection vs Original Size..................34

3-9. Random Reduction: Percent Effectiveness Reduction...............................................37

3-10. Percentage of Test Cases that Expose each of Space’s Faults..................................40

3-11. Size of Minimized Test Suites vs Size of Original Test Suites ................................44

3-12. Percent Reduction in Test Suite vs Original Test Suite Size....................................44

3-13. Percent Reduction in Effectiveness vs. Original Size ..............................................45

3-14. Original and Minimized: Faults Detected vs. Original Size....................................45

4-1. The Harrold, Gupta, and Soffa Test Suite Minimization Algorithm..........................55

4-2. A Multi-Hit Test Suite Reduction Algorithm.............................................................58

4-3. A C Program...............................................................................................................60

4-4. Sizes of Test Suites after PSSC minimization............................................................65

4-5. Average Test Suite Size vs. Average Number of Faults Detected..............................67

List of Tables3-1. The Siemens Programs...............................................................................................16

3-2. Correlation Between Size Reduction and Original Size ............................................26

3-3. Minimization: Correlation between Effectiveness Reduction and Original Size.......30

3-4. Random Reduction: Correlation between Effectiveness Loss and Origianl Size ......34

3-5. Comparison of Fault Detection Reduction.................................................................37

3-6. Comparison of Fault Detection Reduction Variance..................................................38

3-7. The Space Application ...............................................................................................39

3-8. Correlation between Size Reduction and Initial Size.................................................43

3-9. Correlation between Initial Size and Effectiveness Reduction ..................................45

3-10. Average Reductions in Fault Detection Effectiveness..............................................47

3-11. Fault detection abilities of tests used in the Wong98 study......................................50

4-1. The Initial Test Suite for the Example Program.........................................................61

4-2. The Coverage Requirements for the Example Program.............................................62

Chapter 1. Introduction and Motivation

1.1. Motivation

Testing is an important but expensive task necessary for the construction of high quality

software. As such, there is great potential for any practical technique that enables the

detection of more faults with limited software testing funds. One testing strategy is to

orient the testing regimen around concrete, achievable criteria. These includefunctional

tests, designed to exercise the program’s documented features, and alsostructural tests,

designed to exercise each statement in the program. It is thought that a testing regimen

designed around explicit criteria such as those just mentioned is more effective than either

random or ad hoc testing1. In fact, experimentation, such as that done by researchers at

Siemens, has shown that structural testing based on either controlflow or dataflow coverage

criteria can show significantly better fault detection than random testing[Hutchins94].

Coverage criteria are also used as a stopping point to decide when a program is sufficiently

tested. In this case, additional tests are added until the test suite has achieved a specified

coverage level according to a specific adequacy criterion. For example, to achievestate-

ment coverageadequacy for a program, one would add additional test cases to the test suite

until each statement in that program is executed by at least one of the test cases.

It is often the case that as a program evolves, additional tests are needed to maintain ade-

quate coverage. Sometimes, as the test suite grows, it can become prohibitively expensive

1. Random testing is selecting inputs at random, from some input distribution, and using those astest cases. Ad hoc testing is testing with inputs chosen by the tester with no explicit selectioncriteria.

Chapter 1. Introduction and Motivation 2

to execute on new versions of the program. These test suites will often contain test cases

that are no longer needed to satisfy the coverage criteria, because they are now obsolete or

redundant [Chen96, Harrold93, Horgan92, Offutt95].2 For example, Harrold et al. propose

that a reduced test suite, made up of the smallest subset of the test cases that still exercises

all of the coverage items, could be used in place of the original test suite[Harrold93]. The

reduced subset of the original test suite will be referred to as aminimized test suite, and the

process of obtaining the minimized test suite will be calledminimization.

Unfortunately, minimized test suites are not without drawbacks. In addition to the cost

of determining the reduced set, minimization may remove test cases that detect program

faults that are not detected by other test cases that satisfy the same criterion. In the worse

case, the minimized test suites will no longer detect any faults, including those that would

be detected by the original test suite. This work begins to quantify this loss over a limited

range of coverage criteria, programs, program faults, and test cases and compares it to the

benefit in reduced test suite size.

1.2. Overview of This Thesis

Some studies have shown that minimization can result in significant savings in test suite

size with little reduction in the ability of the minimized test suite to detect faults [Wong95,

Wong97, Wong98]. This work, however, shows that this is not necessarily the case. For the

combination of programs, faults, and types of test suites we utilized in two empirical stud-

2. Obsolete test cases no longer exercise any coverage items. Redundant test cases are those thatexercise only test cases that are also exercised by other test cases in the test suite.

Chapter 1. Introduction and Motivation 3

ies, the loss in fault detection was substantial. While a third study showed a less extreme

loss in fault detection, this loss was neither statistically nor practically insignificant.

These findings motivated the search for alternative coverage criteria that could be used in

place of or in conjunction with structural criteria. This resulted in a new coverage criterion:

Probabilistic Statement Sensitivity Coverage. In the process, a new minimization heuristic

was developed.

The next chapter will discuss coverage criteria, test suite minimization, and previous work.

The third chapter will discuss the experiments we conducted to assess the performance of

a conventional minimization technique. Chapter 4 introduces the PSSC criterion, explains

how it could be used, and compares its performance to conventional techniques. Finally,

the conclusion will recap our experiment results, explain the practical consequences of this

work, and suggest areas for further study.

4

Chapter 2. Background and Literature Review

2.1. Test suite minimization

The test suite minimization problem may be stated as follows [Harrold93]

Given: Test suiteT, a set of test case requirementsr1, r

2, ..., r

nthat must be satisfied to provide

the desired test coverage of the program, and subsets ofT, T1, T

2, ... , T

n, one associated with

each of theris such that any one of the test casest

jbelonging toT

ican be used to testr

i.

Problem: Find a representative set of test cases fromT that satisfies all of theris.

The ris in the foregoing statement can represent various test case requirements, such as

source statements, decisions, definition-use associations, or specification items.

A representative set of test cases that satisfies all of theris must contain at least one test

case from eachTi; such a set is called ahitting setof the group of setsT, T

1, T

2, ... ,T

n. To

achieve a maximum reduction, it is necessary to find the smallest representative set of test

cases. However, this subset of the test suite is the minimum cardinality hitting set of the

Tis, and the problem of finding such a set is NP-complete [Garey79]. Thus, minimization

techniques resort to heuristics.

Several test suite minimization techniques have been proposed (e.g., [Chen96, Harrold93,

Horgan92, Offutt95]); in this work we utilize the technique of Harrold, Gupta, and Soffa

[Harrold93].

Chapter 2. Background and Literature Review 5

2.2. Previous empirical work

Many empirical studies of software testing have been performed. Some of these studies,

such as those reported in References [Frankl93,Hutchins94,Wong94], provide only indi-

rect data about the effects of test suite minimization through consideration of the effects of

test suite size on costs and benefits of testing. Other studies, such as the study reported in

Reference [Graves98], provide only indirect data about the effects of test suite minimiza-

tion through a comparison of regression test selection techniques that practice or do not

practice minimization.1

Recent studies by Wong, Horgan, London, and Mathur [Wong95,Wong98]2 and Wong,

Horgan, Mathur, and Pasquini [Wong97], however, directly examine the costs and benefits

of test suite minimization. We refer to these studies collectively as the “Wong” studies, and

individually as the “Wong98” and “Wong97” studies. We summarize the results of these

studies here; the references provide further details.

2.2.1. The Wong98 study

The Wong98 study involved ten common C UNIX utility programs, including nine pro-

grams ranging in size from 90 to 289 lines of code, and one program of 842 lines of code.

1. Whereas minimization considers a program and test suite, regression test selection considers aprogram, test suite, and modified program version, and selects test cases that are appropriate forthat version without removing them from the test suite. The problems of regression test selectionand test suite minimization are thus related but distinct. For further discussion of regression testselection see Reference [Rothermel96].

2. Reference [Wong98] (1998) extends work reported earlier in Reference [Wong95] (1995); thus,except where otherwise noted, we here focus on the most recent (1998) reference.


For each of these programs, the researchers used a random domain-based test generator to

generate an initial test case pool; the number of test cases in these pools ranged from 156

to 997. No attempt was made, in generating these pools, to achieve complete coverage of

program components (blocks, decisions, or definition-use associations).

The researchers next drew multiple distinct test suites from their test case pools, by ran-

domly selecting test cases. The resulting test suites achieved basic block coverages ranging

from 50% to 95%; overall, 1198 test suites were generated. Reference [Wong98] reports

the sizes of the resulting test suites as averages over groups of test cases that achieved sim-

ilar coverage: 270 test suites belonged to groups in which average test suite size ranged

from 9.07 to 33.73 test cases, and 928 test suites belonged to groups in which average test

suite size ranged from only 1 to 4.43 test cases.

The researchers enlisted graduate students to inject simple mutation-like faults into each

of the subject programs. The researchers excluded faults that could not be detected by any

test case. All told, 181 faulty versions of the programs were retained for use in the study.

To assess the difficulty of detecting these faults, the researchers measured the percentages

of test cases, in the associated test pools, that were able to detect the faults. Of the 181

faults, 78 (43%) were “Quartile I” faults detectable by fewer than 25% of the associated

test cases, 42 (23%) were “Quartile II” faults detectable by between 25% and 50% of the

associated test cases, 37 (20%) were “Quartile III” faults detectable by between 50% and

75% of the associated test cases, and 24 (13%) were “Quartile IV” faults detectable by at

least 75% of the associated test cases.

The researchers minimized their test suites using ATACMIN [Horgan92], a minimization


tool based on an implicit enumeration algorithm that found exact minimization solutions

for all of the test suites utilized in the study. Test suites were minimized with respect to

block, decision, and all-uses dataflow coverage. The researchers measured the reduction

in test suite size achieved through minimization, and the reduction in fault-detection effec-

tiveness of the minimized test suites. The researchers also repeated this procedure on the

entire test pools (effectively, treating these test pools as if they were test suites.) Finally,

they used null hypothesis checking to determine whether the minimized test suites had bet-

ter fault detection capabilities than test suites of the same size generated randomly from

the unminimized test suites.

The researchers drew several overall conclusions from the study, including the following:

• As the coverage achieved by initial test suites increased, minimization produced greater

savings with respect to those test suites, at rates ranging from 0% (for several of the

50-55% coverage suites) to 72.79% (for one of the 90-95% block coverage suites).

• As the coverage achieved by initial test suites increased, minimization produced greater

losses in the fault-detection effectiveness of those suites. However, losses in fault detec-

tion effectiveness were small compared to savings in test suite size: in all but one case,

reductions were less than 7.27 percent, and most reductions were less than 4.99 percent.

• Fault difficulty partially determined whether minimization caused losses in fault-

detection effectiveness: Quartile I and II faults were more easily missed than Quartile

III and IV faults following minimization.

• The null hypothesis testing showed that minimized test suites retain a size/effectiveness


advantage over their random counterparts.

The authors draw the following overall conclusion:

...when the size of a test set is reduced while the coverage is kept constant, there is little or no

reduction in its fault detection effectiveness.... A test set which is minimized to preserve its

coverage is likely to be as effective for detecting faults at a lower execution cost. [Wong98].

2.2.2. The Wong97 study

Whereas the Wong98 study examined test suite minimization on 10 common Unix utili-

ties, the Wong97 study involved a single C application developed for the European Space

Agency to aid in the management of large antenna arrays. At 6,100 executable lines, this

application is several times the size of the largest program used for the Wong98 study.

Unlike the Wong98 study, in which an initial pool of test cases was generated randomly

based solely on program specifications, the Wong97 study used a pool of 1000 test cases

generated based on an operational profile.

In the Wong98 study, test suites were generated and categorized based on block coverage.

For the Wong97 study, two different procedures were followed for generating test suites:

the first to create test suites of fixed size, and the second to create test suites of fixed block-

coverage. For the fixed size test suites, test cases were chosen randomly from the test

pool until the desired number of test cases had been selected. In all, 120 test suites were

generated in this manner: 30 distinct test suites for each of the target sizes of 50, 100,


150, 200. For the fixed coverage test suites, test cases were chosen randomly from the test

pool until the test suite reached the desired coverage. Only test cases that added coverage

were added to the fixed coverage test suites. In all, 180 test suites were generated in this

manner: 30 distinct test suites for each of the target coverages ranging from 50% to 75%

block coverage.

Whereas the faults in the Wong98 study were injected by graduate students, the faults used

in the Wong97 study were obtained from an error log maintained during the creation of

the application. The researchers selected 16 of these faults, of which all but one were

detected by fewer than 7% of the test cases, making them similar in detection difficulty to

the “Quartile I” faults used in the Wong98 study. The exceptional fault was detected by

320 (32%) of the test cases.

As in the Wong98 study, all of the test suites were minimized using ATACMIN. In both

studies, the size of each test suite was reduced, while the coverage was kept constant. In

the Wong97 study, however, minimization with respect to block coverage was the only

minimization attempted. Reduction in test suite size and in fault detection effectiveness

were measured. Finally, null hypothesis testing was used to compare test suites minimized

for coverage to test suites that were randomly minimized.


The researchers drew the following overall conclusions from the study:

• There were substantial reductions in size achieved from minimizing the fixed size test

suites. For the fixed coverage test suites, reductions in size also occurred but were

smaller.

• As in the Wong98 study, the effectiveness reductions of the minimized test suites

were smaller than the size reductions, so that minimized test suites resulted in a

size/effectiveness advantage over the unminimized test suites. The average effective-

ness reduction due to minimization was less than 7.3%, and most reductions were less

than 3.6%.

• The null hypothesis testing again showed that minimized test suites retain a

size/effectiveness advantage over their random counterparts.

Thus, the Wong97 study supports the findings of the Wong98 study, while broadening the

scope of the study in terms of both the programs under scrutiny and the types of initial test

suites utilized.

11

Chapter 3. Edge-Minimization Experiments

3.1. Research Questions

The Wong studies leave a number of open research questions, primarily concerning the

extent to which the results observed in those studies generalize to other testing situations.

Among the open questions are the following, which motivate the present work.

1. How does minimization fare in terms of costs and benefits when test suites have a

wider range of sizes than the test suites utilized in the Wong studies.

2. How does minimization fare in terms of costs and benefits when test suites are

coverage-adequate?

3. How does minimization fare in terms of costs and benefits when test suites contain

additional coverage-redundant test cases?

The first and third questions are addressed by the Wong97 study in its use of fixed-size

test suites; however, that study examines only one program. Neither of the Wong studies

considers the second question.

Test suites used in practice often contain test cases designed not for code coverage, but

rather, designed to exercise product features, specification items, or exceptional behaviors.

Such test suites may contain larger numbers of test cases, and larger numbers of coverage-

redundant test cases, than the test suites utilized in the Wong98 study, or than the coverage-

based test suites utilized in the Wong97 study.

Chapter 3. Edge-Minimization Experiments 12

Similarly, a typical tactic for utilizing coverage-based testing is to begin with a base of

specification-based tests, and add additional tests to achieve complete coverage. Such test

suites may also contain greater coverage-redundancy than the coverage-based test suites

utilized in the Wong studies, but can be expected to distribute coverage more evenly than

the fixed-size test suites constructed by random selection for the Wong97 study.

It is important to understand the cost-benefit tradeoffs involved in minimizing such test

suites. Thus, to investigate these tradeoffs, we performed a family of experiments.

3.2. Measures and Tools

We now discuss the measures and tools utilized in our experiments; subsequent sections

discuss the individual experiments. LetT be a test suite, and letTmin

be the reduced test

suite that results from the application of a minimization technique toT.

3.2.1. Measures

We need to measure the costs and savings of test suite minimization.

3.2.1.1. Measuring savings.

Test suite minimization lets testers spend less time executing test cases, examining test

results, and managing the data associated with testing. These savings in time are dependent

on the extent to which minimization reduces test suite size. Thus, to measure the savings

that can result from test suite minimization, we can follow the methodology used in the

Wong studies and measure the reduction in test suite size achieved by minimization. For


each program, we measure savings in terms of the number and the percentage of tests

eliminated by minimization. (The former measure provides a notion of the magnitude of

the savings; the latter lets us compare and contrast savings across test suites of varying

sizes.) The number of tests eliminated is given by (T - Tmin

), and the percentage of tests

eliminated is given by ((T - Tmin

)/T * 100).

This approach makes several assumptions: it assumes that all test cases have uniform costs,

it does not differentiate between components of cost such as CPU time or human time,

and it does not directly measure the compounding of savings that results from using the

minimized test suites over a sequence of subsequent releases. This approach, however,

has the advantage of simplicity, and using it we can draw several conclusions that are

independent of these assumptions and compare our results with those achieved in the Wong

studies.

3.2.1.2. Measuring costs.

There are two costs to consider with respect to test suite minimization. The first cost is

the cost of executing a minimization tool to produce the minimized test suite. However, a

minimization tool can be run following the release of a product, automatically and during

off-peak hours, and in this case the cost of running the tool may be noncritical. Moreover,

having minimized a test suite, the cost of minimization is amortized over the uses of that

suite on subsequent product releases, and thus assumes progressively less significance in

relation to other costs.

The second cost to consider is more significant. Test suite minimization may discard some


test cases that, if executed, would reveal defects in the software. Discarding these test cases

reduces the fault detection effectiveness of the test suite. The cost of this reduced effec-

tiveness may be compounded over uses of the test suite on subsequent product releases,

and the effects of the missed faults may be critical. Thus, in this experiment, we focus on

the costs associated with discardingfault-revealingtest cases.

We considered two methods for calculating reductions in fault detection effectiveness.

On a per-test-case basis: One way to measure the cost of minimization in terms of effects

on fault detection, given faulty programP and test suiteT, is to identify the test cases inT

that reveal a fault inP but are not inTmin

. This quantity can be normalized by the number

of fault-revealing test cases inT. One problem with this approach is that multiple test cases

may reveal a given fault. In this case some test cases could be discarded without reducing

fault-detection effectiveness; this measure penalizes such a decision.

On a per-test-suite basis: Another approach is to classify the results of test suite min-

imization, relative to a given fault inP, in one of three ways: (1) no test case inT is

fault-revealing, and, thus, no test case inTmin

is fault-revealing; (2) some test case in both

T andTmin

is fault-revealing; or (3) some test case inT is fault-revealing, but no test case in

Tmin

is fault-revealing. Case 1 denotes situations in whichT is inadequate. Case 2 indicates

a use of minimization that does not reduce fault detection, and Case 3 captures situations

in which minimization compromises fault detection.


The Wong experiments utilized the second approach; we do the same. For each program,

we measure reduced effectiveness in terms of the number and the percentage of faults for

which Tmin

contains no fault-revealing test cases, butT does contain fault-revealing test

cases. More precisely, ifF denotes the number of faults revealed byT over the faulty

versions of programP, andFmin

denotes the number of faults revealed byTmin

over those

versions, the number of faults lost is given by (F - Fmin

), and the percentage reduction in

fault-detection effectiveness of minimization is given by ((F - Fmin

)/F * 100).

Note that this method of measuring the cost of minimization calculates cost relative to a

fixed set of faults. This approach also assumes that missed faults have equal costs, an

assumption that typically does not hold in practice.

3.2.2. Tool infrastructure.

To perform our experiments we required several tools. First, we required a test suite min-

imization tool; to obtain this, we implemented the algorithm of Harrold, Gupta and Soffa

[Harrold93] within the Aristotle program analysis system [Harrold97]. The Aristotle sys-

tem also provided us with with code instrumenters for use in determining edge coverage.


3.3. Experiments with smaller C programs

Our first two experiments address our research questions on several small C programs,

similar in size to the C utilities utilized in the Wong98 study. In this section we first

describe details common to these two experiments, and then we report the results of the

experiments in turn.

3.3.1. Subject programs, faulty versions, test cases, and test suites.

We used seven C programs as subjects (see Table 3-1). The programs range in size from

138 to 516 lines of C code and perform a variety of functions. Each program has several

faulty versions, each containing a single fault. Each program also has a large test pool.

The programs, versions, and test pools were assembled by researchers at Siemens Corpo-

rate Research for a study of the fault-detection capabilities of control-flow and data-flow

coverage criteria [Hutchins94]. We refer to these programs collectively as the “Siemens”

programs.

Table 3-1.The Siemens ProgramsProgram Lines of Code No. of Versions Test Pool Size Description

totinfo 346 23 1052 information measure

schedule1 299 9 2650 priority scheduler

schedule2 297 10 2710 priority scheduler

tcas 138 41 1608 altitude separation

printtok1 402 7 4130 lexical analyzer

printtok2 483 10 4115 lexical analyzer

replace 516 32 5542 pattern replacement

The researchers at Siemens sought to study the fault-detecting effectiveness of coverage

criteria. Therefore, they created faulty versions of the seven base programs by manually


seeding those programs with faults, usually by modifying a single line of code in the pro-

gram. In a few cases they modified between two and five lines of code. Their goal was

to introduce faults that were as realistic as possible, based on their experience with real

programs. Ten people performed the fault seeding, working “mostly without knowledge of

each other’s work” [Hutchins94].

For each of the seven programs, the researchers at Siemens created a largetest poolcon-

taining possible test cases for the program. To populate these test pools, they first created

an initial set of black-box test cases “according to good testing practices, based on the

tester’s understanding of the program’s functionality and knowledge of special values and

boundary points that are easily observable in the code” [Hutchins94], using thecategory

partition methodand the Siemens Test Specification Language tool [Balcer89, Ostrand88].

They then augmented this set with manually-created white-box test cases to ensure that

each executable statement, edge, and definition-use pair in the base program or its control

flow graph was exercised by at least 30 test cases. To obtain meaningful results with the

seeded versions of the programs, the researchers retained only faults that were “neither too

easy nor too hard to detect” [Hutchins94], which they defined as being detectable by at

least three and at most 350 test cases in the test pool associated with each program.1

1. When we execute these faulty versions, we find four faults that are not detected, and three thatare detected by only one or two test cases. This difference may be attributable to some factorinvolving the system on which we are executing our tests; the difference does not impact theresults of our study.


Figure 3-1. Percentage of Inputs that Expose Each Fault

20

18

16

14

12

10

8

6

4

2

printtok2 replace

perc

enta

ge o

f te

sts

that

rev

eal f

ault

s

subject program

totinfo schedule1 schedule2 tcas printtok1

Figure 3-1 shows the sensitivity to detection of the faults in the Siemens versions relative

to the test pools; the boxplots2 illustrate that the sensitivities of the faults vary within and

between versions, but overall are all lower than 19.77%. Therefore, all of these faults were,

in the terminology of the Wong studies, “Quadrant I” faults, detectable by fewer than 25%

of the test pool inputs.

To investigate our research questions we required coverage-adequate test suites that exhibit

redundancy in coverage, and we required these in a range of sizes. To create these test

2. A boxplot is a standard statistical device for representing data sets [Johnson92]. In these plots,each data set’s distribution is represented by a box. The box’s height spans the central 50% ofthe data and its upper and lower ends mark the upper and lower quartiles. The middle of thethree horizontal lines within the box represents the median. The vertical lines attached to thebox indicate the tails of the distribution.


suites we utilized theedge coveragecriterion. The edge coverage criterion is similar to

the decision coverage criterion used in the Wong98 study, but is defined on control flow

graphs.3

We used the Siemens program test pools to obtain coverage-adequate test suites for each

subject program. Our test suites consist of a varying number of test cases selected randomly

from the associated test pool, together with any additional test cases required to achieve

100% coverage of coverable edges.4 We did not add any particular test case to any particular

test suite more than once. To ensure that these test suites would possess varying ranges of

coverage redundancy, we randomly varied the number of randomly selected test cases over

sizes ranging from 0 to .5 times the number of lines of code in the program. Altogether,

we generated 1000 test suites for each program.

Figure 3-2 provides views of the range of sizes of test suites created by the process just

described. The boxplots illustrate that for each subject program, our test suite generation

procedure yielded a collection of test suites of sizes that are relatively evenly distributed

across the range of sizes utilized for that program. The all-uses-coverage-adequate suites

3. A test suiteT is edge-coverage adequatefor programP iff, for each edgee in each control flowgraph for some procedure inP, if e is dynamically exercisable, then there exists at least one testcaset in T that exercisese. A test caset exercisesan edgee = (n

1,n

2) in control flow graphG iff

t causes execution of the statement associated withn1, followed immediately by the statement

associated withn2.

4. To randomly select test cases from the test pools, we used the C pseudo-random-number gen-erator “rand”, seeded initially with the output of the C “time” system call, to obtain an integerwhich we treated as an indexi into the test pool (modulo the size of that pool).


Figure 3-2. Size Distribution among Unminimized Test Suites for the Siemens Programs

subject program

size

of

test

sui

te

120

150

180

210

240

30

60

90

270

tcastotinfo schedule1 schedule2 printtok1 printtok2 replace

are larger on average than the edge-coverage-adequate suites because in general, more tests

are required to achieve all-uses coverage than to achieve edge coverage.

Analysis of the fault-detection effectiveness of these test suites shows that, except for eight

of the edge-coverage-based test suites forschedule2 , every test suite revealed at least

one fault in the set of faulty versions of the associated program. Thus, although each fault

individually is difficult to detect relative to the entire test pool for the program, almost all

of the test suites utilized in the study possessed at least some fault-detection effectiveness

relative to the set of faulty programs utilized.


3.3.2. Experiment design.

The experiments were run using a full-factorial design with 1000 size-reduction and 1000

effectiveness-reduction measures per cell.5 The independent variables manipulated were:

• The subject program (the seven programs, each with a variety of faulty versions).

• Test suite size (between 0 and .5 times lines-of-code test cases randomly selected from

the test pool, together with additional test cases as necessary to achieve code coverage).

For each subject program, we applied minimization techniques to each of the sample test

suites for that program. We then computed the size and effectiveness reductions for these

test suites.

3.3.3. Threats to validity.

In this section we discuss potential threats to the validity of our experiments with the

Siemens programs.

Threats to internal validity are influences that can affect the dependent variables without

the researcher’s knowledge, and that thus affect any supposition of a causal relationship

between the phenomena underlying the independent and dependent variables. In these

experiments, our greatest concerns for internal validity involve the fact that we do not

5. The single exception involvedschedule2 , for which only 992 measures were available withrespect to edge-coverage-based test suites, due to exclusion of the eight test suites that did notexpose any faults.


control for the structure of the subject programs or the locality of program changes.

Threats to external validity are conditions that limit our ability to generalize our results.

The primary threats to external validity for this study concern the representativeness of the

artifacts utilized. The Siemens programs, though nontrivial, are small, and larger programs

may be subject to different cost-benefit tradeoffs. Also, there is exactly one seeded fault

in each Siemens program; in practice, programs have much more complex error patterns.

Furthermore, the faults in the Siemens programs were deliberately chosen (by the Siemens

researchers) to be faults that were relatively difficult to detect. (However, the fact that the

faults in these programs were not chosen by us does eliminate one potential source of bias.)

Finally, the test suites we utilized represent only two types of test suite that could occur in

practice if a mix of non-coverage-based and coverage-based testing were utilized. These

threats can only be addressed by additional studies utilizing a wider range of artifacts.

Threats to construct validity arise when measurement instruments do not adequately cap-

ture the concepts they are supposed to measure. For example, in this experiment our mea-

sures of cost and effectiveness are very coarse: they treat all faults as equally severe, and

all test cases as equally expensive.

3.3.4. Minimization of edge-coverage-adequate test suites

Our first experiment addresses our research questions by applying minimization to the

Siemens programs and their edge-coverage-adequate test suites. In reporting results we

first consider test suite size reduction, and then we consider fault detection effectiveness

reduction.


3.3.4.1. Test suite size reduction

Figure 3-3 depicts the sizes of the minimized edge-coverage-adequate test suites for the

seven Siemens programs, plotted against original test suite size. The data for each pro-

gramP is depicted by a scatterplot containing a point for each of the test suites utilized for

P. As the figure shows, the average sizes of the minimized test suites ranges from approx-

imately 5 (fortcas ) to 12 (for replace ). For each program, the minimized test suites

demonstrate little variance in size:tcas exhibiting the least variance (between 4 and 5

test cases), andprinttok1 showing the greatest variance (between 5 and 14 test cases).

Considered across the range of original test suite sizes, minimized test suite size for each

program is also relatively stable.

Figure 3-4 depicts the percentage reduction in test suite size produced by minimization in

terms of the formula discussed in Section 3.2.1.1, for each of the subject programs. The

data for each programP is represented by a scatterplot containing a point for each of the

test suites utilized forP ; each point shows the percentage size reduction achieved for a

test suite versus the size of that test suite prior to minimization. Visual inspection of the

plots indicates a sharp increase in test suite size reduction over the first quartile of test suite

sizes, tapering off as size increases beyond the first quartile. The data gives the impression

of fitting a hyperbolic curve.

To verify the correctness of this impression, we performed least-squares regression to fit the

data depicted in these plots with a hyperbolic curve. Table 3-2 shows the best-fit curve for

each of the subject, along with its square of correlation, r2.6 They indicate a strong hyper-


Figure 3-3. Size of Minimized vs. Size of Original Test Suites

0 25 50 75 100 125 150 175 200

original test suite size

0

1

2

3

4

5

6

7

8

9

10

min

imiz

ed te

st s

uite

siz

e

average

tot info

0 25 50 75 100 125 150 175


0

1

2

3

4

5

6

7

8

9

10

min

imiz

ed te

st s

uite

siz

e

average

schedule 1

0 25 50 75 100 125 150 175


0

1

2

3

4

5

6

7

8

9

10

min

imiz

ed te

st s

uite

siz

e

average

schedule 2

0 10 20 30 40 50 60 70


0

1

2

3

4

5

6

min

imiz

ed te

st s

uite

siz

e

average

tcas

0 25 50 75 100 125 150 175 200 225 250


0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

min

imiz

ed te

st s

uite

siz

e

average

print tokens 1

0 25 50 75 100 125 150 175 200 225 250 275


0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

min

imiz

ed te

st s

uite

siz

e

average

print tokens 2

0 50 100 150 200 250 300


10

12

14

16

18

20

min

imiz

ed te

st s

uite

siz

e

average

replace


Figure 3-4. Percent Reduction in Test Suite Size vs. Original Test Suite Size

0

20

40

60

80

100

0 50 100 150 200

0

20

40

60

80

100

0 10 20 30 40 50 60 70

0

20

40

60

80

100

0 50 100 150 200 250

0

20

40

60

80

100

0 20 40 60 80 100 120 140 160

0

20

40

60

80

100

0 50 100 150 200 250

0

20

40

60

80

100

0 50 100 150 200

0

20

40

60

80

100

0 20 40 60 80 100 120 140 160

replace

schedule2

schedule1

totinfo tcas

printtok1

printtok2


bolic correlation between percentage reduction in test suite size (savings of minimization)

and original test suite size.

Table 3-2.Correlation Between Size Reduction and Original Sizeprogram regression equation r 2

totinfo y = 100 * (1 - (5.20762/x)) 0.99

schedule1 y = 100 * (1 - (5.45457/x)) 0.96

schedule2 y = 100 * (1 - (5.12267/x)) 0.94

tcas y = 100 * (1 - (4.97019/x)) 1.00

printtok1 y = 100 * (1 - (7.49780/x)) 0.90

printtok2 y = 100 * (1 - (6.77076/x)) 0.93

replace y = 100 * (1 - (12.1008/x)) 0.99

Our experimental results indicate that test suite minimization can produce savings in test

suite size on coverage-adequate, coverage-redundant test suites. The results also indicate

that as test suite size increases, the savings produced by test suite minimization increase; a

consequence of the relatively stable size of the minimized suites.

3.3.4.2. Fault detection effectiveness reduction

Figure 3-5 depicts the cost (reduction in fault detection effectiveness) incurred by mini-

mization, in terms of the formula discussed in Section 3.2.1.2, for each of the seven subject

programs. The data for each programP is represented by a scatterplot containing a point

for each of the test suites utilized forP; each point shows the percentage reduction in fault

detection effectiveness observed for a test suite versus the size of that test suite prior to

minimization.

6. r2 is a dimensionless index that ranges from zero to 1.0, inclusive, and is “the fraction of variationin the values of y that is explained by the least-squares regression of y on x” [Moore99].


Figure 3-6 illustrates the magnitude of the fault detection effectiveness reduction observed

for the seven subject programs. Again, this figure contains a scatterplot for each program;

however, we find it most revealing to depictfaults detectedversus original test suite size,

simultaneously for both test suites minimized for edge-coverage (black) and for original

test suites (grey). The solid lines in the plots denote average numbers of faults detected over

the range of original test suite sizes, the gap between these lines indicates the magnitude

of the fault detection effectiveness reduction for test suites minimized for edge coverage.

The plots show that the fault detection effectiveness of test suites can be severely com-

promised by minimization. For example, onreplace , the largest of the programs, mini-

mization reduces fault-detection effectiveness by over 50%, with average fault loss ranging

from 4 faults to 20 across the range of test suite sizes, on more than half of the test suites.

Also, although there are cases in which minimization does not reduce fault-detection ef-

fectiveness (e.g., onprinttok1 ), there are also cases in which minimization reduces the

fault-detection effectiveness of test suites by 100% (e.g., onschedule2 ).

Visual inspection of the plots suggests that reduction in fault detection effectiveness

slightly increases as test suite size increases. Test suites in the smallest size ranges do

produce effectiveness losses of less than 50% more frequently than they produce losses in

excess of 50%, a situation not true of the larger test suites. Even the smallest test cases,

however, exhibit effectiveness reductions in most cases: for example, onreplace , test

suites containing fewer than 50 test cases exhibit an average effectiveness reduction of

nearly 40% (fault detection reduction ranging from 4 to 8 faults), and few such test suites

do not lose effectiveness.


Figure 3-5. Minimization: Percentage Effectiveness Reduction vs. Original Size

0

20

40

60

80

100

0 50 100 150 200

0

20

40

60

80

100

0 10 20 30 40 50 60 70

0

20

40

60

80

100

0 50 100 150 200 250

0

20

40

60

80

100

0 20 40 60 80 100 120 140 160

0

20

40

60

80

100

0 50 100 150 200 250

0

20

40

60

80

100

0 50 100 150 200

0

20

40

60

80

100

0 20 40 60 80 100 120 140 160

replace

schedule2

schedule1

totinfo tcas

printtok1

printtok2


Figure 3-6. Effectiveness in Original and after Minimization vs. Original Size

0 50 100 150


0

5

10

15

20

25

30

faul

ts d

etec

ted

by original

after minimization

avg. by orig.

avg. after min.

tinfo

0 50 100 150 200 250


0

5

10

15

20

25

30

35

faul

ts d

etec

ted

by original

after minimization

avg. by orig.

avg. after min.

schedule 1

0 50 100 150


0

2

4

6

8

10

12

faul

ts d

etec

ted

by original

after minimization

avg. by orig.

avg. after min.

schedule 2

0 10 20 30 40 50 60 70


0

10

20

30

40

faul

ts d

etec

ted

by original

after minimization

avg. by orig.

avg. after min.

tcas

0 50 100 150 200


0

1

2

3

4

5

6

7

8

9

10

faul

ts d

etec

ted

by original

after minimization

avg. by orig.

avg. after min.

print tokens 1

0 50 100 150 200 250


0

2

4

6

8

10

12

14

faul

ts d

etec

ted

by original

after minimization

avg. by orig.

avg. after min.

print tokens 2

0 50 100 150 200 250


0

5

10

15

20

25

30

35

faul

ts d

etec

ted

by original

after minimization

avg. by orig.

avg. after min.

replace


Table 3-3.Minimization: Correlation between Effectiveness Reduction and Original Sizeprogram regression line 1 r 2 regression line 2 r 2 regression line 3 r 2

totinfo y = 0.13x + 27.79 0.16 y = 9.56Ln(x) - 1.71 0.22 y = -0.002x^2 + 0.44x + 17.74 0.21

schedule1 y = 0.15x + 38.92 0.12 y = 10.03Ln(x) + 9.25 0.15 y = -0.002x2 + 0.47x + 29.80 0.15

schedule2 y = 0.28x + 34.86 0.16 y = 17.70Ln(x) - 17.12 0.20 y = -0.004x^2 + 0.89x + 17.07 0.21

tcas y = 0.68x + 34.89 0.38 y = 22.18Ln(x) - 16.28 0.47 y = -0.020x^2 + 2.18x + 13.41 0.46

printtok1 y = 0.16x + 22.48 0.18 y = 14.68Ln(x) - 26.34 0.20 y = -0.001x^2 + 0.44x + 10.94 0.20

printtok2 y = 0.07x + 12.57 0.11 y = 6.82Ln(x) - 10.73 0.13 y = -0.001x^2 + 0.19x + 6.95 0.13

replace y = 0.11x + 42.67 0.20 y = 13.07Ln(x) - 4.82 0.27 y = -0.001x^2 + 0.41x + 26.79 0.28

In contrast to the plots of size reduction effectiveness, the plots of fault detection effective-

ness reduction do not give a strong impression of closely fitting any curve or line: the data

is much more scattered than the data for test suite size reduction. Our attempts to fit linear,

logarithmic, and quadratic regression curves to the data validate this impression: the data

in Table 3-3 reveals little linear, logarithmic, or quadratic correlation between reduction in

fault detection effectiveness and original test suite size.

These results indicate that test suite minimization can compromise the fault-detection ef-

fectiveness of coverage-adequate, coverage-redundant test suites. However, the results

only weakly suggest that as test suite size increases, the reduction in the fault-detection

effectiveness of those test suites will increase.


One additional feature of the scatterplots of Figure 3-5 warrants discussion: on several

of the graphs, there are markedly visible “horizontal lines” of points. In the graph for

printtok1 , for example, there are particularly strong horizontal lines at 0%, 20%, 25%,

33%, 40%, 50%, 60%, and 67%. Such lines indicate a tendency for minimization to ex-

clude particular percentages of faults for the programs on which they occur.

This tendency is partially explained by our use of a discrete number of faults in each subject

program. Given a test suite that exposes k faults, minimization can exclude test cases that

detect between 0 and k of these faults, yielding discrete percentages of reductions in fault-

detection effectiveness. Forprinttok1 , for example, there are seven faults, of which the

unminimized test suites may reveal between zero and seven. When minimization is applied

to the test suites forprinttok1 , only 19 distinct percentages of fault detection effective-

ness reduction can occur: 100%, 86%, 83%, 80%, 75%, 71%, 67%, 60%, 57%, 50%, 43%,

40%, 33%, 29%, 25%, 20%, 17%, 14%, and 0%. Each of these percentages except 29%

and 100% is evident in the scatterplot forprinttok1 . With all points occurring on these

16 percentages, the appearance of lines in the graph is unsurprising.

It follows that as the number of faults utilized for a program increases, the presence of hor-

izontal lines should decrease; this is easily verified by inspecting the graphs, considering in

turnprinttok1 with 7 faults,schedule1 with 9, schedule2 with 10,printtok2

with 10, totinfo with 23, replace with 32, andtcas with 41.

This explanation, however, is only partial: if it were complete, we would expect points to

lie more equally among the various reduction percentages (with allowances for the fact that

there may be multiple ways to achieve particular reduction percentages). %100%, 67%,


50%, 33%, and 0% reductions). The fact that the occurrences of reduction percentages are

not thus distributed reflects, we believe, variance in fault locations across the programs,

coupled with variance in test coverage patterns of faulty statements.

3.3.5. Minimization of randomly generated test suites

Our second experiment addresses the question of how edge-coverage-based minimization

compares to random selection as a test suite reduction technique. To facilitate discussion,

we refer to test suites where size was minimized while keeping coverage constant asmin-

imized test suites, and we refer to test suites where the size was reduced to a specific level

by random selection asrandomly reduced test suites.

To randomly reduce test suites, we used Perl’s built in pseudo-random number generator.

Perl’s pseudo-random number generator was automatically seeded with the system time,

process-ID, and various other system variables.7

For each of the test suites, the original test pool was set to the test cases in the unminimized

test suite. The random number generator returned a positive integer less than the size of

the test pool. This integer was treated as an index to the test cases in the test pool, and the

indexed test case was placed in the output test suite and removed from the test pool. The

process was repeated until the output test suite reached the size of the minimized test suite.

7. This is dependent on the version of Perl, and is described in the perlfunc man page for Perl5.004.


The experiment follows a paired-T test design [Johnson92]. A paired-T test is an experi-

ment in which two large sets of data (populations) are compared by comparing the many

elements (subjects), which are extracted from the two populations in pairs, in such a way

that the pairs control extraneous variables.

In this case, by pairing our minimized edge-coverage-adequate test suites with randomly

reduced test suites, we were able to control for differences in the unminimized test suites

and differences in minimized test suite sizes. As a result, we were able to compare the

overall fault detection effectiveness of minimized test suites with the overall fault detection

effectiveness of the randomly reduced test suites.


By design, we produced randomly reduced test suites of the same size as those produced by

minimization in our first experiment. Thus, the test suites produce the same size reductions

as those depicted in Figure 3-3 and Figure 3-4.


Figure 3-7 depicts the cost (reduction in fault detection) incurred by randomly selecting

a subset of the original test suite. These scatterplots look similar to those of Figure 3-5,

which depict the reduction in fault detection incurred by minimization. The only noticeable

difference is that the scatterplot for the randomly selected test suites is somewhat denser

for high failure rates.



for the seven subject programs for random test suite reduction, compared with the reduction

for edge-coverage-based minimization. Again, this figure contains a scatterplot for each

program, and we depictfaults detectedversus original test suite size, simultaneously for

both test suites minimized for edge-coverage (black) and for randomly reduced test suites

(grey). The solid lines in the plots denote average numbers of faults detected over the range

of original test suite sizes, the gap between these lines indicates the difference between the

two minimization techniques. The plots indicate a noticeable difference between the two

techniques.

As with the minimized test suites, we attempted to fit the data points for fault detection

reduction of randomly reduced test suites (Figure 3-7) to some simple functions. The

results of this attempt (shown in Table 3-4) were similar to those for minimization (Table

3-3) as both show little linear, logarithmic, and quadratic correlation between reduction in

fault detection effectiveness and the size of the original test suite. The randomly reduced

test suites, however, have even lower correlation coefficients, reflecting the more variable

nature of random reduction.

Table 3-4.Random Reduction: Correlation between Effectiveness Loss and Origianl Sizeprogram regression line 1 r 2 regression line 2 r 2 regression line 3 r 2

totinfo y = 0.14x + 45.49 0.10 y = 11.35Ln(x) + 9.67 0.16 y = -0.002x^2 + 0.55x + 32.53 0.15

schedule1 y = 0.15x + 62.37 0.09 y = 10.63Ln(x) + 29.92 0.14 y = -0.003x^2 + 0.58x + 50.16 0.13

schedule2 y = 0.19x + 66.50 0.09 y = 13.49Ln(x) + 25.43 0.14 y = -0.004x^2 + 0.78x + 49.73 0.14

tcas y = 0.61x + 48.30 0.24 y = 20.61Ln(x) - 0.24 0.32 y = -0.020x^2 + 2.10x + 27.01 0.30

printtok1 y = 0.15x + 60.11 0.13 y = 14.62Ln(x) + 10.81 0.16 y = -0.001x^2 + 0.44x + 48.32 0.15

printtok2 y = 0.06x + 55.26 0.02 y = 6.52Ln(x) + 32.39 0.03 y = -0.001x^2 + 0.20x + 48.81 0.03

replace y = 0.10x + 55.44 0.15 y = 12.53Ln(x) + 9.96 0.20 y = -0.001x^2 + 0.36x + 42.27 0.19

Figure 3-9 shows boxplots representing (comparatively) the span of the fault detection


Figure 3-7. Random Reduction: Percentage Effectiveness Reduction vs. Original SuiteSize

0

20

40

60

80

100

0 50 100 150 200

0

20

40

60

80

100

0 10 20 30 40 50 60 70

0

20

40

60

80

100

0 50 100 150 200 250

0

20

40

60

80

100

0 20 40 60 80 100 120 140 160

0

20

40

60

80

100

0 50 100 150 200 250

0

20

40

60

80

100

0 50 100 150 200

0

20

40

60

80

100

0 20 40 60 80 100 120 140 160

replace

schedule2

schedule1

totinfo tcas

printtok1

printtok2


Figure 3-8. Minimization and Random Reduction: Fault Detection vs Original Size

0 50 100 150


0

5

10

15

20

25

30

faul

ts d

etec

ted

after minimization

after random reduction

avg. after min.

avg after rand. red.

tinfo

0 50 100 150 200 250


0

5

10

15

20

25

30

35

faul

ts d

etec

ted

after minimization


avg. after min.


schedule 1

0 50 100 150


0

2

4

6

8

10

12

faul

ts d

etec

ted

after minimization


avg. after min.


schedule 2

0 10 20 30 40 50 60 70


0

10

20

30

40

faul

ts d

etec

ted

after minimization


avg. after min.


tcas

0 50 100 150 200


0

1

2

3

4

5

6

7

8

9

10

faul

ts d

etec

ted

after minimization


avg. after min.


print tokens 1

0 50 100 150 200 250


0

2

4

6

8

10

12

14

faul

ts d

etec

ted

after minimization


avg. after min.


print tokens 2

0 50 100 150 200 250


0

5

10

15

20

25

30

35

faul

ts d

etec

ted

after minimization


avg. after min.


replace


Figure 3-9. Random Reduction: Percent Effectiveness Reduction

printtokens2totinfo schedule1 schedule2 tcas replaceprinttokens1

100

90

80

70

60

50

40

30

20

10

0minimized random minimized random minimized random minimized random randomminimized minimized randomrandommininimized

reductions for the various Siemens subjects. Boxplots for the minimized and randomly

reduced test suites are shown side by side. The boxplots show a consistent pattern of

greater loss in fault detection in the randomly reduced test suites than in their associated

minimized test suites.

Table 3-5.Comparison of Fault Detection Reductionprogram Average Reduction in

Fault Detection for

Minimized Test Suites

Average Reduction in

Fault Detection for

Randomly Reduced Test

Suites

Average Difference in

Fault Detection Reduction

totinfo 39.2 58.2 19.0� 2.5

schedule1 51.1 74.2 23.2� 2.7

schedule2 56.7 81.7 25.0� 3.0

tcas 60.9 71.4 10.6� 2.4

printtok1 40.8 77.7 36.9� 2.7

printtok2 21.3 63.1 41.7� 2.8

replace 57.2 69.4 12.2� 2.3


Table 3-5 shows statistical data for randomly reduced and minimized test suites. The data

confirms conclusions drawn from Figure 3-9: the minimized test suites tended to find more

faults than their randomly reduced counterparts. The fourth column shows the difference

in lost fault detection between minimized and random reduction; the margins of error are

shown for the 99.9% confidence level. The average advantage ranged from 10.6% for

tcas to 41.7% forprinttok2 . The differences are significant as all of their confidence

intervals lie entirely above zero.

Table 3-6.Comparison of Fault Detection Reduction Varianceprogram Sample Standard

Deviation in Reduction

in Fault Detection for

Minimized Test Suites

Sample Standard

Deviation in Reduction

in Fault Detection for

Randomly Reduced

Test Suites

totinfo 15.7 22.1

schedule1 18.5 20.7

schedule2 23.2 24.0

tcas 20.3 22.9

printtok1 20.5 22.7

printtok2 13.2 25.3

replace 16.7 18.6

Table 3-6 further quantifies results not directly apparent in Figure 3-9: for all programs,

the minimized test suites detected faults more consistently than their randomly reduced

counterparts. The smallest difference was found forschedule2 , where the reduction

in fault detection for the minimized test suites had a standard deviation of 23.2, and the

reduction in fault detection for the randomly reduced test suites was only slightly less

consistent with a standard deviation of 24.0. The largest difference was forprinttok2 ,

where the reduction in fault detection for minimized test suites had a standard deviation of

only 13.2, while the reduction in fault detection for the randomly reduced test suites had a


standard deviation of 25.3.

3.4. Experiment with the Space Program

Our next experiment addresses our research questions by applying minimization to the

Space program utilized in the Wong97 study.

3.4.1. Subject program, faulty versions, test cases, and test suites.

Space (see Table 3-7), consisting of 9564 lines of C code (6218 executable), functions

as an interpreter for an array definition language (ADL). The program reads a file that

contains several ADL statements, and checks the contents of the file for adherence to the

ADL grammar, and to specific consistency rules. If the ADL file is correct,Space outputs

an array data file containing a list of array elements, positions, and excitations; otherwise

the program outputs error messages.

Table 3-7.The Space ApplicationLines of Code 6218

No. of Versions 35

Test Pool Size 13585

Description language interpreter

Space has 33 associated versions, each containing a single fault that had been discovered

during the program’s development. (The Wong97 study utilized only eighteen of these

faulty versions.) Through working with this program, we discovered five additional faults,


Figure 3-10.Percentage of Test Cases that Expose each of Space’s Faults

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

faulty version

0

10

20

30

40

50

60

70

80

90

100

perc

enta

ge o

f tes

t cas

esRates of Detection

for each fault

and created versions containing just those faults. We also discovered that three of the

“faulty versions” were actually semantically equivalent to the base version. We excluded

these from our study; therefore, we ultimately utilized 35 faulty versions.

The test pool forSpace was constructed in two stages. An initial pool of 10,000 tests was

obtained from Frankl and Vokolos, who had constructed the pool for another study by ran-

domly generating test cases [Vokolos98]. Beginning with this initial pool, we instrumented

the program for edge coverage, measured coverage, and then added additional test cases to

the pool until it contained, for each dynamically executable edge in the control flow graph

for the program, at least 30 test cases that exercised that edge. This process yielded a test

pool of 13,585 test cases.

Figure 3-10 shows the sensitivity to detection of the faults inSpace relative to the test

pools; the bar graph illustrates that the sensitivities of the faults varies, but overall falls

between .13% and 99.77%. In all, 74% (26/35) of these faults were, in the terminology of

the Wong studies, “Quadrant I” faults, detectable by fewer than 25% of the test pool inputs.


As with the Siemens programs, we used theSpace program’s test pools to obtain

coverage-adequate test suites for the program; however, due to limitations in our dataflow

analyzer, we were able to create only edge-coverage-adequate suites.

As with the Siemens programs we utilized test suites consisting of a random number of

randomly selected tests together with additional tests necessary to achieve coverage. In

addition, we also utilized a set of smaller test suites generated for coverage by beginning

with an empty test suite, and then greedily selecting test cases and adding them to the test

suite only if they added coverage, until full coverage was achieved. We call these two

varieties of test suites “extended” and “unextended”, respectively. The extended test suites

ranged in size from 159 to 4712 test cases, and the unextended test suites ranged in size

from 141 to 169 tests. Because the unextended test suites were greedily generated, they

do contain coverage-redundant test cases – though far fewer than most of the extended test

suites.

3.4.2. Experiment design.

The experiments were run using a full-factorial design with 1000 size-reduction and 1000

effectiveness-reduction measures per cell. The independent variables manipulated were:

• The test suite type (extended or unextended).

• Test suite size (for the extended suites, between 0 and .5 times lines-of-code test cases

randomly selected from the test pool, together with additional test cases as necessary to

achieve code coverage; for the unextended test suites, the range of sizes generated that


achieved coverage.)

For each test suite type, we applied minimization techniques to each of the 1000 sample

test suites of that type. We then computed the size and effectiveness reductions for these

test suites.

Also, as with the Siemens programs, we conducted an additional experimental run utilizing

randomly selected tests, using a paired T test design. We report both experimental runs

together.

3.4.3. Threats to validity.

This experiment shares, with our experiments on the Siemens programs, the threats to

validity described in Section 3.3.3. In addition, the program’s naturally occurring faults

had been separated into single faulty versions, abstracting out effects that could occur from

interacting faults; however, this abstraction was necessary in order to be able to attribute

failures properly to faults. On the other hand,Space is a real program, with real faults

uncovered in practice, and its size is an order of magnitude greater than that of the Siemens

programs; these factors augment the external validity of the study.


3.4.4. Data and Analysis

We report results for all types of test suites and for random reduction together in this

section.


Figure 3-11 depicts the sizes of the minimized edge-coverage-adequate test suites (both

extended and unextended) forSpace . These show that the size of the minimized test

suites is relatively stable and does not show a large variance even between the extended

and unextended test suites.

Figure 3-12 shows the percentage reduction in minimized test suite size versus the size

of the original test suite forSpace . The scatterplot for minimization from the extended

test suites (at left in the figure) is hyperbolic – and nearly identical to those seen in the

experiments on the Siemens programs – reflecting reductions to a nearly constant size

across a wide range of original test suites sizes. Although not apparent given the scale of

the plot, the points in the plot for unextended test suites (at left in the figure) fit into the

bottom of the curve for the extended test suites amidst the points for the least extended test

suites.

Table 3-8.Correlation between Size Reduction and Initial Sizeprogram regression equation r^2

Space y = -13.919Ln(x) - 14.832 0.7949

Space y = 0.0065327x + 74.436 0.4702

Space y = -4.0311e-06x^2 + 0.025842x + 58.449 0.7256

Space y = y = 100 - 100*121.012/x 0.9994


Figure 3-11.Size of Minimized Test Suites vs Size of Original Test Suites

0 10 20 30 40 50 60 70 80 90100 110 120 130 140 150 160 170 180


0

25

50

75

100

125

150

min

imiz

ed

te

st s

uite

siz

e

average

spaceunextended

0 1000 2000 3000 4000 5000


0

25

50

75

100

125

150

min

imiz

ed

te

st su

ite

siz

e

average

spaceextended

Figure 3-12.Percent Reduction in Test Suite vs Original Test Suite Size

0

20

40

60

80

100

135 140 145 150 155 160 165 170 175

sf-prtss

0

20

40

60

80

100

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

se-prtss

space, minimized from extended space, minimized from unextended

Table 3-8 shows attempts to fit curves to the points in the plot for extended test suites.

Again, the best fit is the hyperbolic curve.



Figure 3-13 shows scatterplots for the reduction in fault detection for the minimization

and random reduction ofSpace ’s unextended and extended test suites. For all four of

the treatments, the highest loss in fault detection effectiveness is less than 40%. Table 3-8

shows attempts to fit a curve to the effectiveness reduction resulting from minimization of

the extended test suites. Unlike for size reduction, we found no curves that fit the data well.


for Space . Similar to the figures presented for the Siemens programs, the figure contains

a scatterplot for the extended test suites (left) and one for the unextended test suites (right).

Each scatterplot depictsfaults detectedversus original test suite size, simultaneously for

original test suites (black), test suites minimized for all-edges-coverage (dark grey) and

randomly reduced test suites (light grey). The solid lines in the plots denote average num-

bers of faults detected over the range of original test suite sizes, the gaps between these

lines indicate differences between the test suites. The plots reveal noticeable differences

between the techniques as test suite size grows; however, on the smaller unextended test

suites, the differences are much smaller.

Table 3-9.Correlation between Initial Size and Effectiveness Reductionprogram regression equation r^2

Space y = 0.0011x + 6.23 0.1047

Space y = 2.1771Ln(x) - 7.4968 0.1478

Space y = -6.36e-07x^2 + 0.004169x + 3.708 0.1531


Figure 3-13.Percent Reduction in Effectiveness vs. Original Size

0

20

40

60

80

100

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

se-prfdc

0

20

40

60

80

100

135 140 145 150 155 160 165 170 175

sf-prfdc-random

0

20

40

60

80

100

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

se-prfdc-random

0

20

40

60

80

100

135 140 145 150 155 160 165 170 175

sf-prfdc-random

space, minimized from extended test suites space, minimized from unextended suites

space, randomly reduced from extended space, randomly reduced from unextended

Figure 3-14.Original and Minimized: Faults Detected vs. Original Size

0 25 50 75 100 125 150 175


0

5

10

15

20

25

30

35

40

45

50

fau

lts d

ete

cte

d

by original

after minimization


avg. by orig.

avg. after min.


spaceunextended

0 1000 2000 3000 4000 5000


0

5

10

15

20

25

30

35

40

45

50

fau

lts d

ete

cte

d

by original

after minimization


avg. by orig.

avg. after min.


spaceextended


Table 3-10 shows average reductions in fault detection resulting from applying different

treatments toSpace . Not surprisingly, the larger extended test suites exhibited a greater

average fault reduction due to minimization, 8.9% versus 2.3%, and due to random reduc-

tion, 18.1% versus 3.4%. However, in general, the losses in fault detection effectiveness are

here much smaller than those observed with the Siemens programs. As in the experiments

with the Siemens programs, fault detection reduction due to random reduction was greater

than that due to minimization (9.2� 0.7 for the extended test suites), and the margin of

errors show that to be significant at� = 0.001 (99.9% confidence).

Table 3-10.Average Reductions in Fault Detection EffectivenessUnminimized Test Suites Average Reduction in Fault

Detection for Minimized

Test Suites

Average Reduction in Fault

Detection for Randomly

Reduced Test Suites

Average Difference in Fault

Detection Reduction

extended 8.9 18.1 9.2� 0.7

unextended 2.3 3.4 1.2� 0.5

3.5. Comparison to Previous Empirical Results

Both the Wong98 studies and our studies indicate that test suite minimization can produce

savings (in test suite size reduction), and that these savings increase with test suite size.

Both sets of studies also support, to some degree, a claim that reduction in fault-detection

effectiveness increases as test suite size increases.


The two sets of studies differ substantially, however, in their results pertaining to fault-

detection effectiveness reduction. The authors of the Wong studies conclude that, for the

programs and test cases they considered: (1) test suites that do not add coverage are not

likely to detect additional faults, and (2) fault detection effectiveness reduction is insignif-

icant even for test suites that have high block coverage.

Our results onSpace are similar in this respect to those of the Wong97 study. That study

shows an average fault detection effectiveness reduction less than 10% for all of test suite

sizes. While Figure 3-13 shows somewhat larger losses in fault reduction in some cases,

the average reduction in fault detection capability of 8.9% is not far from that discovered

in Wong97.

However, this conclusion contrasts markedly with our results on the Siemens programs,

where fault-detection effectiveness was severely compromised by minimization. We would

like to know the causes of this difference. We here discuss several potential causes of these

differences.

First, the Siemens programs differ from the programs utilized in the Wong98 study; all but

one of the Siemens programs is larger than all but one of the programs used in that study.

Second, the Wong98 study used ATAC for minimization, whereas our study utilized the

algorithm of Reference [Harrold93]. Reference [Wong95] reports that ATAC achieved

minimal test selection on the cases studied; we have not yet determined whether our al-

gorithm was equally successful. However, if our algorithm is less successful than the

algorithm used in the Wong98 study, we would expect this to cause us to underestimate

possible reductions in fault detection effectiveness. A better algorithm, if possible, could


only exacerbate the already large difference in results.

Third, our experiments with the Siemens programs and the Wong98 study both utilized

seeded faults that may accurately be described as “mutation-like”. However, all of the

faults utilized in our study were Quartile I faults, whereas only 41% of the faults used

in the Wong98 study were Quartile I faults. Easily detected faults are less likely to go

unrecognized in minimized test suites than faults that are more difficult to detect; thus, we

would expect our results overall to show greater reductions in fault-detection effectiveness

than the Wong98 study. However, the authors of the Wong98 study did separately report

results for Quartile I faults, and in their study, minimized test suites missed few of these

faults. Our fourth study further considers this factor by utilizing the same faulty versions

utilized in the Wong97 study.

Fourth, a factor more likely to be responsible for differences in results of the studies, in

our opinion, involves the types of test suites utilized. The Wong98 study used test suites

that were not coverage-adequate, and used coverage-based suites that were relatively small

compared to our test suites. Overall, 928 of the 1198 test suites utilized in the Wong98

study belonged to groups of test cases whose average sizes did not exceed 4.5 test cases.

Small test suite size reduces opportunities both for minimization, and for reduction in

fault-detection effectiveness. Further differences in test suites stem from the fact that the

test pools used in the Wong98 study as sources for test suites did not necessarily contain

any minimum number of test cases per covered item. These differences may contribute


to reduced redundancy in test coverage within test suites, and reduce the likelihood that

minimization will exclude fault-revealing test cases.

Finally, another plausible factor involves the specific test cases included in the test case

pools. To illustrate, we reproduce, in Table 3-11, data on the Wong98 study presented

in [Wong98]. The table lists the ten programs used in the Wong98 study (column 1), the

total number of faulty versions of those programs utilized (column 2), and the number of

tests in the test pools created for those programs (column 3). The next two columns report

data obtained when the researchers used ATAC to minimize the entire test pools for these

programs: column 4 indicates the size of the minimized test pool and column 5 indicates

the total number of faults missed by the minimized suites. From this data we derive column

6, the number of faults detected by the minimized test pools.

Table 3-11.Fault detection abilities of tests used in the Wong98 studyprogram number of faults test pool size minimized test

suite size

number of faults

missed

number of faults

detected

cal 20 162 6 1 19

checkeq 20 166 3 1 19

col 30 156 3 1 29

comm 15 754 11 3 12

crypt 15 156 2 0 15

look 15 193 6 2 13

sort 23 997 11 1 22

spline 13 700 5 1 12

tr 12 870 2 4 8

uniq 18 431 5 0 18

Consider columns 2 and 6 for programcrypt . Two tests in the test pool forcrypt were

able to detect all 15 faults in that program. Similarly, 3 tests in the test pool forcol were

able to detect 29 of the 30 faults in that program. These are powerful tests in relation to


the faults; similarly powerful tests appear to exist for the other programs. When such tests

are included in test suites, minimized versions of those suites may well exhibit little loss in

fault detection effectiveness.

This data, and the presence of powerful tests in the test pools, suggests why minimized

test pools retain fault-detection effectiveness in the Wong98 studies. We cannot know,

however, the extent to which such powerful tests in the Wong98 test pools are distributed

among the Wong98 test suites, nor can we know that, distributed among those suites, they

would necessarily have be selected by ATAC. Nevertheless, the data supports a conjecture:

characteristics of the tests in the test pool, and their relation in terms of coverage and fault-

exposing potential to the subject programs, can affect the performance of minimization

techniques.

52

Chapter 4. A New Minimization Technique

Because minimization sometimes results in test suites that are significantly less effective

than the test suites from which they were minimized, we developed a new coverage cri-

terion that might manifest better behavior when used for minimization. The first section

of this chapter introduces mutation analysis and explains how we can use it as a coverage

criterion. Then, the second section shows how this coverage criterion can be used for min-

imization, and it presents an algorithm to enable it. The final section presents our findings

as to the performance of the algorithm.

4.1. Mutation Analysis and Minimization

4.1.1. Mutation Analysis and Sensitivity

One method of assessing test suite effectiveness is known asmutation analysis[Untch93].

Mutation analysis creates many mutant versions of a program. Each statement in the pro-

gram is altered based on a set of rules for creating the mutants. The set of rules used to

create the mutants are calledmutagenic operators. Test cases can be run against the mu-

tant versions of the program to see if their outputs differ from that of the original version

of the program. If the output differs, the mutant is said to bekilled by that test case. A test

suite that contains test cases that kill every mutant is consideredmutation adequate. The

percentage of mutants killed by at least one test case in a test suite is called that test suite’s

Mutation Adequacy Score. The extent to which this is an accurate assessment of test suite

effectiveness depends on the hypothesis that a test suite that detects the minor variations

represented by the mutants will also detect real faults.

Chapter 4. A New Minimization Technique 53

In [Voas92], Jeffrey Voas proposes a related technique which he callspropagation, infec-

tion, and execution (PIE) analysis. PIE Analysis is a dynamic technique for estimating

three characteristics:

1. the probability that a particular section of a program is executed [execution],

2. the probability that the particular section affects the data state [infection], and

3. the probability that a data state produced by that section has an effect on program output[propagation][Voas92]

Combined, these provide an estimate of a statement’ssensitivity, the likelihood that a single

input is able to reveal a hypothetical fault in that statement.

The more sensitive a statement is, the more likely a test case is to reveal any faults in

that statement. Thus a fault in a sensitive statement may be revealed by a single test case

executing the statement, but if the statement is insensitive (has a low sensitivity), it will

often take many test cases executing that statement to reveal a fault in that statement.

Unfortunately, the sensitivity estimate is not unqualified. The "execution" component is

specific to a particular input distribution. The acquiring of the "infection" and "propaga-

tion" estimates depends on the correlation between the statistical behavior faults simulated

through mutation and the extent to which faults found "in the wild"1. The estimate is

1. The mutants used in [Voas92] are slightly different from those traditionally used in mutationanalysis, in that in [Voas92] propagation and infection are treated separately. For infection anal-ysis, the code is mutated as in conventional mutation analysis, but instead of counting kills basedon output, [Voas92] compares the data states of the original and mutant version immediately af-ter the altered statement is executed. For propagation analysis, the data states, rather than theexecutable instructions, areperturbed.


accurate to the extent that the test suite matches the input distribution and the faults in the

program behave like those simulated through mutation.

4.1.2. Adapting Sensitivity for use as a Coverage Criterion

Probabilistic Statement Sensitivity Coverage (PSSC) is qualitatively different than con-

trolflow or dataflow criteria. Instead of looking at program components and making sure

that they are all exercised, PSSC requires that each statement be executed enough times

such that there is at least a minimum likelihood that if a fault exists in the statement then it

will be revealed by one of the test cases. The number of tests required to execute a given

statement is calculated from an estimate of that statement’s sensitivity.

Given the "propagation" and "infection" components of sensitivity, it is possible to deter-

mine the number of times a statement should be executed to achieve a certain confidence

level that if there were a fault, it would be revealed by testing, using the equation: T =

ln(1-c)/ln(1-Ol) where T is the number of executions, c is the confidence level, and O

lis

an estimate of the likelihood of each test revealing a fault based on the "propagation" and

"infection" components of the sensitivity of location l[Voas92].

For a test suite to be PSSC adequate at a particular confidence level each statement must

be executed the requisite number of times. As a coverage criterion PSSC is similar to

statement coverage, where each statement is a requirement that must be covered, but it has

the added element that each statement needs to be exercised by more than one test case


to be sufficiently exercised. This problem is not equivalent to the traditional minimization

problem, so a multi-hit minimization algorithm must be used

4.2. An Algorithm to Facilitate Minimization based on PSSC

PSSC, as defined in the previous section, can be used as the basis for test suite minimiza-

tion. Because of the unique nature of PSSC minimization, we require a new algorithm.

This section will describe one conventional minimization heuristic, give a new more gen-

eral algorithm based on it, show how this can be used for PSSC minimization, and assess

the complexity of this algorithm.

4.2.1. A Conventional Test Suite Minimization Heuristic

As previously stated, the goal of test suite minimization is to find the smallest set of test

cases that cover all testing requirements with at least one test case. Unfortunately, finding

the minimum set is NP-complete[Garey79]. Harrold et al. present an algorithm, shown in

Figure 4-1 that finds a reduced test suite that satisfies all of the testing requirements, but is

not necessarily of the minimal cardinality[Harrold93]. The algorithm uses a heuristic that

selects tests, one at a time, until all of the requirements are exercised, by choosing the test

case that hits the most requirements that are the hardest to satisfy. (In the case of a tie, it

uses the test cases that hit the most requirements that are next hardest to satisfy. If neces-

sary, this is repeated until all of the requirements have been examined, and if it is still a tie,

one of the tests is chosen at random.) The requirements that are the hardest to satisfy are


Figure 4-1. The Harrold, Gupta, and Soffa Test Suite Minimization Algorithmalgorithm ReduceTestSuite

input T1, T2, ..., Tn: associated testing sets for r1, r2, ..., rn respectively,containing test cases from t1, t2, ... tnt

output RS: a representative set of T1, T2, ... Tndeclare MAX_CARD, CUR_CARD: 1...nt

LIST: list of ti’sNEXT_TEST: one of t1, t2, ... tntMARKED: array[1..n] of boolean, initially falseMAY_REDUCE: booleanMax(): returns the maximum of a set of numbersCard(): returns the cardinality of a set

begin/* Step 1: initialization */

MAX_CARD := Max_i(Card(Ti)) /* get the maximum cardinality of the Ti’s */RS := U_i T_i, Card(Ti) = 1 /* take union of all single element Ti’s */foreach Ti such that Ti \ RS != O/ do MARKED[i] := true /* mark all Ti containing elements in RS */CUR_CARD := 1 /* consider single element sets first */

/* Step 2: compute RS according to the heuristic for sets of higher cardinality */loop

CUR_CARD := CUR_CARD + 1 /* consider all sets with next higher cardinality */while there are Ti such that (Card(Ti) = CUR_CARD and not Marked[i] do

/* process all unmarked sets of current cardinality */LIST := all tj 2 Ti where Card(Ti) = CUR_CARD and not Marked[i]

/* all tj in ti in Ti of size CUR_CARD */NEXT_TEST := SelectTest(CUR_CARD,LIST) /* get another tj to include in RS */RS := RS [ {NEXT_TEST} /* add the test to RS */MAY_REDUCE := falseforeach Ti where NEXT_TEST 2 Ti do

MARKED[i] := true /* mark Ti containing NEXT_TEST */if Card(Ti) = Max(Card(Ti) then MAY_REDUCE :=1

endforif MAY_REDUCE then

MAX_CARD := Max(Card(Ti))), for all i where MARKED[i] = falseend while

until CUR_CARD = MAX_CARDend ReduceTestSuite

- - - - - - -

function SelectTest(SIZE, LIST)/* this function selects the next ti to be included in RS */

declare COUNT: array[1..nt]begin

foreach ti in LIST do compute COUNT[ti], the number of unmarked Tj’s of cardinality SIZE containing tiConstruct TESTLIST consisting of tests from LIST for which COUNT[i] is the maximumif Card(TESTLIST=1) then return(the test case in TESTLIST)elseif SIZE = MAX_CARD then return(any test case in TESTLIST)else return(SelectTest(SIZE+1, TESTLIST))

end SelectTest


those whose associated set, the set of test cases that can hit the requirement, has the small-

est cardinality. This algorithm has been described as "greedy on bottlenecks"[Wong98],

because it greedily chooses test cases that makes the most progress towards satisfying the

bottlenecks, the requirements that are hardest to satisfy.

4.2.2. A Multi-Hit Minimization Algorithm

The Harrold, Gupta, and Soffa algorithm is general in that it can be applied to any set

covering problem. This includes minimization based on control-flow, data-flow, functional

coverage, test cases that have found faults in the past, and even combinations of these.

All that is necessary is that the problem be expressed as a set of requirements, each with

a set of test cases (or other objects) that can satisfy those requirements. One thing the

algorithm cannot handle, however, is requirements that need to be hit by more that one

test case before they are satisfied. For example, if we wanted to say that all statements in

the program needed to be executed at least twice, the Harrold, Gupta, and Soffa algorithm

could not meet this requirement.

The structure of this algorithm can be adapted to the more general case where each re-

quirement must be hit an arbitrary number of times rather than just once. What is needed

is a more general measure of the difficulty of satisfying a requirement. Instead of using

the cardinality of the requirement’s associated set, the new algorithm defines a dynamic

measure called the requirement’s "hitting-factor", which represents the ratio of test cases

that can still hit the requirement divided by the number of times the requirement still needs

to be hit.


The hitting-factor of a requirement depends on the cardinality of its associated set (c), the

number of times the requirement needs to be satisfied (n), and the number of test cases that

have already hit the requirement (h). The hitting-factor can be expressed as (c-h)/(n-h),

rounded up to the nearest integer. If h >= n, the requirement is satisfied and the hitting

factor is not needed. If c < n, then the number of hits required is more than the number

of test cases that can hit the requirement, and the requirement cannot be satisfied; the

algorithm presented assumes this is not the case2. Under normal conditions the hitting-

factor of an unsatisfied requirement is a number from one to the associated sets cardinality.

If the hitting-factor is one, it means that all of the test cases in the associated set are needed

to satisfy the requirement. Higher hitting-factors indicate that any particular test case in

the associated set is less likely to be needed to satisfy that requirement. As requirements

accumulate hits, their hitting-factors increase, but always remain less than the associated

set’s cardinality as long as the requirement is not completely satisfied. When the number

of required hits is one, the hitting-factor of an unsatisfied requirement is its associated set’s

cardinality. Thus in the degenerate case, where all of the requirements only need one hit,

this new algorithm is equivalent to that presented in [Harrold93].

The new algorithm is shown in Figure 4-2. Briefly, the algorithm first chooses all of the

tests that satisfy requirements with a hitting factor of one, adds them to the reduced suite

(RS), and updates the requirements’ hitting factors accordingly. Then, it picks test cases

2. In cases where the original test suite does not satisfy the requirement completely, there are acouple options available to the implementer: The tester could be asked to add test cases untilthe requirement is satisfied. The number of hits required could be reduced to the number of hitson the requirement in the original test suite. Or the requirement could be simply be dropped asunsatisfiable.


Figure 4-2. A Multi-Hit Test Suite Reduction Algorithmalgorithm ReduceTestSuite

bounds y: number of requirments (HGS: n)z: highest number of test cases (HGS: nt)

input T1, T2, ..., Ty: associated testing sets for r1, r2, ..., ry respectively,containing test cases from t1, t2, ... tz

N1, N2, ..., Ny: the "hitting number" (minimum number of hits) for r1, r2, ..., ry respectively,each an integer from 1 to Card(r1), Card(r2), ... Card(ry) respectively.

output S: a representative set of T1, T2, ... Tndeclare MAX_HF, CUR_HF: 1...z

LIST: list of ti’sNEXT_TEST: one of t1, t2, ... tzH1, H2, ..., Hy: the number of test cases in S that are also in T1, T2, ... Ty respectively.Max(): returns the maximum of a set of numbersHitFac(ri): returns the hitting factor of a requirement

HitFac(ri) = roundup ((Card(Ti) - Hi) / (Ni - Hi))IsSat(ri): returns true if Hi >= Ni, otherwise false

begin/* Step 1: initialization */

MAX_HF := Max(HitFac(Ti) for every i) /* get the maximum hitting factors of the Ti’s */S := U_i T_i, Card(Ti) = 1 /* take union of all single element Ti’s */Hi := Hi + Card(Ti \ S ), for every i /* mark all Ti containing elements in S */CUR_HF := 1 /* consider requirements needing all of their test cases */

/* Step 2: compute S according to the heuristic for sets with higher hitting factors */loop

CUR_HF := CUR_HF + 1 /* consider all sets with next higher HF */while there are i such that (HitFac(ri) = CUR_HF and not IsSat(ri) do

/* process all unmarked sets of current HF */LIST := all tj 2 Ti where HitFac(Ti) = CUR_HF and not IsSat(ri)

/* all tj in ti in Ti of size CUR_HF */LIST := LIST - R /* but remove those we’ve already chosen */NEXT_TEST := SelectTest(CUR_HF,LIST) /* get another tj to include in S */S := S [ {NEXT_TEST} /* add the test to S */foreach Ti where NEXT_TEST 2 Ti do

Hi := Hi + 1 /* mark ri containing NEXT_TEST */endforMAX_HF := Max({0} U (HF(ri), for all i not IsSat ri))

end whileuntil CUR_HF = MAX_HF

end ReduceTestSuite

- - - - - - -

function SelectTest(HF, LIST)/* this function selects the next ti to be included in S */

declare COUNT: array[1..nt]begin

foreach ti in LIST do compute COUNT[ti], the number of unsatisfied rj’s of hitting-facotr HF containing tiConstruct TESTLIST consisting of tests from LIST for which COUNT[i] is the maximumif Card(TESTLIST=1) then return(the test case in TESTLIST)elseif HF = MAX_HF then return(any test case in TESTLIST)else return(SelectTest(HF+1, TESTLIST))

end SelectTest


Figure 4-3. A C Programinclude <stdio.h>

int main(){

char *cp;int i,j;

scanf( "%d", i );

if( i < 0 ) {cp = "hello world";

} else {cp = "good bye world";

}

scanf( "%d", j );if( j >= 0 ) {

printf( "%s", cp );}

}

one at a time, and adds each one to the reduced suite, until all of the requirements are

satisfied. The test case that hits the most requirements of the smallest hitting-factor (HF)

is chosen. If necessary, the number of hit requirements of the next higher hitting-factor is

used as a tie breaker for the smaller hitting-factor. The result is a reduced suite satisfying

all of the requirements with at least the necessary number of hits.

4.2.3. Using the Multi-Hit Reduction Algorithm for PSSC Minimization

To illustrate how PSSC can be minimized using the multi-hit minimization algorithm, con-

sider the following: Suppose we wish to test the program shown in Figure 4-3, and we

have the test cases given in Table 4-1. Table 4-2 shows each executable statement in the

program along with information about its sensitivity requirement. (The sensitivity values

listed in the table are chosen for the purposes of illustration only. They reflect an intuitive

estimate of the probability of an error in the statement propagating to the output; in practice

they would be obtained through a more rigorous methodology, such as mutation analysis.)

The number of hits required (HN) is calculated from the sensitivity value using


the equation:HN=(ln(1-.75))/(ln(1-sens)) so that there is a 75% chance that

testing will reveal a fault, if exists, on any particular line.

Table 4-1.The Initial Test Suite for the Example Programtest inputs ( i,j) output

t1 -1,-1 ""

t2 -1,0 "hello world"



t5 0,-1 ""

t6 0,0 "good bye world"


t8 1,-1 ""




The algorithm proceeds in the following manner. There is no requirement with an initial

hitting-factor of one, so as the algorithm enters into the main loop, all of the requirements’

hitting-factors are the number of tests in the associated set divided by the number of hits

required (HN), so the state of the algorithm is:

RS:0

HF: r1: 11 r2: 4 r3: 4 r4: 4 r5: 6 r6: 4 r7: 8


Table 4-2.The Coverage Requirements for the Example Programstatements req’s sens HN tests in associated set

scanf( "%d", i ) r1 0.75 1 t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11

if( i < 0 ) r2 0.40 3 t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11

cp = "hello world" r3 0.75 1 t1, t2, t3, t4

cp = "good bye world" r4 0.60 2 t5, t6, t7, t8, t9, t10, t11

scanf( "%d", j ) r5 0.50 2 t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11

if( j >= 0 ) r6 0.40 3 t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, t11

printf( "%s", cp ) r7 1.00 1 t2, t3, t4, t6, t8, t9, t10, t11

The unsatisfied requirements with the lowest cardinality are r2, r3, r4, and r6. In these

requirements test cases t1-t11 each hit three of the requirements. (Each of them hit r2, r6,

and one of r3 or r4.) Since all the test cases tie, they are examined against test requirements

of the next hitting-factor, but since none of the tests score hits against the nonexistent

requirements of HF=5, they all are checked against the next set of requirements: r5 with

a hitting-factor of six. Since r5 contains all of the test cases, they are all checked against

r7, which has a hitting-factor of eight. All of the test cases, except t1, t5, and t8, tie with 1

hit against requirements of HF=8. These test cases also tie on all of the remaining hitting

factors, so one of the remaining 8 tests is chosen at random. For the sake of this example,

t4 is chosen. After marking the hits of t4, r1, r3 and r7 are completely satisfied, and the

algorithm’s new state is

RS: {t4}

HF: r2: 5 r4: 4 r5: 10 r6: 5

The lowest hitting-factor is 4, so hits are first counted against r4, leaving t5-t11 with 1 hit

each. All of these tests score 2 hits against the associated sets for r2 and r5, the require-


ments with HF=5, so t5-t11 are checked against r5 where they all earn 1 hit, so one test

case from the set of {t5,t6,t7,t8,t9,t10,t11} is chosen at random. Choosing t9 satisfies r5

completely, and increases the hitting factors of r2, r4, and r6, resulting in

RS: {t4,t9}

HF: r2: 9 r4: 6 r6: 9

Tests {t5,t6,t7,t8,t10,t11} hit r4, the only requirement with the lowest hitting-factor. Those

tests also hit both r2 and r6, so one of them, t11 perhaps, is chosen at random. t11 satis-

fies all of the remaining requirements, so ReduceSuite returns the set {t4,t9,t11}, a 72%

reduction in test suite size.

4.2.4. Asymptotic Analysis of the Multi-Hit Reduction Algorithm

This multi-hit reduction algorithm can reduce the size of a test suite, while still maintaining

the required coverage, but this is only useful if it runs in a reasonable amount of time.

The single-hit minimization algorithm on which this algorithm is based has a worse case

runtime of O(y(y+z)r), where y is the number of requirements, z is the number of test cases,

and r is the maximum cardinality of the requirements’ associated sets[Harrold93].

It can be shown that this new algorithm has a similar worse case behavior. Let y be the

number of requirements, h be the sum of the number of hits required over all of the re-

quirements, z be the number of test cases, and r be the maximum cardinality of all of the

requirements’ associated sets. As in Harrold et al.’s original algorithm, the most expensive


parts of the new algorithm occur in the SelectTest subprocedure[Harrold93]. The two im-

portant parts of the SelectTest procedure are (1) counting the number of hits of each test,

and (2) picking the best test(s) based on the count. Counting the number of hits takes at

most O(yr) times, because SelectTest will, in the worse case, examine all of the test cases

for each of the requirements. Picking the best test cases involves going through test cases

and selecting those with the highest count, which can be done in O(z) time, but this may

have to be repeated O(r) times to resolve ties, this is done by calling SelectTest recursively.

SelectTest itself is called from the main algorithm once for each test case that makes it into

the reduced suite. In the worse case, the reduced suite will have O(h) test cases, with each

test case scoring one hit on one requirement. Thus, the overall worse case runtime of this

multi-hit reduction algorithm is O(h(y+z)r).

4.3. An Experiment with PSSC Minimization

A final experiment was performed to assess the performance characteristics of PSSC min-

imization. The experiment was similar to that in Section 3.3, except that PSSC rather than

edge-coverage was used as the adequacy criteria during minimization.

4.3.1. Experimental Design

We used the same subjects, faulty versions, test cases, and initial test suites as were used

in our first two experiments (Section 3.3): the seven small C programs, along with their

faulty versions and test cases, and extended edge-coverage adequate test suitex.


As in the earlier experiments, we used a full-factorial experimental design minimizing

1000 test suites of various sizes for each of the cells. PSSC’s dependence on aconfidence

introduced a third independent variable for the experiment, in addition to the subject pro-

gram and test suite size. A set of 14 discrete confidences were used ranging from 0.05 to

0.995. The combination of the 14 confidences with each of the 7 programs resulted in 98

cells for the experiment.

The threats to validity mentioned in Section 3.3.3 apply also to this experiment. Our mini-

mization program may be incorrectly implemented or fail to find the smallest adequate test

suite. We cannot control the nature or locality of the faults. The programs, faults, and test

suites may not be representative of those "in the wild". In our measurements, we treat all

test cases as equally expensive and all faults as equally severe.

4.3.2. Results

4.3.2.1. Minimized Test Suite Size

Figure 4-4 contains boxplots depicting the magnitude of the test suites after minimization

and the variability in their sizes. The boxplots are shown with one set of axes per program.

The horizontal axis for each subject program shows the PSSC-minimization confidence

levels used to produced their corresponding boxplots. Each program’s vertical axis enu-

merates the number of test cases.


Figure 4-4. Sizes of Test Suites after PSSC minimization50454035302520151050

0.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995

70635649423528211470

0.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995

230207184161138115

92694623

00.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995

140126112

98847056422814

00.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995

70635649423528211470

0.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995

80726456484032241680

0.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995

50454035302520151050

0.05 0.1 0.2 0.25 0.33 0.5 0.67 0.75 0.8 0.9 0.95 0.975 0.99 0.995

replace

schedule2

schedule1

totinfo tcas

printtok1

printtok2


Not surprisingly, the higher confidence levels require more tests. For example, on schedule

the median test suite size is 10 at a 0.8 confidence level, and it increases to 23 test cases at

0.975 confidence level.

One difference in these results, from the results of our other experiments, involved the

variance in minimized test suite sizes. The high variability of test suites for print_tokens,

schedule2, and tcas is quite different from the variability among minimized test suite sizes

we found in our first experiment (Section 3.3.4.1), which were very stable despite a large

range of initial test suite sizes. One explanation may be that many of the initial test suites–

while edge coverage adequate–did not completely satisfy the PSSC criteria for which they

were being minimized. Thus the variability in test suite size may correspond to the extent

to which each of the original test suites satisfied the PSSC criteria.

4.3.2.2. Minimized Test Suite Performance

For each subject program, Figure 4-5 shows the mean sizes and mean number of faults de-

tected by the original test suites, edge-minimized test suites, randomly reduced test suites,

and PSSC-minimized test suites represented by the circles, diamonds, squares, and curved

lines, respectively. The averages for each PSSC confidence level is represented by an as-

terisk. Each mean size can be taken as a rough indicator of the cost of running each type

of test suite. Each mean count of faults detected can be thought of as a measure of the

benefit of running the tests. Generally, the PSSC curve starts near the origin with the point

for confidence level of 0.05. The number of faults rises rapidly at first and the PSSC curve

falls in between the points representing the minimized and randomly reduced test suites.


Figure 4-5. Average Test Suite Size vs. Average Number of Faults Detected

0 10 20 30 40

test suite size

0

5

10

15

20

25

faul

ts

original

minimized

randomly reduced

PSSC minimization (0.05−0.995)

tinfo

0 20 40 60 80 100

test suite size

0

1

2

3

4

5

6

7

8

9

10

faul

ts

original

minimized

randomly reduced


schedule

0 20 40 60 80 100

test suite size

0

1

2

3

4

5

6

7

8

faul

ts

original

minimized

randomly reduced


schedule2

0 10 20 30 40

test suite size

0

5

10

15

20

25

faul

ts

original

minimized

randomly reduced


tcas

0 50 100 150

test suite size

0

1

2

3

4

5

6

7

faul

ts

original

minimized

randomly reduced


print tokens

0 50 100 150

test suite size

0

2

4

6

8

10

12

faul

ts

original

minimized

randomly reduced


print tokens2

0 50 100 150 200

test suite size

0

5

10

15

20

25

30

faul

ts

original

minimized

randomly reduced


replace


For some of the subjects, such as schedule2, the line is closer to the point representing

the randomly reduced test suites. For others, such as tcas, it is closer to the point repre-

senting the minimized test suites. This suggests that, at the size of edge-minimized test

suites, PSSC-adequate test suites may be less cost-effective than edge-adequate test suites

but more cost-effective than random testing.

The confidence level on the PSSC curve that is closest to the minimized test suite size

varies. For tot_info, tcas, and replace a confidence level of 0.2 seems to be comparable

to edge-minimization in terms of size and faults detected. The schedule, schedule2, and

print_tokens subjects require higher confidence levels and larger test suites to detect a

comparable number of faults as compared to edge minimization. The print_tokens2 subject

performs comparably to edge minimization with a confidence level of 0.67.

The variability in the confidence level required to perform comparably to edge-coverage

is unfortunate because it indicates that the practitioner would need to either set a low con-

fidence level and risk losing even more effectiveness than edge-coverage or set a high

confidence level and risk having larger test suites than necessary for the desired level of

fault detection.

The performance variation also suggests that some faults are more disposed to being de-

tected by edge-minimized test suites and others may be better detected by PSSC-minimized

test suites. To investigate this further we more closely examined the 41 faults in the tcas

subject. For each fault we counted the number of edge-minimized test suites, and 0.2-

confidence-level PSSC-minimized test suites, that detect that fault. These test suites have

similar characteristics in terms of average faults detected and average test suite size, but


differ in terms of the criteria by which they were created. We found that 16 of the faults

were detected by the PSSC-minimized test suites more than twice as often as by the edge-

minimized test suites. 8 of the faults were detected by the edge-minimized suites more

than twice as often as they were by the PSSC-minimized. The remaining 17 had less than

a 2:1 difference in detection rates. While not conclusive, this would seem to support the

hypothesis that different coverage criteria are useful in detecting different classes of faults.

71

Chapter 5. Conclusion

5.1. Results

In our experiments edge-coverage based minimization typically resulted in substantial size

reductions. Each program exhibited a remarkably stable minimized test suite size, so the

amount of savings depended almost entirely on the size of the original test suite. The

experiments with the unextended test suites for the Space program showed that test suites

greedily generated to be edge-coverage adequate may still have 20% more tests than are

needed for edge-coverage adequacy. Test suites designed with more test cases resulted in

even greater savings.

Unfortunately, this savings comes at the cost of missing the discovery of faults that would

otherwise be detected. Our experiments on the Siemens programs indicated that this loss

could be drastic and unpredictable. Our experimentation with the Space program, however,

showed that this cost can be small and relatively stable.

PSSC offers an alternative minimization technique. Because of its ability to be used at

varying confidence levels, it can be used to achieve a range of results on a cost-benefit

curve. Although PSSC is less effective than edge-coverage at the same test suite size,

PSSC is capable of scaling to test suite sizes larger than edge-coverage but still smaller

than the original test suites. There are some indications that PSSC may be complimentary

to edge-coverage in the sense that they are useful in detecting different faults.

The multi-hit minimization algorithm provides a new generalization of the traditional min-

imization algorithm. In addition to enabling PSSC minimization this algorithm could be

Chapter 5. Conclusion 72

used to scale other coverage criteria, by, for example, requiring that each statement be

exercised by at least two test cases in the test suite.

5.2. Practical Implications

The savings from minimization are substantial and can be of practical importance. This

is especially true if running or verifying the correctness of each test case is expensive in

terms of human labor or external hardware.

In our experience, the cost of running edge-coverage minimization has not been excessive.

Once the instrumentation and minimization system is in place, even for the large test suites

of the space application, the edge-coverage-minimization itself takes only a few minutes of

CPU time and very little human interaction. In addition, if used in the context of regression

testing, this cost can be amortized across several versions in which the minimized test suite

is used in the place of the original.

PSSC, on the other hand, though providing an alternative to traditional minimization, re-

quired a large amount of work to find out what mutants were killed by each of the test

cases. A static alternative method of sensitivity estimation would substantially reduce the

time it takes to perform PSSC minimization.

Unfortunately, potential loss of fault detection is problematic, so in any case, minimization

should be utilized with caution. It would probably be unwise to minimize on coverage cri-

teria less stringent than that used to produce the initial test suite. For example, if the initial

test suite was designed by creating a core of functional tests and then adding additional


tests to achieve edge coverage adequacy, then it would make greater sense to use a mini-

mization tool that allows for the inclusion of the functional requirements into the criteria

against which the test suite is minimized. If minimization is to be used in the context of

regression testing, it may be wise to use the unminimized test suite to test the initial release

and then decide to use the unminimized or minimized test suites for later releases on the

basis of how well the minimized test suite would have detected faults in the initial version.

5.3. Limitations of This Investigation and Future Work

Our empirical investigation was limited in the size and nature of the subject programs, in

the number and nature of faults seeded in the programs, and in the nature of test cases and

test suites utilized.

It is unclear to what extent our results extend to other programs. The only way to ad-

dress this deficiency is to perform further experimentation on a wider range of programs,

especially additional programs as large as or larger than the Space program.

The faults used with the Siemens programs were artificial ones seeded by researchers. The

faults used in the Space program were real faults found during and after the program’s de-

velopment. Unfortunately, we are not sure how accurately the faults that have been detected

in the Space program reflect the actual distribution of faults that existed in the program at

the time where minimization would have been employed. A better understanding of the

nature, distribution, and severity of faults existing "in the wild" would be useful in better

understanding the practical cost of the loss in the number of faults detected.


The original test cases and test suites we used were created by researchers and may not

accurately reflect the test cases and test suites used in practice. In particular our test suites

were chosen from large test pools containing test cases generated according to multiple

coverage criteria. Further studies using either test suites from practitioners or test suites

that better model practice–by, for example, generating test suites from separate pools of

functionally oriented and edge-coverage oriented test cases–might provide a more accurate

simulation of minimization’s use in practice.

The minimization algorithms discussed in this paper do not always find the minimally

sized test suite meeting all requirements; they find a test suite as small as or smaller than

the original which satisfies all of the requirements. An empirical investigation into various

test suite minimization algorithms, comparing them in terms of optimality and execution

time of minimizing test suites from practice, may be useful for practitioners.

Test Suite Minimization has potential for substantial savings in the cost of testing, but

further investigation is needed to better understand the practical effect of the risks inherent

in test suite minimization.

75

Bibliography

Balcer, M. and Hasling, W. and Ostrand, T.Automatic generation of test scripts from for-mal test specifications. Proc. of the 3rd Symp. on Softw. Testing, Analysis, andVerification, Pages 210-218, December 1989.

Beizer, B.Softw. Testing Techniques. (New York, NY: Van Nostrand Reinhold, 1990).

Chen, T.Y. and Lau, M.F.Dividing strategies for the optimization of a test suite. Infor-mation Processing Letters: Dividing strategies for the optimization of a test suite,60(3):135-141, March 1996.

Frankl, P.G. and Weiss, S.N.An experimental comparison of the effectiveness of branchtesting and data flow testing. IEEE Trans. on Softw. Eng.: An experimental compar-ison of the effectiveness of branch testing and data flow testing, 19(8):774-787, Aug1993.

Garey, M.R. and Johnson, D.S.Computers and Intractability. (New York: W.H. Freeman,1979)

Graves, T.L. and Harrold, M.J. and Kim, J-M and Porter, A. and Rothermel, G.An empir-ical study of regression test selection techniques. Proc. 20th Int’l. Conf. on Softw.Eng., April, 1998

M. J. Harrold and R. Gupta and M. L. SoffaA methodology for controlling the size of a testsuite. ACM Trans. on Softw. Eng. and Methodology: A methodology for controllingthe size of a test suite, 2(3):270-285, July 1993.

Harrold, M.J. and Rothermel, G.Aristotle: A System for Research on and Developmentof Program Analysis Based Tools. Technical Report: OSU-CISRC-3/97-TR17, TheOhio State University, March 1997.

Horgan, J.R. and London, S.A.ATAC: A data flow coverage testing tool for C. Proc. Symp.on Assessment of Quality Softw. Dev. Tools, pages 2-10, May 1992.

Hutchins, M. and Foster, H. and Goradia, T. and Ostrand, T.Experiments on the effective-ness of dataflow- and controlflow-based test adequacy criteria. Proc. 16th Int’l. Conf.on Softw. Eng.: Experiments on the effectiveness of dataflow- and controlflow-basedtest adequacy criteria, pages 191-200, May 1994

Johnson,R.Elementary Statistics. Sixth Edition. (Belmont, CA: Duxbury Press, 1992)

Moore, D. S. and McCabe, G. P.Introduction to the Practice of Statistics. Third Edition.(New York: W.H. Freeman and Company, 1999)

Offutt, J. and Pan, J. and Voas, J.M.Procedures for reducing the size of coverage-basedtest setsProc. of the Twelfth Int’l. Conf. on Testing Comp. Softw. pages 111-123,June 1995.

76

Ostrand, T.J. and Balcer, M.J.The category-partition method for specifying and generatingfunctional tests. Comm. of the ACM: The category-partition method for specifyingand generating functional tests, 31(6), June 1988.

Rothermel, G. and Harrold, M.J.Analyzing regression test selection techniques IEEETrans. on Softw. Eng.: Analyzing regression test selection techniques, 22(8):529-551,August 1996.

Rothermel, G. and Harrold, M.J. and Ostrin, J. and Hong, C.An empirical study of theeffects of minimization on the fault detection capabilities of test suites. Proc. of Int’l.Conf. on Softw. Maintenance, November 1998.

Untch, Roland H. and Offutt, A. Jefferson and Harrold, Mary Jean.Mutation AnalysisUsing Mutant Schemata. International Symposium on Software Testing and Analysis,pages 139-148, June 1993.

Voas, J.M.PIE: A Dynamic Failure-Based Technique. IEEE Trans. on Soft. Eng.: PIE: ADynamic Failure-Based Technique, 18(8):717-727, August 1992.

Vokolos, F. I. and Frankl, P. G.Empirical evaluation of the textual differencing regressiontesting technique. Proceedings of the International Conference on Software Mainte-nance, pages 44-53, November 1998.

Wong, W.E. and Horgan, J.R. and London, S. and Mathur, A.P.Effect of test set size andblock coverage on the fault detection effectiveness. Proc. Fifth Intl. Symp. on Softw.Rel. Engr. pages 230-238, November 1994.

Wong, W.E. and Horgan, J.R. and London, S. and Mathur, A.P.Effect of test set minimiza-tion on fault detection effectivenessProc. 17th Int’l. Conf. on Softw. Eng. pages41-50, April 1995

Wong, W.E. and Horgan, J.R. and Mathur, A.P. and Pasquini, A.Test set size minimizationand fault detection effectiveness: A case study in a space applicationProc. 21stAnnual Int’l Comp. Softw. & Applic. Conf. pages 522-528, August 1997

Wong, W.E. and Horgan, J.R. and London, S. and Mathur, A.P.Effect of test set minimiza-tion on fault detection effectiveness Software – Practice and Experience: Effect of testset minimization on fault detection effectiveness, 28(4):247-369, April 1998.

test suite minimization an empirical investigation by a ......a project submitted to oregon state...

Documents