effectiveness of inadequate test suites

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

Effectiveness of Inadequate Test SuitesA Case Study of Mutation Analysis

HIKARI WATANABE

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Effectiveness of Inadequate

Test Suites

A Case Study of Mutation Analysis

HIKARI WATANABE

DD221X, Degree project in Computer Science (30 ECTS credits)

Master’s program in Computer Science (120 ECTS credits)

KTH Royal Institute of Technology, Year 2017

Supervisor at CSC: Karl Meinke

Examiner at CSC: Cristian M Bogdan

Master thesis work carried out at NASDAQ Technology AB

Abstract

How can you tell whether your test suites are reliable? This is often done through the

use of coverage criterion that would define a set of requirements that the test suites

need to fulfill in order to be considered reliable. The most widely used criterion is

those referred to as code coverage, where the degree to which the code base is

covered is used as an arbitrary measure of how good the test suites are. Achieving

high coverage would indicate an adequate test suite i.e. reliable according to the

standards of code coverage. However, covering a line of code does not necessarily

mean that it has been tested. Thus, code coverage can only tell you what parts of the

code base have not been tested, opposed to what have been tested.

Mutation testing on the other hand is an approach to evaluate the adequacy of test

suites through their fault detection ability, rather than how much of the code base

they cover.

This thesis performs mutation analysis on a project with inadequate code coverage.

The present testing effort on unit level is evaluated and the cost and benefits of

adopting mutation testing as a testing method is evaluated.

Sammanfattning

Hur vet man när tester är tillförlitliga? Ofta använder man sig av täckningskriterium

som definierar en uppsättning krav som tester måste uppfylla för att betraktas som

pålitlig. Det mest använda kriterier är de som kallas kodtäckning, där graden till

vilken kodbasen är täckt används som ett mått av pålitlighet av tester. Hög täckning

indikerar adekvat tester, dvs pålitlig enligt kodtäckning. Men täckning av en kodlinje

betyder inte nödvändigtvis att den har testats. Koddekning kan således bara visa

vilka delar av kodbasen som inte har testats, snarare än vad som har testats.

Mutation testing å andra hand är ett sätt att utvärdera testers effektivitet genom

deras felsökningsförmåga, snarare än hur mycket av kodbasen de täcker.

Denna examensarbete utför mutationsanalys på ett projekt med otillräcklig

koddekning. Kvalite av nuvarande tester på enhetsnivå utvärderas och kostnaden

och fördelar för att anta mutation testning som en testmetod utforskas.

Keywords

Mutation testing, code coverage, regression analysis

Preface

I would like to thank my team at NASDAQ that has helped and supported me throughout the project.

Special thanks to:

Kjell Paulson at NASDAQ for having me in his team

Karl Meinke at KTH for taking me on as a thesis student and guiding me throughout

Cristian M Bogdan at KTH, for evaluating my work

Contents

Introduction ................................................................................................................................................ 1

1.1 Objective ........................................................................................................................................... 2

1.2 Delimitations ................................................................................................................................... 3

1.3 Related Work ................................................................................................................................... 3

Background ................................................................................................................................................ 4

2.1 Mutation Testing ............................................................................................................................. 4

2.1.1 RIP Model .................................................................................................................................. 5

2.1.2 Mutation Score ......................................................................................................................... 5

2.1.4 Mutation Operators .................................................................................................................. 5

2.1.5 Equivalent Mutants .................................................................................................................. 6

2.1.6 Cost reduction ........................................................................................................................... 6

2.2 Theory behind Mutation Testing .................................................................................................... 7

2.3 Mutation System ............................................................................................................................. 9

Methodology ............................................................................................................................................. 11

3.1 Codebase ......................................................................................................................................... 11

3.2 Sample Space .................................................................................................................................. 12

3.3 Generating Unit Tests .................................................................................................................... 13

3.4 Mutation System ............................................................................................................................ 13

3.5 Mutation Analysis .......................................................................................................................... 14

3.6 Coverage Metrics and Mutation Coverage .................................................................................... 14

Results ....................................................................................................................................................... 16

4.1 The Sample Space ........................................................................................................................... 16

4.2 Mutation Analysis .......................................................................................................................... 17

4.2.1 Original Test Suites ................................................................................................................. 17

4.2.2 Generated Test Suites ............................................................................................................ 18

4.2.3 Performance ............................................................................................................................ 19

4.3 Linear Regression Analysis ........................................................................................................... 20

Discussion ................................................................................................................................................ 23

5.1 Quality of Unit Tests ...................................................................................................................... 23

5.2 Cost and Benefits of Mutation Testing ........................................................................................ 24

5.4 Reflection ........................................................................................................................................25

5.5 Sustainability and Societal Aspects ...............................................................................................25

5.6 Conclusions ....................................................................................................................................25

5.7 Future Work .................................................................................................................................. 26

References................................................................................................................................................. 27

Introduction | 1

Chapter 1

Introduction

This chapter introduces the topic and objective of this thesis project along with the research

questions and relevance of this study.

Software testing remains one of the most important moreover expensive aspects of ensuring high

quality. According to the Capgemini World Quality Report in 2015 budgets for quality assurance and

testing have risen to an average 35% of total IT spending. Significant 9% increase from 2014, with a

prediction that the average will reach 40% by the year 2018 [WQR].

At its core, software testing is an endeavor for higher quality, typically through the detection of

dormant faults. However, the growing size and complexity of software entail a practically infinite

input space, making it infeasible to completely test entire systems. Testing is thus always a trade-off

between the cost of testing and potential cost of undiscovered faults. To overcome this fundamental

limitation of testing, developers need a structured way to assess the effectiveness, or quality of test

suites in terms of detecting faults.

Intuitively, the most logical measure of fault detection ability is simply the number of real faults it

detects. Faults discovered during a products lifetime can be used in retrospect, to assess the adequacy

of test suites. However, this approach does not lend itself well to a development process. Thus, a

method to predict the quality of test suites solely based on the suites, and the current build of the

system under test (SUT) is required. The most common such approach is the use of coverage criteria

[AO17]. Coverage criteria define the properties that the test suit needs to fulfill, for example,

statement coverage require that every statement to be executed and branch coverage that every

branch be traversed. The coverage measurements would then serve as an arbitrary indicator of

adequacy, e.g. test suite with 80% statement coverage is higher quality than a test suite with 70%

statement coverage.

Mutation Testing is a fault-based testing technique that provides a coverage criterion called mutation

coverage. Mutation coverage is different from the other criteria. It is based on fault detection rather

than coverage in terms of structural aspects of the code base.

2 | Introduction

Mutation analysis is the process of injecting small faults into the SUT through syntactic changes,

creating copies, or mutants each containing one fault. Test suites that are able to detect the injected

faults are then considered adequate i.e. reliable.

Mutation testing can be used for testing at both unit level and integration level [DM96]. It has been

applied to many programming languages, e.g. Fortran, Ada, AspectJ, Java and C. Beside the use at

the software implementation level, it has been applied at the software design level to test the

specifications of the SUT [MR01].

The concept of mutation was first introduced in 1971, by Richard Lipton in a class term paper. It was

later developed and published by DeMillo et al. in 1978. Over four decades of history and wide range

of studies have resulted in a large body of literature [JH11, OU00].

Mutation coverage subsumes many other coverage criteria [OV96] where subsumption can be defined

as, coverage criterion Ca subsumes Cb if and only if every test suit that satisfies Ca also satisfies Cb. It

has also been shown to predict actual faults detection ability better than other criteria in some

settings, but never shown to be worse [GJ14]. However, mutation testing is computationally

expensive and difficult to apply and although there has been much research [JH11], it is still regarded

as academic and not widely adopted within the industry.

1.1 Objective

NASDAQ Technology AB is an American fin-tech company. NASDAQ is a leading provider of trading,

clearing, exchange technology, listing, information and public company services across six continents

[NH17]. The business-critical nature of the financial domain necessitates a solid testing effort with

reliable test suites.

In this thesis, the test suites of one of NASDAQs software projects are evaluated on the unit level.

Historically the project has lacked a set structure for testing which resulted in low coverage. The team

maintaining the project are taking steps to supplement the present testing efforts, however the

abundance of legacy code with high interdependency have made it difficult to create unit tests.

Unit tests are widely recognized as an integral part of a development process. Among other benefits,

they serve as a safety net during the inevitable refactoring of old code, detecting undesired behaviors

and helping to facilitate the fault.

On investigating previous system failures, the project team discovered that only a handful of the

critical failures could have been prevented with unit tests. Thus, the team is doubtful of the gain from

further unit tests and reluctant to invest any resource. However, assessing the quality of present unit

tests would determine the effectiveness of past approaches and could prove to be useful in convincing

the team otherwise.

Since the project is lacking in the amount of unit tests, a new set of unit tests was generated through

an automatic test suites generator. Mutation testing was applied on both the original and generated

unit tests. To assess the benefits of mutation testing, the ability of conventional coverage criteria to

predict mutation coverage was explored through regression analysis.

Introduction | 3

The research questions can be defined as: What is the quality of present unit tests, and what are the

cost and benefits of adopting mutation testing?

1.2 Delimitations

The measurements used within the thesis are directly dependent on the metrics reported by the tools.

Although simple coverage measurements such as line coverage can be easily cross validated because

of its prevalence, path coverage is far more difficult. Thus the ability to validate measurements is

somewhat limited in this regard.

The performance comparison of mutation testing is limited by the lack of possibility to augment or

modify the present test suites. The SUT is extremely large and complex, therefore creating meaningful

test suites without the assistance of a developer from the project is far too time-consuming. Without

a way to augment the test suites, it is practically impossible to measure the performance of mutation

testing at different degree of testing efforts.

1.3 Related Work

A study conducted by Simona et al. [NW11], attempted to assess the cost of applying mutation testing

on a real-world software system. The study applies three widely recognized mutation testing tools,

namely, MuJava, Jumble and Javalanche, on the open source project Eclipse. The study concluded

that although the configuring and applying the tools is simple enough, we should pay special attention

to the high execution time.

A recent study conducted by Gopinath et al. [GJ14] attempted to investigate the correlation between

mutation kill ratio and widely used coverage criteria (statement, block, branch and path coverage).

The study considered hundreds of open source java projects amassed from GitHub repositories. They

measured the coverage and performed mutation analysis on the test suits. The data was then analyzed

through regression analysis, measuring both τβ (Kendal rank correlation coefficient) and R2

(coefficient of determination). The same experiment was conducted on both the projects original test

suites and suites automatically generated through the Randoop testing tool. The study found

correlation between the widely used coverage criteria and mutation kill ratio, with statement coverage

being the best at R2 = 0.94 for original tests and 0.72 for generated tests. The aim of Gopinath et al

was to measure the ability of coverage criteria as a predictor of suite quality, from the perspective of

non-researchers and to present a possible alternative to the computationally expensive mutation

testing.

The statistical approach adopted by Gopinath et al. is the same as that of Gligoric et al. [GG13], who

considered some of the same question but from a research perspective. However, the study considered

only 15 Java programs and 11 C programs and concluded that branch coverage performed best. This

thesis investigates one considerably larger project, and implements a similar statistical approach to

similar results as Gopinath et al.

4 | Background

Chapter 2

Background

This chapter presents the background material to understand mutation testing and the tools used

throughout this thesis.

2.1 Mutation Testing

The 70s saw the rise of Van Halen. Like any other band, when Van Halen was hired to play at a venue

they provided the promoter with a contract rider. The rider included everything from sound and

lighting requirements to food and drinks. Listed among these was a big bowls of M&M’s, but

absolutely no brown ones. This was not just a superstition or some rock star ridiculousness and served

a very specific purpose. They randomly buried the odd request to make sure that the contract was

thoroughly read. Finding a brown M&M meant there might be other things that the promoter missed

[VA01].

Van Halen made sure the rider was thoroughly read, by hiding an odd item for the promoter to find.

In a similar way, mutation testing will make sure the SUT is thoroughly tested, by introducing

artificial faults for the test suites to find. The process creates several copies of the code, each

containing one fault. Existing test cases are executed against the copies, with the objective to

distinguish the original program from the faulty ones, determining the adequacy of existing test

suites.

Let P be a program that correctly functions on some test set T. The program is subjected to a mutation

operators that introduce small artificial faults, thereby creating mutants (refer to figure 2.1) that

differ from the original program in very small ways. Note that each mutant only contains one fault

each. Let these mutants be called P1, P 2 … P n. Running each mutant against T, there are two possible

outcomes:

1. Some Pi gives a different result than P

2. Some Pi gives the same result as P

In case (1) Pi is said to be killed and in case (2) Pi is said to be alive. If a mutant is killed, that means

the tests were able to distinguish P from the mutant. A mutant can be alive because of two reasons.

Background | 5

Either the tests were not sensitive enough to detect the introduced fault and must be augmented, or

Pi and P turns out are functionally equivalent (henceforth noted as Pi ≡ P) [DL78, AB79].

Program P Mutant Pi

…

if (a ≤ b)

…

…

if (a ≥ b)

…

Figure 2.1: Example of a mutant

2.1.1 RIP Model

Condition for a mutant to be considered killed, can be expressed more formally with three conditions,

together referred to as the RIP model [YH14, VM97, AO17].

Reachability: The location of the mutation must be reached by the test.

Infection: After the location is executed, the state of the program must be infected i.e. differ

from the corresponding state, of the original program.

Propagation: The infection must propagate through execution and result in an erroneous

output or final state.

2.1.2 Mutation Score

As is defined by DeMillo et al [DL78], a test set that manages to kill all mutants, except for those

equivalents to P is adequate. In other words, a test set is adequate, if it distinguishes the program from

the mutant programs.

The extent to which coverage criteria is satisfied is measured as a coverage score, calculated in terms

of imposed requirements. In the case of mutation testing it is referred to as mutation score [AO17,

OU00]. Let M be total number of mutants, D the number of killed mutants and E the number of

equivalent mutants. [JH11, AB79, GO92]. Mutation score can be defined as:

MS(T) =𝐷

𝑀−𝐸 (2.1)

2.1.4 Mutation Operators

A mutation operator is a syntactic or semantic transformation rule applied to a SUT to create mutants.

Operators are created with one of two goals: to inject faults representative of common mistakes the

programmers tend to make, or to enforce testing heuristics, e.g. executing every branch.

Key to successful mutation testing is well designed mutation operators. Syntactically illegal mutants

would be caught by the compiler and be of no value. These are called stillborn mutants and should be

discarded or not generated at all, and a trivial mutant can be caught by any test.

Mothra mutant operators is the first set of 22 formalized mutation operators for the Fortran

programming language [JH11, AO17]. The operators were derived through study of programmer

6 | Background

errors and implemented in the Mothra mutation system [KO91, DG88]. The full list and detailed

description of each operator can be found elsewhere [KO91]. The operators were adapted to Java by

Ammann et al [AO17] and one of them is:

Relational Operator Replacement - ROR

Replace each occurrence of one of the relational operators (<, ≤, >, ≥, ==, ! =) by each of the

other operators and by falseOp and trueOp, where falseOp always result in false and trueOp in

result in true. Applying the ROR operator on for example the program P shown in figure 2.1 we

would generate seven possible mutants,

𝑖𝑓(𝑎 ≤ 𝑏), 𝑖𝑓(𝑎 > 𝑏), 𝑖𝑓(𝑎 ≥ 𝑏), 𝑖𝑓(𝑎 == 𝑏), 𝑖𝑓(𝑎 ! = 𝑏), 𝑖𝑓(𝑓𝑎𝑙𝑠𝑒), 𝑖𝑓(𝑡𝑟𝑢𝑒)

2.1.5 Equivalent Mutants

One of the biggest hurdles of mutation testing is the equivalent mutant problem. Some mutants can

turn out to be semantically equal to the original program, although they are syntactically different.

Without detecting all the equivalent mutants, the tester cannot have complete confidence in the test

data. There would simply be no way to be sure, whether the test is inadequate or the live mutants are

equivalent.

An equivalent mutant will always produce the same output as the original program, thus impossible

to kill. Refer to figure 2.2 for an example. Although they have two different conditions, both program

P and mutant Pi will act in the exact same way, hence they are equal.

Detecting equivalence between two programs is an undecidable problem [BA82], i.e. there is no

algorithmic solution. The situation however is somewhat different for the equivalent mutant problem.

We do not need to determine the equivalence of two arbitrary pair of programs, but rather two

syntactically very similar programs. Although this was also proven undecidable, it has been suggested

that it is possible in many specific cases [OP97, OC94].

Program P Equivalent Mutant Pi

…

int a = 0;

while ( 5 < a ) {

a++;

}

…

…

int a = 0;

while ( 5 != a ) {

a++;

}

…

Figure 2.2: Example of equivalent mutant

2.1.6 Cost reduction

One of the hindrances to mutation testing being widely adopted in the industry is the unreasonably

high computational cost of creating and running a vast number of mutant programs. The number of

mutants generated for a program is roughly proportional to the number of data references times the

number of data objects [OL96, OU00]. There are for example seven possible for the single line of code

shown in figure 2.1, and would drastically increase for every new line. As was described in previous

section, we must run at least one and potentially all the test cases against each mutant, this brings

with it a large computational cost.

Background | 7

There are several approaches proposed to reduce the computational cost of mutation testing. These

methods can be categorized as mutant reduction and execution cost reduction. This section will

present the most studied methods for each category according to the survey done by Jia et al [JH11].

2.1.6.1 Mutant Reduction Techniques

Mutant reduction techniques aim to reduce the number of generated mutants without suffering a

significant loss of effectiveness. Let MST(M) denote the mutation score for a test set T applied on the

mutants M. Mutant reduction problem can be defined as the problem of finding the subset M’ of M

so that MST(M) ≈ MST(M’) [JH11].

Mathur et al. proposed the idea of constrained mutation, to apply mutation testing with only the

crucial mutation operators. The concept was later developed by Offutt et al [OR93] as Selective

mutation, an approximation technique that reduces the number of created mutants by reducing the

number of used mutation operators. Mutation operators generate varying number of mutants, some

operators have higher applicability and will generate that many more than others, which may turn

out to be redundant [JH11, OU00, OL96, MA91].

A study on selective mutation conducted by Offutt et al [OL96], on 10 FORTRAN programs concluded

that 5 of the Mothra mutant operators are sufficient, to effectively conduct mutation testing.

2.1.6.2 Execution Cost Reduction Technique

Another way to reduce the computational cost, other than reducing the number of mutants generated,

is to optimize the mutant execution process.

Traditional mutation testing is often referred to as strong mutation. In strong mutation, for a given

program P, a mutant Pi is said be killed, only if the original program P and the mutant Pi produce

different outputs.

Proposed by Howden [HO82], weak mutation is an approximation technique that optimizes the

execution of strong mutation by relaxing the definition of “killing a mutant”. Weak mutation only

requires that the first two condition of the RIP (Reachability, Infection and Propagation) model to be

satisfied. A program P is assumed to be constructed by components {c1, c2 … cn}. Let Pi be a mutant

created by changing the component ci, mutant Pi is said be killed if the internal state of Pi is incorrect

after the execution of the mutated component. As such weak mutation trades test effectiveness for

reduced computational cost [JH11, AO17].

2.2 Theory behind Mutation Testing

Section 2.1 gave an overview of mutation testing. This section will present the theory that makes

mutation testing possible.

Mutation testing is grounded on two fundamental hypotheses, first introduced by DeMillo et al. in

1978 [DL78] stated as:

Competent programmer hypothesis: Programmers are usually competent and produce

code either correct or close to being correct.

8 | Background

Coupling effect: Tests that detect small errors are so sensitive that they implicitly detect

more complex errors.

Suppose we have a program P, which is meant to compute a function F with an input domain D. The

traditional approach to determining the correctness of P would be to find a subset T of D, such that

if for all 𝑥 in 𝑇, 𝑃(𝑥) = 𝐹(𝑥) (2.2)

then for all 𝑥 in 𝐷, 𝑃(𝑥) = 𝐹(𝑥)

where 𝑃′ is the function actually computed by P. The subset T is then referred to as a reliable test set

i.e. the set of input data needed to determine the correctness of P. However, finding T requires

exhaustive testing efforts and is deemed undecidable [HO76] for any non-trivial programs.

Mutation testing on the other hand is a technique that attempt to draw a weaker conclusion, find a

subset T of D, such that:

if 𝑃 is not pathological

and for all 𝑥 in 𝑇, 𝑃(𝑥) = 𝐹(𝑥) (2.3)

then for all 𝑥 in 𝐷, 𝑃(𝑥) = 𝐹(𝑥)

A program P is not pathological if it was written by a competent programmer, i.e. it follows the

competent programmer hypothesis. Mutation testing assumes that P is close to the correct program

Pc hence, either P = Pc or some other program Q close to P is correct.

Figure 2.3: Neighborhood of P within the domain of all possible programs

Φ

μ

P

Pc

Background | 9

Let Φ be the set of programs close to P. With the assumption that P or some other program Q within

Φ is correct. The approach of mutation testing to find subset T is to eliminate the alternatives. We

formulate the method as, find subset T of D, such that:

for all 𝑥 in 𝑇, 𝑃(𝑥) = 𝐹(𝑥)

and for all 𝑄 in Φ (2.4)

either 𝑄 ≡ 𝑃

or for some 𝑥 in 𝑇, 𝑄(𝑥) ≠ 𝑃(𝑥)

If we can find a subset T that satisfies formula 2.4 then we say that P passes the Φ mutant test, or that

T differentiates P from all other programs in Φ. This can be explained as: Given that P performs

correctly on test set T. Each program Q in Φ should either be equivalent to P or produce a different

result output than P. Instead of having to exhaustively test P with a practically infinite amount of test

set, we can focus on differentiating P from Φ. However, the problem remains too large.

The coupling effect hypothesis says that there is often a strong coupling between members of Φ and

a small subset μ (refer to figure 2.3). The subset μ can be thought of as a set of programs very close to

P, such that if P passes the μ mutant test with test data T, then P will also pass the Φ mutant test with

test data T. The subset μ is referred to as mutants of P and the task of differentiating P from Φ is

reduced to finding μ and differentiating P from μ [BD80].

2.3 Mutation System

Mutation testing is performed using a so called mutation system. A mutation system would

implement the mutation analysis process, i.e. generating the mutants and handling them.

Figure 2.4 show a generic process for mutation analysis. Let P be a program and T be a set of tests to

be evaluated. When P is submitted to a mutation system, the system would first create the mutants

P1, P2 … Pn. Next, T is loaded as input to P. If a test fails we have discovered a bug within P and it needs

to be corrected, otherwise T is executed on the mutants P1, P2 … Pn. If the output of a mutant Pi is

different from the output of P, we mark Pi as killed. Once all the tests in T have been executed, the

mutation score is calculated. If there are still live mutants, the tester can augment T to target the live

mutants and the process is repeated. Equivalent mutants are marked, either manually or through

some automated technique and are not considered for the next iteration.

Although the augmented test set does not necessarily reveal any new faults, the mutation score gives

an approximate indication on the adequacy of the test set. The process is repeated until a mutation

score of 1 is achieved or a threshold set by the tester is met [AO17, OU00].

10 | Background

Figure 2.4: Generic mutation testing process [JH11, OU00]

The above described process of mutation analysis is based on the theory from section 2.2. Creating

the mutants P1, P2 … Pn using mutation operators is an attempt to find μ. The repeated process after

the creating the mutants implements the method of formula 2.4 i.e. differentiating P from μ.

Input

original

program P

Create

mutants

Input test

set T

Run T on P

P (T)

correct ? Fix P

Run T on

each live

mutant

All mutants

killed ?

Analyse

and mark

equivalent

mutants

F

F

T

T

End

Augment T

Star

t

Methodology | 11

Chapter 3

Methodology

This chapter presents the methodology used throughout this thesis to answer the research question,

including the process outline and a description of each step.

An empirical approach was adopted since the problem statement of this thesis is directly reliant on

measured data. The experimental model contains the steps:

1. Codebase of the SUT statistically analyzed

2. An appropriate sample space is chosen from the codebase

3. Second set of unit tests is generated through an automatic test suite generation tool, to

perform mutation analysis and compare the performance.

4. A mutation testing tool is chosen as the mutation system

5. Mutation analysis is performed on both the original and the generated suites

6. Common coverage criteria are compared to mutation coverage

7. The results are evaluated and performance of mutation analysis is compared between the two

data sets.

3.1 Codebase

The SUT is a multi-module project. Each module has its own test suites, build script, dependencies

and resources. It was necessary to attain certain statistical measurements of each module, to

understand the data set. The project was therefore configured to use SonarQube, an open source

platform for development teams to continuously manage source code quality and reliability.

SonarQube provide code analysis and defects hunting, as its core functionality and display e.g. Code

Smells, Vulnerabilities and Duplicate Lines [SQ01]. For this thesis, the most relevant information is

mainly Lines of Code (LOC), Cyclomatic Complexity (CC), Line Coverage (LC) and the number of

12 | Methodology

unit tests (#UT). LOC is the number of executable lines of code and CC is the number of independent

paths within a code.

3.2 Sample Space

The measurements presented here and the final sample space discussed in section 4.1 can to some

extent also be found in the works of Mishra [SM17]. This is because the same codebase was evaluated

and thus can be referred to as well for further information.

Table 3.1 contains data from the statistical measurement of the SUT. Immediately apparent when

observing the measurements is that the first 3 modules are larger in terms of LOC. Unit tests are few

in numbers and concentrated on 2 modules.

Name LOC CC LC #UT

Kenny 81448 16132 5.6% 233

Mark 61248 11520 1.2% 7

Perry 37728 7005 28.5% 172

Sally 16757 2260 0.9% 33

Martin 7269 1197 0.0% 0

Conan 6074 1384 13.1% 74

Coral 5278 965 8.4% 34

Patrick 3285 565 14.4% 10

Derek 3076 598 2.4% 1

Tommy 1745 290 0.0% 0

Brad 1137 209 0.0% 0

Daniel 1132 243 29.8% 5

Emil 917 183 19.7% 2

Uther 831 178 27.8% 17

Danny 819 154 0.0% 0

Francine 585 0.0% 126 0

Sebastian 369 0.0% 48 0

Waldo 164 0.0% 41 0

Table 3.1: Measurements of modules in the SUT

Each module is built separately and tested with its own test suite, as such, can be looked at

individually. To quantify the overall test suite quality of the clearing engine, mutation analysis would

have to cover every module. Mutation analysis will generate mutants for every mutable line of code,

regardless of the absence of tests. Analyzing the quality of the entire project will inevitably result in

very low mutation score, and the data would not fairly represent the quality of current in place unit

tests.

It was deemed appropriate to reduce the modules through a selection process. In the initial selection,

modules with no tests were essentially treated as noise since they served no practical purpose.

Modules without tests increase the total LOC, in turn lowering the total coverage and increasing the

number of live mutants. The second and last selection aimed to further filter the project down to its

core. Intuitively, LC seems to be a logical indicator of a well-tested module, however those with high

coverage turn out to be relatively small. After a discussion with the project team, it was agreed that

Methodology | 13

test suites within the modules Kenny and Perry best represent the most recent testing efforts. Further,

they are two of the largest and most complex modules, constituting a good chunk of the project.

Hence, they were chosen as the sample space to be analyzed.

3.3 Generating Unit Tests

Performing mutation analysis on the current test suites is sufficient for evaluating the quality of

present unit tests. However, to assess the cost of mutation analysis, it is necessary to obtain a second

set of measurements. Mutation analysis of test suites with higher number of tests will allow a

performance comparison. Difference in execution time can be observed and explained through factors

affecting the process e.g. number of test cases and the code coverage. Number of mutants created,

killed and those never covered by tests can be compared to further understand the difference in

results.

Although theoretically possible, it was deemed impractical to create a second set of test suites by hand.

Instead, an automatic test suite generation tool called EvoSuite [EVO1] was used to create a second

set of test suites that was analyzed separately from the original suites.

While test cases can be automatically generated, the task of verifying the correctness still remains a

problem. Faults that that cause exceptions and program crashes can be easily detect but only testing

for obvious faults will lead to negligible tests.

EvoSuite automate the creation of test suites and adopts a search based approach and state of the art

techniques to create tests with small assertions i.e. testing for small faults that do not cause an

exception. EvoSuite also applies an approach that first generates test suites, and later optimizes the

suites to achieve a high coverage criteria score e.g. line, branch and weak mutation coverage. Thus

generating test suites with high coverage [EVO1, FA11].

For further information on the inner workings of EvoSuite, the study of Fraser et al. [FA11] can be

referred.

3.4 Mutation System

Mutation analysis can be defined as a twostep process, generate mutants then check whether the

mutants are detected by the test. Generating mutants is essentially done through creating copies of

the source or byte code with small changes. This process is very rarely done by hand and generally

uses a mutation system. Although there are several mutation systems available for Java, most are old

and come with certain usability issues. This could be the lack of support for popular build tools such

as Maven or mocking frameworks such as Mockito.

Pitest (PIT) was chosen for mutation analysis. PIT is the most recent of the systems and actively

developed with frequent releases. While other systems were built for research purposes, PIT was

meant for development environments [PIT1], more accurately fitting the objective of this thesis

project.

14 | Methodology

PIT applies a set of mutation operators to the byte code generating a large number of mutant classes.

Before exercising the tests against the newly created mutants, PIT will first measure the line coverage

(LC) of the code base. Employing the coverage information, for each mutant, PIT will only execute

tests that cover the line which contain the mutation. This optimization is significant for inadequate

test suites with a large codebase, such as the one examined in this thesis.

3.5 Mutation Analysis

The mutation analysis is performed using the most stable default mutant operators in PIT. They are

defined in the documentation as:

1. Conditionals Boundary Mutator (CBM)

Mutates relational operators <, <=, > and >= to their boundary counterpart.

2. Increments Mutator (IM)

Mutates increments, decrements, assignment increments and assignment decrements of

local variables. For example, i++ would be mutated to i--.

3. Invert Negatives Mutator (INM)

Inverts negation of integers and floating-point numbers, e.g. –i would be mutated to i.

4. Math Mutator (MM)

Replaces binary arithmetic operations for either integer or floating-point arithmetic with

another operation. For example, a + b would be mutated to a – b.

5. Negate Conditionals Mutator (NCM)

Mutates conditionals i.e. ==, !=, <=, >=, < and >. This operator overlaps to some extent with

conditionals boundary mutator, but is easier to kill.

6. Return Values Mutator (RVM)

Mutates the return value of method calls. For example, in the case of a Boolean type return

value, false would be mutated to true and vice versa.

7. Void Method Calls Mutator (VMCM)

Mutator will remove calls to methods with return value type void.

3.6 Coverage Metrics and Mutation Coverage

A similar experiment to that of Gopinath et al. [GJ14] and Gligoric et al. [GG13] is used in this thesis.

The ability of Line Coverage (LC), Branch Coverage (BC) and Path Coverage (PC) to predict the

Mutation Score (MS) is evaluated through linear regression analysis.

Regression analysis is a form of predictive modelling technique to estimate the relationship between

a dependent variable (target) and independent variables (predictors) using a best fit line i.e.

regression line.

Methodology | 15

The data set used for regression analysis is per source code class basis. Each class was measured using

the mentioned coverage criteria, each one is then combined with the MS and shown in a scatter graph.

The aim was to determine how well LC, BC and PC could serve as predictors of MS. For that purpose,

the coefficient of determination (R2) was calculated. R2 is a measure of how well the regression line

approximates the real data i.e. a high R2 would indicate that the independent variables are good

predictors.

The choice of coverage criteria is based on how likely they are to be used by development teams.

LC and BC was measured using JaCoCo [JC01], a free code coverage library for Java. PC was

measured using JMockit [JM01], a mocking framework meant to be used for unit testing in Java and

the only free software to measure PC for Java.

16 | Results

Chapter 4

Results

This chapter presents the empirical data from the experiments described in the method section. The

sample space is motivated, followed by the result from performing analysis on both original and

generated test suites are presented. Finally, the result from regression analysis between common

coverage criteria and mutation score is presented.

4.1 The Sample Space

Table 4.1 gives an overview of the modules constituting the sample space. The selection process of the

sample space drastically restricted the number of modules. Although the two modules combined

make up half of the SUT, it is not certain they have similar distributions of factors that can affect the

mutation analysis. Concerning at this point is whether this has resulted in a skewed sample space that

can jeopardize the integrity of analysis results.

LOC CC LC #UT

Kenny 81448 16132 5.6% 233

Perry 37728 7005 28.5% 172

Table 4.1: Measurements of modules in the sample space

Factors to consider are LOC and CC of classes. High LOC indicates a large number of lines to cover

thus implicitly reducing coverage. High CC indicates a complicated class with large number of paths

thus making it difficult to achieve high quality.

The distribution of LOC and CC per class was measured and can be found in figure 4.1 as histograms.

It is apparent from the almost identical shapes of lines for both graphs, that the sample space has kept

the original distribution. Lending to the theory the analysis done on the sample space is reflective of

the whole system.

Results | 17

Figure 4.1: Distribution of LOC and complexity per class, after the initial selection (Covered) and last selection (Core)

process.

4.2 Mutation Analysis

The analysis was performed on both the original test suites and the test suites generated through

EvoSuite. For each case, the test suites for Kenny and Perry were considered separately. The results

are presented in the tables 4.2, 4.3, 4.5 and 4.6 where each row corresponds to one of the mutation

operators. The columns display, for each operator the number of created mutants, how many of those

were killed, how many were left alive and how many were never reached due to the lack of coverage.

4.2.1 Original Test Suites

Result from the mutation analysis of the original test suites can be found in table 4.2 and 4.3.

The mutation score (MS) for both modules are very low, which was to be expected considering the

low LC.

It is immediately apparent that some operators, specifically NCM, RVM and VMCM create most of

the mutants. Although it might be affected by the type of code that is being mutated, this is most likely

due to their more applicable nature. For example, NCM overlaps to some degree with CBM but apply

to far more situations.

The uneven number of mutants created between the modules can be explained as. Kenny has more

than twice the LOC of Perry, hence resulting in far more mutants created.

0

100

200

300

400

500

600

1 20 120 220 1500

Freq

uen

cy

Lines of code

All

Covered

Core

0

100

200

300

400

500

600

700

800

900

1000

1,5 16,5 40

Freq

uen

cy

Cyclomatic comlexity

All

Covered

Core

18 | Results

Operator Created Killed Live No coverage Created Killed Live No coverage

CBM 2666 103 (4%) 84 2479 452 29 (6%) 24 399

IM 1589 57 (4%) 36 1496 184 16 (9%) 2 166

INM 5 1 (20%) 0 4 0 0 (0%) 0 0

MM 330 31 (9%) 16 283 69 15 (22%) 3 51

NCM 10091 564 (9%) 153 9374 2110 638 (20%) 122 1350

RVM 5143 230 (4%) 48 4865 3017 595 (20%) 198 2224

VMCM 6727 183 (3%) 184 6360 1607 41 (3%) 22 1544

Total 26551 1169 (4%) 521 24861 7439 1334 (18%) 371 5734

Table 4.2: Analysis result of Kenny’s test suites Table 4.3: Analysis result of Perry’s test suites

An observation is that, although the MS are very low for both analyses, the ratio of killed to live

mutants seems to overall lean toward the killed. Table 4.4 contains the MS recalculated to only

consider mutants with coverage. The test suites are effective at the part of the code base they cover.

The MS is low due to the low coverage and would most likely increase accordingly with higher

coverage.

Operator Killed Live Killed Live

CBM 103 (55%) 84 29 (55%) 24

IM 57 (61%) 36 16 (89%) 2

INM 1 (100%) 0 0 (0%) 0

MM 31 (66%) 16 15 (83%) 3

NCM 564 (79%) 153 638 (84%) 122

RVM 230 (83%) 48 595 (75%) 198

VMCM 183 (49%) 184 41 (65%) 22

Total 1169 (69%) 521 1334 (78%) 371

Table 4.4: Ratio between killed and live mutants for analysis of Kenny and Perry

4.2.2 Generated Test Suites

The automatic generation of unit tests, yielded new test suites with significantly more unit tests. Test

suites generated for Kenny contained 4923 unit tests with 27% LC compared to the previous 5.6%.

The suites generated for Perry contained 2037 unit tests with 40% LC compared to the previous

28.5%. Although this was a significant increase in coverage it is still low when considering that they

were generated with the goal to achieve a high coverage score. This can be attributed to the complex

codebase and is most likely difficult to remedy.

Result from the mutation analysis of the generated test suites can be found in table 4.5 and 4.6.

The generated suites were analyzed in the same manner as the original suites. The increase in

coverage was reflected by similar increase in MS, strengthening the previous explanation in section

4.1.1.

It was a concern that for automatically generate tests, the MS i.e. quality might be significantly lower

than its coverage, this turns out not to be the case.

Results | 19

Operator Created Killed Live No coverage Created Killed Live No coverage

CBM 2666 430 (16%) 241 1995 452 167 (37%) 17 268

IM 1589 208 (13%) 188 1193 184 62 (34%) 6 116

INM 5 0 (0%) 0 5 0 0 (0%) 0 0

MM 330 27 (8%) 50 253 69 40 (58%) 2 27

NCM 10091 1654 (16%) 780 7657 2110 942 (45%) 105 1063

RVM 5143 1357 (26%) 397 3389 3017 768 (25%) 86 2163

VMCM 6727 1078 (16%) 648 5001 1607 500 (31%) 91 1016

Total 26551 4754 (19%) 2304 19493 7439 2479 (33%) 307 4653

Table 4.5: Analysis result of Kenny’s generated test suites Table 4.6: Analysis result of Perry’s generated test suites

Again, it can be observed the ratio of killed and live mutants seem to overall lean toward the killed.

Table 4.7 contains the MS recalculated to only consider mutants with coverage.

Operator Killed Live Killed Live

CBM 430 (64%) 241 167 (91%) 17

IM 208 (52%) 188 62 (91%) 6

INM 0 (0%) 0 0 (0%) 0

MM 27 (35%) 50 40 (95%) 2

NCM 1654 (68%) 780 942 (96%) 105

RVM 1357 (77%) 397 768 (90%) 86

VMCM 1078 (62%) 648 500 (85%) 91

Total 4754 (67%) 2304 2479 (90%) 307

Table 4.7: Ratio between killed and live mutants for analysis of Kenny and Perry

4.2.3 Performance

During the two mutation analyses, performance data was gathered for both the original test suites

(OTS) and generated test suites (GTS). Table 4.8 contains the overview with Line Coverage (LC),

number of unit tests (#UT), number of covered mutants (#CM), number of executed tests (#ET) and

the execution time.

LC #UT #CM #ET Exec Time

OTSKenny 5.6% 233 1690 4532 4 min 36 sec

GTSKenny 27% 4923 7058 71638 3 h 29 min 30 sec

OTSPerry 28.5% 172 1705 24854 1 h 21 min 21 sec

GTSPerry 40% 2037 2786 24510 50 min 56 sec

Table 4.8: Summary of mutation analysis performance

The computationally expensive nature of mutation analysis comes from the process of executing

entire test suites against every mutant program. PIT however will only execute every test that cover a

mutation. The number of covered mutations and the number of unit tests directly determine the

number of executed tests, which is the most time consuming part of mutation analysis.

20 | Results

Visible is the enormous increase in execution time between analyzing OTSKenny and GTSKenny. Although

the LC increased moderately, it alone cannot explain the spike. The increased LC presumably led to

more covered mutations This combined with the drastic increase in the number of unit tests increased

the number of test executions hence spike in execution time.

The difference in execution time between analyzing OTSKenny and OTSPerry is somewhat difficult

explain. Although one test suite has higher coverage than the other, they both cover almost the same

number of mutations. This combined with similar number of unit tests should result in similar

execution time. The most likely explanation is that, only a handful of test cover any mutations in

OTSKenny, resulting in less number of executed tests than OTSPerry. This indicates that the number tests

and the number of covered mutations are sufficient predictors of execution time.

The reduction in execution time between analyzing OTSPerry and GTSPerry is unexpected. The increase

in LC should increase the execution time if were to look at the case of OTSKenny and GTSKenny. GTSPerry

has higher LC, more unit tests and more covered mutations, yet less executed tests, resulting in a

shorter execution time. Only explanation for this phenomenon, is that even with over ten times more

unit tests, fewer tests in GTSPerry cover any mutation compared to OTSPerry.

4.3 Linear Regression Analysis

The results presented in this section are shared with the thesis work of Mishra [SM17]. Although the

data are the same, they are incorporated differently into the works.

Table 4.9 displays the measured Line Coverage (LC), Branch Coverage (BC) and Path Coverage (PC)

for both modules.

LC BC PC

Kenny 5.6 % 5 % 2 %

Perry 28.5 % 27.7 % 13 %

Table 4.9: Line, branch and path coverage summary

For the same reasoning as section 4.1, both LOC and CC were considered in the regression analysis.

Table 4.10 contains the estimated coefficients for the saturated regression model, between dependent

variable (target) MS and the independent variables (predictors) LC, LOC and CC. Each row in the

model represents a predictor. A pValue above 0.05 indicates that the variable is insignificant and thus

LOC and CC could safely be removed from the model.

Results | 21

Estimate Std Error tStat pValue

Lines of code 5.0995e-05 4.011e-05 1.2714 0.20383

Complexity -0.00026416 0.0002234 -1.1825 0.23725

Line coverage 0.81205 0.0086981 93.36 0

Table 4.10: Estimated coefficients for saturated regression model

Figure 4.2 is the scatter plot between the MS and LC. Each data point corresponds to a class in either

module, with the size of the circle representing the classes LOC.

The coefficient of determination or R2 is displayed above the regression line. The value of R2 is perhaps

the most relevant information here. The variable indicates how well the regression line fits the data

set i.e. how well the independent variables can predict the dependent variable i.e. how well LC predicts

MS.

Figure 4.2: Scatter plot between MS and LC

Figure 4.3 and 4.4 is the result of performing the same process as above with BC and PC.

22 | Results

Figure 4.3: Scatter plot between MS and BC Figure 4.4: Scatter plot between MS and PC

The result from performing linear regression analysis indicates that LC is the most accurate predictor

of MS. All three comparisons resulted in moderately high R-squared values, which would mean any

one of the three coverage criteria can be used as a predictor of MS, with moderate to high accuracy.

This result agrees with that of Gopinath et al. [GJ14] and indicates, that the same relation tendency

found when examining a few hundred projects is found in this SUT.

Discussion | 23

Chapter 5

Discussion

This chapter discusses the observed results in regards to the research questions, reflects on the

presented findings and concludes the work.

5.1 Quality of Unit Tests

Mutation analysis will apply a set of mutation operators to create a set of mutants. The test suites are

then executed against these mutants to measure how many mutants can be detected. The purpose is

to measure the quality of test suites. Mutations in uncovered parts of the codebase are never detected,

directly lowering the mutation score (MS). Performing mutation analysis for the whole SUT without

moderate to high coverage will always result in low total MS.

It was assumed very early in the thesis, that the mutation score for the original suites would be low.

This was due to the low coverage and limited number of unit tests, and was shown to be true

immediately after the first mutation analysis.

As mentioned in section 4.2.1 and 4.2.2 the results also support a different observation. When

measuring the number of killed, alive and uncovered mutants, it was noted that the ration between

killed and live mutants leaned towards the killed. This was made clearer when recalculating the

mutation score with only the covered mutants. When considering only the part of the code base with

coverage, the unit tests are surprisingly effective with around 70% in mutation score.

Above observation can be explained as, the test suites are effective at the parts of the code base they

cover, the mutation score is low due to the low coverage and would increase accordingly with higher

coverage. This of course is only true when any new test suites that is added, maintain the same level

of quality.

The generated test suites display the exact same behavior, with the MS being drastically higher when

only considering covered mutants, adding to the plausibility of the above explanation.

It was a concern that for automatically generate tests, the quality would be significantly lower than its

line coverage (LC). The concern was based on the fact that the tests would be automatically generated

and not be able to test the behavior of methods to the same degree as hand written tests. This however

turns out not to be the case for both Kenny and Perry with both maintaining a MS close to the

24 | Discussion

corresponding LC. This observation indicates that automatically generated tests are of enough quality

that developers should consider them as a replacement of unit tests if the current coverage is low or

use them to augment the current test suites. The moderate quality of automatically generated test

suites should also remove any concern of the validity of any comparison between the original test

suites and the generated ones.

5.2 Cost and Benefits of Mutation Testing

Mutation testing subsumes many other coverage criteria [OV96] and has been shown to predict actual

faults detection ability better than other criteria in some settings, but never shown to be worse [GJ14].

Thus, it is difficult to deny the effectiveness of mutation testing. The practicality of mutation testing

however is very much up for debate.

Pitest (PIT) was chosen for mutation analysis. PIT applies a set of mutation operators to the byte code

generating a large number of mutant classes. Before exercising the tests against the newly created

mutants, PIT will first measure the line coverage (LC) of the code base. Employing the coverage

information, for each mutant, PIT will only execute tests that cover the line which contain the

mutation. This optimization is significant in reducing the execution time e.g. the longest execution

time during this thesis was 3 hour 30 minuts.

The result of mutation analysis displayed a drastic increase in execution time, when the number of

unit tests covering any mutation and the number of covered mutations increased. Let us refer to the

situation when two unit tests cover the same mutation as overlapping. Overlapping (as was discussed

in the results section) increase the number of test executions without an increase in killed mutants,

thus directly increasing execution time without increasing the mutation score (MS).

In a perfect world, there would be a handful of unit tests covering all mutation with no overlapping.

However, it is reasonable to assume that, as coverage increases so does the overlapping. After a

threshold, the increase in execution time when supplementing the test suite will not be worth the

increase in mutation score.

The purpose of conducting regression analysis was to assess the ability of common coverage metrics

to predict mutation score, to determine if mutation analysis is truly worth the cost, and whether other

cheaper coverage metrics could be used instead.

The result indicates that LC is an effective predictor of mutation score. Although LC is by no means a

replacement for mutation analysis it can serve as an indicator in practice.

Developers can use LC as the measurement of test suite effectiveness in practice, and have scheduled

mutation analysis of the SUT. Through mocking of dependency and conscious decision to minimize

overlap in coverage between tests, the execution time should be manageable.

Discussion | 25

5.4 Reflection

The decision to perform mutation analysis on the two modules was due to technical limitations. This

approach can be criticized due to the risk of test suites for one module covering parts of the other

module. This could result in some lost coverage that could have increased the mutation score,

although most likely not in a meaningful way.

Generating a second data set for comparison did enable some comparison of performance. However,

the legacy code and high interdependency can have contributed to meaningless tests, e.g. test suites

that simply call class constructors to add to the LC. The measurements obtained from these test suites

might not be genuine.

Performing mutation analysis on the original tests and the generated test resulted in some interesting

data. However, manually creating test suites to measure the performance at different levels of LC and

MS would have been more fruitful.

5.5 Sustainability and Societal Aspects

This thesis is a case study of a fault-based testing technique on a software project used within the

financial industry, as such there is very little ethical concerns. From the societal and aspects, this

thesis is not only relevant for the project team providing the SUT, but also to other development

setups with similar project and the need for higher quality testing efforts.

From the economical sustainability perspective, studies in this field contribute to prevent software

failure with significant economic consequences [FT01]. This thesis can inspire anyone trying to delve

into the subject of mutation testing and higher quality testing.

5.6 Conclusions

Performing mutation analysis, what is the quality of present test suites? The mutation score for the

SUT is low, indicating that very few of the created mutants is discovered. However, when considering

only the part of the code base with coverage, the unit tests are surprisingly effective with around 70%

in mutation score. Hence, it is reasonable to assume that current unit tests are of high quality, albeit

only covering a small portion of the system.

What are the cost and benefits of adopting mutation testing? Mutation testing subsumes many other

coverage criteria [OV96] and has been shown to predict actual faults detection ability better than

other criteria in some settings, but never shown to be worse [GJ14]. Thus the benefits of mutation

testing is difficult to dispute. The concern would be the practicality of the testing method with the

execution time being the most significant factor. This turns out to be reasonable when using a more

modern mutation system, as the one used within this degree project. Following are two

recommendations on how mutation testing could be used.

26 | Discussion

The overlap in coverage between tests was shown to be a major contributor to high execution time for

mutation analysis. Minimizing the number of tests, maximizing the number of covered mutations and

minimizing the overlap in coverage between tests, should result in the best possible execution time.

Performing regression analysis on the original test suites resulted in LC performing the best.

Developers can thus use LC as the measurement of test suite effectiveness in practice, and have

scheduled mutation analysis of the SUT.

Another, unexpected conclusion of this thesis project was in regards to the automatically generated

unit tests. The generated test suites had significantly higher coverage than the original test suites.

Performing mutation analysis, it was revealed that the mutation score was also higher than that of the

original suites. This entails that the test suites do not only cover more of the code base, but also are

effective when doing so. It can be concluded that automatically generated suites can replace hand

written test suites if the current coverage is low, or be used to augment the hand written suites.

5.7 Future Work

Although this was a case study some further work can be done to extend results gathered in this thesis.

It would be interesting to create test suites, following the approach suggested in this thesis, and to

measure the performance at different levels of LC. This would result in findings about scalability and

prove or disprove the conclusions drawn in this thesis.

References | 27

References

[YH14] X. Yao, M. Harman, and Y. Jia, “A study of equivalent and stubborn mutation operators using

human analysis of equivalence,” International Conference on Software Engineering, pp. 919–930,

2014.

[VM97] J. Voas and G. McGraw. Software Fault Injection: Inoculating Programs Against Errors. John

Wiley & Sons, 1997.

[OR93] A. J. Offutt, G. Rothermel, and C. Zapf, “An experimental evaluation of selective mutation,"

in Proceedings of the Fifteenth International Conference on Software Engineering, (Baltimore, MD),

pp. 100-107, IEEE Computer Society Press, May 1993.

[WD94] W. E. Wong, M. E. Delamaro, J. C. Maldonado, and A. P. Mathur, Constrained mutation in

C programs," in Pro- ceedings of the 8th Brazilian Symposium on Software Engi- neering, (Curitiba,

Brazil), pp. 439{452, October 1994.

[HO82] W. E. Howden, “Weak Mutation Testing and Completeness of Test Sets,” IEEE Transactions

on Software Engineering, vol. 8, no. 4, pp. 371–379, July 1982.

[DG88] R. A. DeMillo, D. S. Guindi, K. N. King, W. M. McCracken, and A. J. Offutt, “An Extended

Overview of the Mothra Software Testing Environment,” in Proceedings of the 2nd Workshop on

Software Testing, Verification, and Analysis (TVA’88). Banff Alberta,Canada: IEEE Computer society,

July 1988, pp. 142–151.

[MA91] A. P. Mathur, “Performance, Effectiveness, and Reliability Issues in Software Testing,” in

Proceedings of the 5th International Computer Software and Applications Conference

(COMPSAC’79), Tokyo, Japan, 11-13 September 1991, pp. 604–605.

[OL96] A. Jefferson Offutt, Ammei Lee, Gregg Rothermel, Roland Untch and Christian Zapf: An

Experimental Determination of Sufficient Mutation Operators, ACM Trans. on Software Engineering

& Methodology, Vol. 5, pp. 99–118, April 1996.

[KO91] K. N. King and A. J. Offut. A Fortran language system for mutation-based software testing.

Software-Practice and Experience, 21(7):685-718, July 1991.

[OC94] A. Jefferson Offutt and W. Michael Craft: Using Compiler Optimization Techniques to Detect

Equivalent Mutants, The Journal of Software Testing, Verification and Reliability, 4(3):131–154,

September 1994.

28 | References

[OP97] A. Jefferson Offutt and Jie Pan: Automatically Detecting Equivalent Mutants and Infeasible

Paths, The Journal of Software Testing, Verification, and Reliability, Vol 7, No. 3, pp. 165–192,

September 1997.

[BA82] T. A. Budd and D. Angluin. Two Notions of Correctness and Their Relation to Testing. Acta

Informatica, 18(1):31–45, March 1982.

[GO92] Robert Geist and A. Jefferson Offutt and Frederick C. Harris Estimation and Enhancement

of Real-Time Software Reliability Through Mutation Analysis IEEE Transactions on Computers,

41(5), May 1992.

[AO17] Paul Ammann , Jeff Offutt, Introduction to Software Testing Second Edition, Cambridge

University Press, New York, NY, 2017

[HO76] William E. Howden, “Reliability of the path analysis testing strategy.” IEEE Transactions on

Software Engineering SE-2(3):208-214, September 1976.

[JH11] Yue Jia , Mark Harman, An Analysis and Survey of the Development of Mutation Testing, IEEE

Transactions on Software Engineering, v.37 n.5, p.649-678, September 2011

[BD80] Timothy A. Budd, Richard A. DeMillo, Richard J. Lipton and Frederick G. Sayward:

Theoretical and Empirical Studies on Using Program Mutation To Test The Functional Correctness

of Programs, Proceedings of the 7th ACM SIGPLAN-SIGACT symposium on Principles of

programming languages, p.220–233, January 28–30, 1980, Las Vegas, Nevada.

[DL78] R. A. DeMillo, R. J. Lipton, and F. G. Sayward. Hints on Test Data Selection: Help for the

Practicing Programmer. Computer, 11(4):34–41, April 1978

[AB79] A. T. Acree, T. A. Budd, R. A. DeMillo, R. J. Lipton, and F. G. Sayward. Mutation Analysis.

Technique Report GIT-ICS-79/08, Georgia Institute of Technology, Atlanta, Georgia, 1979.

[OU00] A. Jefferson Offutt and Roland H. Untch: Mutation 2000: Uniting the Orthogonal, Mutation

2000: Mutation Testing in the Twentieth and the Twenty First Centuries, pp. 45–55, San Jose, CA,

October 2000.

[OV96] A. J. Offutt and J. M. Voas. Subsumption of condition coverage techniques by mutation

testing. Technical report, 1996.

[GJ14] Rahul Gopinath, Carlos Jensen, and Groce Alex. Code coverage for suite evaluation by

developers. In ICSE, pages 72–82, 2014.

[NH17] Nasdaqcom. (2017). Nasdaqcom. Retrieved 16 May, 2017, from

http://www.nasdaq.com/about/about_nasdaq.aspx

http://dl.acm.org/citation.cfm?id=1355340&CFID=728362595&CFTOKEN=30045218




References | 29

[GG13] M. Gligoric, A. Groce, C. Zhang, R. Sharma, M. A. Alipour, and D. Marinov. Comparing non-

adequate test suites using coverage criteria. In ACM International Symposium on Software Testing

and Analysis. ACM, 2013.

[NW11] Nica, S. A., Ramler, R., & Wotawa, F. (2011). Is Mutation Testing Scalable for Real-World

Software Projects? In The Third International Conference on Advances in System Testing and

Validation Lifecycle.

[PIT1] Pitestorg. (2017). Pitestorg. Retrieved 12 May, 2017, from http://pitest.org/

[EVO1] Evosuiteorg. (2017). Evosuiteorg. Retrieved 16 May, 2017, from

http://www.evosuite.org/evosuite/

[SQ01] Documentation - SonarQube Documentation. (n.d.). Retrieved May 16, 2017, from

https://docs.sonarqube.org/display/SONAR/Documentation

[WQR] Capgemini Releases World Quality Report 2016. (2016, September 21). Entertainment

Close-up.

[MR01] T. Murnane, K. Reed: On the Effectiveness of Mutation Analysis as a Black Box Testing

Technique, 13th Australian Software Engineering Conference (ASWEC’01) August 27–28, 2001,

Canberra, Australia p. 0012, 2001.

[DM96] M. E. Delamaro, J. C. Maldonado, A. P. Mathur: Integration Testing Using Interface

Mutation, Proceedings of the Seventh International Symposium of Software Reliability Engineering

(ISSRE’96), White Plains, NY, pp. 112–121, 1996.

[FT01] Financial Times. (n.d.). Retrieved May 24, 2017, from

https://www.ft.com/content/9657d306-4d7c-11e5-b558-8a9722977189

[JC01] JaCoCo Java Code Coverage Library. (2017, March 21). Retrieved June 04, 2017, from

http://www.eclemma.org/jacoco/

[JM01] JMockit An automated testing toolkit for Java. (n.d.). Retrieved June 04, 2017, from

http://jmockit.org/

[SM17] "Analysis of test coverage metrics in a business critical setup". MSc. KTH Royal Institute of

Technology, 2017. Print.

[FA11] Gordon Fraser, Andrea Arcuri, EvoSuite: automatic test suite generation for object-oriented

software, Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on

Foundations of software engineering, September 05-09, 2011, Szeged, Hungary .

https://pure.tugraz.at/portal/en/publications/is-mutation-testing-scalable-for-realworld-software-projects(818ef072-2511-405a-b8ce-5775750caa9e).html

https://www.ft.com/content/9657d306-4d7c-11e5-b558-8a9722977189


http://www.eclemma.org/jacoco/

http://jmockit.org/


http://www.evosuite.org/evosuite/

https://docs.sonarqube.org/display/SONAR/Documentation

https://pure.tugraz.at/portal/en/persons/franz-wotawa(c13aba7b-eb27-412a-a60a-ac651c48800e).html

https://pure.tugraz.at/portal/en/publications/is-mutation-testing-scalable-for-realworld-software-projects(818ef072-2511-405a-b8ce-5775750caa9e).html


www.kth.se

effectiveness of inadequate test suites

Documents