improving test adequacy and software reliability with ...€¦ · improving test adequacy and...

6
Improving Test Adequacy and Software Reliability with Practices of Statistical Testing Yufeng XUE i , Lan LIN I, John C. Tucker 2 , Becky Hammons 2 , Michael Wolfe 2 1. Department of Computer Science Ball State University Muncie, USA 2. Ontario Systems, LLC 1150 West Kilgore Avenue Muncie, USA {yxue2, lIin4} @bsu.edu, {john.tucker, becky. hammons, michae l. wolfe} @ontariosystems.com Abstract-Statistical testing based on a Markov chain usage model, as a rigorous testing method, has been around for more than two decades. Through the comprehensive application of statistical science to the testing of software, it provides audit trails of evidence to support correctness arguments for a software- intensive system as well as a decision that the system is of requisite quality for its intended use. This paper reports a real- world case study in which we applied standard statistical testing practices to the phone flag copy testing problem at the site of our industrial collaborator, and presents our solution from problem formalization, usage modeling and model analysis to test case generation and analysis. Our results helped evaluate the coverage of a heuristically generated test suite, and shed light on what other test cases to craft/generate to improve test coverage and adequacy, as well as reliability estimates both at the arc (usage event) level and at the path (system) level. Keywords-statistical testing; Markov chain usage model; test case generation; test coverage; test adequacy; reliability estimate I. I NT RODUCTION AND RELATED WORK Software-intensive systems have become quite large over the years, with systems of ten million lines of source code now common. "Any large, complex, expensive process with myriad ways to do most activities," as is the case with software development, "can have its cost-benefit profile dramatically improved by the use of statistical science." [7] Statistical testing based on a Markov chain usage model is the comprehensive application of statistical science to the testing of software [7 , 6, 12, 9, 13 , 8, 15 , 17, 16] , to solve the problems posed by industrial software development. Using the structure that statistics provide for collecting data and transforming the data into information that can improve decision making under uncertainty, statistical testing enables "efficient collection of empirical data that will quantify the behavior of the software- intensive system and support economic decisions regarding deployment of dependable systems." [7] As a rigorous testing method developed by the University of Tennessee Software Quality Research Laboratory (UTK SQRL), Markov chain usage-based statistical testing has been around for more than two decades [7, 6, 12, 9, 13,8,15, 17,16] 978·1·5386·0918·7/ 17/$ 31.00 ©2017 IEEE and successfully applied in a variety of industry and government projects, ranging from medical devices to automotive components to scientific instrumentation, to name a few [3 , 2, 14]. Sayre and Poore [14] reported a project with the Oak Ridge National Laboratory in which test models were created for approximate ly forty programs in a library to support theoretical physics calculations. Bauer et al. [2] reported the collaboration with th e Fraunhofer Institute of Experimental Software Engineering in Germany through an example of a tool chain to support statistical testing of an embedded control unit for a car door mirror. Verum (in the Netherlands) reported a large-scale industrial application of statistical testing to medical devices [3]. They have also integrated this testing method into the Verum Compliance Test Framework for the certification of industrial software. The direct benefit statistical testing provides is a quantitative analysis of the system' s quality using empirical data that can be used to demonstrate, document, and certify that the system is fit for its intended use. Although the theory underlying statistical testing has been well-established [7, 6, 12, 9, 13 , 8, 15, 17, 16] , it remains problem specific to work out all the details needed to implement statistical testing, and to integrate it into the existing verification and validation process. In this paper we illustrate how we applied standard statistical testing practices, i.e ., usage modeling, model analysis, test case generation, and test case analysis, and the supporting tool, i.e ., the JUMBL (J Us age Model Builder Library, also developed by UTK SQRL [11 , 1]), to approach a real-world testing problem, and how our results helped evaluate the coverage of a heuristically generated test suite, and shed light on what other test cases to craft/generate to improve test coverage and adequacy. The remainder of the paper is organized as follows. Section II briefly overviews the statistical testing method and the process. Section ill introduces our case study: the phone flag copy testing problem. Section IV presents our solution applying practices of statistical testing, from the problem formalization to usage modeling, model analysis, test case generation, and test case analysis, and compares to and evaluates a heuristically generated test suite. Finally Section V concludes the paper and points out directions for future work. R138 The Second Internati onal Conference on Reliability Systems Engineering (ICRSE 2017)

Upload: others

Post on 12-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Test Adequacy and Software Reliability with ...€¦ · Improving Test Adequacy and Software Reliability with Practices of Statistical Testing Yufeng XUEi, Lan LINI, John

Improving Test Adequacy and Software Reliability with Practices of Statistical Testing

Yufeng XUE i , Lan LINI, John C. Tucker2, Becky Hammons2, Michael Wolfe2

1. Department of Computer Science Ball State University

Muncie, USA 2. Ontario Systems, LLC 1150 West Kilgore Avenue

Muncie, USA {yxue2, lIin4} @bsu.edu, {john.tucker, becky. hammons, michael. wolfe} @ontariosystems.com

Abstract-Statistical testing based on a Markov chain usage model, as a rigorous testing method, has been around for more than two decades. Through the comprehensive application of statistical science to the testing of software, it provides audit trails of evidence to support correctness arguments for a software-intensive system as well as a decision that the system is of requisite quality for its intended use. This paper reports a real-world case study in which we applied standard statistical testing practices to the phone flag copy testing problem at the site of our industrial collaborator, and presents our solution from problem formalization, usage modeling and model analysis to test case generation and analysis. Our results helped evaluate the coverage of a heuristically generated test suite, and shed light on what other test cases to craft/generate to improve test coverage and adequacy, as well as reliability estimates both at the arc (usage event) level and at the path (system) level.

Keywords-statistical testing; Markov chain usage model; test case generation; test coverage; test adequacy; reliability estimate

I. INTRODUCTION AND RELATED WORK

Software-intensive systems have become quite large over the years, with systems of ten million lines of source code now common. "Any large, complex, expensive process with myriad ways to do most activities," as is the case with software development, "can have its cost-benefit profile dramatically improved by the use of statistical science." [7] Statistical testing based on a Markov chain usage model is the comprehensive application of statistical science to the testing of software [7, 6, 12, 9, 13, 8, 15, 17, 16], to solve the problems posed by industrial software development. Using the structure that statistics provide for collecting data and transforming the data into information that can improve decision making under uncertainty, statistical testing enables "efficient collection of empirical data that will quantify the behavior of the software-intensive system and support economic decisions regarding deployment of dependable systems." [7]

As a rigorous testing method developed by the University of Tennessee Software Quality Research Laboratory (UTK SQRL), Markov chain usage-based statistical testing has been around for more than two decades [7, 6, 12, 9, 13,8,15, 17,16]

978·1·5386·0918·7/17/$31.00 ©2017 IEEE

and successfully applied in a variety of industry and government projects, ranging from medical devices to automotive components to scientific instrumentation, to name a few [3, 2, 14]. Sayre and Poore [14] reported a project with the Oak Ridge National Laboratory in which test models were created for approximately forty programs in a library to support theoretical physics calculations. Bauer et al. [2] reported the collaboration with the Fraunhofer Institute of Experimental Software Engineering in Germany through an example of a tool chain to support statistical testing of an embedded control unit for a car door mirror. Verum (in the Netherlands) reported a large-scale industrial application of statistical testing to medical devices [3]. They have also integrated this testing method into the Verum Compliance Test Framework for the certification of industrial software. The direct benefit statistical testing provides is a quantitative analysis of the system' s quality using empirical data that can be used to demonstrate, document, and certify that the system is fit for its intended use.

Although the theory underlying statistical testing has been well-established [7, 6, 12, 9, 13, 8, 15, 17, 16], it remains problem specific to work out all the details needed to implement statistical testing, and to integrate it into the existing verification and validation process. In this paper we illustrate how we applied standard statistical testing practices, i.e., usage modeling, model analysis, test case generation, and test case analysis, and the supporting tool, i.e., the JUMBL (J Usage Model Builder Library, also developed by UTK SQRL [11 , 1]), to approach a real-world testing problem, and how our results helped evaluate the coverage of a heuristically generated test suite, and shed light on what other test cases to craft/generate to improve test coverage and adequacy.

The remainder of the paper is organized as follows. Section II briefly overviews the statistical testing method and the process. Section ill introduces our case study: the phone flag copy testing problem. Section IV presents our solution applying practices of statistical testing, from the problem formalization to usage modeling, model analysis, test case generation, and test case analysis, and compares to and evaluates a heuristically generated test suite. Finally Section V concludes the paper and points out directions for future work.

R138 The Second International Conference on Reliability Systems Engineering (ICRSE 2017)

Page 2: Improving Test Adequacy and Software Reliability with ...€¦ · Improving Test Adequacy and Software Reliability with Practices of Statistical Testing Yufeng XUEi, Lan LINI, John

.. ..

Figure 1. Steps of Markov chain usage-based statistical testing Account I

Qhone no }:;::home Phone no 2 - place of employment Phone no 3 - attorney

CPhone no I - user defined phone field

Account 2 - rouCe multiple Phone no 4 - home

<::::Phone no I -responsible party Phone no 5 - place of employment

Account 3 - agency multiple Phone no 6 - home Phone no 7 - responsible party

<::::Phone no I - debtor phone

ú =ú =

C

G :=:>

H N ::::> P

u

.. 'r:::::>

Figure 2. The phone flag copy testing problem

II. STATISTICAL TESTING BASED ON A MARKOV CHAIN

USAGE MODEL

In statistical testing a Markov chain usage model is first developed, based on historical or projected usage data, that depicts the intended use of the software in the field and represents the popUlation of all possible use cases. Then all kinds of statistics can be computed routinely from the model, providing a basis for model validation, revision, and test planning. From the validated model one then generates test cases by walking the graph, by applying graph algorithms, or by sampling. Test scripts can be associated with arcs of the usage model which become instructions to manual testers or automated test runners. Pass and fail data are recorded and analyzed for reliability estimation, coverage analysis, or stopping decisions. Fig. I shows the steps of the statistical testing process. Markov chain usage-based statistical testing [7, 6, 12, 9, 13, 8, 15, 17, 16] supports quantitative certification of software by statistical protocol for standards compliance, as well as for the construction and evaluation of assurance cases for dependable systems [5 , 4]. A public domain tool supporting statistical testing (JUMBL: J Usage Model Builder Library developed by UTK SQRL) is freely available [11 , 1].

III. CASE STUDY: T ESTING OF THE PHONE FLAG COpy

FEATURE

We were interested in testing the phone flag copy feature implemented in a real-world application by our industry collaborator. Each user account contains a table of phone fields with associated phone numbers and phone flags, as illustrated in Fig. 2 (here each table entry takes the form of "phone nwnber - phone field [ spaces] phone flag") .

User accounts can be grouped as route multiples or agency multiples based on the following criteria:

978-1-5386-0918-7/17/$31.00 ©20171EEE

- If two different accounts share the same person name, the same person Social Security Number (SSN), and are in the same phase of processing, they are route multiple accounts .

- If two different accounts share the same person name, the same person SSN, but are in different phases of processing, they are agency multiple accounts.

Fig. 2 assumes that Account 1 and Account 2 are route multiples, and Account 1 and Account 3 are agency mUltiples.

When a phone nwnber of a specific phone field (in an account) has its phone flag changed, how the phone flag change should be copied to the same phone number in the same account or possibly different accounts is dependent on two things:

- The "where to copy" flag: "A" indicates the same phone number in the same account only (with possibly different phone fields), "RM" indicates the same phone number not only in the same account but also in any route mUltiple account, and "AM" indicates the same phone number in the same account, or in any route mUltiple account, or in any agency multiple account.

- Whether the changed phone flag is on a list of phone flags whose changes need to be copied: "yes" or "no".

For instance, as shown in Fig. 2, if the phone flag of "Phone no 1 - home" is changed from "B" to "P", the "where to copy" flag is "AM", and "B" is on the list of phone flags whose changes need to be copied, there are three other entries whose phone flags should be changed into "P" as a result:

- Account 1, Phone no 1 - user defined phone field (original phone flag is "G" )

- Account 2, Phone no 1 - responsible party (original phone flag is "N")

- Account 3, Phone no 1 - debtor phone (original phone flag is " ")

The question is: how do we test systematically that the phone flag copy feature is implemented correctly?

IV. ApPLYING STATISTICAL T ESTING PRACTICES TO

IMPROVE TEST ADEQUACY

In this section we show how we attacked this testing problem using practices of statistical testing. We fust formalized the testing problem, then developed a Markov chain usage model and performed a model analysis. We created a test suite containing path coverage test cases and a random sample, and performed a statistical analysis. We also analyzed a suite of heuristic test cases provided by our collaborator, mapped them to the usage model, and performed a statistical analysis. Usage modeling revealed 63 test paths that were not covered by heuristic testing. Path coverage sampling and random sampling combined also considerably improved reliability estimates both at the arc (usage event) level and at the path (system) level.

R138 The Second International Conference on Reliability Systems Engineering (ICRSE 2017)

Page 3: Improving Test Adequacy and Software Reliability with ...€¦ · Improving Test Adequacy and Software Reliability with Practices of Statistical Testing Yufeng XUEi, Lan LINI, John

A. Formalization of the Testing Problem

Let A be the set of all (user) accounts, and RA, RRM, and RAM be three binary relations on A defined by

- GjRAG2 iff Gj = G2

- GjRRMG2 iff Gj * G2, but Gj and G2 have the same person name, the same person SSN, and the same processing phase

- GjRAMG2 iff Gj and G2 have the same person name and the same person SSN, but different processing phases.

For any aj , a2 in A, one and only one of the following must hold:

- GjRAG2 (i.e., they are the same account)

- GjRRMG2 (i.e., they are route multiple accounts)

- GjRAMG2 (i.e., they are agency multiple accounts)

- ú E ~ à o ^ ~ O F D = ú E ~ à o o j ~ O F D = ú E ~ à o ^ j ~ O F = (i.e., they are different accounts that are neither route multiples nor agency multiples).

We assume the following terminology and notation:

- Let N denote the set of all positive integers.

- Let PN and PF be the set of all phone numbers, and the set of all phone flags, respectively.

- Consider a functionfofthe form:f: X ---> y x Z. If f(x) = (y, z), we write fi(x) = y and /r(x) = z. It: X ---> Y and /r : X ---> Z are two functions defmed by f Each account a in A defines a function/a : N ---> PN x PF.

Informally the i-th (i = 1, 2, 3, ... ) phone field has an associated phone number /ali) and an associated phone flag/ar(i).

A test case must provide the following inputs:

- two accounts a j, a2 in A that can be related by RA, RRM, RAM, or none of the above

- a phone number pn for some phone field in aj with its associated phone flag pI, i.e., there exists some i in N such that/a,(i) = (pn, pj)

- a predicate p that is true iff pf is on the list of phone flags whose changes need to be copied

- a string indicating where to copy; the string is from the set {A, RM, AM} for copying to all accounts that are related by RA only (i.e. , to this account only), by either RA or RRM (i.e., to this account and any route multiple account), or by any of RA, RRM, and RAM (i.e., to this account, any route multiple account, and any agency multiple account), respectively.

978-1-5386-0918-7/ 17/$31.00 ©20171EEE

In developing a usage model for testing purposes our focus is therefore on a sequence of steps in which all the inputs needed for a test case will be defined, and all possible combinations of inputs will be covered (at a certain level of abstraction) and represented in the model.

B. Usage Modeling and Model Analysis

To define all the inputs needed for a test case, we first decide how a, and a2 should be related. We consider two cases. In each case the definition of all the test inputs is done in a seq uence of steps:

Case 1: a jRAa2 (implying aj = a2)

1. Define aj (i.e., a2), pn, and pf subject to /a/(i) = (pn, pj) for some i in N . Supposedly pn is the phone number on the account aj whose phone flag pfis to be changed. Without loss of generality we assume aj contains 15 phone fields. Each phone field corresponds to a phone number and a phone flag. Let Cj = I{i : /aji) = pn} l. C j represents how many times pn appears in aj (in possibly different phone fields). We consider three sub-cases: (i) Cj = 1, (ii) 1 < Cj < 15, (iii) Cj = 15. (i) and (iii) describe two extreme cases: pn appears in aj only once, or pn is the only phone number that appears in all the phone fields of a j, respectively. (ii) describes the case in which aj has pn appearing in at least two phone fields, and aj contains at least one other number different than pn.

2. Define p. We consider p = true and p = false , for pfbeing on, or not on, the list of phone flags whose changes need to be copied, respectively.

3. Define the string indicating where to copy. It could be A, or RM, or AM as illustrated in Section IV. A.

Case 2: ú E ~ à o ^ ~ O F E á ã é ä ó á å Ö = three sub-cases: (i) ajRRMa], (ii) ajRAMa], (iii) ú E ~ à o ^ ~ O F D = ú E ~ à o o j ~ O F D = ú E ~ à o ^ j ~ O F F =

R138

1. Define aj , pn, and pf subject to /ali) = (pn, pj) for some i in N . Supposedly pn is the phone number on the account a) whose phone flagpfis to be changed.

2. Define G2. Without loss of generality we assume a2 contains 15 phone fields. Each phone field corresponds to a phone number and a phone flag. Let C2 = l{i :/aii) = pn} l· C2 represents how many times pn appears in G2 (in possibly different phone fields). We consider three sub-cases: (i) C2

= 0, (ii) 0 < C2 < 15, (iii) C2 = 15. (i) and (iii) describe two extreme cases: pn does not appear in a2 at all (this is possible as Gj and a2 are two different accounts), or pn is the only phone number that appears in all the phone fields of a2, respectively. (ii) describes the case in which a2 has pn appearing at least once, and a2 contains at least one other number different than pn.

The Second International Conference on Reliability Systems Engineering (ICRSE 2017)

Page 4: Improving Test Adequacy and Software Reliability with ...€¦ · Improving Test Adequacy and Software Reliability with Practices of Statistical Testing Yufeng XUEi, Lan LINI, John

Figure 3. A usage model for the phone flag copy testing. Arcs are labeled from left to right, and from top to bottom as a i, ... , a -/3 . We

assume the uniform probability distribution for all arcs not labeled with a probability out of every state.

TABLE I. MODEL STATISTICS

Node Count 23 nodes Arc Count 43 arcs Stimulus Count 43 stimuli Expected Test Case Length 5.75 events Test Case Length Variance 2.844 events Transition Matrix Density (Nonzeros) 83. I 758034E-3 (44 nonzeros) Undirected Graph Cyclomatic N umber 22

1. Define p. We consider p = true and p = false , for pfbeing on, or not on, the list of phone flags whose changes need to be copied, respectively.

2. Define the string indicating where to copy. It could be A, or RM, or AM as illustrated in Section IV. A.

The process discussed above defmes the structure of a usage model for the phone flag copy testing. Fig. 3 shows the model structure. To avoid clutter we only have the states and some arcs labeled in the figure. Arcs are labeled from left to right, and from top to bottom as aj, ... , a43.

Next we assigned probabilities to all the arcs of the usage model based on knowledge obtained from the customers' data. Except for the arcs out of States 2 and 9 (based on values of Cj

and C2) we assume the uniform probability distribution for all the outgoing arcs from every state (see Fig. 3). This step completes the statistical phase of the usage model development.

We performed a model analysis using the JUMBL[I]. Table I shows the model statistics, including the number of nodes, arcs, and stimuli (distinct usage events) in the usage model, the expected test case length (the mean value, i.e., the average number of steps in a randomly generated test case) and variance.

There are a number of other statistics that are also computed for every node, every arc, and every stimulus of the usage model, i.e., occupancy, probability of occurrence, mean occurrence, and mean first passage. These statistics are validated against what is known or believed about the application domain and the environment of use.

978-1-5386-0918-7/ 17/$31.00 ©20171EEE

W Craft Test Case X l es t Case

ount: 5 urrentHode: 23 St,. ""to Ale label NcLabel TargelStale

1 1 .,

I

2 . 5 3 , .11 4 " 02. 5 15 .32 ,

ú ú = ê J ú x á ~ å Ç ç ã = | g ú =

Figure 4. A test case for the phone flag copy testing

TABLE IT. EXCERPTS OF THE T EST CASE ANALYSIS WITH 72 EXHAUSTIVE P ATHS: RELiABILITIES

Single Event Reliability 0.919740316 Single Event Variance I 96.600032E-6 Single Event Optimum Reliability 0.919740316 Single Event Optimum Variance I 96.600032E-6 Single Use Reliability 0.621234246 Single Use V ariance 23.8904056E-3 Single Use Optimum Reliability 0.621234246 Single Use Optimum Variance 23.8904056E-3 Arc Source Entropy 0.736566543 bits Kullback Discrimination 0.179227483 bits Relative Kullback Discrimination 24.333 % Optimum Kullback Discrimination 0.179227483 bits Optimum Relative Kullback Discrimination 24.333% ..

Smce each arc corresponds to the defmltLon of one or more inputs that make up a test case, and the input values could change from one test case to another, the model is abstract, i.e., the same test case could be successful at some times and failed at other times depending on the specific test inputs selected for different runs. In executing a test case crafted/generated from the usage model, each arc corresponds to the selection of one or more input parameters that satisfy a specific constraint (e.g., choose how the two accounts aj and a2 should be related; choose a phone number whose phone flag is to be changed in aj such that this phone number appears in aj exactly once). The concrete values of the inputs (e.g., what does aj look like; what is the phone number of a phone field on aj whose phone flag is to be changed) will be determined by the tester for each run.

C. Test Case Generation and Analysis

The Markov chain usage model represents the population of all possible uses. Every path that starts with the source and ends with the sink corresponds to a possible use of the system under test, and a valid test case. Note that for the phone flag copy testing the model does not contain any loops, as opposed to the model for an embedded system for instance, due to the nature of the testing problem, i.e., each step corresponds to the defmition of one or more inputs that constitute a typical test case.

It can be observed from Fig. 3 that the model contains 72 distinct paths starting with the source and ending with the sink. 18 of the 72 paths go through Node 2 (the top part of the model), and the remaining 54 paths go through Node 9 (the bottom part). We can exhaustively test all the 72 paths, although the number of possible inputs we could select for each path (test case) is astronomical.

R138 The Second International Conference on Reliability Systems Engineering (ICRSE 2017)

Page 5: Improving Test Adequacy and Software Reliability with ...€¦ · Improving Test Adequacy and Software Reliability with Practices of Statistical Testing Yufeng XUEi, Lan LINI, John

Fig. 4 shows a test case (i.e. , the most top path) as a sequence of arcs (steps) traversing the usage model starting with the source and ending with the sink.

We first manually crafted the n distinct test cases in the JUMBL, recorded them all as successful, and performed a test case analysis (see Table II). Since the n tests cover all paths (and therefore all nodes and all arcs) of the usage model, there is no need to generate minimum coverage test cases for that purpose (i.e., for model coverage).

TABLE III. EXCERPTS OF THE TEST CASE ANALYSIS WITH 72 EXHAUSTIVE PA THS AND 1000 RANDOM PATHS: RELIABILITIES

Single Event Reliability 0.994013065 Single Event Variance 919.121939E-9 Single Event Optimum Reliability 0.994013065 Single Event Optimum Variance 919.121939E-9 Single Use Reliability 0.966221325 Single Use Variance 1.00926881 E-3 Single Use Optimum Reliability 0.966221325 Single Use Optimum Variance 1.00926881 E-3 Arc Source Entropy 0.736566543 bits Kullback Discrimination 4.723076I3E-3 bits Relative Kullback Discrimination 0.641228709 % Optimum Kullback Discrimination 4.72307613E-3 bits Optimum Relative Kullback Discrimination 0.641228709 %

TABLE IV. NODEI ARclSTIMULUS COVERAGE FOR THE TEST SUITES INCLUDING THE 72 EXHAUSTIVE PATHS

Node Generated 23 nodes I 23 nodes (I) Arcs Generated 43 arcs 143 arcs (I) Stimuli Generated 43 stimuli 143 stimuli (I) Node E xecuted 23 nodes I 23 nodes (1) Arcs Executed 43 arcs I 43 arcs (I) Stimuli Executed 43 stimuli 143 stimuli (1)

Then we randomly generated lOOO test cases based on the arc probabilities, and recorded them all as successful. Now our test suite contains Ion tests (n path coverage tests + 1000 random tests). We performed a test case analysis using the JUMBL. The analysis is shown in Table III.

It can be observed that with the n path coverage tests only the single event reliability (i.e., the probability of a randomly chosen arc/event being successful) is high (i.e. , 0.919740316), but the single use reliability (i.e., the probability of a randomly chosen use being successful) is only 0.621234246, due to the complexity of the usage model and the small sample of tests. With lOOO more randomly generated uses running successfully both the single even reliability and the single use reliability improve: single event reliability reaches 0.994013065, and single use reliability reaches 0.966221325 given the statistical protocol.

Table IV shows the node/arc/stimulus coverage for the two test suites including the n path coverage tests. The generation counts (how many nodes/arcs/stimuli appear in the generated test cases) and the execution counts (how many nodes/arcs/stimuli appear in the executed test cases) all reach 100% coverage due to the achieved path coverage.

978-1-5386-0918-7/17/$31.00 ©2017 IEEE

The statistics as shown in Tables II, III, and IV are computed through standard Markov computation. The derivations are detailed in a technical report [lO]. System end-to-end reliability (i.e., single use reliability) is computed via analytical means in closed form, following the arc-based Bayesian model [12], which takes into consideration both the Markov chain structure, the probabilities on the arcs, and the experience obtained through testing the sample. Details of the derivations are beyond the scope of this paper. Interested readers should refer to [10, 12].

Also note that an arc-based Bayesian model is used to compute the single use reliability estimate [12] . It takes into consideration both the Markov chain usage model and the testing experience. Our test suite specifically covered the n exhaustive paths due to the nature of this problem (the model does not contain any loops) and also as a sanity check (to make sure every node/arc/stimulus be seen at least once in testing).

D. Analysis of Heuristic Testing We also analyzed the 64 heuristic test cases that were

designed and executed by our collaborator. The 64 heuristic tests can be mapped to nine out of the n paths of the usage model, as shown in Table V. Each test case is represented by a sequence of nodes that are traversed on the path.

TABLE V. HEURISTIC TEST CASES FOR THE PHONE FLAG COpy TESTING. EACH TEST CASE Is SHOWN AS A PATH THROUGH THE USAGE MODEL

STARTING WITH THE SOURCE (NODE I) AND ENDING WITH THE SINK (NODE 23).

hi: 1, 2, 7, 10, 15, 23 h2 : 1, 2, 7, II , 15, 23 h3 : 1, 3, 9, 13, 18, 20, 23 h4 : 1, 3, 9, 13, 19, 20, 23 h5 : 1, 3, 9, 13, 18, 21 , 23 h6 : 1, 3, 9, 13, 19, 21 , 23 h7 : 1, 3, 9, 13, 18, 22, 23 hs : 1 , 3, 9, 13, 19, 22, 23 h9 : 1, 3, 9, 12, 18, 22, 23

TABLE VI. EXCERPTS OF THE TEST CASE ANALYSIS WITH NINE HEURISTIC PATHS: RELIABILITIES

Single Event Reliability 0.706534161 Single Event Variance 1.64390122E-3 Single Event Optimum Reliability 0.706534161 Single Event Optimum Variance 1.64390122E-3 Single Use Reliability 0.14162298 Single Use Variance 21.3206386E-3 Single Use Optimum Reliability 0.14162298 Single Use Optimum Variance 21.3206386E-3 Arc Source Entropy 0.736566543 bits Kullback Discrimination 11.948 bits Relative Kullback Discrimination 1,622.084% Optimum Kullback Discrimination 11.948 bits Optimum Relative Kullback Discrimination 1,622.084%

Note that eight out of the nine heuristic tests (i.e., hi - h8) go through the two most frequent arcs (among arcs out of the same state), i.e., Arc a6 from Node 2 to Node 7 with Probability 0.60, and Arc al8 from Node 9 to Node 13 with Probability 0.98. When we generate weighted test cases from the usage model based on the probability mass of each test case, from the most probable to the least probable, it takes 36 weighted test

R138 The Second International Conference on Reliability Systems Engineering (ICRSE 2017)

Page 6: Improving Test Adequacy and Software Reliability with ...€¦ · Improving Test Adequacy and Software Reliability with Practices of Statistical Testing Yufeng XUEi, Lan LINI, John

cases (half of the exhaustive tests) to include all the eight heuristic tests.

We fIrst perfonned a statistical analysis based on the nine heuristic tests only, shown in Table VI. Single event reliability is 0.706534161, and single use reliability is low (0.l4162298) given the complexity of the model and the small sample.

Next we perfonned a statistical analysis based on each of the nine heuristic tests being executed 1,000 times (with possibly different input parameters) and they all running successfully. As shown in Table VII, the single event reliability is elevated to 0.864526384, and the single use reliability is elevated to 0.457744018.

Table VIII shows the node/arc/stimulus coverage for the two test suites including the nine heuristic tests.

TABLE VII. EXCERPTS O F THE TEST CASE ANAL YSIS WITH 9.000 R UNS O F THE NINE HEURISTIC PATHS (1 ,000 RUNS FOR EACH PATH): RELlABILlTlES

Single Event Reliability 0.864526384 Single Event Variance 720.895785E-6 Single Event Optimum Reliability 0.864526384 Single Event Optimum Variance 720.895785E-6 Single Use Reliability 0.457744018 Single Use Variance 0.156256243 Single Use Optimum Reliability 0.457744018 Single Use Optimum Variance 0.156256243 Arc Source Entropy 0.736566543 bits Kullback Discrimination 11.948 bits Relative Kullback Discrimination 1,622.084% Optimum Kullback Discrimination 11.948 bits Optimum Relative Kullback Discrimination 1,622.084%

TABLE VIII. NODE/ARC/STIMULUS COVERAGE FOR THE TEST SUITES INCLUDING THE NINE HEURISTIC PATHS

Node Generated 16 nodes / 23 nodes (0.695652174) Arcs Generated 23 arcs / 43 arcs (0.534883721) Stimuli Generated 23 stimuli /43 stimuli (0.534883721) Node Executed 16 nodes / 23 nodes (0.695652174) Arcs Executed 23 arcs / 43 arcs (0.534883721) Stimuli Executed 23 stimuli /43 stimuli (0.534883721)

V. CONCLUSION AND FUTURE WORK

Statistical testing based on a Markov chain usage model is a rigorous testing method that can be used to provide audit trails of evidence to support correctness arguments for a software-intensive system and a decision that the system is of requisite quality for its intended use. This paper reports a case study in which we applied standard statistical testing practices to a real-world testing problem, and presents our solution from problem fonnalization , usage modeling and model analysis to test case generation and test case analysis. We also analyzed a heuristically generated test suite against the developed usage model. Our results supported the claim that usage modeling and sampling based on the Markov chain usage model can reveal important test paths not covered by heuristic testing, therefore improve test coverage and adequacy. Coverage and random sampling combined also considerably improved reliability estimates both at the arc (usage event) level and at the path (system) level, compared with heuristic sampling.

978-1-5386-0918-7/ 17/$31.00 ©20171EEE

Work is under way to implement the 63 test cases not covered in the heuristic test suite, to see if they lead to the identifIcation of any bugs for the fIelded application. We are also interested in automating the testing process for the phone flag copy testing to provide fully automated test case generation-exec uti on-evaluation.

ACKNOWLEDGMENT

This work was generously funded by Ontario Systems through the NSF Security and Software Engineering Research Center (S2ERC).

REFERENCES

[1] 2017. J Usage Model Builder Library (JUMBL). Software Quality Research Laboratory, The University of Tennessee. http://jumbl.sourceforge.net/jumbITop.html.

[2] T. Bauer, T. Beletski, F. Boehr, R. Eschbach, D. Landmann, and 1. Poore. From requirements to statistical testing of embedded systems. In Proceedings of the 4th International Workshop on Software Engineering for Automotive Systems, pages 3- 9, Minneapolis, MN, 2007.

[3] 1. Bouwmeester, G. H. Broadfoot, and P. 1. Hopcroft. Compliance test framework. In Proceedings of the 2nd Workshop on Model-Based Testing in Practice, pages 97- 106, Enscede, The Netherlands, 2009.

[4] W. J. Gutjahr. Software dependability evaluation based on Markov usage models. Performance Evaluation, 40(4): 199- 222, 2000.

[5] D. Jackson, M. Thomas, and L. I. Millett, editors. Software for Dependable Systems: Sufficient Evidence? National Academies Press, 2007.

[6] 1. H. Poore. Theory-practice-tools for automated statistical testing. DoD Software Tech News: Model-Driven Development, 12(4):20- 24, 20 10.

[7] 1. H. Poore, 1. Lin, R. Eschbach, and T. Bauer. Automated statistical testing for embedded systems. In 1. Zander, I. Schieferdecker, and P. 1. Mosterman, editors, Model- Based Testing for Embedded Systems in the Series on Computational Analysis and Synthesis, and Design of Dynamic Systems. CRC Press-Taylor & Francis, 2011.

[8] 1. H. Poore and C. J. Trammell. Engineering practices for statistical testing. Crosstalk (DoD Software Engineering Journal-Newsletter), pages 24-28, April 1998.

[9] 1. H. Poore and C. J. Trammell. Application of statistical science to testing and evaluating software intensive systems. In M. L. Cohen, D. L. Steffey, and J. E. Rolph, editors, Statistics, Testing, and Defense Acquisition: Background Papers. National Academies Press, 1999.

[10] S. 1. Prowell. Computations for Markov Chain Usage Models. Technical report UT -CS-03-505, 2003.

[11] S. J. Prowell. JUMBL: A tool for model-based statistical testing. In Proceedings of the 36th Annual Hawaii International Conference on System Sciences, page 337c, Big Island, HI, 2003.

[12] S. J. Prowell and 1. H. Poore. Computing system reliability using Markov chain usage models. Journal of Systems and Software, 40(4):199- 222, 2004.

[13] S. J. Prowell, C. J. Trammell , R. C. Linger, and 1. H. Poore. Cleanroom Software Engineering: Technology and Process. Addison-Wesley, Reading, MA, 1999.

[14] K. Sayre and J. H. Poore. Automated testing of generic computational science libraries. In Proceedings of the 40th Annual Hawaii International Conference on System Sciences, page 277c, Big Island, HI, 2007.

[15] G. H. Walton, 1. H. Poore, and C. 1. Trammell. Statistical testing of software based on a usage model. Software - Practice & Experience, 25(1):97- 108, 1995.

[16] 1. A. Whittaker and J. H. Poore. Markov analysis of software specifications. ACM Transactions on Software Engineering and Methodology, 2(1):93- 106, 1993.

[17] J. A. Whittaker and M. G. Thomason. A Markov chain model for statistical software testing. IEEE Transactions on Software Engineering, 30(10):812- 824,1994.

R138 The Second International Conference on Reliability Systems Engineering (ICRSE 2017)