a hint-based technique for system level model-based test case … · a hint-based technique for...

Ouriques JFS, Cartaxo EG, Alves ELG, Machado PDL. JOURNAL OF COMPUTER SCIENCE AND TECH-NOLOGY : 1– Mar. 2017

A Hint-Based Technique for System Level Model-Based Test Case Pri-oritization

Joao F. S. Ouriques1, Emanuela G. Cartaxo1, Everton L. G. Alves1, and Patrıcia D. L. Machado1

1Software Practices Laboratory, Federal University of Campina Grande, Paraıba, 58429-900, Brazil

E-mail: {jfelipe, emanuela}@copin.ufcg.edu.br, {everton, patricia}@computacao.ufcg.edu.br

Received March 6, 2017

Abstract Test Case Prioritization (TCP) techniques aim at proposing new test case execution orders to favorthe achievement of certain testing goal, such as fault detection. Current TCP research focus mainly on code-basedregression testing; however in the Model-Based Testing (MBT) context, we still need more investigation. GeneralTCP techniques do not use historical information, since this information is often unavailable. Therefore, techniquesuse different sources of information to guide prioritization. We propose a novel technique that guides its operationusing provided hints, the Hint-Based Adaptive Random Prioritization - HARP. Hints are indications or suggestionsprovided by developers about error-prone functionalities. As hints may be hard to collect automatically, we alsopropose an approach of collecting them. To validate our findings, we performed an experiment measuring the effectof introducing hints to HARP. It shows that hints can improve HARP’s performance comparing to its baseline.Then, we investigated the ability of developers/managers to provide good hints and used them in a case study.This analysis showed that developers and managers are able to provide useful hints, which improves HARP’s faultdetection comparing to its baseline. Nonetheless, the provided hints should be consensual among the developmentteam members.

Keywords Model-Based Testing; Test Case Prioritization; Fault Detection; Experimental Evaluation; Question-naire.

1 Introduction

Verification and Validation (V&V) activitiesplay an important role during software develop-ment. They often help to decrease the chancesof delivering a buggy product and to increase thequality of software artifacts. Therefore, a greatdeal of the project’s budget and resources is oftendevoted to V&V tasks [1]. One of these tasks isSoftware Testing. When testing, a tester executesa program providing a set of controlled inputs, andchecks the outputs, looking for errors, anomalies,and/or non-functional information [1]. Althoughwidely used and extensively studied, software test-ing is known to be costly. Beizer [2], and Ammannand Offutt [3] point that testing alone can con-sume nearly 50% of a project’s budget. Therefore,

research has been conducted aiming at reducingthe costs related to software testing.

In this scenario, the Model Based Testing ap-proach (MBT) [4] has emerged as an alternative totraditional testing. MBT uses the system’s mod-els to automatically perform several testing tasks(e.g., test case generation). Among the main ad-vantages of using MBT, we can list [4]: i) time andcosts reduction - automation is one of the pillarsof MBT, which ends up helping the developmentteam to save time and resources when buildingtesting artifacts; and ii) artifacts soundness - astest artifacts are derived automatically from sys-tem’s models, it decreases the chances of creatingincorrect artifacts.

Due to the exhaustive nature of most test casegeneration algorithms used in MBT (e.g., based on

This work was partially supported by the National Institute of Science and Technology for Software Engineering (2015),funded by CNPq/Brasil, grant 573964/2008-4. First author was also supported by CNPq grant 141215/2012-7.

arX

iv:1

708.

0323

6v1

[cs

.SE

] 1

0 A

ug 2

017

2 J. Comput. Sci. & Technol., Mar.. 2017, ,

depth-first search), generated test suites tend tobe massive, which may jeopardize test case man-agement [4]. To cope with this problem, a num-ber of techniques have been presented in the lit-erature. Most of them can be classified amongthree categories: Test Case Selection [5–7], TestSuite Reduction [8, 9], and Test Case Prioritiza-tion (TCP) [10–12].

The goal of test case selection and test suitereduction is to propose a subset of test cases fromthe original suite, in other words, to help testers towork with a smaller amount of test cases. However,some test cases that unveil faults not detected byany other one may be discarded [13]. Therefore,in some scenarios, both test case selection or re-duction may fall short on proposing an effectivesmaller test suite that still reveals all faults. Onthe other hand, the TCP problem is to determinea permutation for the elements of a test suite thatminimizes an evaluation function, chosen accord-ing to a prioritization objective [12]. As a directconsequence, a test case prioritization techniquedoes not eliminate any test case from the originaltest suite, but it helps to minimize resources usageby focusing on the most relevant ones [14]. By notdiscarding test cases, TCP is flexible, since it al-lows the tester to decide the amount of test casesto be run. This decision is often taken accordingto project’s resources availability. Thus, TCP hasbeen widely studied and is found as a good alter-native to address the “huge suite” problem.

Both code-based and model-based test suitescan be prioritized using TCP techniques. How-ever, most techniques presented in the literaturehave been defined and evaluated only for code-based suites, and in the context of regression test-ing [11, 12]. Catal and Mishra [15] conducted asystematic literature mapping regarding TCP and,among the results, they suggest that the amountof new model-based prioritization techniques hasincreased in a slower rate when compared to code-based ones. Moreover, they encourage novel worksin this field.

To the best of our knowledge, there are onlyfew attempts for defining TCP techniques basedon model information. Korel et al. [16] proposeda set of heuristics for TCP using Extended Fi-nite State Machines - EFSM. On the other hand,

Sapna and Mohanty [17], Kaur et al. [10], Stall-baum et al. [18], and Nejad et al. [19] proposedTCP techniques that prioritize test cases generatedfrom UML Activity Diagrams. In addition, Kunduand Sarma [20] proposed a technique suitable forUML Sequence Diagrams. However, effectivenessand applicability of current techniques still requirefurther investigation.

In a previous work [21], we performed a setof empirical studies evaluating how several fac-tors, such as number of faults, elements of model’slayout (represented by the amount of branches,joins, and loops it has), and failing test cases’characteristics, can affect the behavior of tradi-tional TCP techniques in the MBT context, fo-cusing on Labeled Transition Systems (LTS). Ourresults showed that, for models with different lay-outs, the investigated techniques did not presentvariation in fault detection capability, which in-dicates low effects of model layout. Besides, thestudy pointed out that characteristics of failingtest cases (such as the amount of branches, joinsand loops they traverse in the model) can explaintechniques’ performance. Therefore, these stud-ies suggested that TCP techniques should not relyonly on layout characteristics. As secondary re-sult of these studies, we also observed that tech-niques implementing the Adaptive Random strat-egy [11,22] presented a good performance, mainlybecause they have the adaptive component, explor-ing more evenly the test cases. However, they werealso negatively affected by random choices, whichwas exposed by spread boxplots.

The effectiveness of a prioritized test suite isoften evaluated by its ability of revealing faults asfast as possible. Ideally, a technique should placeall test cases that unveil faults in the first positionsof the prioritized test suite. However, this may re-quire key information that may not be available,such as historical data about previous test execu-tions, as in the Regression Testing context [15].On the other hand, general TCP techniques donot take into account historical data. They canbe very useful during the initial cycles of systemtesting since a great amount of never executed testcases are often available, and there are only prelim-inary information to guide the prioritization pro-cess [23]. There are some techniques that rely on

Ouriques et al.: A Hint-Based for MBT Test Case Prioritization 3

the experts’ knowledge as a guide to the prioriti-zation task [18,24,25], but they demand a high in-teraction with the specialist, which increases theirapplication costs. In this work, we focus on thegeneral TCP problem in the system level context,considering test suites generated through MBTapproaches, assuming that historical informationfrom previous testing executions is not available,and using expert’s knowledge to leverage the pri-oritization process.

Motivated by the lack of confidence of tak-ing into account the model layout itself and thepromising performance of techniques based onthe adaptive random strategy, we propose Hint-Based Adaptive Random Prioritization -HARP. It comprises two main steps: the acquisi-tion of hints from developers and managers, andthe prioritization of test cases using these hintsas a guidance layer. By “hints” we mean indica-tions from developers and managers – which werefer to as development team – regarding jeopar-dized portions of the system, for instance, portionsof the implemented use cases that were hard toimplement or even subject to changes in sched-ule. HARP’s objective is to take advantage of theexpertise acquired during the system developmentto improve the exploration of test cases and con-sequently help to better order the test cases foranticipate fault detection.

HARP’s first step aims at collecting hintsfrom the development team with respect to theuse cases that they actually worked on througha questionnaire. It is important to remark thathints about error-prone functionalities could beobtained from a number of different sources, forinstance change impact analysis data, defect pre-diction, and risk assessment [26,27], provided thathistory on test execution is required. However, thisinvestigation is out of the scope of this paper, sincewe are not addressing the regression testing con-text.

Using the hints as a layer of guidance, HARPtakes as input a test suite and suggests a neworder for their elements. Each test case is a se-quence of labels representing steps performed bythe tester, results that the system under testing(SUT) should produce, or conditions that mustbe satisfied to proceed executing the test case. A

hint is a piece of information from the developmentteam, which is translated to a test purpose that, inturn, is an expression defined from test case labels,that filters the set of available test cases. Conse-quently, HARP is independent of specific kinds ofmodels as well as of the test case generation pro-cess.

We evaluate HARP through three empiri-cal studies: an experiment evaluating the affectof giving good and bad hints on HARP; a ques-tionnaire with development teams from two in-dustrial projects, evaluating whether the partici-pants would be able to provide proper hints, i.e.,it evaluates whether the use case steps providedby the participants are actually related to faults;and a case study comparing HARP with the origi-nal ARP technique [11], involving test suites fromindustrial systems, using the hints collected in thequestionnaire.

Our experiment suggested that HARP gets toits best performance when one provides a goodhint, which means that even with the random as-pects, the algorithm benefits from hints; on theother hand, if the provided hint is bad, HARPis negatively affected and should not be used inthese scenarios. Since we are suggesting that aparticular hint to be used only whether the ma-jority of the developers and managers agree aboutit, we believe that scenarios where a hint wouldmisguide the prioritization are being minimized.From the questionnaire we found out that, in mostof the cases, participants were able to indicate er-ror prone paths, which evidences the real applica-bility of our technique. From the case study wefound out that, for the investigated applicationsand guided by the hints provided through the ques-tionnaire, HARP reduced the time to reveal thefirst fault in comparison to traditional ARP.

In summary, this work has the following con-tributions: i) a novel TCP technique – HARP –along with its algorithm and asymptotic analysis;ii) a way to collect helpful information to guideprioritization from development teams; iii) an ex-periment exposing the effects of providing a hintto the provided algorithm; and iv) a case studycomparing the ability of revealing the first fault ofHARP and its baseline. In addition, we provideenough detail to stimulate a repetition of this in-


vestigation, as well as scripts and collected data,providing transparency and reproducibility to theinvestigation.

The remainder of the paper is structured asfollows: Section 2 exposes the background topicsinvolved in this work; Section 3 depicts a moti-vational example introducing the idea of the hintand how to obtain it; Section 4 details the pro-posed technique and provides an asymptotic analy-sis; Section 5 presents details about the conductedempirical investigation; Section 6 presents the re-lated work; Finally, Section 7 summarizes the con-clusions about the work.

2 Background

This section introduces the key concepts usedin this work. It covers the formal definition of theTest Case Prioritization problem (Section 2.1), theAdaptive Random strategy (Section 2.2), the ac-tivity of modeling a system (Section 2.3), and theuse of Test Purposes on testing techniques (Sec-tion 2.4).

2.1 Test Case Prioritization

Test Case Prioritization (TCP) techniques re-order the test cases from a test suite in order toquicken the reaching of a given testing objective.Testers may apply TCP on both code-based andmodel-based contexts, but it has been more fre-quently used in the code-based context and oftenrelated to regression testing [15]. Formally, theTCP problem is defined as following [28]:

Definition 1 (TCP Problem). Given: Atest suite TS, the set of all permutations of TS,PTS, and a function from PT to real numbersf : PTS → R.Problem: To find a TS′ ∈ PTS | ∀ TS′′ (TS′′ ∈PTS) (TS′′ 6= TS′) · f(TS′) ≥ f(TS′′).

In other words, for the set of all permutationsof the available test cases, the problem is to selectone that maximizes the evaluation function, i.e. afunction that measures how good is a prioritizedtest suite.

TCP has two clear limitations: i) to analyzethe whole set of all permutations of TS; and ii) theevaluation function f . Since the amount of permu-tations of a set A is |A|!, whether TS is big, analyz-

ing the elements from PTS might be impractical,as discussed by Lima et al. [29]. Thereby, usuallyTCP techniques aim at selecting one test case at atime and place it in the desired order following cer-tain heuristics. Moreover, the evaluation functionis related to the prioritization goal, for example,to reduce test case setup time [29] and, more com-monly, to accelerate fault detection. Dependingon the prioritization goal, the required informa-tion to define f is not available beforehand, makingits definition impossible. For instance, consideringfault detection, the information regarding wherethe faults are or which test cases detect them isnot available in advance, then a technique is notable to maximize it. Therefore, techniques that fo-cus on fault detection, estimate fault informationthrough surrogates [24].

2.2 Adaptive Random Strategy

Chen et al. [30] propose the Adaptive Ran-dom strategy as an alternative to pure random testcase generation for software with numerical inputs.The authors introduce the term failure pattern,which is a region in the input space that causesfailures. Thus, the adaptive random strategy liesin estimating these failures patterns and spread thetest cases more efficiently through the input space.The original strategy is suitable for on-line testcase generation with numerical inputs [30]. Aftera test case is run, its result determines whetherthe algorithm should generate another one, i.e., ifa test case does not reveal any fault, the strategygenerates another test case, until a failure occurs.The idea is to randomly sample a set of candidatetest cases and, among them execute the farthestaway from the already executed ones. This pro-cess continues until the execution unveils a fault.

The distance concept, represented by the far-thest away expression, is a set of functions thatevaluates two test cases and two sets of test cases.The first one calculates the distance between twotest cases, taking into account the estimated fail-ure pattern. Chen et al. [30] propose the use of theEuclidean Distance function to evaluate two testcases that manipulate numerical inputs. To com-pare two sets of test cases, the algorithm looks forthe maximum of the minimum distances between


the already executed test cases and the candidates.

Jiang et al. [11] propose the use of the adap-tive random strategy in order to solve the gen-eral TCP problem. For that, they propose a tech-nique, which we refer to as the Adaptive RandomPrioritization (ARP) technique, by adapting theoriginal strategy as follows: i) turning the exe-cuted set into a prioritized sequence, and ii)changing the stop criterion to repeat the pro-cess until all test cases are in the prioritizedsequence. Thus, the proposed technique placestest cases in order while the untreated set of testcases is not empty. The authors use Jaccard [31]as distance function, which measures the distancebetween sets by comparing the union and intersec-tion sets. Later, Zhou [22] suggests another usefor the previously proposed technique. The au-thor changes the distance function to the Manhat-tan function, but keeps the same algorithm, whichcompares two test cases through their branch cov-erage. Moreover, Zhou et al. [32] perform an em-pirical study comparing these two variations. Theresults of this study show that, in the code-basedcontext, the Manhattan distance provides betterfault detection. This fact is mainly because itmeasures just coverage of code elements, insteadof measuring the frequency of the coverage, as it isdone by the Jaccard function.

2.3 Modeling Systems

In Model-Based Testing, the main artifact isthe system model, which often represents its be-havior through execution flows. The first decisionis to define the model notation based on the ab-straction level and the testing level. In this work,since we consider a higher abstraction level in sys-tem level, we consider Transition Based modelsdue to their high abstraction, wide use, availabilityof supporting tool, and its suitability for control-flow representation [4]. In our work, we are usingessentially Labeled Transition System - LTS as sys-tem models. An LTS is defined in terms of statesand labeled transitions linking them. Formally, anLTS is a 4-tuple < S,L, T, s0 >, where [33]:

• S is a non-empty and finite set of states;

• L is a finite set of labels;

• T ⊆ SxLxS is a set of triples, the transitionrelation;

• s0 is the initial state.

To model a system using this notation, oneneeds to represent the user steps, system re-sponses, and conditions as LTS transitions andto represent a decision, one can just branch thestate and add a transition representing each alter-native. For the sake of notation, the transition’slabel has a dedicated prefix: “S - ” for user steps,“R - ” for system responses, and “C - ” for condi-tions. It is also possible to represent the junction ofmore than one flow, and loops. One can representthree kinds of execution flows using this nota-tion: base, describing the main use of the system,i.e. the scenario where the user does the expectedin most situations and no error occurs; alternative,defining an user’s alternative behavior or a differ-ent way some action can be done; and exception,specifying the occurrence of an error returned bythe system. In Figure 1 we can see an exampleof an application that verifies login and password,modeled through an LTS. The first transition rep-resents the system precondition, which is when itshows its main screen. It can be interpreted asan initial condition for the system operation. Inthe next step, the user must fill the login field andthen, the system verifies if the filled login is valid.Depending on the login validity, the execution fol-lows two different paths, expressed by the branchesgoing out from the state labeled as “4”. If the lo-gin is invalid, a loop transition that shows an errormessage takes the execution flow back to the actionof filling the login field.

2.4 Test Purpose

Test purpose is the purpose of testing! In-deed, a test purpose indicates the intention behinda testing process. According to Vain et al. [34], atest purpose is a specific objective or property ofthe system under test that the development teamwants to test. In practice, applying a test pur-pose to a test suite represents a filtering process,resulting in a subset of test cases that satisfies thepurpose.


1

2

C - Show the login screen

3

S - Fill the login field

4

R - Check if the login is valid

5 6

C - Valid login C - Invalid login

7

8

9 10

C - match C - do not match

11

R - Show an error message: “Login not found”

R - Show an error message:

“Login and password do not

match”

S - Fill the password field

R - Check if the login and password match

R - Show the main screen

Fig. 1: Model representing an use case for login andpassword verification.

There are different ways for representing testpurposes. Jard and Jeron [35] present a formalapproach for representing test purposes in con-formance testing. This representation uses anextended version of Labeled Transition Systems(LTS), which includes input and output actions(IOLTS). The same format is suggested to be usedfor modeling the system under testing. Therefore,with the system and the test purpose respectingthe same rules, the latter is able to guide the test-ing process through the former.

Cartaxo et al. [36] proposes a test purpose alsobased on LTS models but considering the “accept”or “reject” modifier. This notation is more flexible,since it does not need to define specific inputs andoutputs, just the steps of the model represented bythe edges of the LTS. Moreover, this representationcan be also applied in the MBT context by consid-ering system level test cases. This test purpose isexpressed using the following guidelines:

• The purpose is a sequence of strings sepa-rated by commas;

• Each string is a label of an edge from themodel or an “*”, which works as a wildcard,indicating that any label satisfies the pur-pose;

• The last string is “accept”, if the test casemust agree with the test purpose definition,or “reject” otherwise.

HARP takes as input a test purpose, whichformat is a variation of Cartaxo’s test purpose [36].We adapted it by omitting the last string, which isaccept or reject, adopting just the acceptance. Thismodification was needed because we focus only onhints regarding error-prone regions of the system,regardless of safer portions. Therefore, our testpurpose is defined as follows:

• A sequence of strings separated by a verticalbar or pipe (“|”);

• Each string is the label of an element of thesystem model, e.g. a step of a use case, ora “*”, which is a wildcard, representing anytext.

3 Motivating Example

We measure the success of a TCP techniquethat focus on fault detection by analyzing whetherit creates prioritized test suite orders that antici-pate fault detection, when compared to the orig-inal suite. Therefore, a TCP technique ideallyplaces all test cases that fail in the first posi-tions of the prioritized test suite. However, thismay require key historical information that is oftennot available. Considering the general TCP con-text, some techniques use as prioritization criteriamodel characteristics such as structural aspects ofthe models. For instance, a greedy prioritizationtechnique [12] would reschedule test cases accord-ing to the amount of steps each one describes.Therefore, it assumes that the more steps a testcase has, more likely it is to reveal a fault.

To illustrate the problem addressed in ourwork, consider the LTS model depicted in Figure 1,which specifies a use case that performs a loginand password verification in an information sys-tem. Now, suppose that a tester applies a testcase generation algorithm that traverses loops atmost twice and this process creates a test suitewith seven test cases (Table 1).

Now, suppose that a fault occurs when theuser provides an invalid login twice in a row. Con-sidering this scenario, TC7 is a test case that cov-ers the respective path of the model, and conse-quently should reveal the system’s fault.


Table 1: Generated test cases from the examplemodel. We use the notation based on the node’slabels for the sake of visualization.

Label Test Case Nodes

TC1 [1, 2, 3, 4, 5, 7, 8, 9, 11]

TC2 [1, 2, 3, 4, 5, 7, 8, 10, 2, 3, 4, 5, 7, 8, 9, 11]

TC3 [1, 2, 3, 4, 5, 7, 8, 10, 2, 3, 4, 5, 7, 8, 10, 2]

TC4 [1, 2, 3, 4, 5, 7, 8, 10, 2, 3, 4, 6, 2]

TC5 [1, 2, 3, 4, 6, 2, 3, 4, 5, 7, 8, 9, 11]

TC6 [1, 2, 3, 4, 6, 2, 3, 4, 5, 7, 8, 10, 2]

TC7 [1, 2, 3, 4, 6, 2, 3, 4, 6, 2]

With that in mind, suppose that the testerintends to accelerate fault detection by applyingtest case prioritization. Therefore, he/she de-cides to apply the aforementioned greedy strategyfor rescheduling his test suite. After running it,he/she collects the following prioritized test suite:PTSG = [TC2, TC3, TC4, TC5, TC6, TC7, TC1].As the greedy strategy considers only number ofsteps as prioritization criteria, it ended up plac-ing the single test case that reveals the fault atthe penultimate position, therefore representing anundesirable order for a TCP technique.

Now, suppose that during the development ofthe referred use case, due to time restrictions, theteam end up, by time restriction, neglecting theverification of the scenarios where the login andpassword are not valid. Or else, also during de-velopment, the team can perceive that the loginand password verification is not a trivial feature,e.g., login and passwords must adhere to specialpatterns, or rely on external resources, for ex-ample an unstable network link. In both cases,take into account this information could poten-tially lead to an indication of use case error proneregions. Therefore, a developer may provide ahint, for example, “Invalid login attempts may nothave been properly handled due to resource con-straints”, that can be translated by a tester intoa test purpose as “* | C - Invalid Login | *”,indicating that any test case that cover a scenariowhere the provided login is invalid, would be in-teresting to test. Thus, a hypothetical prioritiza-tion technique that considers this hint would place

test cases related to these problematic paths in thesuite’s first positions. For instance, test cases thatcover the steps where the login is verified and anerror occurs would be run first. Among the possi-ble output test suites for this hint strategy can bePTSH = [TC4, TC7, TC5, TC6, TC3, TC2, TC1].As we can see, TC7 is placed at the beginning ofthe prioritized test suite, which is a more desirableperformance for a TCP technique.

We believe it is valid to assume that the de-velopers’ expertise acquired during the develop-ment of certain functionality can be very usefulfor guiding the prioritization process, since he/sheoften owns key information regarding problematicand/or complex software features/modules, whichcan lead to fault inclusion. Therefore, we proposea technique that takes into account hints providedby the development team and uses them as esti-mators of error prone portions of the system.

4 Hint-Based Adaptive Random Prioriti-zation

We propose HARP as an adaptive random pri-oritization technique that uses hints from develop-ers and managers to prioritize test cases from atest suite. In this section we detail both steps thatit comprises.

4.1 Hint Acquisition

Hints are indications that the developmentteam give - using use case documents about oc-currences regarding the system under develop-ment/testing - that may contribute to the insertionof faults, such as portions of the system that somedeveloper considered hard to implement, that usessome external and/or untrusted library, or thatsuffered a schedule shortening. Our hypothesis isthat the aforementioned occurrences are estima-tors for faults and taking them into considerationshould lead to better fault detection capabilities.

We designed HARP to process hints encodedas test purposes using the notation presented inSection 2.4. Encoding hints as test purposes is amanual, but simple task. A single hint may be asingle step in a use case or a sequence of steps (oreven a whole flow). In the first case, suppose thatthe development team indicate a step with text


“single step” as hint; so the equivalent test pur-pose is tp1 =“* | single step | *”, indicating thatany test case that traverse “single step” is filteredby tp1. On the other hand, in the second case,let the hint comprises a pair of step and expectedresult with texts “single step” and “single systemresponse” respectively; the equivalent test purposeis tp2 =“* | single step | single system response |*”, indicating that any test case that describes thereferred step and provides expects the related re-sponse is filtered by tp2. The same idea applies fora whole flow. Therefore, the output of this step isa set of hints TP = {tp1, tp2, · · · , tpn} translatedinto test purposes.

As illustration, considering the example inSection 3, suppose that the testers suspect thatthe error messages are not correct or are being ac-quired using an error-prone strategy. Therefore,they could codify hints as the set TPex = {“ ∗| C - Invalid Login |∗”, “∗| C - do not match |∗”}.The set contains two hints, each one about a sce-nario where the SUT prints an error message.

4.2 Test Case Prioritization

HARP is targeted for MBT test suites, poten-tially for manual execution, where test cases areexpressed as sequences of steps, but do not rely onany specific type of model or test case generationalgorithm. It combines the advantages of exploringevenly the test suite with the guidance providedby hints in order to improve the performance of anadaptive random-based technique, since it reducesrandom choices.

By observing the original adaptive randomprioritization algorithm [11,22], we identified threepoints that could be modified: i) the selection ofthe first test case to be placed in order; ii) thegeneration of a candidate set of test cases to beplaced in order in a particular algorithm’s iter-ation; and iii) the functions that represents thenotion of resemblance and that selects the nexttest case among the candidates to be placed in or-der. In this work, we modify these three aspectsin order to reduce the effects of random choicesand to take into consideration the hint-based guid-

ance. Algorithm 1 summarizes the HARP tech-nique. The method prioritize, which is the mainfunction, takes as input a test suite U TS and aset of test purposes TP , and returns a prioritizedtest suite.

Algorithm 1 The main procedure.

1: function Prioritize(U TS, TP )2: PTS ← ∅3: filtTCs← filter(U TS, TP)4: first← firstChoice(filtTCs)5: PTS.append(first)6: U TS.remove(first)7: filteredTCs.remove(first)8: while U TS 6= ∅ do9: c← genCandSet(U TS, filtTCs)

10: nextTC ← selectNext(PTS, c)11: PTS.append(nextTC)12: U TS.remove(nextTC)13: filtTCs.remove(nextTC)14: end while15: return PTS16: end function

The algorithm starts with an empty set of testcases (Line 2). In Line 3, it filters U TS using ev-ery test purpose from TP and the remainder of thealgorithm uses the result set as hint-related testcases. In Line 3, the algorithm defines the first testcase to be placed in order (contextAwareChoice).For that, it selects randomly one among the fil-tered set. This selected test case is then includedin the prioritized sequence (Line 4), and removedfrom the original set (Line 5) and from the filteredset (Line 6).

The loop in lines 8-14 assembles the rest ofthe prioritized suite. To do so, in its first step, itgenerates a set of candidates (Line 9). The can-didate set generation process selects randomly onetest case at a time, while the set keeps increas-ing branch coverage and size less than or equalsto 10, as discussed by Jiang et al. [11] and em-pirically evaluated by Chen et al. [30]. We modi-fied the candidate set generation by increasing thechances of hint-related test cases not yet placedin order be selected. In order to decide whether

We use the term resemblance because in the literature there are two different notions, which are distance or similarity.Authors suggest functions following these two notions but both may be applied here.


the next test case to compose the candidate setcomes from the hint-related ones or not, we definea random variable following uniform distributionV ∼ U(0, 1). Therefore, iteratively, if the algo-rithm samples a value v < 0.5, it selects randomlya test case among the ones filtered by the testpurposes, otherwise selects randomly a test caseamong the ones not yet prioritized but not relatedto the hints.

The next step is to select the next test case tobe placed in order among the candidates by call-ing selectNextTestCase (Line 10). Algorithm 2gives details on how it works. First, it builds amatrix (Line 5) that represents the resemblancebetween candidates and the test cases that werealready prioritized. Since we use the notion ofsimilarity to represent the resemblance betweentest cases, the matrix contains similarity values.Then, the algorithm selects the candidate withthe highest similarity (Line 8), i.e. the next testcase will be the most similar to the previous pri-oritized ones. In order to measure the similar-ity between test cases we use the similarity func-tion proposed by Coutinho et al. [8], which wasproposed specifically to the MBT context, beingfully compatible with test cases generated withdifferent algorithms, since it takes into account,besides the common transitions, their frequencyin the test cases. The function is defined by:

similarity(i, j) =nip(i, j) + |sit(i, j)|

(|i|+ |j|+ |sdt(i)|+ |sdt(j)|

2)

,

where:

• i and j are two test cases;

• nip(i, j) is the number of identical transitionpairs between two test cases;

• sdt(i) is the set of distinct transitions in thetest case i;

• sit(i, j) is the set of identical transitions be-tween two test cases, i.e. sdt(i) ∩ sdt(j).

Algorithm 2 The select next test case procedure

1: function selectNextTestCase(p TS, c S). p TS is the already prioritized test sequence. c S is the candidate set

2: d← array[p TS.size][c S.size]3: for i = 0 to p TS.size - 1 do4: for j = 0 to c S.size - 1 do5: d← sim(p TS[i], c S[j])6: end for7: end for8: index←maxValue(d)9: nextTestCase← c S.get(index)

10: return nextTestCase11: end function

4.3 Running Example

In this section, we provide a visual and prac-tical understanding on how HARP works with anexample. To do so, consider the scenario intro-duced in Section 3, in which we present a behav-ioral model of a fictitious application for login andpassword verification, and Table 1 shows its gen-erated test suite.

Besides, suppose that the development teamresponsible for building this system identifies theverification of the login in the database as an error-prone step of the system, particularly the invalidlogin verification. This step may be risky becauseit verifies a non-encrypted data from the databaseand, if the developer did not think carefully whencoding it, he/she might have left an open windowfor security attacks (e.g., SQL injection). There-fore, the tester formulates the hint as a set witha single test purpose logError ={“* | [C - Invalidlogin] | *”}.

Considering the original test suite (refer to Ta-ble 1), to propose a new execution order, HARPchooses the test case to be placed in the first po-sition by filtering the input test suite according tothe provided hint. The algorithm filters the testsuite by visiting every test case, traversing it andverifying whether its sequence of the steps matchesthe test purpose, a process similar to the evalua-tion of a regular expression. The filtered subsetis filtered = {TC4, TC5, TC6, TC7}. Then, sup-pose that the first test case randomly selected fromthe filtered set is TC6. Therefore, the algorithm


removes it from the untreated test suite and makesprioritizedSequence = [TC6].

Then, HARP creates a candidate set by ran-domly and iteratively selecting test cases fromthe untreated test suite or from the hint-relatedset. Suppose that the set is candidateSet ={TC1, TC3, TC5}, where just TC5 comes from thehint-related set. The next step is to calculate thesimilarities between the elements from current pri-oritized sequence and the ones from the candidateset. The calculated values are in Table 2.

Table 2: 1st Iteration: Similarities among candi-dates and test cases already prioritized

TC6

TC1 63.15%

TC3 71.11%

TC5 72.72%

Considering these values, HARP selects thetest case with the highest similarity, TC5, thenthe algorithm adds it to the prioritized sequence,making prioritizedSequence = [TC6, TC5].

Now, in the second iteration of the main loopof the algorithm, it proceeds by generating a newcandidate set, and suppose the second candidateset is candidateSet = {TC2, TC4}, where TC4comes from the hint-related set. In the next step,it calculates the similarities between the test casesfrom candidateSet and prioritizedSequence, ascan be seen in Table 3.

Table 3: 2nd Iteration: Similarities among candi-dates and test cases already prioritized

TC6 TC5

TC2 59.57% 68.08%

TC4 90.90% 72.72%

By comparing the maximum similarities forthe two candidates, which are 68.08% and 90.90%,the next test case to be placed in order is TC4,making prioritizedSequence = [TC6, TC5, TC4].The algorithm repeats this process until the lasttest case is placed in order. After the whole ex-ecution, adding one test case at a time, a pro-

posed sequence could be prioritizedSequence =[TC6, TC5, TC4, TC2, TC3, TC7, TC1].

4.4 Asymptotic Analysis

To investigate how costly it would be to runHARP in practice, we analyze its asymptoticbounds. In order to perform this analysis, we con-sider its worst execution scenario, which is whenthe maximum number of candidates (10), are se-lected at every iteration. Moreover, consider a testsuite with n elements and, for simplification, letsassume test cases from the original test suite havesame size, t. The execution time for every step isas follows:

1. similarity : The similarity calculation de-pends on the involved test cases sizes, how-ever, as we are simplifying sizes to t, we cansay its time is 2 ·O(t) or O(t);

2. maxValue: The definition of the maximumbetween the maximum similarities traversesa matrix with <number of candidates>columns and <number of test cases alreadyprioritized> lines. The number of candidateswill be in the interval of 1 and 10, and in theworst case, it will be 10 all iterations. Thenumber of test cases already prioritized de-pends on the initial suite. Thus, the time isO(10 · n) or O(n);

3. selectNextTestCase: The selection of thenext test case to be placed in order, oncemore, depends on the number of candidatesand on the number of test cases already pri-oritized. The similarity is calculated for ev-ery pair of test cases and, following the samereasoning, the time is 10 · n ·O(t) plus O(n)of the maxValue execution. Thus, the execu-tion time is O(t · n) + O(n) or O(t · n);

4. generateCandidateSet: The generation ofa candidate set depends on its size, that mustbe at most ten, and the increasing coverageof requirements. Considering the worst case,the time is constant, thus O(1);

5. contextAwareChoice: The choice of the firsttest case depends on the amount of test cases


from initial test suite and the time necessaryto evaluate if a single test case matches thetest purpose. The function just iterates overthe initial test suite and verifies if each testcase is accepted by the test purpose. Thisverification depends on the size of the testcase and, since we assume that the size ofthe test cases are the same (equals to n), theverification is O(t). Thus, the whole methodexecutes in n ·O(t) or O(t · n) time;

6. prioritize: This function calls the abovementioned ones and its execution time is thetotal execution time of HARP. Thus, HARPexecution time is O(t · n) + n · (O(1) + O(t ·n)) = O(t · n) + O(n) + O(t · n2), which isequals to O(n3), whether n ≥ t or O(t · n2),otherwise.

5 Empirical Investigation

HARP relies on the assumption that de-velopment teams’ members are able to providegood hints about functionalities they developedor helped to develop. Besides providing evidenceof this assumption, we also intend to discuss theboundaries of HARP’s applicability. In order toprovide evidence about HARP’s applicability andability of revealing faults, we intend to investigatethe following research questions:

• RQ1: Does the quality of a hint re-ally affects HARP? In this question wewant to evaluate if a good hint leads to asignificantly better ability of reveal faults incomparison with when receiving a bad hint,evaluating the effect size of HARP’s perfor-mance equipped these two kinds of hints. Weevaluate this to observe whether the randomchoices in HARP’s algorithm affect the guid-ance provided by the hints;

• RQ2: Are developers and managersable to provide good hints? We are notinterested in evaluating the participants it-self. Instead, we try to find out whetherthe hints provided by them approximate por-tions of the system that really contain faults;

• RQ3: Is the hint collection processcostly? This information is important sincewe need to justify costs inserted by our tech-nique, and it will show whether it can beapplicable in practice. We are trying to findout how easy it can be to implement a pro-cess aiming to collect required information toderive the test purpose, i.e, the hints;

• RQ4: IS HARP able to outperformthe original Adaptive Random Prior-itization technique, considering actualhints? To address this question, we com-pare HARP with the baseline version of theAdaptive Random Prioritization proposedby Jiang et al. [11].

To address these research questions, we per-form three distinct empirical studies: an exper-iment controlling the quality of hints providedas input for HARP to address RQ1; a question-naire involving developers and managers from apartner project addressing RQ2 and RQ3; anda case study involving industrial systems withhints provided by developers and managers to ad-dress RQ4. We report these studies in the nextsections. This research has a companion site(https://goo.gl/FH3b5m) containing the collecteddata, R scripts for data analysis, as well as instruc-tions on how to use them.

5.1 Experimental Study

In this section, we report an experiment de-signed to investigate how HARP performs whengood and bad hints are provided. Good hints arethe ones that actually points to test cases that fail,on the other hand bad hints are the ones that donot point to test cases that fail. In order to guidethis investigation and address RQ1 properly, wedefine a clear manual procedure to derive good andbad hints using actual fault reports and exerciseHARP with them. Besides, as a further analysis,we also vary the distance function aiming at as-sessing the effect on them. Therefore, we definethis experiment using the framework based on keyconcepts suggested by Wohlin et al. [37]:

• The objects of study are HARP, hintsand the test suites generated from mod-


els that represent the System Under Testing(SUT);

• The purpose of the study is to evaluatewhether HARP is really affected by the hint’squality;

• The quality focus is the capacity of sug-gesting a test suite capable of revealing faultsearlier;

• The perspective is from the tester point ofview, i.e., the person that executes a TCPtechnique in a testing team;

• And the context is Model-Based Testing.

Based on the previous definition, we detailthe experiment planning, which encompasses: theexperiment context, its variables and experimen-tal objects, experimental design, hypotheses undertesting, and validity evaluation.

This is an offline experiment, which com-prises industrial applications, fault reports, andtest cases generated through a test case genera-tion algorithm. We define manually the artificialhints used in this study, since we need to repre-sent both extremes of the situation: when the teamprovides either an accurate hint (one that pointsto test cases that fail) which we call a good hint,or a poor hint (one that does not point to any fail-ure), which we call a bad hint. We define themwith previous knowledge about faults and failurescollected from fault reports provided during thedevelopment process.

Hint Definition To define a bad hint is straight-forward; we define a test purpose with steps notrelated to any fault and make sure that any ofthe filtered test cases reveal a fault. On the otherhand, a good hint must represent a scenario wherethe filtered test cases reveal a fault, whereas notbiasing the investigation. The definition of a goodhint for this study follows a systematic and manualapproach:

• For a specific fault of a given model, com-posed by its cause and the test cases thatfail because of it, we define a test purposewith the label of the edge more related tothe cause of the fault;

• If the whole set of test cases filtered by thistest purpose fails, using it can bias the inves-tigation, since a test case that fail would bechosen as the first one in sequence proposedby HARP. Therefore, we add more edges yetrelated to the cause in order to avoid thisscenario;

• If it is not possible to create a single testpurpose able to filter test cases that fail andthat not fail, we define other test purposerelated to the same hint and we restart theprocess aiming at filtering a set of test casesthat does not bias the study.

As an example of this approach, consider thescenario suggested in our motivating example (re-fer Section 3). Assuming the same suggested fault,the hint definition process begins by creating a testpurpose with the label of a single transition relatedto the fault, which is the one between states la-beled as 4 and 6. The first row of Table 4 showsthe filtered test cases and the proportion of theones that fail considering a test purpose compris-ing the label of the transition more related to theconsidered fault. Since we know that the fault oc-curs when the user provide a wrong password twicein a row, we can be even more precise by addingthe same label twice. The second row in the sametable shows a test purpose that specifies the case.However, note that this test purpose filters onlythe test case that reveal the fault, which bias thestudy, because HARP would always place TC7 thein the first position in the prioritized test suite.Thus, according to the aforementioned approach,the test purpose “ * | C - Invalid Login | * ” is thefinal one.

Table 4: Illustration of the good hint generation

Test Purpose Filtered TC % Failure

“*|C-Invalid Login|*” TC4, TC5, TC6, TC7 25%

“*|C-Invalid Login|*

|C-Invalid Login|*”TC7 100%

We use the proportion of test cases that failamong the filtered ones to judge how good a hintis. Intuitively, for a bad hint 0% of the filteredset fail, on the other hand, following the afore-


mentioned approach to derive good hints lead toa proportion between 20% and 50%. These val-ues represent the best test purposes we are ableto generate using the aforementioned hint defini-tion, considering the selected use cases and testreports, and without being extreme, i.e. neitherfiltering 100% of the test cases and being equal tothe baseline ARP technique, nor just one test casethat unveil the fault. We use this approach to de-fine the possible values for a variable investigatedin this study.

Independent Variables Since we want to eval-uate the effects of the hint’s quality on HARP,observing variations by adopting different resem-blance functions, we define the following indepen-dent variables:

• Hint quality

– Good hints (between 20% and 50% ofthe test cases filtered by hints fail)

– Bad hints (no test case filtered by hintsfail)

• Resemblance function

– Similarity function [8](as proposed inSection 4)

– Jaccard distance function (as originallyproposed by Jiang et al. [11])

Dependent Variable Our goal is to investi-gate whether hints affect HARP’s results, regard-ing early fault detection. To help this investiga-tion, we selected the APFD (Average percentageof Fault Detection) metric. The APFD is calcu-lated as follows:

APFD = 1− TF1 + TF2 + . . . + TFm

nm+

1

2n(1)

where TFi is the position of the first test case thatreveals the i-th fault, m is the number of faultsthat the test suite is able to reveal, and n is thesize of the test suite [12]. The APFD is a per-centage in which a higher value indicates that the

faults were revealed earlier during test executionand analogously, a lower value indicates that thefaults were unveiled later.

Experiment Objects In the MBT process, thesystem’s models are often the input artifacts. Testcases are generated from these models and laterprovided to prioritization techniques. As experi-ment objects, we use two industrial systems: theBiometric Collector and the Document Con-trol. Both systems were modeled through a set ofLabeled Transition Systems, which represent theirfunctionalities.

The test cases were generated by the test-ing team using the same test case generation al-gorithm, which traverses the LTS covering loopsat most twice, and uploaded them to a test caseexecution management tool - Testlink. Then, wecollected the test suites and used them in our ex-periment. To the purpose of our investigation, weconsider only the ones that have at least one re-ported fault. The faults considered in the studywere reported by testers also using Testlink. Thus,our models and faults are:

• Biometric Collector

– CB01: which contains two faults, onerevealed by two test cases, and the otherby a single test case;

– CB03: which contains one fault re-vealed by a single test case;

– CB04: which contains four faults, tworevealed by two test cases each, and theremaining two revealed by a single testcase each;

– CB05: which contains three faults, onerevealed by two test cases, and the re-maining two revealed by a single testcase each;

– CB06: which contains one fault re-vealed by a single test case.

• Document Control

Developed as part of a cooperation between Ingenico do Brasil and our research labhttp://testlink.org/


– CD01: which contains two faults, onerevealed by three test cases, and theother by a single test case;

– CD02: which contains one fault re-vealed by a single test case;

– CD03: which contains one fault re-vealed by a singe test case;

– CD04: which contains two faults, re-vealed by a single test case each;

– CD05: which contains two faults, onerevealed by three test cases, and theother one by just a single test case.

Since our objective was to provide a hintand detect a related fault earlier in the test-ing process, each experiment object was a pair< testsuite, fault >, where testsuite is the testsuite related to the selected use cases and fault isa fault recorded by executing this test suite. Forinstance, for CB01, we considered as objects both<CB01,Fault1> and <CB01,Fault2>. Thus, ac-cording to the list of models and faults exposedpreviously, we worked with 19 experiment objects.Although we are not able to provide the artifactsitself, Table 5 exposes some metrics about them.

Experiment Design Our experimental unitsare the pair <HARP algorithm, resemblancefunction> and the treatments are both good andbad hints. In order to increase precision andpower of statistical analysis, we repeated indepen-dently the application of each experimental unitand treatment to every experiment object 1000times, according guidelines proposed by Arcuriand Briand [38]. Figure 2 presents the experimentoverview.

10 LTS Models Fault Reports

<Model,Fault>

Test Suite

Good Hints

Bad Hints

HARPbad

HARPgood

HARPbad

HARPgood

1000 repetitions for each <model,fault>

pair

19 pairs

Jac

Jac

Sim

Sim

APFD

Fig. 2: Experiment Design Overview

5.1.1 Results and Analysis

As a first insight about the collected data, weperform a visual analysis (see Figure 3). From thisfigure we remark: i) HARP with good hints aremore accurate because of a more compact boxplot,however some outliers appear; ii) it is already no-ticeable the difference between good and bad hints,regardless the applied resemblance function; iii)both resemblance functions appear to have a sim-ilar behavior across the levels of the variable hintquality.

●

●

●●

●

●

●●●

●

●●●

●

●

●

●

●●●●●

●

●●●●●●●

●

●●●●●●●●

●

●●●●

●

●●●●

●

●●

●

●●●●

●

●●●●●●

●

●●●

●

●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●

●

●●

●

●

●

●●●●●●●●●

●

●

●

●

●

●●●●●

●●

●●●●●

●

●●

●

●

●

●

●

●●●●●

●

●●

●

●●

●

●●●●

●●

●●●

●

●

●

●●

●

●●●●●●●●●●

●

●

●●

●

●●

●

●●

●●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●●

●●●

●●●

●●

●

●●

●

●

●●

●

●●●●●

●

●●●●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●●●

●●

●

●

●●●

●

●●●●●

●

●

●

●

●

●

●●

●

●●

●

●●●●●●

●

●●●●●

●●

●

●

●●

●

●●●

●

●●

●

●

●

●

●

●

●

●●

●●●●●●●●●●●●

●

●

●

●●●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●●

●

●●●●●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●●

●●

●●

●●●●●

●●●

●●●

●●

●●●

●

●●

●

●●

●

●●●●●

●

●●●

●

●●●●●

●●

●●●●●●●

●

●●●●●●●●●●

●

●●●●

●●●●

●

●

●●●

●

●●●●●

●●●●

●●●●●●●●

●

●●●

●

●●●

●●●●●

●●

●

●●●●●●

●●

●●●

●●

●

●●

●●●

●

●

●

●●●●

●●●●

●●

●●

●

●

●●●●●●

●

●

●

●●●●

●

●●●

●

●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●

●●●

●●●●●

●

●

●●

●

●●●●

●●●●●●

●●●

●

●●●●●●

●●

●

●●●●

●●●●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●●

●●

●

●

●

●

●●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●●

●

●●●●●

●

●●

●

●●●

●

●

●

●●

●

●

●●

●●

●

●●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

HARPSimBad HARPSimGood HARPJacBad HARPJacGood

0.0

0.2

0.4

0.6

0.8

1.0

Data from the investigated techniques

Techniques

AP

FD

from

100

0 tr

ials

Fig. 3: Boxplots of the raw data collected in theexperiment execution.

Considering both alternatives that receivegood hints (HARPSimGood and HARPJacGood),one is able to notice some outliers, marked as smallhollow dots in the respective boxplots. It happensmainly because the HARP’s candidate set genera-tion may take many iterations to select a hint re-lated test case. It varies based on the sampled val-ues from the uniform random variable that guidesthe process. Although very uncommon, the out-liers are not mistakes, therefore they must composethe sample to investigate the differences throughthe effect sizes.

Effect Size Analysis To measure the effectsizes and hence properly compare the treatments,we perform a set of parwise A-Tests, which as-sumes values between 0 and 1, which represent theprobability of the result of a direct comparison be-


Table 5: Metrics about models used as experiment objects

Biometric Collector Document Control

CB01 CB03 CB04 CB05 CB06 CD01 CD02 CD03 CD04 CD05

Branches 9 9 11 17 4 7 7 5 2 9

Joins 3 3 4 11 0 4 4 5 2 9

Loops 2 2 1 6 0 2 2 0 1 0

Max Depth 12 20 36 32 14 22 22 12 14 12

Min Depth 8 4 14 12 6 10 10 10 14 10

Gen. TC 16 14 16 67 3 10 10 5 3 9

tween two alternatives be right. Table 6 containseach one of the calculated values. About the statis-tic value (third column), we can interpret the effectsize as [39]:

• large, if the statistic is greater than 0.71 orless than 0.29;

• medium, if the statistic is greater than 0.64or less than 0.36;

• small or negligible, otherwise.

Table 6: Effect sizes of pairwise comparisons

Function - Hint Technique B A12 Eff. Size

Similarity - Bad Similarity - Good 0.0713 Large

Similarity - Bad Jaccard - Bad 0.3628 Small

Similarity - Bad Jaccard - Good 0.1271 Large

Similarity - Good Jaccard - Bad 0.8484 Large

Similarity - Good Jaccard - Good 0.5801 Small

Jaccard - Bad Jaccard - Good 0.2237 Large

Addressing RQ1, we evaluate the effect sizesseparately between the pairs HARPSimBad —HARPSimGood and HARPJacBad — HARPJac-Good. The A statistic value for these compar-isons are 0.0713 and 0.2237 respectively, whichsuggests that the hint’s quality has a large effecton HARP. Therefore, HARP performs significantlybetter when receiving good hints.

As a further analysis, we observe whether theinvestigated resemblance functions act differentlyduring HARP operation. To do so, we compare the

effect sizes of the pairs HARPSimBad — HARP-JacBad and HARPSimGood — HARPJacGood,i.e. change the resemblance function but keepingthe level of the variable hint quality. The values forA statistic are 0.3628 and 0.58 respectively. Bothvalues indicate a small effect size, although the firstone is close to the medium effect size threshold.Therefore, we do not have enough evidence to af-firm that any function is better or more indicatedto be used in HARP, in other words both functionsexpress the same notion of resemblance.

Remarks Since we are suggesting to use HARPjust with hints pointed by a majority of the teammembers, allied with the results of the question-naire reported in Section 5.2, we believe that sit-uations that a bad hint misguide HARP’s opera-tion are minimized. As further analysis, througha simple visual analysis, one can note a smallerinterquartile distance when comparing the pairs(HARPSimGood, HARPJacGood) and (HARP-SimBad, HARPJacBad) favoring the alternativeequipped with the similarity function. Besides itis already focused on MBT, we suggest the investi-gated similarity function as more indicated to ourcontext.

5.1.2 Validity Evaluation

We evaluate the validity of our experiment bydiscussing its threats and how we deal with them.Concerning the internal validity, the variationof APFD might be also affected by random choicesduring the techniques execution. In order to dealwith this threat, we repeated the executions ac-


cording to Arcuri and Briand’s work [38]. Besides,the execution order for each technique in every rep-etition is defined randomly.

About construct validity, since we gener-ate the hints based on prior knowledge about faultsand test cases that fail, the process of defining goodand bad hints may be biased. Aiming to mitigatethis risk, we provide a systematic procedure to dothat and we submit both good and bad hints to thesame treatments in order to evenly expose them.

Concerning the external validity, eventhough we use real and industrial systems, a sam-ple with only two systems are not enough to pro-vide proper generalization. Although we were notable to generalize the results due to sample size, wetry to be as most realistic as possible by using mod-els that represent the investigated applications, sowe believe that similar conclusions would hold forsimilar systems.

5.2 Questionnaire

Introducing the idea of getting informationfrom the development team, we perform a ques-tionnaire [40] to collect indications about portionsof the system that suffered some occurrence dur-ing its development, which are the hints, and in-vestigate whether these hints are related to faults.This is an observational method that helps to as-sess opinions and attitudes of a sample of interest.In this section, we discuss its methodology and re-sults.

Participants and Investigated Systems Forthis investigation, our population of interest aresmall development teams that have an interactionwith both industry and academy. Therefore, weconsider teams from two industrial projects of theIngenico do Brasil. SAFF is an information systemthat manages status reports of embedded devices;and TMA is a system that controls the executionof tests on a series of hardware parts and managestheir results. More details regarding these systems,as well as their behaviors, cannot be unveiled dueto a non-disclosure agreement (NDA).

Since it is hard to find real world teams avail-able for participating on our investigation, and as

we do not intend to generalize the results for alarger population, we establish a non-probabilitysampling strategy [41]. With that sample, webelieve that the teams performing in research lab-oratories in partnership with companies are wellrepresented. Therefore, four members from SAFFand three from TMA compose our sample, asreported in Figure 4a. They present differentmaturity levels, varying from undergraduate stu-dents with development experience to profession-als. SAFF is an ongoing project and is been underdevelopment for the last two years, while TMA inthe last seven months. However, one participantof each team started working on the project afterit had started, which balances the participants ex-perience in the projects. Figure 4b summarizesthe time of activity of the participants in theirprojects. It should be noted that, the participantsacted in the development of the investigated usecases using the model-based process described byJorge et al. [42], in other words the participantsdesigned and/or implemented the use cases con-sidered in this study, therefore they actually facedthe reported problems. Sheet1

Page 1

1

2 4

Questionnaire Participants

Managers

Analysts

Developers

(a) Positions in projectSheet1

Page 1

1 2 30

5

10

15

20

25

30

SAFF

TcoM

(b) Time in project (months)

Fig. 4: Summary about the questionnaire’s partic-ipants.

www.ingenico.com.br


To select the target use cases, we check the re-quirement documents of both systems, which aredescribed using a natural language based use casenotation and previous system testing execution re-sults. Among them, we select use cases that haveat least one reported fault. Among the considereduse cases, and aiming to keep a balanced design,we selected two of them for each system. Note thatthere are seven participants and different amountsof participants from both teams participated. Ithappened because, whereas for TMA the threeparticipants worked in both use cases, for SAFFone developer worked in one of the investigateduse cases and other developer in another, whichleads to four participants. To illustrate the dimen-sion of the use cases used here, Figure 5 depictsone of them. It comprises four execution flows -one base flow, two exception, and one alternative- and the exact amount of transitions and statesof the original document, but with obfuscated la-bels in order to hide the actual behavior due to theaforementioned NDA.

1C - BaseFlow_1

2

1

3

S - BaseFlow_2

4

R - BaseFlow_3

5

S - BaseFlow_4

6

R - BaseFlow_4

7S - AltFlow1_1

8

R - AltFlow1_2

9

10

11S - AltFlow1_3

R - AltFlow1_4

C - ExFlow1_1

12

13

14

R - ExFlow1_2

S - ExFlow1_3

R - ExFlow1_4

15

S - ExFlow1_5

16

17

18

R - ExFlow1_6

S - ExFlow1_7

R - ExFlow1_8

18 C - ExFlow2_1

R - ExFlow2_2

Fig. 5: Example of an use case used in the question-naire, modeled using Labeled Transition System,but with obfuscated labels.

Questionnaire In order to address RQ2 andRQ3, we devise a questionnaire intending to re-veal three kinds of information:

• Experience: we ask the participant’s name,his/her position in the team, and how longthe participant had been working in theproject. The intention is to evaluate whetherthe experience of the team really impacts onthe ability to give hints;

• Use case critical regions: we attach tothe questionnaire the correspondent use casedocument and we ask the participant to un-derline a region in the use case (a single stepor an entire flow) that he/she believes thatwas jeopardized during the development ac-tivities before system level testing. For that,we instruct them to solely use his/her expe-rience acquired when implementing the usecases. Moreover, we ask them to providepossible reasons for the choice. We providea list of reasons: Inherent complexity – whenthe use case presented aspects that leaded toa complex implementation; Neglected due toschedule – when some other demand had tobe satisfied in advance and the schedule be-came shorter; and Already present any sortof problem beforehand – when the indicatedstep or flow presented a non-expected behav-ior in early steps of development and unittesting. Besides, the participants are free towrite down any other reason they find ade-quate. Comparing the answers of this ques-tion with the faults and failures reported, weare able to know if it is really possible togather good hints;

• How time-consuming and hard was toperform the task: we also ask to measurethe time in minutes they spend to read theattached artifacts and to answer the ques-tionnaire, and whether the participant con-siders it difficult to read the provided usecase document and to answer the relatedquestions. We want to evaluate whether thewhole process to obtain hints is feasible.



In order to address RQ2, we resort to the por-tion of the questionnaire in which the participantsunderline their hints and explain their reasons. Weguide our analysis by observing whether the majorpart of the participants indicate the same portionswith the same reason. A summary of this analy-sis is available in Table 7, and the bold face cellsmark the situations that the participants assignthe same hint, i.e. they indicated the same por-tion of the use case and the same cause.

Considering the Use Case 1 from SAFF(UC1), the participants report an inherent com-plexity problem hint, but two of them point thesame step. Analyzing the fault report related tothis use case, we verify that the step pointed by themajority of the participants is directly related toa fault. In turn, for the Use Case 2 (UC2), thereare divergences about the reasons of the indica-tions. Two participants report an inherent com-plexity whereas the other assigned it to problemsthat already appeared during the development andthat is not unveiled on early tests (unit testing).However, when relating it to the actual fault, allparticipants successfully report the same flow, andit is also related to a fault. Therefore, for this usecase, the participants also provide a useful hint.

Now, considering the first use case (UC1) ofTMA project, the participants report that an in-herent complexity regarding the underlined regioncould be the source of some problem. Two of thempoint the same step in the base flow and this stepis related to a fault, according to the fault reports.On the other hand, concerning Use Case 2, a differ-ent but possible scenario emerges. Every partici-pant wrote freely that they had no difficulty devel-oping this use case, but two of them report the usecase’s precondition as the most affected region, andthe other one do not provide any hint. The faultreport contained a fault for this use case, but it isnot related to the provided answers. In a conver-sation out of investigation they discussed that therelated precondition is hard to be satisfied, sincesome network requirements should be met. Thisis a situation that requires further investigation,since it is hard to measure how a precondition canbe related to fault detection.

Considering the use cases where a majorityof participants indicate the same portion with thesame cause, allied to the fact that every partici-pant worked in the respective use case, in threeout of four, they provide a good hint, i.e., a regionthat actually contained a fault. Therefore, basedon our results, we are able to infer that membersof development teams, with similar characteristics,may provide good hints about functionalities thatthey have worked with.

To address RQ3, we analyzed a quantitativemeasure in minutes collected from the question-naires and the answer of a question about howhard they think that is to perform the question-naire. All participants found the task of findingerror-prone regions in a use case easy to perform.Moreover, the whole questionnaire was applied onan average of 4.17 minutes when considering theSAFF use cases, and 3.67 for the TMA ones.

Considering these results, we suggest that theimpact of using this approach in other projectstend to be low. Similar questionnaires couldbe easily used in other industrial projects with-out much effort. We believe these questionnairesshould be handed right between a developmentand a system testing stage, which is the momentthe team still has fresh knowledge regarding whatwas recently developed. Alternatively, these hintscould be collected iteratively during the develop-ment phase and then used right before the test caseexecution to prioritize the system level test cases.

Remarks Based on the questionnaires, we canstate that when most developers and managers in-dicate similar hints, these are often good estima-tors of faults and failures. Thus, we suggest thathints should only be used when pointed by a mostof the developers and managers directly involvedwith the correspondent code, which serves also as aguideline for using HARP. In a tie situation abouta particular hint, managers and testers could eval-uate and deliberate about the case.


In this section, we discuss aspects that mayrepresent some threat to the validity of our study.For instance, when analyzing our results, we must


Table 7: The indications collected from the questionnaires, as well as their reasons. In bold we highlightregions that a majority is achieved.

Participant 1 Participant 2 Participant 3

SAFF

UC1Base Flow/Step 4 Base Flow/Step 3 Base Flow/Step 4

Inherent Complexity Inherent Complexity Inherent Complexity

UC2Exception Flow 1/Step 7 Exception Flow 1 Exception Flow 1

Previous Problems Inherent Complexity Inherent Complexity

TMA

UC1Alternative Flow 4/Step 1 Base Flow/Step 4 Base Flow/Step 4

Inherent Complexity Inherent Complexity Inherent Complexity

UC2– Precondition Precondition

No Difficulty No Difficulty No Difficulty

be aware that the participants answered the ques-tionnaire after the use cases were implemented.Therefore, we rely on their memory regarding thesystem. Moreover, as we intended to investigatedevelopers from projects mixing undergraduatesand professionals in cooperation with industry, wewere limited to a small and selective sample.

Due to our limited sampling strategy, we arenot able to generalize our results to different con-texts. However, since all elements involved in thisstudy report to real projects (e.g., real developers,systems, and use cases), we believe our results arestill valid. Nonetheless, our results suggest thatsimilar questionnaires can be used in different con-texts.

5.3 Case Study

After controlling the hint’s quality in a con-trolled environment, in order to address RQ4 wecompare the performance of HARP with our base-line, which is the adaptive random prioritizationtechnique proposed by Jiang et al. [11]. For thecomparison, we consider the systems and hintsprovided in the questionnaire, reported in Sec-tion 5.2. The objective is to evaluate the applica-tion of actual hints in industrial systems, observingthe differences between HARP and its baseline(ARTJac).

Setup We follow the system testing activities onSAFF and TMA systems, which comprised exe-cuting manually the available test cases on theapplications, which run on small terminals with

a proprietary operational system, and recordingtheir results, associating the test cases that failedand a cause described in high level of abstraction.Therefore, we collected these artifacts and usedthem in our case study. To illustrate the dimen-sions of the executed test suites, Table 8 exposessome relevant testing related metrics.

In order to use the provided hints in this casestudy, we use the manual process of convertingthem into test purposes as explained in Section 4.However, the amount of hints for each system isdifferent because, there was a case in the TMAsystem where the participants did not report anyproblem, therefore we consider just the hint for usecase 1.

In this case study, just HARP uses hints toguide its operation, and both techniques take thetwo aforementioned test suites as input. We needto repeat the executions of these techniques be-cause they both make random choices. Therefore,we run each technique 1000 times independentlywith the same input, as suggested by Arcuri andBriand [38].

As a measure of fault detection capability, in-stead of measuring the APFD, which takes intoconsideration the detection rate of all faults thatthe test suite is able to unveil, we use the F-Measure [22], which considers the ability of detect-ing the first fault. F-Measure counts the amount oftest cases executed before revealing the first fault.A test case reveals a fault when it fails, in otherwords, when the system produces outputs differentfrom the expected. We are using F-Measure in-


Table 8: Metrics about the test suites investigated in the case study. We measure the shortest and longesttest cases in amount of steps, expected results and conditions that the test case describes.

System #TC Shortest TC Longest TC #Faults #TC that fail #TC filtered by hint

SAFF 60 3 89 9 13 6

TMA 32 7 41 5 8 8

stead of APFD because the participants providedjust a single hint for each use case in the question-naire and when applying this hints in the wholetest suite, would be unfair that the other use casesnot investigated affect the measurements.


A visual representation of the collected data isdepicted in Figure 6. Besides HARP and ARTJac,we added the Random technique as a lower bound-ary for techniques’ performance. A good per-formance presents a low F-Measure value, whichranges between 0, when the first test case alreadyreveals a fault, to n−1, when the last test case re-veals a fault. Since both test suites have differentsizes, we analyze both of them separately.

Regarding SAFF, based on how spread are theboxplots in Figure 6a, HARP presents a more ac-curate performance than the other two techniques,presenting fewer outliers; however visually it isnot possible to conclude which of the techniquespresent the lowest F-Measure. By calculating theeffect size (as performed in Section 5.1) of the com-parison between HARP and ARTJac we obtain0.4106 favoring HARP, but with a small effect size.It suggests small gains through the application ofHARP.

On the other hand, analyzing visually the box-plots from Figure 6b, the improvements either inaccuracy or in earlier fault detection are clearer.The effect size measurement of the comparison be-tween them is 0.2442, which means a large effectfavoring HARP. Thus, considering these two sys-tems and addressing RQ4, HARP is able to out-perform ARTJac when using actual hints.

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

HARP ARTJac Random

05

1015

20

Data from the investigated techniques regarding SAFF

Techniques

F−

Mea

sure

from

100

0 tr

ials

(a) Boxplot with the results from SAFF system

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●●

●●

●

●

●●

●

●●

HARP ARTJac Random

02

46

810

12

Data from the investigated techniques regarding TMA

Techniques

F−

Mea

sure

from

100

0 tr

ials

(b) Boxplot with the results from TMA system

Fig. 6: Summary about the collected data.


Remarks Even though the small differences be-tween the techniques regarding the median of F-Measure, which are the lines in the center of theboxplots in Figure 6, this difference may still pro-vide significant savings in the MBT context and ina context with manual test case execution. For in-stance, during the development of the systems con-sidered in our evaluation, reports of the test caseexecution show they took between 8-60 minutes,including setup, evaluation, and recording the re-sults using tools. In general, a single test case ex-ecution took 16-27 minutes and any reduction inthe amount of test cases executed before revealingfaults is representative, in a process point of view.


Here we discuss the validity of this case studyby indicating existent threats and how we dealwith them. We are not able to draw more gen-eral conclusions due to the low amount of systemsin our case study, however both systems are realand industrial size. Moreover, the hints were pro-vided by the actual development teams. Althoughwe use two hints from SAFF and one from TMA,we are sure that the hints were given by peoplethat really worked in these use cases. THerefore,to reduce the effect of the other use cases’ faultsin the ability of reveal faults, we use F-Measureinstead of APFD in the case study.

6 Related Work

In this section, we discuss a series of relevantworks related to ours. The TCP problem is moreassociated with the code-based testing context, asdiscussed by Catal and Mishra [15]. HARP on theother hand is designed to be applied with Model-Based test suites, particularly when test cases aregenerated from transition systems. There is noconsensus about the performance of TCP tech-niques that deal with new test cases or with theabsence of historical data about previous test exe-cutions [43], which is a motivation for our investi-gation.

As discussed in Section 2, TCP techniquesthat focus on fault detection aim at estimatinga surrogate for this information. Some of themost popular techniques use coverage data: code

branches and statements [11, 12], and models el-ements [16, 17]. Korel et al. [16] propose heuris-tics for TCP using Extended Finite State Machines- EFSM. These heuristics take into considerationmodifications performed during system’s evolu-tion, hence the work focus on Regression Testing.As results, the authors suggest that some aspectsmay not significantly affect the performance of theproposed heuristics. For instance, the amount oftimes that a modified transition of the EFSM is ex-ecuted by test cases. HARP focus on the generalTCP problem, not relying on history of modifica-tions.

Sapna and Mohanty [17] propose a techniqueto prioritize test cases generated from UML Activ-ity Diagrams based on a fixed relation of weightsof the diagram’s elements. Fork/join nodes havemore weight than branch/merge nodes, whereasbranch/merge nodes have more weight than actionnodes. The authors define this relation giving morepriority to test cases that represent main scenariosof the system, which cover fewer loops, branchesand forks. Another technique suitable for activitydiagrams is proposed by Kaur et al. [10], which alsocalculates weights, however taking into considera-tion the amount of incoming and outgoing flows ofevery model element. Contrary to HARP thesetechniques take the model structure into account.

Kundu et al. [20] proposes STOOP, a tech-nique that manipulates UML Sequence Diagrams.It generates test cases from input models and pri-oritizes the generated test cases based on struc-tural elements from sequence diagrams, such asmessages and edges. They also propose metrics forthree prioritization objectives: code coverage, con-fidence in reliability, and the composition of both.After calculating the metrics for every test case,according the desired objective, the technique sortsthe generated test cases respecting a decreasingorder. Contrary to HARP the technique couplethe test case generation and test case prioritizationprocesses and focus only on structural elements.

Other studies introduce the idea of collect-ing data from experts, which are people involvedwith the development of the system. Stallbaum etal. [18] propose the augmentation of UML activitydiagrams with values related to the probability ofspecific actions in the diagram contains a fault, as


well as values associated with the damage causedwhether a fault be revealed in that action, derivingrisk scores by multiplying them. This work is veryrelated to our strategy because both acquire in-formation from specialists. However, HARP con-siders a different prioritization objective, which isfault detection in contrast to risk prevention, anda non-numeric input from the specialists.

Another source of information presented in lit-erature is a set of pairwise comparisons performedby an expert involving a subset of the providedtest cases, which guides the operation of the tech-niques [24,25]. The former is focused on risk-basedstrategies and the latter demands a high interac-tion with the specialists, because the amount ofcomparisons that the expert must perform. On theother hand, HARP does not demand a high in-teraction from experts, as discussed in Section 5.2.

It is also worth to mention some works relatedto ARP. Jiang and Chan [44], which have activelycontributed in this field, tried to overcome the de-pendence of code elements by proposing the useof the black-box information to guide the priori-tization. Their work provides evidences that thisstrategy can provide good results in the MBT con-text, since it uses an abstraction of the source-coderepresentation. Other research that uses the adap-tive random strategy, but it is focused on regres-sion testing, is performed by Schwartz and Do [45].They propose a technique also based on the adap-tive random strategy, but using the Analytic Hier-archy Process (AHP) in order to create importancelevels among the tested components, using expertknowledge and data from previous testing execu-tions.

There is a field of investigation related to thisresearch, which is the defect prediction [46]. Xuet al. [26] compare 32 methods to point systemfeatures more prone to contain faults. The inves-tigated methods use diverse underlying theories tobuild their prediction models, such as statistic andprobabilities, clustering, and logistic regression.All of them have the same common assumptionthat historical data about source code metrics, testcase coverage, and fault reports. The methods, inturn, divide the data into training and testing sets;their performances are compared through the areaunder the ROC curve, which relates the rate of

true positives and negatives, in a detailed empir-ical comparison. The authors claim that cluster-based methods can achieve good results whereasretrieving a smaller set of potentially faulty sys-tem features. Since we assume that historical datais not available, we are not able to use any of thesemethods, but we understand their importance asa source of information to prioritize regression testcases. Radjenovic et al. [27] surveyed other metricsused to estimate parts of the system more proneto fail, such as object-oriented metrics, traditionalsource code metrics, and process metrics. Some ofthem, mainly the process metrics, also depends onhistorical information, not available in the contextof our research.

To the best of our knowledge, there is no otherinitiative of applying ARP in the MBT context.In previous work, we conducted a set of studies,as already discussed in Section 1 [21] includingthe two variants, with Jaccard [11] and Manhat-tan [22] distance functions. Besides our main re-sults, in the referred context, Jaccard function per-formed better than the Manhattan. This resultprovides us evidence that code-based results can-not be generalized to model-based ones, since ourresults contradict the ones presented by Zhou etal. [32], which encourages researches in both con-texts.

7 Conclusions

Test Case Prioritization (TCP) is an activ-ity dedicated to optimize test case execution inthe sense of rescheduling the test cases aiming atachieving some testing goal. In this paper, we pro-pose a TCP technique, named HARP, for MBTsystem level test suites, in a context that histor-ical information regarding failures/faults may notbe available. In order to guide prioritization, ourtechnique uses information derived from the ex-pertise of development teams, named “hints”. Werepresent these hints as test purposes that workas filters, accepting test cases related to the rep-resented information. HARP is based on ARP,which focus on exploring the input test cases us-ing a distance notion.

In the first study of our empirical investiga-tion, we control the quality of hints applied toHARP, measuring the fault detection capability


through the APFD metric. As results, we provideevidence that, HARP performs significantly betterwhen a good hint is provided, instead of a bad one.In other words, the random choices in HARP donot affect significantly the hints’ guidance. Yet,as a secondary analysis, we suggest that differentdistance and/or similarity functions may capturethe same resemblance notion.

Besides, we applied a questionnaire where weask participants to indicate steps or flows in usecase models representing functionalities that, forexample, suffered some unexpected schedule mod-ification or contain an inherent complexity to theirdevelopers, and related these indications to actualfaults. Our results suggest that the questionnaireis a quick and easy way to collect that kind of in-formation. Moreover, comparing the provided an-swers to the fault records of the investigated usecases, we observed that the participants were ableto provide good hints. In other words, they wereable to provide indications that actually points tofaults. In addition, we defined guidelines for us-ing these indications as hints in HARP: whetherthe majority of the team indicates a common as-pect of the system, this particular hint should betranslated to a test purpose and used in HARP.

Furthermore, we also perform a case studycomparing HARP to its baseline. Results suggestthat applying the hints proposed by the develop-ment team leads to gains with respect to the abilityof detecting faults. In both investigated systems,HARP presented significant improvements againstthe adaptive random prioritization proposed byJiang et al. [11]. Even though the measured gainsin the first case are not statistically significant, inboth situations the practical gains considering theexecution time of the test cases are considerable,since in our context the test cases are executedmanually.

As future work, we intend to evaluate HARPin a broader variety of systems, against differ-ent TCP techniques, and enabling developers andmanagers to give hints about multiple portions ofthe system simultaneously, aiming at increasingthe validity of our technique. Moreover, we in-tend to perform other experiment using HARP,but involving different distance/similarity func-tions. The idea is observing whether these func-

tions represent different resemblance notions andquantifying the effect of these notions on HARP.We also believe that using defect prediction tech-niques might enable HARP be used in other con-texts, such as code-based regression testing.

References

[1] I. Sommerville, Software Engineering (8th edi-tion). New York: Addison Wesley, 8th ed.,2006.

[2] B. Beizer, Software Testing Techniques. In-ternational Thomson Computer Press, 2 ed.,June 1990.

[3] P. Ammann and J. Offutt, Introduction toSoftware Testing. Cambridge UniversityPress, 2008.

[4] M. Utting and B. Legeard, Practical Model-Based Testing: A Tools Approach. MorganKauffman, first ed., 2007.

[5] H. Hemmati, A. Arcuri, and L. C. Briand,“Achieving scalable model-based testingthrough test case diversity.,” ACM Trans.Softw. Eng. Methodol., vol. 22, no. 1, p. 6,2013.

[6] E. G. Cartaxo, F. G. O. Neto, and P. D. L.Machado, “Test case selection using similar-ity function,” in MOTES 2007, pp. 381 – 386,2007.

[7] M. J. Harrold, R. Gupta, and M. L. Soffa, “Amethodology for controlling the size of a testsuite,” ACM Trans. Softw. Eng. Methodol.,vol. 2, no. 3, pp. 270–285, 1993.

[8] A. E. V. B. Coutinho, E. G. Cartaxo, and P. d.D. L. Machado, “Analysis of distance func-tions for similarity-based test suite reductionin the context of model-based testing,” Soft-ware Quality Journal, pp. 1–39, 2014.

[9] S. Sampath and R. C. Bryce, “Improving theeffectiveness of test suite reduction for user-session-based testing of web applications.,”Information and Software Technology, vol. 54,no. 7, pp. 724–738, 2012.


[10] P. Kaur, P. Bansal, and R. Sibal, “Priori-tization of test scenarios derived from umlactivity diagram using path complexity.,” inCUBE (V. Potdar and D. Mukhopadhyay,eds.), pp. 355–359, ACM, 2012.

[11] B. Jiang, Z. Zhang, W. K. Chan, and T. H.Tse, “Adaptive random test case prioritiza-tion,” in ASE, pp. 233–244, IEEE ComputerSociety, 2009.

[12] S. G. Elbaum, A. G. Malishevsky, andG. Rothermel, “Test case prioritization: Afamily of empirical studies,” IEEE Transac-tions in Software Engineering, February 2002.

[13] D. Jeffrey and N. Gupta, “Experiments withtest case prioritization using relevant slices,””Journal of Systems and Software”, vol. 81,no. 2, pp. 196 – 221, 2008. Model-Based Soft-ware Testing.

[14] G. Rothermel, R. Untch, C. Chu, and M. Har-rold, “Test case prioritization: an empiri-cal study,” in Software Maintenance, 1999.(ICSM ’99) Proceedings. IEEE InternationalConference on, pp. 179–188, 1999.

[15] C. Catal and D. Mishra, “Test case prioritiza-tion: a systematic mapping study,” SoftwareQuality Journal, pp. 1–34, 2012.

[16] B. Korel, G. Koutsogiannakis, and L. Tahat,“Application of system models in regressiontest suite prioritization,” in IEEE Interna-tional Conference on Software Maintenance,pp. 247–256, 2008.

[17] P. Sapna and H. Mohanty, “Prioritization ofscenarios based on uml activity diagrams,” inComputational Intelligence, CommunicationSystems and Networks, 2009. CICSYN ’09.First International Conference on, pp. 271 –276, july 2009.

[18] H. Stallbaum, A. Metzger, and K. Pohl, “Anautomated technique for risk-based test casegeneration and prioritization,” in AST ’08:Proceedings of the 3rd international workshopon Automation of software test, (New York,NY, USA), pp. 67–70, ACM, 2008.

[19] F. M. Nejad, R. Akbari, and M. M. De-jam, “Using memetic algorithms for testcase prioritization in model based softwaretesting,” in 2016 1st Conference on SwarmIntelligence and Evolutionary Computation(CSIEC), pp. 142–147, March 2016.

[20] D. Kundu, M. Sarma, D. Samanta, andR. Mall, “System testing for object-orientedsystems with test case prioritization.,” Softw.Test., Verif. Reliab., vol. 19, no. 4, pp. 297–333, 2007.

[21] J. F. S. Ouriques, E. G. Cartaxo, andP. D. L. Machado, “Revealing influence ofmodel structure and test case profile on theprioritization of test cases in the contextof model-based testing,” Journal of SoftwareEngineering Research and Development, 2015.

[22] Z. Q. Zhou, “Using coverage information toguide test case selection in adaptive randomtesting,” in Computer Software and Applica-tions Conference Workshops (COMPSACW),2010 IEEE 34th Annual, pp. 208 –213, july2010.

[23] Y. Lu, Y. Lou, S. Cheng, L. Zhang, D. Hao,Y. Zhou, and L. Zhang, “How does regressiontest prioritization perform in real-world soft-ware evolution?,” in Proceedings of the 38thInternational Conference on Software Engi-neering, ICSE ’16, (New York, NY, USA),pp. 535–546, ACM, 2016.

[24] S. Yoo, M. Harman, P. Tonella, and A. Susi,“Clustering test cases to achieve effectiveand scalable prioritisation incorporating ex-pert knowledge,” in Proceedings of the Eigh-teenth International Symposium on SoftwareTesting and Analysis, ISSTA ’09, (New York,NY, USA), pp. 201–212, ACM, 2009.

[25] P. Tonella, P. Avesani, and A. Susi, “Us-ing the case-based ranking methodology fortest case prioritization,” in Software Mainte-nance, 2006. ICSM ’06. 22nd IEEE Interna-tional Conference on, pp. 123–133, Sept 2006.

[26] Z. Xu, J. Liu, Z. Yang, G. An, and X. Jia,“The impact of feature selection on defect


prediction performance: An empirical com-parison,” in 2016 IEEE 27th InternationalSymposium on Software Reliability Engineer-ing (ISSRE), pp. 309–320, Oct 2016.

[27] D. Radjenovic, M. Hericko, R. Torkar, andA. Zivkovic, “Software fault prediction met-rics,” Inf. Softw. Technol., vol. 55, pp. 1397–1418, Aug. 2013.

[28] S. Elbaum, A. G. Malishevsky, and G. Rother-mel, “Prioritizing test cases for regressiontesting,” in Proceedings of the 2000 ACMSIGSOFT International Symposium on Soft-ware Testing and Analysis, ISSTA ’00, (NewYork, NY, USA), pp. 102–112, ACM, 2000.

[29] L. Lima, J. Iyoda, A. Sampaio, andE. Aranha, “Test case prioritization based ondata reuse an experimental study,” in Pro-ceedings of the 2009 3rd International Sympo-sium on Empirical Software Engineering andMeasurement, ESEM ’09, (Washington, DC,USA), pp. 279–290, IEEE Computer Society,2009.

[30] T. Y. Chen, H. Leung, and I. K. Mak, “Adap-tive random testing,” in Advances in Com-puter Science - ASIAN 2004, vol. 3321/2005of Lecture Notes in Computer Science,pp. 320–329, Springer Berlin / Heidelberg,2004.

[31] P. Jaccard, “Etude comparative de la distri-bution florale dans une portion des Alpes etdes Jura,” Bulletin del la Societe Vaudoisedes Sciences Naturelles, vol. 37, pp. 547–579,1901.

[32] Z. Q. Zhou, A. Sinaga, and W. Susilo, “On thefault-detection capabilities of adaptive ran-dom test case prioritization: Case studieswith large test suites.,” in HICSS, pp. 5584–5593, IEEE Computer Society, 2012.

[33] R. G. de Vries and J. Tretmans, “On-the-fly conformance testing using spin.,” STTT,vol. 2, no. 4, pp. 382–393, 2000.

[34] J. Vain, K. Raiend, A. Kull, and J. P. Er-nits, “Synthesis of test purpose directed reac-

tive planning tester for nondeterministic sys-tems.,” in ASE (R. E. K. Stirewalt, A. Egyed,and B. F. 0002, eds.), pp. 363–372, ACM,2007.

[35] C. Jard and T. Jeron, “Tgv: theory, princi-ples and algorithms: A tool for the automaticsynthesis of conformance test cases for non-deterministic reactive systems,” Int. J. Softw.Tools Technol. Transf., vol. 7, pp. 297–315,Aug. 2005.

[36] E. G. Cartaxo, W. L. Andrade, F. G. O. Neto,and P. D. L. Machado, “LTS-BT: a tool togenerate and select functional test cases forembedded systems,” in SAC ’08: Proc. of the2008 ACM Symposium on Applied Comput-ing, vol. 2, (New York, NY, USA), pp. 1540–1544, ACM, 2008.

[37] C. Wohlin, P. Runeson, M. Host, M. C. Ohls-son, B. Regnell, and A. Wesslen, Experimen-tation in software engineering: an introduc-tion. Norwell, MA, USA: Kluwer AcademicPublishers, 2000.

[38] A. Arcuri and L. Briand, “A practical guidefor using statistical tests to assess randomizedalgorithms in software engineering,” in Soft-ware Engineering (ICSE), 2011 33rd Interna-tional Conference on, pp. 1–10, May 2011.

[39] A. Vargha and H. D. Delaney, “A critique andimprovement of the cl common language ef-fect size statistics of mcgraw and wong,” Jour-nal of Educational and Behavioral Statistics,vol. 25, no. 2, pp. 101–132, 2000.

[40] A. Adams and A. L. Cox, Questionnaires, in-depth interviews and focus groups. 2008.

[41] M. Kasunic, “Designing an effective sur-vey,” tech. rep., Carnegie Mellon University,September 2005.

[42] D. Jorge, P. D. L. Machado, F. G. O.Neto, and A. E. V. B. Coutinho, “Inte-grando teste baseado em modelos no de-senvolvimento de uma aplicacao industrial:Benefıcios e desafios,” in 8th Brazilian Work-shop on Systematic and Automated Software


Testing (W. de Lucena Andrade and V. A.de Santiago Junior, eds.), vol. 2, pp. 101 –106, 2014.

[43] C. Henard, M. Papadakis, M. Harman, Y. Jia,and Y. Le Traon, “Comparing white-box andblack-box test prioritization,” in Proceedingsof the 38th International Conference on Soft-ware Engineering, ICSE ’16, (New York, NY,USA), pp. 523–534, ACM, 2016.

[44] B. Jiang and W. Chan, “Bypassing code cov-erage approximation limitations via effectiveinput-based randomized test case prioritiza-tion,” in Computer Software and ApplicationsConference (COMPSAC), 2013 IEEE 37thAnnual, pp. 190–199, July 2013.

[45] A. Schwartz and H. Do, “Cost-effective re-gression testing through adaptive test prior-itization strategies,” Journal of Systems andSoftware, 2016.

[46] C. Catal, “Software fault prediction: A liter-ature review and current trends,” Expert Sys-tems with Applications, vol. 38, no. 4, pp. 4626– 4636, 2011.

Joao Felipe S.Ouriques is a Ph.D. re-searcher in the Federal Uni-versity of Campina Grande(UFCG), since 2012. He holdsMaster and Bachelor degreesin Computer Science from theFederal University of CampinaGrande (UFCG), in 2012 and2010 respectively. His research

interests include software engineering, especiallysoftware testing and Mode-Based Testing. In soft-ware testing, his focus is on empirical-based inves-tigation of approaches for test case prioritization.

Emanuela G. Cartaxois an Associate member in theSoftware Practices Lab at theFederal University of CampinaGrande. She received her Doc-tor Degree in Computer Sci-ence from the Federal Uni-

versity of Campina Grande,Brazil, in 2011. Her doctoralresearch was done at Federal

University of Campina Grande in cooperation withItalian National Research Council - CNR (Pisa)and Motorola Brazil Test Center - Research andDevelopment. In 2013, she did a post-doctoral re-search in the Embedded Systems Quality Assur-ance Department at Fraunhofer IESE, Germany.Since 2004, she has worked with Software Test-ing with main focus on Model-Based Testing. Herinterests include Software Engineering, especiallySoftware Testing. In Software Testing, the maininterest is in Model-Based Testing approaches fortest case generation and for controlling the costsof execution.

Everton L. G. Alves isa Professor in the Comput-ing and Systems Department atFederal University of CampinaGrande (UFCG), Brazil, since2016. He received his Doctor-ate degree in Computer Sciencefrom the Federal University ofCampina Grande, Brazil, in2015, and his Master and Bach-

elor Degrees in Computer Science also from theFederal University of Campina Grande, Brazil, in2011 and 2009, respectively. Between 2013/2014 heworked as visiting researcher at the Department ofElectrical and Computer Engineering at the Uni-versity of Texas at Austin. His main interestsinclude software maintenance, regression testing,model-driven development and testing, real-timesystems and integration.

Patrıcia D. L. Machadois a Professor in the Comput-ing and Systems Department atFederal University of CampinaGrande (UFCG), Brazil, since1995. She received her PhDDegree in Computer Sciencefrom the University of Edin-burgh, UK, in 2001, Master De-gree in Computer Science from

the Federal University of Pernambuco, Brazil, in1994 and Bachelor Degree in Computer Science

from the Federal University of Paraiba, Brazil,in 1992. Her research interests include softwaretesting, formal methods, mobile computing, com-ponent based software development and model-driven development. Since 1998, she has produced

a number of contributions in the area of softwaretesting, including research projects, publications,tools, supervising, national/international coopera-tion and teaching.

27

a hint-based technique for system level model-based test case … · a hint-based technique for...

Documents