uk software testing research iii - sheffield series of uk software testing research workshops...

UK Software Testing

Research III

5-6 September 2005

Department of Computer Science

University of Sheffield

UKTest 2005:

UK Software Testing Research IIIThe series of UK Software Testing Research workshops started with a workshop at the

University of York in September 1999. This first event involved only invited speakers andparticipants and almost all were from the UK. The EPSRC funded FORTEST network wasestablished in 2001 and discussions during FORTEST workshops led to a second event beingheld in September 2003. This second workshop was open to the international testing commu-nity and all papers were reviewed by members of the programme committee. UKTest 2005has followed this pattern.

The aim of UKTest is to bring together members of the UK Software Testing researchcommunity. It is intended to be an informal event and so presentations are relatively short(most are 20 minutes) and plenty of time has been scheduled for questions and discussion.

Organisation

• General Chair:Rob Hierons, Brunel University

• Programme Committee Chair and Local Organiser:Phil McMinn, University of Sheffield

Acknowledgements

We would like to thank our two invited speakers Paul Gibson and Alan Richardson foragreeing to present at UKTest 2005.

Programme Committee

• Paul Baker, Motorola

• Tony Cowling, University of Sheffield

• Kirill Bogdanov, University of Sheffield

• John Clark, University of York

• John Derrick, University of Sheffield

• Isabel Evans, Testing Solutions Group Ltd

• Ian Gilchrist, IPL

• Keith Harrison, Praxis

• Mark Harman, Kings College, London

• Rob Hierons, Brunel University

• Mike Holcombe, University of Sheffield

• Paul Krause, University of Surrey

• Phil McMinn, University of Sheffield

• Mark Roper, University of Strathclyde

• Clive Stewart, IBM

• Joachim Wegener, DaimlerChrysler

• Martin Woodward, University of Liverpool

• Hong Zhu, Oxford Brookes University

Contents

1. Empirical Software Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Collecting and Categorising Faults in Object-Oriented CodeNeil Walkinshaw, Marc Roper, Murray Wood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Study of the Reciprocal Collateral Coverage of Two Testing MethodsDerek Yates, Nicos Malevris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17

2. Formal Models and Approaches to Testing . . . . . . . . . . . . . . . . . . . . . 37

Towards Unit Testing for Communicating Stream X-machine SystemsJoaquin Aguado, Michael Mendler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39

Testing from object machines in practiceKirill Bogdanov, Mike Holcombe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

A Formal Model for Test FramesTony Cowling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

The Need for New Statistical Software Testing ModelsJohn May, Maxim Ponomarev, Silke Kuball, Julio Gallardo . . . . . . . . . . . . . . . . . . . . . . .99

A Theory of Regression Testing for Behaviourally Compatible Object TypesAnthony Simons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Exploring test adequacy for database systemsDavid Willmor, Suzanne M Embury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .123

3. Search-Based Software Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Automatic Software Test Data Generation For String Data Using Heuristic Search withDomain Specific Search Operators

Mohammad Alshraideh, Leonardo Bottaci . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137Use of branch cost functions to diversify the search for test data

Leonardo Bottaci . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151Testability Transformation for Efficient Automated Test Data Search in thePresence of Nesting

Phil McMinn, David Binkley, Mark Harman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

4. Tools and Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

Model-Driven Engineering Testing EnvironmentsPaul Baker, Paul Bristow, Clive Jervis, David King, Rob Thomson . . . . . . . . . . . . . . 185

Towards the Holy Grail of Software TestingIan Gilchrist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

Improving Fault Prediction Using Bayesian Networks for theDevelopment of Embedded Software Applications

Elena Perez-Minana, Jean-Jacques Gras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

1. Empirical Software Testing

Colle ting and Categorising Faults in Obje t-Oriented CodeNeil Walkinshaw, Mar Roper, Murray WoodDepartment of Computer and Information S ien esUniversity of Strath lyde, Glasgow G1 1XH, UKnw, mar , murray� is.strath.a .ukAbstra tA range of te hniques exist to identify and isolatefaults in obje t-oriented ode, but development andevaluation of these te hniques is hampered by the fa tthat there is very little information about the natureof obje t-oriented faults. This paper des ribes ourexperien es of using open sour e software proje ts tobuild up a pi ture of ommon faults in OO software.Open sour e proje ts are a ri h sour e of su h databut the identi� ation of faults from problem reportsis a non-trivial exer ise. Heeding existing warningsagainst employing tree-based lassi� ation s hemes,the attribute ategorization s heme of Weyuker andOstrand is adapted for the OO paradigm and em-ployed to des ribe the 71 identi�ed faults. Thesedes riptions are then used to reate an initial tree-based fault model. The resulting fault model is of ourse partial (we are limited by the types of bugswe have observed), but it provides a useful startingpoint for future re�nement.1 Introdu tionA sound knowledge of the types of faults1 that typ-i ally may reside in software produ ts is essentialfor the development of e�e tive validation and ver-i� ation te hniques. For example, e�e tive inspe -tion te hniques rely on the presen e of he klists thatguide the inspe tor towards potentially problemati areas of ode and these must be ontinually updatedto represent the urrently o uring problems; the ef-fe tiveness of a testing strategy may be evaluated bydetermining its power to un over a representative dis-tribution of defe ts; and the poten y of stati analysistools an be in reased by targetting them at reveal-ing ommonly o uring fault types. As Ostrand andWeyuker observed, the goals of olle ting and ate-gorising faults in lude �evaluating the e�e tiveness of1Throughout this paper the terms �fault�, �bug� and �de-fe t� are used synonymously. A fault is deemed to be the onsequen e of an error and a failure is a manifestation of afault.

urrent software development, validation, and main-tenan e te hniques, assessing the e�e tiveness of pro-posed new te hniques, gathering information that willguide future development methods and proje ts� [14℄.Although written twenty years ago, these goals arestill valid.Sin e then obje t-orientation (OO) has be omethe dominant programming paradigm. The types offaults that an o ur in pro edural ode (the sub-je t of Ostrand and Weyuker's paper) are di�erent tothose that an o ur in obje t-oriented ode. How-ever, there is very little information about the typeand distribution of faults that are likely to o urwithin OO proje ts, whi h makes it very di� ult todevelop and evaluate te hniques for OO software.This paper des ribes our experien es of trying tobuild knowledge of the types and distributions ofobje t-oriented faults that o ur in typi al softwareproje ts. We sampled 71 sour e ode faults fromthree open-sour e OO proje ts with the initial aimof using them to produ e an OO fault taxonomy. Asthis paper shows, there are several fa tors that makeit virtually impossible to produ e a taxonomy that isboth de�nitive and obje tive. We demonstrate thiswith a survey of existing pro edural and OO fault lassi� ation s hemes and highlight some of their am-biguities. We then use a modi�ed version of Ostrandand Weyuker's attribute ategorisation s heme in anattempt to lassify our own olle tion of faults. The�nal result is a tentative hierar hy of fault types.The most important ontribution of this paper isnot the fault lassi� ation s heme itself, but the ex-tension and employment of the attribute lassi� a-tion s heme whi h permits the faults to be used as abasis for multiple lassi� ation s hemes. If they aresimply re orded into a pre-de�ned hierar hy, there isno s ope for re lassi� ation.2 Related WorkThis se tion analyses some of the key ontributionsthat have been made in the study and lassi� ationof faults, both in pro edural and OO ontexts.13

UK Software Testing Research III

2.1 Pro edural Code FaultsA substantial amount of work has been arried out onthe analysis of errors, faults and failures that o urin pro edural ode (a omprehensive summary maybe found in Roper [17℄). This se tion is not exhaus-tive but fo uses on work that is parti ularly relevantto our aim of ategorising obje t-oriented faults (i.e. ategorisations in luding faults that ould also o - ur within obje t methods and studies that providemethodologi al details).Glass A sample of 100 `software problem reports'(SPRs) is taken from two real-life software proje tsand used as a basis for ategorisation [10℄. SPRsare studied in their `raw' undo tored form so thatas mu h original information as possible left by theprogrammer an be taken into a ount. Faults areallowed to `self- ategorise' - having reviewed a fault,it is either assigned to a ategory des ribing its ownnature or assigned to an existing ategory generatedby a previous fault. The bene�t of this approa h isthat it allows for new ategories to be established.The �nal set of fault ategries identi�ed by Glass anbe seen in table 1.Ostrand and Weyuker A sample of 156 faultsis taken from a ten thousand line spe ial-purpose ed-itor proje t [14℄. Faults are re orded on `SoftwareUser Reports' and `Change Report' forms are usedto des ribe hanges made to the produ t. They note( iting Thibodeau [18℄) that the popular approa h of lassifying an error by pla ing it in a tree of separatetypes is prone to ambiguous, overlapping and in om-plete ategories, too many ategories and onfusionof error auses, fault symptoms and a tual faults.Their solution to this is to des ribe the problemsymptoms, the a tual problem dis overed and the orre tion made instead of slotting faults into a spe- i� ategory. They use these des riptions - apturedin terms of attributes and values - to develop theirown lassi� ation s heme. Importantly, they do notassign a fault to a single ategory but �attempt toidentify the fault's hara teristi s in several distin tareas�. The fault attributes and the orrespondingvalues suggested in their paper are:• Major ategory: Data De�nition, Data Han-dling, De ision , De ision Pro essing, Do umen-tation, System, Not an Error• Type: Address, Control, Data, Loop, Bran h• Presen e: Omitted, Super�uous, In orre t• Use: Initialize, Set, Update

Defe t Type Pro ess Asso iationsFun tion DesignInterfa e Low Level Design (LLD)Che king LLD or CodeAssignment CodeTiming / Serialisation LLDBuild / Pa kage / Merge Library ToolsDo umentation Publi ationsAlgorithm LLDTable 2: ODC Defe t types [7℄Per eptual bugsSpe i� ation bugsAbstra tion bugsAlgorithmi bugsReuse bugsLogi al bugsSemanti bugsSynta ti bugsDomain Adheren e bugsTable 3: Pur hase and Winder's fault taxonomy [16℄Chillarege et al. Fault data an be used to pro-vide feedba k on the software development pro ess.Chillarege et al. propose the Orthogonal Defe t Clas-si� ation (ODC) pro ess, where signatures an be ex-tra ted from defe ts that point to triggers in the soft-ware pro ess [7℄. The goal of ODC is to identify theroot ause of defe ts of a parti ular type so that thedevelopment pro ess an be improved to address this ause. Chillarege et al. keep the number of possibledefe t types to a minimum in order to avoid onfu-sion. These an be asso iated with a parti ular stageof the pro ess and are shown in table 2.2.2 Obje t-Oriented Code FaultsHu�man Hayes Hu�man Hayes looks at the pos-sibility of applying fault-based testing to OO software[11℄. The paper onsolidates previous work on obje t-oriented software faults by Pur hase and Winder [16℄,Firesmith [9℄ and Mirsky et al. [13℄, although theempiri al basis for this, and the lassi� ations uponwhi h it is founded, are by no means lear. Everyfault lass is assigned a method that an be used forits dete tion. We have split Hu�man-Hayes' faulttaxonomy in order to group faults by author. Faultsproposed by Pur hase and Winder, Firesmith andMirsky et al. are shown in tables 3, 4 and 5 respe -tively.24

UKTest 2005

Omitted Logi Code is la king whi h should be presentFailure to reset data Reassignment of needed value to variable omittedRegression error Attempt to orre t one error auses anotherDo umentation in error Software and do umentation on�i tRequirements inadequate Spe i� ation of the problem insu� ient to de�ne desired solutionPat h in error Temporary ma hine ode hange ontains an errorCommentary in error Sour e ode omment is in orre tIF statement too simple Not all onditions ne essary for an IF statementReferen ed wrong data variable Self-explanatoryData alignment error Data a essed is not the same as data desired due to using wrong set of bitsTiming error auses data loss Shared data hanged by a pro ess at an unexpe ted timeFailure to initialise data Non-preset data is referen ed before a value is assignedTable 1: Fault ategories reported by Glass [10℄Errors asso iated with Obje tsAbstra tion violatedPersisten e ProblemsDo umentation out of syn with odeIn orre t state modelInvariants violatedFailures linked to instantiation and destru tionCon urren y problemsFailure to meet requirements of the obje tSyntax errorsFailures asso iated with messages, ex eptions, attributes or operationsErrors Asso iated with ClassesAbstra tion violatedDo umentation out of syn with odeIn orre t state modelInvariants violatedFailures linked to instantiation and destru tionFailures asso iated with inheritan eFailure to meet requirements of the lassSyntax errorsFailures asso iated with messages, ex eptions, attributes or operationsErrors Asso iated with S enariosFailure to meet requirements of the s enarioCorre t message passed to the wrong obje tIn orre t message passed to the right obje tCorre t ex eption raised to wrong obje tIn orre t ex eption raised to right obje tFailures linked to instantiation and destru tionCon urren y problemsInadequate performan e and missed deadlinesTable 4: Firesmith's fault taxonomy [9℄35

Errors with En apsulationPubli interfa e to lass not via lass methodsImpli it lass to lass ommuni ationA ess module's data stru ture from outsideOveruse of friend/prote ted me hanismsErrors Asso iated w/ModularityMethod not usedPubli method not used by obje t usersInstan e not usedEx essively large number of methods in lassToo many instan e variables in lassEx essively long methodEx essively large moduleErrors with Hierar hyBran hing errorsDead-ends and y lesMultiple inheritan e errorsImproper pla ementErrors with Abstra tionClass ontains non-lo al methodIn omplete (non-exhaustive) spe ialisationTable 5: Mirsky et al.'s fault taxonomy [13℄In onsistent Type Use ( ontext swapping)State De�nition Anomaly (possible post- ondition violation)State De�nition In onsisten y (due to state-variable hiding)State De�ned In orre tly (possible post- ondition violation)Indire t In onsistent State De�nitionTable 6: Alexander et al.'s inheritan e / polymor-phism - related fault ategoriesAlexander et al. Con entrate on the faults that an be introdu ed by inheritan e and polymorphismand provide the `synta ti patterns' whi h ause them[4℄. The proposed fault ategories are listed in table6. The synta ti patterns also open up the possibil-ity of using stati -analysis tools to produ e empiri alresults, whi h would be very tedious to generate man-ually.Younessi Provides a higher-level overview ofparadigm features that ause problems when tryingto dete t faults [19℄. Although no on rete fault lassi� ations are proposed, a useful list of issuesthat need to be onsidered when dealing with obje t-oriented defe ts is provided. This is shown in table7.Dunsmore et al. Dunsmore et al. [8℄ produ e aset of defe t hara teristi s (in the vein of Ostrand

Abstra tion Redu es observabilityPartial/distributed implementationEn apsulation S ope es alationHierar hy integrationGeneri ity Type / behaviour variabilityInheritan e Substitutability problemMixing inheritan e stylesDeeply nested hierar hiesMultiple inheritan ePolymorphism In orre t binding in a homogeneous hierar hyServer-side hangeTable 7: Obje t-oriented features likely to ausefaults as listed by Younessiand Weyuker [14℄), where a single defe t an be as-signed several attributes. This was used to measurethe e�e tiveness of di�erent ode reading te hniquesat dete ting delo alised faults (faults that annot bedete ted by looking at a module of ode in isolationof the rest of the system). The set of hara teristi sis listed in table 8.Binder �The number of pla es to look is in�nitefor pra ti al purposes, so any rational testing strat-egy must be guided by a fault model.� [6℄. Binderdevotes a hapter of his testing book to the �bug haz-ards� of obje t-oriented ode, providing detailed ex-planations of why faults an o ur. He also providesa fault taxonomy whi h is provided in table 9 (weonly list the implementation faults, although he alsoprovides requirements, design and pro ess faults), al-though again the empiri al basis for this is un ertain.3 Colle ting FaultsThis se tion des ribes our experien es of olle ting,identifying and ategorising faults. The open sour eproje ts used are identi�ed, and the pro ess for de-riving a fault from its des ription is des ribed. Thepotential problems of trying to immediatly allo ate afault into a hierar hy are illustrated and a modi�edattribute lassi� ation s heme is proposed.3.1 Proje t Choi eOpen-sour e software foundries su h as Sour eforgeare a potent resour e for dis overing the nature of46

UKTest 2005

Use of library lass Requires understanding of lass librariesWrong obje t used Sending message to wrong obje tWrong method alled Sending in orre t messageIn orre t parameter in method all In orre t parameters in method allAlgorithm omputation Error in algorithmData �ow error In orre t / missing variable or in orre t valueSpe i� ation lash Clash with spe i� ationOmission Missing odeCommission In orre t or super�uous odeLo ality Area of ode required to be looked at to spot the defe t→Method→Class→SystemMethod size Size of method with defe t present→Small 0-4 lines of ode→Medium 5-10 lines of ode→Large 11+ lines of odeSequen e diagram lash defe t lashes with sequen e diagram given to use- ase inspe torsTable 8: Dunsmore et al.'s defe t hara teristi ssoftware faults. Their relatively re ent appearan eeliminates what used to be a signi� ant issue of ob-taining `real-life' software for a ademi analysis. Asfar as defe ts are on erned, tra king tools su h asBugzilla or the basi Sour eforge bug-tra king systemare valuable be ause they re ord omments about thenature of any fault / failure from both the developerand user.Three proje ts were hosen to demonstrate how to olle t faults using our fault s heme. The riteriafor the hoi es were that they must be open-sour e,obje t-oriented and must maintain a relatively de-tailed bug-reporting system. An example of a bugreport from one of the systems is shown in �gure 1.The three proje t hoi es were:

• Apa he Ant - A Java-based build tool [1℄ (17faults)• JHotDraw - A Java GUI framework for semanti drawing editors [3℄ (26 faults)• JFreeChart - A Java lass library for generating harts [2℄ (28 faults)3.2 Lo ating a FaultMost bug reports report a failure, but many do notexpli itly detail the a tual faults in the sour e ode.Usually there is not a lear one-to-one mapping be-tween failures and faults. In this paper we establishthis mapping by referring to a fault as the �x required

to eliminate a parti ular failure2.Whilst olle ting fault data, the �x to a failure of-ten had to be investigated by the author if the bugreport was not detailed enough. The simplest ap-proa h to determining a �x is to use a �le di�eren ingtool on subsequent CVS versions of the sour e ode.The E lipse IDE is parti ularly suitable for this taskbe ause it supports both CVS and �le di�eren ing[15℄. This is illustrated in �gure 2, where the two onne ted boxes in the middle highlight sour e odethat has hanged from one version to the next.3.3 The Dangers of Premature Classi-� ationInitially this study was arried out in order to vali-date a spe ulative fault hierar hy whi h was intendedto summarise ea h fault by its position in the hierar- hy. Hierar hies often seem the intuitive approa h to ategorising su h data and it is tempting to try anduse them. However, as has been noted earlier, theproblem with software fault hierar hies is that theyare prone to ambiguous, overlapping and in omplete ategories [18, 10, 14, 5℄. We illustrate this with anexample inspired by one of the bug reports: Figure3 shows a super�uous method all in an argument.If the type returned by getParent() (in the faulty ver-sion) is di�erent from omp (in the orre t version) and2It is important to note that there are usually several possi-ble �xes to a given fault. Fixes may also introdu e new faultsinto the system. Without a spe i� ation however it must beassumed that (at the time of orre tion) the �x is as lose aswe will get to the developers notion of a orre t system.57

Method S opeGeneral Message sent to obje t without orresponding methodUnrea hable odeContra t violation (pre ondition/post ondition/invariant)Message sent to wrong server obje tMessage priority in orre tMessage not implemented in the serverFormal and a tual message parameters in onsistentSyntax errorAlgorithm Ine� ient, too slow, onsumes too mu h memoryIn orre t outputIn orre t a ura y, ex essive numeri al or rounding errorPersisten e in orre t - wrong obje t saved or not savedDoes not terminateEx eptions Ex eption missingEx eption in orre tEx eption not aughtIn orre t at hEx eption propagates out of s opeEx eption not raisedIn orre t state after ex eptionInstan e variable de�ne/use Missing obje t (referred to, but not de�ned)Unused obje t (de�ned but no referen e)Corruption / in onsistent usage by friend fun tionMissing initialization, in orre t onstru torIn orre t type oer ionServer ontra t violatedIn orre t or missing unit (e.g. grams vs. oun es)In orre t visibility s opingIn orre t serialisation, resulting in a orrupted stateInsu� ient pre ision / range on s alar typeClass S ope In orre t method under multiple inheritan e due to errorIn orre t onstru tor or destru torIn orre t parameter(s) used in generi lassAbstra t lass instantiatedSyntax errorsAsso iation not implementedCluster / Subsystem S ope In orre t priorityIn orre t serialisationMessage sent to destroyed obje tIn onsistent garbage olle tionIn orre t message / right obje tCorre t ex eption / wrong obje tWrong ex eption / right obje tIn orre t resour e allo ation / deallo ationCon urren y problemsInadequate performan eDeadlo kTable 9: Binder's fault taxonomy68

UKTest 2005

Figure 1: Example Bug Report from JFreeChart

Figure 2: Comparing subsequent �le versions in E lipse79

popUp( omp.getParent(), newLo ation);instead ofpopUp( omp, newLo ation);Figure 3: Example of a faultthe alled method - popUp - is overloaded, then one ofthe overloading methods might be exe uted instead.How would this be ategorised using the obje t-oriented fault s hemes mentioned in se tion 2.2? Us-ing Pur hase and Winder's s heme [16℄ (shown intable 3), it would probably be most appropriate tolist this as an àlgorithmi bug'. Using Firesmith'ss heme (shown in table 4), it would be lassi�ed asa `failure asso iated with messages, ex eptions, at-tributes or operations' and ìn orre t message passedto right obje t'. Using Miller's s heme (shown in ta-ble 5) there is no appli able ategory. Using Dun-smore's s heme (shown in table 8) it ould be lassi-�ed as `wrong method alled', ìn orre t parameter inmethod all' and ` ommission'. Using Binder's tax-onomy in �gure 9 the fault ould be ategorised under`message sent to wrong server obje t', ìn orre t out-put' and `server ontra t violated'. It is lear that ategorisation is ertainly not straightforward usingany of these approa hes be ause we have the (by nomeans un ommon) situation where either there is nofault ategory tailored to this parti ular fault, or sev-eral may be appli able. This is a dangerous situa-tion sin e, through ina urate assignment of faults,it leads to the population of ina urate and unrep-resentative hierar hies, whi h has severe impli ationsfor their use in the development and evaluation ofveri� ation and validation te hniques.3.4 Re ording FaultsOstrand andWeyuker suggest that their attribute at-egorisation s heme is more �exible and a urate thaninserting faults into a hierar hy [14℄. It must be notedthough that attribute ategorisation s hemes are notwithout their problems: the ambiguities that arisewhen using fault hierar hies an still persist if indi-vidual attribute values are not orthogonal. As withthe hierar hies, su h ambiguities an result in in on-sistent data entries. The la k of restri tion on lassi�- ations an pose problems as well. Depending on thenumber of attributes and values, the number of possi-ble ombinations used to des ribe faults is potentiallyvast, making olle ted data di� ult to analyse.This restri tion an however also be seen as a de-gree of freedom. Allowing for the de�nition of faultsby their attributes and values without a prede�nedset of ategories makes this a very �exible te hnique

to use in obje t-oriented programming, where littlefault data has been olle ted. It allows for the poten-tial use of data lustering te hniques to dis over newfault ategories.For our data olle tion we expanded Ostrand andWeyuker's original attributes (see se tion 2.1) to in- lude some of the lassi� ations mentioned in se -tion 2.2 in order to adequately des ribe OO faults(remember that Weyuler and Ostrand's s heme wasdevised prior to the widespread adoption of OO pro-gramming). This resulted in the introdu tion of the�S ope� attribute and its asso iated values. New val-ues were generated in the vein of Glass's self ate-gorising approa h (see se tion 2.1): If a parti ularvalue didn't exist for a given fault attribute, it was reated. Also, some of the original values were un-used and have been dropped3 from the s heme, orhave been modi�ed to more a urately re�e t theOO paradigm (e.g. within the �Type� attribute, the�Data� value have been split into �Data Value� and�Data Type�. The �nal s heme is shown in table 10where bold items represent additions or modi� ationsto the original Ostrand and Weyuker s heme. Usingthe enhan ed ategorisation s heme introdu ed in ta-ble 10, the example fault des ribed earlier would be ategorised as follows:• Major Category: Call (be ause the fault is asuper�uous all)• Type: Message (the type of fault is an erroneousmessage being sent from one obje t to another)• Presen e: Super�uous (to �x the fault the allmust be removed)• Use: Argument (the all is made in the ontextof an argument within another all)• S ope: Method (the �x only a�e ts ode withina single method)This attribute ategorisation approa h has beenapplied to the sample of 71 faults harvested from thethree proje ts. Tentative groupings were establishedby re ording faults as a spreadsheet, where ea h rowis a fault, and ordering the data so that similar rowsare adja ent to ea h other (this notion of similarityis a subje tive one whi h gives slightly more weightto the major ategory attribute, but otherwise is asimple ount of ommon values). An extra t of thespreadsheet is shown in �gure 4. To ensure that thegroupings orre tly re�e t fault hara teristi s, re-ports for ea h of the faults in a group were ompared3The values were dropped mainly to prevent the s hemefrom be oming luttered. It is always possible that a new faultmay relate to one of these dropped values, but the �exibility ofthe s heme always allows for the reintrodu tion of su h values.810

UKTest 2005

Major Category Type Presen e Use S opeAlgorithmi Address In orre t Argument MethodAbstra t lass Predi ate Omission Boolean operation ClassCall Control Super�uous Cast SystemCon urren y fault Data value De�nitionDe laration Loop MethodEn apsulation Message Obje tEvent Data typeEx eptionInheritan e / PolymorphismTable 10: Enhan ed attribute ategorisation s heme

Figure 4: Illustration of Fault Des riptions9

and a summary of the fault was produ ed. The (un-ordered) summaries are provided in table 11.3.5 Analysing FaultsCapturing faults in the manner we have illustratedpermits further analysis to be arried out. Althoughwe have argued against the use of hierar hies for thesake of re ording faults be ause of their ambiguities,that does not mean that they should not be usedoutright. As demonstrated by Kuhn (albeit on logi alspe i� ations)[12℄, they an be useful for reasoningabout the aspe ts of software development on ernedwith faults su h as test set generation for fault-basedtesting [11, 6℄.A tentative attempt was made to further lusterthe faults in table 11 and insert them into a hierar- hy, whi h is shown in �gure 5. Again this was pri-marily a manual pro ess - mainly due to the relativelysmall number of faults, oupled with the fa t that us-ing attributes (as opposed to numeri values) makeautomati lustering di� ult - but whi h resulted inan in reased number of ategories in this when om-pared with the initial spe ulative hierar hy. This istestimony to the fa t that the attribute ategorisa-tion s heme en ourages the dis overy of new faulttypes. The new hierar hy in ludes ategories su h as`Event fault' and `Super�uous all' that would havebeen di� ult to ategorise in the initial spe ulativehierar hy.The list of faults presented in this ase study isnaturally not a omplete list of faults that ould pos-sibly o ur in an obje t-oriented environment. It ishowever a useful starting point for anyone lookingfor a olle tion of representative faults whi h an beused to evaluate a fault isolation approa h su h assoftware inspe tion or testing.4 Con lusions and Future WorkThere is a la k of information on erning the na-ture and distribution of faults that are likely to o urwithin OO systems, and this is hampering the devel-opment and evaluation of veri� ation and validationte hniques aimed at this paradigm. Some lassi� a-tion s hemes exist but most of these have a question-able empiri al basis. Open sour e proje ts have typi- ally lists of problem reports whi h represent a valu-able sour e of data. Three open sour e proje ts were hosen and their problem reports mined to extra tinformation on the faults attributed to the problem(this is not always a straighforward task).The dangers of immediately lassifying faultswithin a tree-like stru ture are illustrated and insteadthe attribute ategorization me hanism introdu ed

by Weyuker and Ostrand is extended and adaptedfor the OO paradigm. This is employed to des ribe71 faults identi�ed in the open sour e proje ts andused to provide a tentative lassi� ation of ommonOO faults. Future work will on entrate on buildingup this database of faults and investigating the appli- ation of automati lustering me hanisms to de�nea robust fault lassi� ation s heme.

UKTest 2005

Des ription Number Re ordedhigh level algorithm error ( orre tion required in several methods) 3not implementing interfa e when it should be 1in orre tly implements an interfa e 2missing fun tionality (requires additional data member) 1missing fun tionality (no lear �x) 1in onsistent method (does not allow property that other methods allow) 1high level algorithm error ( orre tion required in single method) 6missing fun tionality in method body 1predi ate ondition is in orre t 3predi ate is missing a ondition 9statements are in wrong sequen e 1data value is not orre t 3bool op an never return true given data value 1data value an ause failure without additional ontrol stru tures 2ine� ient loop algorithm, shouldn't iterate through entire sear h spa e 1not alling a method when it should 1 alls wrong method when de�ning a variable value 1in orre t method alled 4in orre t argument value 4wrong obje t used in method all 1in orre t boolean expression as argument 1omitted obje t in method all 1super�uous method all 2should not use method all in variable de�nition 1 on urren y fault auses failure (deadlo k / ra e ondition) 3Method is visible to entire pa kage instead of being prote ted 1sub lass de�nes new methods instead of overriding existing ones 1method name breaks naming onvention 1missing fun tionality 2super�uous de�nitions (already de lared in super-interfa e) 1wrong variable type de laration 2en apsulation error: data member an be manipulated via a essor 1does not �re events when it should 1fails to raise ex eption 3in orre tly implements super lass 2missing alls to super lass 1Table 11: Fault Summaries11

• Algorithm⊲ Missing fun tionality⊲ Control Flow� Predi ate fault

· In orre t predi ate· Predi ate missing a ondition� In orre t exit lause for loop� Statements in wrong sequen e

⊲ Data� Data value is not orre t� Boolean operation an never return true for a given variable� Variable an ause failure without additional ontrol stru tures⊲ Call� Missing all� In orre t method alled� In orre t obje t used in all� super�uous method all� In orre t argument value

• Abstra t lass (interfa e)⊲ Not implementing an interfa e when it should⊲ In orre tly implementing an interfa e

• Con urren y fault (deadlo k / ra e ondition)• De laration fault

⊲ Method given wrong visibility⊲ Sub lass de�nes new methods instead of overriding existing ones⊲ Method breaks naming onvention⊲ Super�uous de laration (e.g. Interfa e de laring method that is already de lared in base-interfa e)⊲ Wrong variable type de laration

• En apsulation fault (e.g. Data member an be manipulated via a essor)• Event fault (e.g. Does not �re events when it should)• Ex eption fault (e.g. Fails to raise ex eption)• Inheritan e fault

⊲ In orre tly implements super lassFigure 5: Hierar hy of summaries12

UKTest 2005

Referen es[1℄ The apa he ant proje t.http://ant.apa he.org/.[2℄ Jfree hart.http://sour eforge.net/proje ts/jfree hart.[3℄ Jhotdraw.http://sour eforge.net/proje ts/jhotdraw.[4℄ R. Alexander, J. O�utt, and J. Bieman. Synta -ti fault patterns in obje t-oriented programs. InPro eedings of The Eighth IEEE InternationalConferen e on Engineering of Complex Com-puter Systems (ICECCS '02), pages 193�202,Greenbelt, Maryland, De ember 2002.[5℄ B. Beizer. Software Testing Te hniques. VanNostrand Reinhold, 1990.[6℄ R. Binder. Testing Obje t-Oriented Systems. Ad-dison Wesley, 1999.[7℄ R. Chillarege, I. Bhandari, J. Chaar, M. Halli-day, D. Moebus, B. Ray, and M. Wong. Orthogo-nal defe t lassi� ation - a on ept for in-pro essmeasurements. IEEE Transa tions on SoftwareEngineering, 18(11):943�956, November 1992.[8℄ A. Dunsmore, M. Roper, and M. Wood. Thedevelopment and evaluation of three diversete hniques for obje t-oriented ode inspe tions.IEEE Transa tions on Software Engineering,29(8):677�686, August 2003.[9℄ D. Firesmith. Testing obje t-oriented software.In Pro eedings of TOOLS, Mar h 1993.[10℄ R. Glass. Persistent software errors. IEEETransa tions on Software Engineering,7(2):162�168, Mar h 1981.[11℄ J. Hu�man Hayes. Testing of obje t-orientedprogramming systems (oops): A fault-based ap-proa h. In Pro eedings of the International Sym-posium on Obje t-Oriented Methodologies andSystems (ISOOMS), number 858 in Springer-Verlag Le ture Notes on Computer S ien e se-ries, pages 205�220, Palermo, Italy, September1994.[12℄ D. Kuhn. Fault lasses and error dete tion apa-bility of spe i� ation-based testing. ACM Trans-a tions on Software Engineering and Methodol-ogy, 8(4):411�424, O tober 1999.[13℄ S. Mirsky L. Miller, J. Hu�man Hayes. Task7: Guidelines for the veri� ation and validation

of arti� ial intelligen e software systems. Te h-ni al report, United States Nu lear RegulatoryCommission and the Ele tri Power Resear h In-stitute, 1993.[14℄ T. Ostrand and E. Weyuker. Colle ting and at-egorizing software error data in an industrial en-vironment. Journal of Systems and Software,4(4):289�300, November 1984.[15℄ OTI. E lipse platform overview. 2003.[16℄ J. Pur hase and R. Winder. Debugging toolsfor obje t-oriented programming. Journal ofObje t-Oriented Programming, 4(3):10�27, June1991.[17℄ M. Roper. Software Testing. M Graw Hill, 1994.[18℄ R. Thibodeau. The state-of-the-art in softwareerror data olle tion and analaysis - �nal re-port. Te hni al report, General Resear h Corp.,Huntsville, AL, 1978.[19℄ H. Younessi. Obje t-Oriented Defe t Manage-ment of Software. Prenti e Hall PTR, 2002.

UKTest 2005

A Study of the Reciprocal Collateral Coverage Provided by Two Testing

Methods

D. F. Yates

Information Systems and Databases Laboratory

Department of Informatics

Athens University of Economics and Business

Athens, Greece

N. Malevris

Department of Informatics,

Athens University of Economics and Business

Athens, Greece

Abstract

Branch testing, for example, seeks to cover all branches in a piece of code, but when it

is performed, as well as branches, other program elements, such as statements or p-

uses, will necessarily be covered. The contemporaneous coverage of these other

elements is referred to as “collateral coverage”. An understanding of the extent of the

collateral coverage of the various program elements that is afforded by the available

testing methods, can be used to determine an optimal deployment of the methods in

pursuance of the aim of covering one or more of those elements. In this paper, as a

step towards such an understanding, the collateral coverage that branch testing yields

in respect of JJ-paths, and that JJ-testing yields in respect of branches, is investigated.

The results of the investigation do, in fact, facilitate the definition of a policy for

deploying the two methods when branch coverage, JJ-path coverage, and/or both of

these is sought. However perhaps surprisingly, they also indicate that the widespread

preference for branch over JJ-path testing, may not be justified.

Keywords: branch testing, JJ-path testing, collateral coverage, test coverage.

1.0 Introduction

The extent of branch coverage that is achieved by a practical application of branch

testing, may be expressed using the Ter2 metric, Brown [2], as follows:

codetheinbranchesofnumbertotal

executedbeenhavethatbranchesofnumber2 =Ter

Woodward et al. [19], introduced a family of structural testing methods based on

the concept of a JJ-path or LCSAJ, where a JJ-path is a sequence of consecutively

numbered basic blocks: p, (p+1), …, q, followed by a jump to a basic block numbered

r, where r ≠ (q + 1). The first member of this family seeks to cover all JJ-paths in a

unit of code, and the success of the method can be measured using the Ter3 metric,

also introduced by Woodward et al.; the metric being defined analagously to Ter2.

The influence of work on branch testing, of which [1], [7], [9], [16], and [23]

represents but a small sample, has been significant in it becoming viewed as a

minimum requirement for the testing of a piece of software. Branch testing does have

its limitations, [4] and [5], nevertheless, it is popular, as is reflected in the fact that it

is a testing requirement in many published software standards, see [3] and [14] for

example. JJ-path testing has substantially fewer adherents, and few testing standards

require that it be performed; [14] is an exception, although it is mentioned in [15].

In Malevris and Yates [11], the concept of “collateral coverage” was introduced.

It may be defined as follows.

Definition 1

Let M be a testing method that specifically aims at covering some element (feature), E

say, of program code. If as a result of applying M to a unit of code in persuance of

covering element E, some coverage of another element, H, is also realised, then this

additional coverage is defined to be the collateral coverage of H that is achieved by

Upon its application, every testing method will achieve collateral coverage, which

may be extensive, of a number of different structural elements of a piece of code. It

would be unwise and inefficient not to take some advantage of this. What is more, if

the relative ability of the various testing methods in achieving collateral coverage are

known, further advantage may clearly be gained by optimally sequencing the

application of those methods. The work presented in this paper is intended to be one

step towards assessing such relative abilities. The step taken involves an investigation

of branch and JJ-path testing in this respect. Specifically, the collateral coverage of

branches by JJ-path testing, and of JJ-paths by branch testing, is compared with

respectively the branch coverage derived by branch testing, and the JJ-path coverage

resulting from JJ-path testing. Thus, in effect, it is an investigation of the reciprocal

collateral coverage of branch and JJ-path testing. Further, since it too may prove to

be relevant in determining an optimal sequence for the application of the two

methods, the “aggregate coverage”, (branch coverage +JJ-path coverage) achieved by

the methods is also compared.

The paper is structured as follows. After introducing necessary definitions and

concepts, and defining the path generation method used in the experiments in section

2 of the paper, the sample and experimental regime are detailed in section 3. The

results of the experiments themselves, which entailed the generation and attempted

execution of well over 6,000 program paths, are presented in section 4. An overview

and discussion of the important finding of the experimental work in section 5 then

concludes the report.

2.0 Definitions and Path Generation Method

Consider, C, a code unit (a program, procedure, subroutine, or other delimited and

distinguished section of program code) with n basic blocks. If C is to undergo branch

UKTest 2005

testing, an appropriate model of C's structure upon which path selection can be based

is “the control flow graph” of C, see Paige [13].

Definition 2

The control flow graph Gc= (V, �

) of C is a connected directed graph in which, there

is a unique vertex vj ∈V for each basic block j of C, and arc vivj ∈�

iff there is a

transfer of control f1ow (branch) in C from block � to block j.

As a code unit may have one or more entry points, and one or more exit points, Gc

may possess one or more sources (vertices with no in-coming arc), and one or more

sinks (vertices with no out-going arc) respectively. It will be assumed here that Gc

contains a unique source denoted S and a unique sink denoted F. Should this not be

the case in practice, the existence of multiple entry and/or exit points in C can be

addressed by introducing into Gc, a super-source and /or super-sink, as is appropriate,

together with associated arcs, see [20] for example.

Under this assumption, it can be seen that there is a one-to-one correspondence

between the program paths (paths from the entry to the exit point) of C and the paths

from S to F in Gc (S-to-F p � ths). Consequently, executing one of C's progam paths is

equivalent to 'executing', or covering, the corresponding S-to-F path in Gc and

therefore, its constituent branches.

Unfortunately, coverage of a path in Gc is not necessarily equivalent to executing

the corresponding program path in C since the program path may be 'infeasible', that

is, no data set exists that will force its execution. Consequently, a set of S-to-F paths

selected to give a specific value of Ter2 may not realize that value when an attempt is

made to test the corresponding program paths in C. The existence and influence of

infeasible paths will necessarily play a major role in what follows.

If, rather than branch testing, C is to undergo JJ-path testing, the control flow

graph of C is not the most convenient model on which to base test path generation. In

this situation C `s JJ-graph, GJ , is more appropriate, see [18] and [21].

Definition 3

The JJ graph GJ = (V , S ) of C is a connected directed graph in which, there is a

unique vertex vj ∈ V for each JJ-path j of C, and arc vivj ∈ S iff there is a transfer of

control f1ow (jump) in C from JJ-path � to JJ-path j.

It is not unusual for GJ to have one or more sources and/or sinks, see Woodward [18].

However, the same expedient as for the control flow graph, namely, that of

introducing appropriately a super-source and /or super-sink, together with the

necessary associated arcs, may be adopted. Therefore, it will be assumed here that GJ

contains a unique source, S , and a unique sink, F . Given the definition of GJ it is not

difficult to understand that, with this assumption, the execution of a program path, and

therefore, its constituent JJ-paths, is equivalent to covering the vertices and arcs of the

corresponding S -to- F path.

2.1 Improvements on the Teri

By definition, full branch, and full JJ-path coverage is achieved only when the

corresponding Ter metric achieves the value unity. However, given that every basic

block involves a predicate, and that, in general, any program path will involve several

basic blocks, there exists the possibility of there being no test data that will

simultaneously satisfy all predicates associated with a specific branch, or JJ-path. In

the presence of such infeasible branches and infeasible JJ-paths, less than accurate

information is, therefore, provided by the Ter metrics. However, consider *

iTer , i = 2,

3, the relative iTer metrics, which are defined as:

infeasiblebe toprovennot of elementsofnumber

coveredof elementsofnumber*

C Teri =

where the term ‘elements’ refers to branches, or JJ-paths, as is appropriate. (It is noted

that similar ‘relative’ metrics can be defined for any testing method.)

Clearly, in order to evaluate any *

iTer , it is necessary to know the number of

corresponding infeasible elements in the software under test. Testing, however, is

essentially a sequential process. Thus, there may come a time when a certain element

is known to be infeasible, and at that point, an error may potentially have been

highlighted, but thereafter, even if no error exists, *

iTer will provide a more accurate

indicator of the coverage actually achieved than iTer . It is for this reason that the *

iTer metrics, rather than the iTer , will be used in reporting the experimental results

that have been derived.

2.2 The Path Selection Method

The aim of this paper is to compare certain characteristics of the performance of two

testing methods. As such a comparison will (necessarily) be afforded only as a result

of generating sets of test paths. If an unbiased comparison is to be made, it is

imperative, therefore, that the method adopted for the generation of such path sets also

be unbiased. This section only outlines the method that was adopted in deriving the

experimental results reported herein, as a detailed specification has been published

elsewhere in Yates and Malevris [21], and [22].

In an attempt to derive a heuristic for selecting feasible test paths a priori, Yates

and Hennell [20] advanced, and argued the proposition that: a program path that

involves q≥0 predicates is more likely to be feasible than one involving p > q. The

formal statistical investigation of this that was undertaken in Malevris et al. [12]

concluded, with great statistical significance, that the feasibility of a path decays

exponentially with the increasing number of predicates it involves. As a result, Yates

and Malevris [22] proposed a path selection method, extending that of Yates and

Hennell. Although initially introduced to support branch testing, the method was

founded only upon a consideration of the number of predicates on a program path.

Thus, the method does not seek to optimisation in respect of any one testing criterion,

and thus, may be used validly, and without bias, when attempting to fulfill any testing

criterion. If, for purposes of this paper, a code element is taken to refer to either a

branch, or a JJ-path, the method may be summarised as follows.

UKTest 2005

1. Generate set of program paths, , whose constituent paths each involve a

minimum number of predicates, and which, in the absence of infeasible

paths, would cover the elements of code unit C.

2. Derive the value of iTer resulting from executing C with test data

corresponding to those paths in that are feasible.

While condition repeat step (3)

3. Select an uncovered element, E of C, and successively generate K

Eπ , K = 1,

2, …, until for some K = λ , λπ E is found to be feasible, and then

recalculate the value of iTer .

Here, K

Eπ denotes that path through element E that involves the Kth

smallest number of

predicates.

In order to generate and the K

Eπ for branch and JJ-path testing, use can be made

of the fact that corresponding paths exist in respectively the control flow graph and

the JJ-graph. These paths can be found by generating the kth

shortest paths (in terms

of the number of predicates involved) through the arcs of the control flow graph for

branch testing, and the vertices of the JJ-graph for JJ-path testing. Here k =1 when

is to be constructed, and k = K otherwise, and standard graph theoretic algorithms are

available in each case, see Gondran and Minoux [6], for example.

All that is now required to define the instantiation of the above path generation

method that was used in the experiments, are details of condition, and definition of the

criterion used to determine the order in which uncovered elements are selected in step

(3). Details of the choices made are now given.

Although, it would be ideal to select condition to be iTer < 1, for reasons of

practicality, it was found to be necessary to curtail the testing of certain units. This

was achieved by taking condition to be While iTer < 1 and K < 300. Thus, only a

maximum of 300 paths, in addition to those contained in the initial path set, initially

could be generated by any of the methods

When there is more than one uncovered element to be treated in step 3 of the

method, in which order is it most appropriate to attempt to cover them? This question

was answered by making use of the substantiated thesis that: a program path that

involves q≥0 predicates is more likely to be feasible than one involving p > q.

Specifically, at the beginning of each iteration of step 3, the lengths, again in terms of

the number of predicates involved, of the shortest untried path through the extant

uncovered elements, are compared. The ensuing attempt to increase coverage then

involves that element which corresponds to the shortest of these paths. Should this

process result in a tie, one from amongst the tying candidates is then selected at

random.

3.0 The Test Sample and Experimental Regime

The results that are reported in section 4 were derived by testing a sample of code

units written in Fortran. Although Fortran is still employed in many organisations, its

usage, relative to that of other available languages, has diminished most significantly

over the last twenty-five years. There then arises the question of whether the results

and corresponding conclusions derived herein have any general significance or

relevance. This question is addressed by the following.

Suppose that infeasible paths do not exist. In such circumstances, a valid path set

derived to fulfil a testing criterion would, when executed, automatically achieve its

aim. Unfortunately, infeasible paths do exist! Therefore, it is only because of their

existence that full coverage in respect of a testing criterion may not be met in practice,

and it is also clear that this must be true irrespective of the language in which the

software under test has been encoded. Correspondingly, it is relevant to ask whether

infeasible paths are more likely to occur in code written in one programming language

than in another. In the substantially sized study performed by Malevris et al. [12] of

the feasibility, or otherwise, of program paths, statistical analysis of the results

showed that with a certainty of 99.95%, the potential infeasibility of a program path is

characterised by the number of predicates that it involves. Given such a definitive

result, one of two possibilities must obtain. Either, the encoding language plays no, or

at most, a highly insignificant and almost imperceptible role in influencing the

feasibility of a program path, or, the language itself is instrumental in deciding the

number of predicates involved in a path.

Consider the second of these possibilities, and also consider an arbitrary

generalised algorithm (not a program) for processing certain data sets. In general,

subsets of the data will be processed differently by the algorithm depending on the

specific characteristics of the subsets, and differentiation in processing will be

achieved via the use of tests (predicates). Without loss of generality, assume that it

requires x > 0, say, predicates to differentiate the processing of one of the subsets, A

say, from the others. Further, suppose that a programmer is asked to encode the

algorithm in one imperative language and then transliterate to produce another

encoding in a second imperative language. Under the relatively mild assumptions that

in the two languages, the decimal accuracy with which a real value is represented, and

with which arithmetic is performed on such values, is the same, then in the two

encodings y > x tests (y = x if the programmer codes efficiently) will be needed to

distinguish the processing of subset A from that of other data subsets. Now this will

be true irrespective of the two imperative languages used in the encodings, and will

also obtain for all valid data subsets that the code is designed to process. Therefore,

the two programs produced by the programmer will use the same number of tests to

distinguish the processing of each specific pair of data subsets, and the same total

number of tests to distinguish the processing of any one data subset. Thus, under the

above assumptions, the idea that the encoding language influences coverage should be

eschewed.

Correspondingly, the authors contend that the language in which software is

encoded is immaterial, or at worst insignificant, where results derived from

experiments such as those reported below are concerned, and that such results may

justifiably be viewed as being both relevant and generally applicable. Work aimed

substantiating this contention experimentally, forms part of the authors’ on-going

research.

3.1 The Experimental Regime

The experimental results reported in the succeeding section, were derived as a result

of testing a set of 35 Fortran subroutines chosen in a pseudo-random manner from the

UKTest 2005

NAG library. The subroutines have thus been employed extensively in industry, in

research establishments, and for various academic purposes. In all, the investigation

of well over 6000 program paths was entailed. A profile of the sample of code units,

detailing the number of branches, and JJ-paths that each contains, is presented as table

Each of the 35 units was first subjected to branch testing. The path generation

method adopted was that described in section 2. The values of 2Ter , and 3Ter ,

achieved for each routine at the end of step 1 of the path generation method, herein

referred to as the initial coverage in respect of the units, were recorded. The

increased values of the metrics resulting from the coverage of each additional branch

in step 3 of the method were also recorded; the last such pair of values to be recorded

being referred to as the final coverage in respect of the units. An analogous regime

was then adopted for JJ-path testing.

Once the testing of all of the routines had been completed, the values of *

3Ter , corresponding respectively to each of the recorded values of 2Ter and 3Ter

were derived.

In order to generate the control flow and JJ graphs upon which the path

generation methods rely, the TESTBED tool, [17], was used. The test paths

themselves were generated using the ESPM tool, [10], which embodies the path

selection method of section 2, and corresponding sets of test data (for the feasible

paths) were derived using the VOLCANO symbolic execution system of Koutsikas

and Malevris [8].

Profile of the sample

Unit No. of

branches

No. of

JJ-paths

Unit No. of

branches

No. of

JJ-paths

1 9 8 19 21 17

2 12 12 20 12 12

3 4 5 21 12 13

4 9 10 22 6 7

5 33 36 23 16 18

6 16 14 24 36 32

7 21 19 25 15 16

8 9 9 26 10 9

9 18 16 27 21 21

10 11 12 28 20 19

11 27 25 29 18 16

12 38 35 30 50 46

13 25 23 31 15 18

14 22 19 32 45 41

15 31 26 33 19 18

16 30 24 34 17 14

17 3 3 35 25 20

18 20 19 Total 696 657

Table 1.

4.0 Results and Discussion

The results derived as a result of the experiments are presented in three subsections.

Sections 4.1 and 4.2 focus on the relative effectiveness of the testing methods in

respect of branch and JJ-path coverage respectively, whereas section 4.3 addresses

their relative effectiveness in terms of aggregate coverage.

4.1 Branch Coverage

The initial branch coverage of each of the 35 code units as achieved by the testing

methods is reported, in terms of *

2Ter , in table 2. The table’s size is substantial, and

consequently, is placed in the appendix. However, an inspection of the table reveals

certain salient points, and these are summarised in table 3.

Initial coverage of branches

Method

Total no.

of paths

generated

Mean no. of

paths per unit

generated

No. of units

for which *

2Ter = 1

No. of units

for which *

2Ter = 0

Range of *

2Ter for

other units

Mean value of *

2Ter for all

35 units

Branch testing 218 6.229 13 2 [0.125, 0.933] 0.695

JJ-path testing 324 9.257 14 2 [0.125, 0.933] 0.717

Final coverage of branches

Method

No. of

generated

Mean no. of

paths per unit

generated

No. of units

for which *

2Ter = 1

No. of units

for which *

2Ter = 0

Range of *

2Ter for

other units

Mean value of *

2Ter for all

35 units

Branch testing 2455 70.143 29 0 [0.5, 0.789] 0.947

JJ-path testing 4096 117.029 29 0 [0.278, 0.818] 0.935

Table 3.

The values of the final branch coverage achieved by the testing methods in respect of

the test sample are also given in table 2, and in summary, in table 3. Attention is

drawn to the fact that for two of the units, no initial branch coverage was achieved. In

both cases the reason for this was the existence of nested loops in the code, the

innermost of which needed to be executed more times than was catered for by the

initial set of test paths.

Focussing upon the last column of table 3, the values suggest that branch testing

gives a worse return, in the mean, than JJ-path testing as measured by the initial

coverage, but that the situation is reversed when final coverage is considered.

However, these levels of coverage should also be assessed in the light of the number

of test paths generated in achieving them (column 2 of the table).

In order to facilitate further investigation, the mean values of the quantities of

interest were derived as follows.

Denote by *

,2 kjTer the value of *

2Ter < 1 achieved in respect of code unit j after a

total of k paths have been generated for it in the application of one of the testing

methods. Now, for some value of k, Qj say, full branch coverage of unit j will be

achieved, and thus, define:

UKTest 2005

=.otherwise1

QkifTerjk,Ter

Further, define:

=otherwise

QkifkjkP

from which it can bee seen that, for code unit j, the testing method achieves a value of

2 jkTerTer = after ),( jkP paths have been generated in respect of testing it.

Hence, using these, it can be seen that after a mean of )(kPM paths per code unit have

been generated, where:

jkPkPj

M ∑=

2 kPT M , the mean value for *

2Ter achieved by testing the 35 code units, is:

2 jkTerkPTj

M ∑=

For simplicity and conciseness in what follows, ))((*

2 kPT M and )(kPM will be written

as )(*

2 MPT and MP respectively; their dependence upon k being tacitly understood.

A trend line relating the values of )(*

2 MPT and MP that were obtained from the

experiments, was generated for each testing method. These values are given in the

appendix as table 4. In deriving the trend lines, it was essential that the range MP ∈

[9.257, 70.143] be chosen. This choice facilitates an unbiased comparison of the

methods, because the results for them were derived in different ranges: MP ∈ [6.229,

70.143], and MP ∈[9.257, 117.028], for branch and JJ-path testing respectively (see

columns 2 and 4 of table 4).

Of the various functions that were fitted to these data values, one of the form

2 MPT = 1

321 )/( −++ MP

M eAPAA , where A1 , A2 and A3 are constants, proved to be

the best for both methods. The resulting trend lines are defined by: )(2 MPT = 1)55693.466/8653818.20044056.1( −−++ MP

M eP for branch testing and )(2 MPT = 1)84016.534/176849.3014394.1( −−++ MP

M eP for JJ-path testing. As evidenced by

the corresponding values of R2, 0.996 and 0.997 (3 dec. places.)

respectively, the

regression lines represent an extremely good fit, and therefore may be relied upon.

The plot of )(*

2 MPT against MP together with the corresponding regression line for

branch testing is given in figure 1 for purposes of illustration.

A calculation using the trend lines straightforwardly reveals that branch testing

provides a better mean branch coverage than does JJ-path testing in the entire range

MP ∈[9.257, 70.143]. Specifically, the mean branch coverage that branch testing

yields, is 3.691% greater than that obtained using JJ-path testing at MP =9.257, but this

percentage diminishes in the range of interest: at MP = 21.165 it is 2% greater, for

example, and at MP =70.143, it is only 1.38% greater. Consequently, it may be said

that, in most of the experimental range, the mean branch coverage achieved by branch

testing is less than 2% more than the mean collateral branch coverage achieved by JJ-

path testing.

Figure 1. A plot of )(*

2 MPT against MP for branch testing (diamonds), and the

corresponding regression line (dots).

It is also noted that, because of the form of the regression lines, exactly the same

comments can be made about the performance of the two testing methods when

branch coverage per unit test path, rather than branch coverage itself, is considered.

4.2 JJ-path Coverage

Both the initial, and the final JJ-path coverage of each of the 35 code units that was

achieved by branch and JJ-path testing is reported, in terms of *

3Ter , in table 5. This

table too has been placed in the appendix because of its size. However, a synopsis of

the important values contained therein, is presented in table 6.

The entries in the last column of table 6 do indeed indicate that, as measured

3Ter , the initial and final return from JJ-pair testing exceeds that of branch testing.

As in the case of branch coverage, each of these values belies the effort needed to

achieve it, that is, taking the number of test paths that were generated (column 2 of the

table) into account.

0 20 40 60 80P M

2 (P M)

UKTest 2005

Initial coverage of JJ-paths

Method

Total no. of

generated

Mean no. of

paths per unit

generated

No. of units

for which *

3Ter = 1

No. of units

for which *

3Ter = 0

Range of *

3Ter for

other units

Mean value of *

3Ter for all

35 units

Branch testing 218 6.229 3 2 [0.125, 0.882] 0.592

JJ-path testing 324 9.257 12 2 [0.125, 0.952] 0.665

Final coverage of JJ-paths

Method

Total no. of

generated

Mean no. of

paths per unit

generated

No. of units

for which *

3Ter = 1

No. of units

for which *

3Ter = 0

Range of *

3Ter for

other units

Mean value of *

3Ter for all

35 units

Branch testing 2455 70.143 7 0 [0.343, 0.909] 0.823

JJ-path testing 4096 117.029 25 0 [0.171, 0.882] 0.892

Table 6.

The level of coverage of JJ-paths achieved by the two methods was investigated in a

similar manner to that of branch coverage. Thus, by analogously defining

3 kPT M , it can be understood that, ))((*

3 kPT M , the mean value for *

3Ter , is

achieved after testing the 35 code units when )(kPM paths per unit have been

generated. Again for simplicity and conciseness, ))((*

3 kPT M and )(kPM will be written

respectively as )(*

3 MPT and MP in what follows.

In order to relate the experimental values of )(*

3 MPT and MP , these being given in

the appendix as table 7, a trend line in the range MP ∈[9.257, 70.143] was derived for

each testing method. As before, a function of the form 1

321 )/( −++ MP

M eAPAA

provided the best fit. )(*

3 MPT = 1)19472.159/3515703.31559459.1( −−++ MP

M eP and

3 MPT = 1)92324.993/4141592.3060645.1( −−++ MP

M eP , the regression lines derived

for branch testing and JJ-path testing respectively, both have an associated R2

value of

0.997, and this clearly indicates that each line fits the corresponding data very well

indeed.

Using the equations of the trend lines, it is found that JJ-path testing yields better

JJ-path coverage than branch testing in the entire range MP ∈[9.257, 70.143]. In

relative terms, JJ-path testing is between 0.579% more effective than branch testing at

MP = 9.257, this rises swiftly to 5% at MP = 10.854, and to 7.843% at MP = 97.914. It

may therefore be concluded that the extent of the mean collateral cover of JJ-paths

provided by branch testing, is not commensurate with the level of coverage achieved

by JJ-path testing. Again because of the form of the regression lines, these comments

obtain identically when the relative performance of the methods in respect of

coverage per unit test path is considered.

4.3 Aggregate Coverage

By summing the relevant columns in tables 2 and 5, the values of the initial and final

aggregate coverage of each of the 35 units may be deduced straightforwardly.

Using the definitions of )(*

2 MPT and )(*

3 MPT , the aggregate mean coverage that

is achieved by each of the testing methods is given by:

MPT = ( )(*

2 MPT + )(*

3 MPT )

The values of MP and )(*

MPT derived experimentally, may be deduced by summing

the appropriate entries in corresponding entries in tables 4 and 7. Using these, a trend

line relating )(*

MPT and MP for each method was derived, as before The lines:

3 MPT = 1)52041.167/5449193.153743396.0( −−++ MP

M eP for branch testing, and

3 MPT = 1)87559.374/6462876.151848991.0( −−++ MP

M eP for JJ-path testing each

represents a good model of the aggregate coverage attained by corresponding method

since the associated R2 values are both 0.997. Again by performing simple

calculations on the trend lines, it was found that branch testing yields an aggregate

coverage that is superior to that of JJ-path testing in the rather small range MP ∈

[9.257, 10.06], but after MP = 10.06, the situation is reversed. The performance of the

methods in relative terms shows that in the range of interest, JJ-path testing is between

(–1.64)% and 3.127% more effective than branch testing; it being 2% or more

effective for MP > 16.179. Therefore, it may be said that, in the mean, for all values of

MP > 10 (approximately), JJ-path testing provides a superior level of aggregate

coverage than does branch testing. Identical comments can be made, and the same

conclusion drawn in respect of aggregate coverage per test path.

5.0 Conclusions

In this paper, an experimental investigation has been undertaken of the extent of the

collateral coverage of: branches provided by JJ-path testing, and JJ-paths provided by

branch testing. Specifically, the mean collateral coverage provided by each testing

method has been compared with the mean “natural” coverage provided by the other.

The levels of the mean aggregate coverage (branch coverage + JJ-path coverage)

achieved by the methods have likewise been compared. The values of *

2Ter , *

3Ter ,

and ( *

2Ter + *

3Ter ) that were used to effect the comparisons, were derived as a result of

applying branch testing and JJ-path testing to a sample of 35 units of code and

recording the relevant statistics.

The results of the comparisons can be summarised as follows.

(1) Branch coverage

The results show that if fewer than 71 test paths per unit (including infeasible paths)

need to be generated, the mean branch coverage achieved by branch testing always

exceeds the mean collateral branch coverage provided by JJ-path testing. The

difference in mean branch cover is 3.691% at MP = 9.257, which reduces to 2% at MP =

21.165, and thence to 1.38% at MP = 70.143. It is noted that the experimental results

apply only to MP , the number of test paths generated, in the range [9.257, 70.143].

However, if the trend lines that have been derived are valid for MP > 70.143, the value

of the mean collateral coverage will continue to approach that of the mean natural

coverage (the percentage difference between them will continue to decrease) with

increasing values of MP .

UKTest 2005

(2) JJ-path coverage

The natural coverage of JJ-paths by JJ-path testing, in the mean, always exceeds the

collateral cover delivered by branch testing. The extent of the excess increases

rapidly from 0.579% at MP = 9.257 to 5% at MP =10.854, and thereafter to 7.843%

at MP =70.143. Further, if the trend lines are valid beyond this value of MP , it will

continue to increase until a maximum value is reached.

(3) Aggregate coverage

The mean aggregate coverage that branch testing yields is at most 1.64% greater than

that achieved by JJ-path testing only if no more than approximately 10 test paths need

to be generated in order to give an aggregate coverage of 2. If, on the other hand,

between 10 and 70 paths are needed, the situation is reversed, and the relatively better

mean yield of JJ-path testing increases to more than 2% at MP = 16.179, after which it

increases steadily to 3.127% at MP =70.143. Further increases up to a maximum

value are indicated if the trend lines are valid beyond this value of MP .

Given these, how then should the two methods be sequenced to maximal advantage in

a single testing strategy? As the testing of branches in a program is likely, in general,

to require the generation of more than 10 paths, then in virtue of (1) and (3) above,

only the extent of branch coverage will be greater if branch testing is applied first.

The aggregate coverage provided will be less, see (3), than that attained using the

alternative sequence, as to a greater extent will be the coverage of JJ-paths, see (2).

The results therefore suggest that the order of application should be JJ-path testing

first, and then branch testing provided that some branches remain uncovered.

However, taking into account the extent of the percentage differences between the

performance of the two methods in the above three cases, the use of only JJ-path

testing in a testing strategy, rather than in combination with branch testing, does not

by any means appear to be an unrealistic option. Moreover, it appears to the authors

to be a somewhat better option. This view is supported by the fact that, because of

both the form of the trend lines, and their quality of fit, all of the above comments

concerning relative performance also obtain when coverage per test path is

considered. For this can certainly be viewed as a measure of the effort required in

order to achieve a given level of coverage. Thus, the authors contend that, if it is

necessary to adopt only one of the testing methods in order to encompass branch and

JJ-path coverage and/or their aggregate, the choice should be JJ-path testing. Further,

they also contend that serious consideration should also be given to using only JJ-path

testing even when the application of both methods is possible.

How valid and general are these conclusions? The fundamental factors of influence

here are: the method used to generate the test paths, and the sample used in the

experimentation, in the context of its being representative. As far as the first of these

is concerned, the features of the path generation method used in the experiments are,

as was made explicit in section 2.2, such as not to introduce bias into the results

derived using any testing method. Thus, although the values of *

2Ter , *

3Ter , and

2Ter + *

3Ter ) achieved by any other unbiased path generation method used in

connection with branch and JJ-path testing, may not accord with those recorded

herein, the relative levels of coverage that they yield should remain inviolate. With

regard to the second factor, two features of the sample are germane: the language in

which units in the sample are encoded, and the extent to which the sample is

representative in respect of the number of branches and JJ-paths that were involved.

It was argued in section 3.0 that the issue of encoding language should have, at most,

an insubstantial influence on the outcome of the experiments. In terms of the sample

used being truly representative, it must first be noted that no method exists for

proving/disproving that any given sample is representative. However, the code units

in the sample possess between 3 and 50 branches, with a mean of 19.89, and between

3 and 46 JJ-paths; 18.77 being the mean value. Whether these values conform to the

sample’s being representative, is a moot point, however both of the ranges are quite

substantial. Also, it must be said that: the code units constituting the sample were

selected in a pseudo-random manner; they have been used extensively in a number of

areas of endeavour; the size of the sample (35) is relatively large when compared with

other studies of test coverage that are reported in the literature. The authors suggest,

therefore, that tentative acceptance of the sample’s being representative is not

unreasonable, and thus, some credence should be afforded to the results and

conclusions deriving from the work.

8.0 References

[1] Bertolino A and Marre M. How many paths are needed for branch testing? Journal

of Systems and Software, 1996; 35(2): 95-106.

[2] Brown JR. Practical application of software tools. TRW report TRW-55-72-05,

TRW Systems, One Space Park, Redondo Beach, California, 1972

[3] European Space Agency (ESA) Software Engineering Standards. ESA PSS-05-0

Issue 2, European Space Agency (ESA), 8-10, rue Mario-Nikis, 75738 PARIS

CEDEX, France, 1991.

[4] Frankl PG, Hamlet RG, Littlewood B and Strigini L. Evaluating testing methods

by delivered reliability. IEEE Transactions on Software Engineering; 1998; 24(8)

586-601.

[5] Frankl PG and Weyuker EJ. Provable Improvements on Branch Testing, IEEE

Transactions on Software Engineering; 1993; 19(10) 962-975.

[6] Gondran M and Minoux M. Graphs and Algorithms, Wiley-Interscience, John

Wiley and Sons, Chichester, UK, 1984.

[7] Harder M, Mellen J, and Ernst MD. Improving test suites via operational

abstraction. Proceedings of the International Conference on Software

Engineering, Portland, Oregon; 2003; 60-71.

[8] Koutsikas C and Malevris N. A Unified Symbolic Execution System, Proceedings

of the ACS/IEEE International Conference on Computer Systems and

Applications, Beirut, 2001, pp. 466-469

[9] Malaiya YK, Li MN, Bieman JM, and Karcich R. Software reliability growth with

test coverage. IEEE Transactions on Reliability; 2002; 51(4): 420-426.

UKTest 2005

[10] Malevris N. An Assessment of the Number of Paths Needed for Control Flow

Testing. 3rd International Conference on Reliability, Quality and Safety of

Software-Intensive Systems (ENCRESS ’97), Athens, Greece, pp. 32-43.

[11] Malevris N and Yates DF. The Collateral Coverage of Data Flow Criteria When

Branch Testing, Information and Software Technology, to appear.

[12] Malevris N, Yates DF and Veevers A. A predictive metric for the likely

feasibility of program paths. Information and Software Technology; 1990;

32(2), 115-119.

[13] Paige MR. On partitioning program graphs, IEEE Transactions on Software

Engineering; 1977; SE-3(6) 386-393.

[14] Requirements For Safety Related Software in Defence Equipment, Part 1.

Defence Standard 00-55(PART 1)/Issue 2, Ministry of Defence, U.K., 1997.

[15] Standard for Software Component Testing. British Computer Society Special

Interest Group in Software Testing (BCS SIGIST), U.K., 2001.

[16] Sze SKS and Lyu MR. ATACOBOL - A COBOL Test coverage analysis tool

and its applications. Proceedings of the 11th International Symposium on

Software Reliability Engineering (ISSRE'00), San Jose, California; 2000; 327-

[17] Testbed, LDRA Software Technology, Portside, Monks Ferry, Wirral, U.K.

[18] Woodward MR. An investigation into program paths and their representation.

Techniques et Science Informatiques; 1984; 3, 273-286

[19] Woodward MR, Hedley D and Hennell MA. Experience with path analysis and

testing of programs. IEEE Transactions on Software Engineering; 1980; SE-6:

278-286.

[20] Yates DF and Hennell MA. An approach to branch testing. Proc. of 11th

Workshop on Graph Theoretic Techniques in Computer Science, Wurtzburg,

West Germany; 1985; pp. 421-433.

[21] Yates DF and Malevris N. The effort required by LCSAJ testing: an assessment

via a new path generation strategy. Software Quality Journal; 1995; 4(3): 227-

[22] Yates DF and Malevris N. Reducing the effect of infeasible paths in branch

testing. Proc. 3rd

Symposium on Software Testing, Analysis and Verification

(TAV3), Key West, Florida, U.S.A.;1989; 48-56.

[23] Zhu H, Hall P and May HR. Software unit test coverage and adequacy. ACM

Computing Surveys; 1997; 29(4):336–427.

Appendix

Branch Coverage

Initial Coverage Final Coverage

Branch

Testing

JJ-path

Testing

Branch

Testing

JJ-path

Testing

Unit No. of

No. of

1 3 1 4 1 3 1 4 1

2 4 0.857 6 0.857 6 1 12 1

3 2 1 3 1 2 1 3 1

4 3 0.75 5 1 5 1 5 1

5 9 0.353 19 0.353 309 0.588 319 0.353

6 5 0.875 6 0.875 6 1 7 1

7 7 0.385 10 0.385 23 1 36 1

8 3 1 4 1 3 1 4 1

9 6 1 8 1 6 1 8 1

10 4 0.714 7 0.714 178 1 123 1

11 8 0.5 12 0.5 308 0.5 312 0.5

12 10 0.167 16 0.278 212 1 316 0.278

13 7 1 10 1 7 1 10 1

14 6 0.9 8 0.9 8 1 10 1

15 9 0.667 11 0.667 42 1 14 1

16 9 0.933 12 0.933 12 1 52 1

17 2 1 2 1 2 1 2 1

18 8 0.5 10 0.5 21 1 310 1

19 7 0.583 9 0.583 13 1 309 1

20 5 0.625 7 0.625 13 1 203 1

21 4 1 7 1 4 1 10 1

22 3 1 4 1 3 1 4 1

23 5 0.667 10 0.667 94 1 310 1

24 10 0.789 15 0.789 310 0.789 315 0.789

25 5 1 8 1 5 1 8 1

26 4 1 5 1 4 1 15 1

27 7 0.231 12 0.231 307 0.538 312 1

28 7 0 11 0 53 1 311 1

29 6 1 8 1 6 1 8 1

30 15 0.429 25 0.857 103 1 63 1

31 5 0 10 0 26 1 310 1

32 13 0.273 21 0.273 313 0.727 321 0.818

33 5 0.125 7 0.125 36 1 38 1

34 5 1 5 1 5 1 5 1

35 7 1 7 1 7 1 7 1

Table 2

UKTest 2005

2 MPT used in deriving the trend

lines in respect of branch coverage

Branch Testing JJ-path Testing

MP )(*

2 MPT MP )(*

10.14286 0.774 9.257143 0.717

11.85714 0.776 9.6 0.721

12.28571 0.791 10.25714 0.728

12.71429 0.793 10.85714 0.758

13.11429 0.806 11.97143 0.781

13.51429 0.824 12.51429 0.811

14.25714 0.832 14.6 0.811

14.45714 0.833 16.54286 0.812

15.37143 0.861 17.02857 0.814

18.82857 0.872 17.51429 0.828

19.42857 0.881 18.48571 0.83

21.2 0.886 19.45714 0.859

22.62857 0.892 21.88571 0.859

23.2 0.905 22.37143 0.863

26.51429 0.908 24.68571 0.88

31.91429 0.909 25.54286 0.893

33.97143 0.91 25.97143 0.893

34.25714 0.919 27.25714 0.893

48.37143 0.923 27.68571 0.893

52.88571 0.942 28.51429 0.893

53.22857 0.947 38.91429 0.909

70.14286 0.947 48.2 0.918

- - 56.74286 0.922

Table 4.

JJ-path Coverage

Initial Coverage Final Coverage

Branch

Testing

JJ-path

Testing

Branch

Testing

JJ-path

Testing

Unit No. of

No. of

1 3 0.875 4 1 3 0.875 4 1

2 4 0.667 6 0.75 6 0.833 12 1

3 2 0.8 3 1 2 0.8 3 1

4 3 0.5 5 1 5 0.875 5 1

5 9 0.171 19 0.171 309 0.343 319 0.171

6 5 0.786 6 0.857 6 1 7 1

7 7 0.278 10 0.333 23 0.833 36 1

8 3 0.875 4 1 3 0.875 4 1

9 6 0.875 8 1 6 0.875 8 1

10 4 0.455 7 0.455 178 1 123 1

11 8 0.4 12 0.4 308 0.4 312 0.4

12 10 0.156 16 0.25 212 0.844 316 0.25

13 7 0.85 10 1 7 0.85 10 1

14 6 0.882 8 0.882 8 1 10 1

15 9 0.458 11 0.5 42 0.833 14 1

16 9 0.857 12 0.952 12 0.905 52 1

17 2 1 2 1 2 1 2 1

18 8 0.471 10 0.471 21 0.882 310 0.882

19 7 0.533 9 0.533 13 0.867 309 0.867

20 5 0.545 7 0.545 13 0.909 203 1

21 4 0.75 7 0.875 4 0.75 10 1

22 3 0.857 4 1 3 0.857 4 1

23 5 0.5 10 0.5 94 0.857 310 0.857

24 10 0.7 15 0.733 310 0.7 315 0.733

25 5 0.8 8 1 5 0.8 8 1

26 4 0.875 5 0.875 4 0.875 15 1

27 7 0.167 12 0.167 307 0.5 312 1

28 7 0 11 0 53 0.765 311 0.765

29 6 0.875 8 1 6 0.875 8 1

30 15 0.415 25 0.707 103 0.902 63 1

31 5 0 10 0 26 0.692 310 0.692

32 13 0.206 21 0.206 313 0.441 321 0.618

33 5 0.125 7 0.125 36 1 38 1

34 5 1 5 1 5 1 5 1

35 7 1 7 1 7 1 7 1

Table 5.

UKTest 2005

3 MPT used in deriving the trend

lines in respect of JJ-path coverage

Branch Testing JJ-path Testing

MP )(*

3 MPT MP )(*

10.14286 0.675 9.257143 0.666

11.85714 0.678 9.6 0.67

12.28571 0.69 10.25714 0.678

12.71429 0.693 10.85714 0.716

13.11429 0.7 11.97143 0.741

13.51429 0.717 12.51429 0.769

14.25714 0.721 14.6 0.772

14.45714 0.722 16.54286 0.774

15.37143 0.742 17.02857 0.779

18.82857 0.753 17.51429 0.793

19.42857 0.763 18.48571 0.796

21.2 0.766 19.45714 0.816

22.62857 0.771 21.88571 0.818

23.2 0.781 22.37143 0.822

26.51429 0.784 24.68571 0.839

31.91429 0.787 25.54286 0.849

33.97143 0.787 25.97143 0.85

34.25714 0.797 27.25714 0.851

48.37143 0.805 27.68571 0.852

52.88571 0.822 28.51429 0.853

53.22857 0.825 38.91429 0.865

70.14286 0.825 48.2 0.875

- - 56.74286 0.883

Table 7.

2. Formal Models and Approaches

to Testing

Towards Unit Testing for

Communicating Stream X-machine Systems⋆

Joaquın Aguado and Michael Mendler

Faculty of Information Systems and Applied Computer Sciences,University of Bamberg, Germany,

{joaquin.aguado,michael.mendler}@wiai.uni-bamberg.de

Abstract. This papers studies the conformance testing problem for amodel of distributed systems, Communicating Stream X-machine Sys-

tems (CSXMS). An approach for the unit testing of CSXMS is sug-gested. It is based on generating testing sequences from the distributedspecification, and exercising them on each of the components indepen-dently of the context through a distributed test system, which can bethe same original system, but in which the components have some ex-tra functionality to act as testers. The main technical result is twofold.First, for testing a complete system this approach has at least the samefault detection ability as the previous (product machine) approach butit is also more efficient and it permits to separate unit and integrationtesting. Second, the design for test conditions are relaxed compared toprevious CSXMS techniques regarding the attainability of the memory,which means less effort to prepare systems for testing.

1 Introduction

Conformance testing concentrates on the production of a set of test cases, whichare deemed sufficient to demonstrate that the behaviour of the implementationconforms to the specification. Conformance, in general, cannot be expected toimply full functional correctness for two reasons. First, to validate the correct-ness fully, exhaustive testing is required which is impossible to achieve if thereis an infinite range of operating conditions and user interactions of which only arather small and finite part can be covered by a finite test suite. Second, if the im-plementation under test is a concrete physical system (a set of programs runningon a set of hardware processors) implementing a complex software architecture(compiled and optimized for a given operating system and middleware layer) itwill be next to impossible to determine an exact model of the implementationrelative to which one could guarantee that tests are exhaustive.

If conformance testing does not obtain full specification correctness, whatdoes it achieve? How can we measure the quality of a conformance test andgive rational assertions about the extent to which it achieves its goal? Any such

⋆ This work has been partially supported by the European Commission within theTYPES network IST 510996.

method, so it seems, will have to involve an abstraction both of the specifica-tion (to reduce the number and complexity of the specified behaviour) and ofthe implementation. The Stream X-machine Model (SXM) [1] is geared towardsmaking such abstractions by separating finite control structure from (usuallyinfinite) data paths. Each of these two can be tested separately. For instance, itmay be assumed, that relative to a given set of control points the implementationcan be represented by an SXM with a bounded number of control states and thatthe data transformations that are effected through the interaction points can beverified separately. This is known as the fundamental testing hypothesis. Relativeto such an abstract model of the implementation under test (IUT) it is possibleto give guarantees about full test coverage. To be more precise, according to [2] afault model is a triplet (sι, conforms,Miut), where sι is a finite-state specification(e.g. I/O FSM, LTS, I/O Automaton, X-Machine), conforms is the conformancerelation (e.g. FSM equivalence, reduction or quasi-equivalence, trace inclusionor trace equivalence) and Miut is the fault domain (i.e. the set of possible im-plementations). The fault domain usually reflects test assumptions (e.g. all I/OFSM with a given number of states) which are established by a testing hypoth-esis. In this manner, a test suite is complete with respect to (sι, conforms,Miut)when for all ι ∈ Miut, ι passes the test suite if and only if ι conforms sι. If so,the test suite provides fault coverage guarantee for the given fault model.

ι conforms sι

Hypothesis

Testing

mι ≃ sι

Fault model (sι, conforms, Miut)

Fig. 1. Conformance Testing

In conformance testing [3, 4] we are given three sets Miut, Ms and Mm asin Fig. 1, respectively, of implementations under test, specifications and modelsof implementation. Each IUT ι ∈ Miut is to be tested with respect to a givenspecification sι ∈ Ms. The testing hypothesis is the assumption that each ι ∈Miut can be captured adequately by some associated formal model mι ∈ Mm.Conformance is the result that we hope to get from running some set of testson the implementation ι. If successful, the tests establish a relationship between

UKTest 2005

Miut and Ms, viz. “ι conforms sι”. This is not a formal relationship but anoperational process. If we want to judge the quality of the tests we refer to themodels of implementation. For we may be able to show that the tests are powerfulenough so that any model in Mm that passes all tests actually implements Ms insome formal sense. Depending on the purpose of testing such an implementationrelationship, written mι ≃ sι, can mean one of many things, e.g. that mι istrace equivalent to sι, that mι simulates sι, and so on. The point is that it is aprecise mathematical relation that gives rational meaning to the statement thatan implementation passes a set of test. In this way the testing hypothesis linksthe informal notion of ι conforms to sι with a formal one, mι ≃ sι.

Testing is about deciding the implementation relation ≃ which — if it issignificant — refers to an infinite number of possible input sequences in termsof a finite but judiciously chosen test set χ. For finite state systems either theclassical W-method [5] or the Wp-method [6] can be applied to obtain a test set χfrom the specification automaton. The test suite generation problem from SXMhas been solved [7, 8] by considering the associated (control) automaton A(Λs)of the SXM specification Λs (see Def. 1 below) treating the operations Φ on thedata path as abstract input symbols (Φ is called the type of Λs). This dependsonly on the control states and is independent of the different possible states of thememory (i.e. memory values). Thus, it produces a much smaller test set than thatproduced for a method that considers both control-states and memory-states.The test set is a set of sequences χ ⊆ Φ∗ from the type Φ of Λs which tests traceequivalence with respect to A(Λs). Let ≃T denote language equivalence relativeto the words in T ⊆ Φ∗, i.e., A1 ≃T A2 iff ∀ω ∈ T. ω ∈ L(A1) ⇔ ω ∈ L(A2),where L(A) denotes the language of A. Full trace equivalence, A1 ≃ A2, isA1 ≃Φ∗ A2. As part of the testing hypothesis one assumes that the controlautomaton A(Λm) of an implementation under test Λm has a bounded numberk of control states (essentially the number of states, depending on χ). Then ifA(Λm) passes all tests in χ (which have been generated to fit k) it must be traceequivalent to A(Λs). Formally, A(Λm) ≃χ A(Λs) implies A(Λm) ≃ A(Λs). Now,if all data path operations Φ are implemented correctly (by testing hypothesis)this further implies that Λm and Λs have the same input-output functions fΛm

and fΛm, respectively (see Def. 2 below), i.e., ∀s ∈ Σ∗. fΛm(s) = fΛs(s).

Let us overload notation and use ≃ to denote this functional equivalencefor SXMs. Under the testing hypothesis, thus, a finite test set χ suffices todecide functional equivalence ≃ between two SXMs. But how do we apply χto the control automaton, A(Λm), in practice? What we will have available fortesting is the IUT corresponding to the full SXM Λm, not the abstracted controlautomaton. Thus, in order to exercise these sequences on Λm it is necessary toconvert them into sequences from the input alphabet Σ of Λm. Here is where thefundamental test function [9, 7, 8] comes into play. A fundamental test functiont : Φ∗ → Σ∗ has the property that for every ω ∈ χ, t(ω) ∈ dom(fΛm) iffω ∈ L(A(Λm)). In other words, to find out if a test sequence ω ∈ χ is acceptedby the control automaton A(Λm) it suffices to run the translation t(ω) ∈ Σ∗ on

the “real” system Λm. Obviously, then Λm ≃ Λs iff ∀ω ∈ χ. t(ω) ∈ dom(fΛm) ⇔ω ∈ L(A(Λs)).

How do we construct the test function? The canonical way to do this isto simulate the abstract test sequence ω = φ1φ2 · · ·φk ∈ χ incrementally onΛm, at each step providing a suitable input symbol σi ∈ Σ∗ for controlling theexecution of φi ∈ Φ on Λm, so that we are able to detect whether or not φi canbe successfully executed (trigger). The abstraction made for obtaining the testset χ (by interpreting the relation in Φ as abstract symbols) removes informationregarding to the input-output behaviour and the machine memory. In order forthe fundamental test function t to be able to fill in this information, the standardtechnique proceeds as follows:

1. Identify a memory invariant N ⊆ MEM that can be maintained for a re-stricted set of input symbols S ⊆ Σ. Formally, the type Φ of Λ is calledclosed with respect to (MEM, S) if the initial memory has m0 ∈ N and forall φ ∈ Φ executed from an arbitrary memory m ∈ N under an input symbolσ ∈ S will produce a memory value m′ ∈ N .

2. Ensure that it is always possible to apply an input σk+1 ∈ S that can triggera given operation φk+1 ∈ Φ in all memory states within the invariant setN that can possibly be assumed by the implementation Λm after havingexecuted the functions φ1φ2 · · ·φk in this order. This condition, known asinput completeness, ensures that (i) every path ω ∈ L(A(Λs)) which is alsoin L(A(Λm)) can actually be followed in Λm, and (ii) if at the end of thispath a function φk+1 is not found to be executable in Λm (note that thisφk+1 may or may not be executable in Λs) then ω cannot be extended byφk+1 in the control automaton L(A(Λm)) either.

3. Ensure that it is always possible to determine which operation φ ∈ Φ hasbeen executed by observing the output γ ∈ Γ produced by Λm. For non-deterministic systems it is crucial that this output also determine the nextmemory value. This condition, known as output-distinguishability guaranteesthat the output sequence produced by the machine identifies uniquely thepath followed.

Completeness and output-distinguishability are known as design for test con-ditions. Output-distinguishability is only mentioned briefly in Section 4.3 re-garding deterministic machines. For information about different types of non-deterministic SXM the reader is referred to [10–12].

The fault model of SXM-based testing admits a more general interpretation interms of an abstract machine AINV which we might call an automaton invariant.The idea is that AINV encapsulates an abstraction of the combined memoryand control paths which both specification and implementation are assumedto follow as part of the testing hypothesis. In this manner, the fault domain(given by the testing hypothesis) refers to the set of those implementationsrefined from AINV . The notion of refinement here is trace inclusion. To be moreprecise, for all SXM Λ with type Φ, for which the testing hypothesis holds,

UKTest 2005

we have L(A(Λ)) ⊆ L(AINV ). The implication of this is that any test set χis a subset of L(AINV ). This is justified because any ω /∈ L(AINV ) cannot beexecuted neither in the specification nor in the implementation. Note that thisfiltering through L(AINV ) still involves both positive and negative testing: Forall ω ∈ χ ∩ L(A(Λ)) it is tested that the implementation admits these traces,while for ω ∈ χ ∩ (L(AINV ) \ L(A(Λ))) we verify that ω cannot be executed.

For the stand-alone SXM testing technique, AINV is trivial as illustrated inFig. 2. In this AINV there is a single state representing all pairs Q × N and aself-loop on this state for every operation φ ∈ Φ subject to the input constraintS. The intended meaning is that for each (q, m) ∈ Q×N and every φ ∈ Φ thereis an input σ ∈ S such that (m,σ) triggers φ and the resulting state is again inQ × N .

Q × N Φ, S

Fig. 2. The automaton invariant AINV of the SXM testing approach.

One contribution of this paper consists of defining AINV for communicatingstream X-machine system (CSXMS) (see Fig. 3 below) subject to the structuralconstraints and the assumptions on the operation of the functions of Φ made inthe testing hypothesis.

Next, let us look at the conformance testing problem from a communicatingstream X-machine system (CSXMS) specification W . In [9, 13] it is assumed bytesting hypothesis that the IUT ι is modelled by a single monolithic SXM Λmrather than a communicating system of individual SXMs. In order to obtaintests, the specification W is “transformed” into a functionally equivalent globalSXM Λ(W ) [13] so that Λ(W ) ≃ W . The testing is done from this resulting Λ(W )using a stand-alone testing method. If the test cases are sufficient to demonstratethat Λm ≃ Λ(W ) and since ≃ is transitive, then it is possible to deduce thatΛm ≃ W and the problem reduces to that of finding a test set for Λm fromΛ(W ). In this approach, thus, the original structure of the CSXMS specificationis lost and not linked with any component structure of the IUT.

Using the procedures described in [7], it can be ensured that the type Φi

of every component Λi is both input-complete and output-distinguishable. Thefollowing results from [9] provide the basis for test generation from Λ(W ), un-der moderate assumptions on the structure of the CSXMS W (testing variant,simpleness):

– If the associated automaton of each component Λi of W is deterministic thenthe associated flat automaton Λ(W ) is also deterministic.

– If the type Φi of each Λi is a set of partial functions then the type Φ of Λ(W )also consists of partial functions.

– If all Φi are input-complete then Φ is input-complete.– If all Φi are output-distinguishable, then Φ is, too.

The advantage of this global product machine approach is that it is verygeneral and can be applied in situations where the IUT cannot be decomposedinto parts and only accessed as a whole. However, there are severe disadvantages,too. Because of state-space explosion the construction of Λ(W ) and of the testsequences from it can be very expensive. Also, since the IUT is tested as a wholeit can be very difficult to obtain a reliable guess on the maximal sequentialdepth (number of states) of Λm when the IUT is a complex system. Fortunately,in many practical situations this monolithic approach is not necessary. Oftenthe distribution of components in W has a direct match in the implementation,either because the implementation has been derived from W in a component-oriented manner or because the specification W has been tailored to reflectexisting implementation components for the purposes of testing. If the softwaresystem is implemented from the CSXMS specification it seems natural to assumethat this structure is preserved in the implementation. Programming languages,operating systems and tools provide a standard mechanism to achieve this evenin stand-alone computers.

Indeed, we believe that the main benefits of using CSXMS as a specificationformalism lie in the possible separation of unit and integration testing. In unittesting the smallest piece of a system (e.g. the smallest compilable element or acommunicating component) is tested in isolation. On the other hand, integrationtesting is defined as the activity in which components are combined and testedto evaluate the interaction between them [14]. The original approach [9, 13] de-scribed above does not support this separation of concerns as it performs at thesame time and in an indivisible manner both the unit testing and the integrationtesting of a system specified as a CSXMS. This paper attempts to make firststeps towards developing a methodology for unit and integration testing basedon CSXMS.

2 Unit Testing for CSXMS

A CSXMS specification W is distributed in the sense that it is presented in termsof clearly separated and independently executing components, which interactwith each other. In order to define a fault model that exploits this for unittesting one possibility is to restrict the fault domain Miut to those IUTs whichfollow the structure (system architecture) of the specification W . Concretely,this means that for every implemented component under test (ICUTi) in theIUT, there is one and only one component Λi in the CSXMS specification Wthat prescribes its behaviour.

Therefore, at least in principle, it is possible to take the component’s specifi-cation Λi and generate a test set out of it, and then exercise the test sequencesin the component’s implementation ICUTi to observe its behaviour. However,neither this Λi nor the ICUTi are isolated entities. In the CSXMS model they

UKTest 2005

communicate with other components through input and output ports. We maynow analyse the behaviour of the ICUTi and compare it to that prescribed by Λi

under the assumption that the rest of the system is fixed and operates correctlyaccording to the CSXMS specification, or test independently of the context. Thislatter alternative is the one explored in this paper.

Let Λim be the model of ICUT i and Λi its specification. The implementationrelationship is articulated as follows. We put Λim ≃ Λi iff W [Λi] ≃ W [Λim],where W [·] is an arbitrary CSXMS in which Λi and Λim can operate as a com-ponent. For W [Λi] ≃ W [Λim] we use standard CSXMS equivalence. We showhow the test generation problem to decide the strong contextual Λim ≃ Λi canbe solved using the SXM testing approach on the specified component Λi withan extended communication interface. Note that the assumption that the com-ponents can be manipulated does not imply that from outside of the system(environment) it is possible to access directly the communication interfaces (i.e.ports), but just the environment interface (i.e. streams) of the components. Inorder to do this, the approach that is suggested here is based on testing archi-tectures. The intention of a distributed/remote architecture [4] is to provide aframework for unit testing of the ICUTs that later will be integrated into asystem. A testing architecture is a description of the environment in which theIUT is executed by carrying out test cases in order to observe whether or notit has a certain behaviour. Abstract testing architectures are described in termsof observable outputs and controllable inputs with respect to the IUT [15]. Inthis way, a testing architecture refers to the test devices (also called testers),the connections between them and those between test devices and the IUT [16].Even more specifically, a testing architecture consists of the IUT, its points ofcontrol and observation (PCOs) and a test system. The PCOs are points closestto the IUT at which input events and observation of output events take placeduring testing [15, 17].

Our test architecture consists in connecting independent testers to all inputand output channels of the CSXMS component under test and to provide themwith individual test instructions, called local test sequences (LOC) that all to-gether constitute a remote test case (RTC). In this way the test is performedin a concurrent and local fashion and also provides some amount of integrationtesting as it can involve the actual communication topology in which ICUT issupposed to operate. Two CSXMS W ′ and W are architecturally isomorphic,denoted by W ′ ∼=id W , if and only if there is a one-to-one correspondence be-tween the systems’ components and the structure of communication (i.e. theconnections among components) is the same in both.

Now, given that all the components have been tested and conform, it is re-quired to test the architectural isomorphism Wm ∼=id W (provided that thisis not given), where the CSXMS Wm is the model of implementation of IUT.Then, because W is given and distributed, it is possible to use a simple dis-tributed algorithm that can be implemented as part of the functionality of thecomponents, and executed when required (at least once before the system is putinto operation). This integration testing algorithm operates as follows. It receives

from the input stream of each component a codification of the identifier of thecomponent and the identifier of all its neighbours in W . Then it sends its identi-fier to all its neighbours and receives from them their corresponding identifiers,which are placed, one by one, in the output stream. If at the end, the outputstream contains exactly the set of neighbours identifiers that are indicated in Wthen the component is in the right place and correctly connected. If for all thecomponents in the IUT this is the case, then obviously Wm ∼=id W .

3 Basic Definitions

In this section we recall the standard definitions of deterministic (communicat-ing) stream X-machines.

Definition 1. A SXM is a tuple Λ = (Σ,Γ, Q,M,Φ, F, q0,m0), where:

– Σ and Γ are finite input and output alphabets, respectively,– Q is a (finite) set of control states,– M is a possibly infinite set called the internal memory of Λ,– Φ, called the type of Λ, is a set of non-empty (partial) processing functions

of the form φ : M × Σ ⇀ Γ × M ,– F is the (partial) next-state function, F : Q × Φ ⇀ Q,– q0 ∈ Q and m0 ∈ M are respectively the initial control and memory states.

For determinism we require that if F (q, φ1) 6= ⊥ and F (q, φ2) 6= ⊥ are bothdefined for distinct functions φ1 6= φ2 then dom(φ1) ∩ dom(φ2) = ∅.

A configuration cf of a SXM is a tuple cf = (m, q, s, g), where m ∈ M , q ∈ Q,s ∈ Σ∗, g ∈ Γ ∗. The SXM starts its execution from an initial configuration(m0, q0, s, ǫ), where the input stream is set to s and ǫ indicates that the outputstream is empty. A change of configuration (m, q, σ · s′, g) ⊢ (m′, q′, s′, g · γ) ispossible if there exists φ ∈ Φ with q′ ∈ F (q, φ) and φ(m,σ) = (γ,m′). Thetransitive and reflexive closure of ⊢ is denoted by ⊢∗. Configurations such as(m, q, ǫ, g) with empty input stream are final representing successful termination.

Definition 2. The (partial) function computed by a SXM Λ, fΛ : Σ∗ ⇀ Γ ∗, isdefined by fΛ =df {(s, g) | ∃q ∈ Q,m ∈ M. (m0, q0, s, ǫ) ⊢∗ (m, q, ǫ, g)}.

The notion of CSXMS combines the advantages of the SXM, i.e., to separatethe concerns of modelling control system and functional data paths, with thepossibility of sending and receiving messages among several of these abstractmachines. The communication between components (referred also as CSXM) isimplemented via local input and output ports within each component and aglobal communication matrix C that links every component with every other.Each entry C[i, j] in the matrix is a directed one-place buffer for messages intransit from Λi to Λj . The buffer, which may be empty or full, synchronises thecomponents in the sense that any writing into C[i, j] by Λi is blocked until anyprevious value has been read by Λj so that the buffer is empty, while reading

UKTest 2005

from C[i, j] must wait until the buffer is filled by Λi. Otherwise, the executionof operations in different component SXMs is concurrent and asynchronous, i.e.any number of components whose operations are enabled can move together inone step of a global change of configuration. Let C[i, j] ← x be the communcationmatrix C where cell (i, j) has been updated by value x, leaving all other cellsunchanged.

Definition 3. A CSXMS with n components is a family W = (Λi)i≤n of n SXMsystem components of the form Λi = (Σi, Γi, Qi,DATAi, Φi, Fi, q

0i , data0) over

the abstract data path DATAi = INi ×Mi ×OUTi ×CM subject to the followingstructural constraints and interpretations:

– INi and OUTi are two sets1 called the input port and the output port re-spectively of the component, with the property that INi,OUTi ⊆ Mi ⊎ {λ}and λ ∈ INi ∩OUTi. The special symbol λ is used to indicate an empty port.We will use the abbreviation MEM i = INi × Mi ×OUTi called the memoryof Λi.

– CM =df Π(i,j)∈n×nOUTi is the set of communication matrices C ∈ CM oforder n × n modelling a message buffer of size 1 associating with every pair1 ≤ i, j ≤ n a message C[i, j] ∈ OUTi en-route from Λi to Λj. We assumethat OUTi ⊆ INj for every pair i, j of components, so that every outputmessage of component i can be accepted as input by every component j.

– data0i = (λ,m0

i , λ, C0) is the initial state of the data path, where m0i the

initial internal memory and C0 the initial communication matrix, which weassume to be empty. I.e., C0[i, j] = λ for all i, j ≤ n.

Further, the type Φi of any component can be partitioned into processing opera-tions Φp

i and communicating operations Φci , i.e. Φi = Φp

i ⊎ Φci , such that

– no processing operation depends on or modifies the communication matrix,i.e., all φp

i ∈ Φpi are of type MEM i × Σi ⇀ Γi × MEM i. As a function

on the full type DATAi × Σi ⇀ Γi × DATAi we have φpi (in,m, out, C, σ) =

(γ, in′,m′, out′, C) where φpi (in,m, out, σ) = (γ, in′,m′, out′) under the re-

stricted type.– the communicating operations Φc

i = Φsi ⊎ Φr

i split into output-moves Φsi or

input-moves Φri , which are partial functions of the following shapes (assum-

ing i 6= j):• The output moves are sendi→j : OUTi ×CM ×Σi ⇀ Γi ×OUTi ×CM

such that sendi→j(out, C, σ) is defined only if out 6= λ = C[i, j]. Thefunction value, if defined, is (γ, λ, C[i, j] ← out) = sendi→j(out, C, σ).

• The input moves are receivej→i : INi ×CM ×Σi ⇀ Γi × INi ×CM suchthat receivej→i(in, C, σ) is defined only if C[j, i] 6= λ = in. The result, ifdefined, is (γ,C[j, i], C[j, i] ← λ) = receivej→i(in, C, σ).

Both sendi→j and receivej→i operations are naturally extended to the fulltype DATAi × Σi ⇀ Γi × DATAi as follows: We put

sendi→j (in,m, out, C, σ) = (γ, in,m, out′, C ′)

1 We assume that INi 6= {λ} and OUTi 6= {λ}.

iff (γ, out′, C ′) = sendi→j(out, C, σ) for send and

receivej→i (in,m, out, C, σ) = (γ, in′,m, out, C ′)

iff (γ, in′, C ′) = receivej→i(in, C, σ) for the receive operation.

Finally, the set of control states is partitioned Qi = Qpi ⊎ Qc

i into a set of pro-cessing states Qp

i and communicating states Qci such that

– all the transitions emerging from Qpi are labelled by processing operations

and all transitions emerging from Qci labelled by communicating operations,

so the next-state function Fi : Qi×Φi → Qi of component Λi has the domaindom(Fi) ⊆ (Qp

i × Φpi ) ∪ (Qc

i × Φci )

– Every processing function φ ∈ Φpi clears the input port and sets the output

port to a defined value, i.e., if φ(x,m, y, C, σ) = (γ, x′,m′, y′, C) then x′ = λand y′ 6= λ

– the initial state is a processing state, q0i ∈ Qp

– a communicating operation always moves the machine into a processing state,i.e., if (q, φ) ∈ dom(Fi) and q ∈ Qc

i then Fi(q, φ) ∈ Qpi .

We will generally use processing operations φpi , input moves sendi→j and

output moves receivej→i with their more specific types. Also, note that commu-nication operations just like processing operations are synchronous in the sensethat they consume exactly one symbol from the input stream and produce onesymbol on the output stream. Previous work on CSXMS [9] permits commu-nication operations to run silently, but adds extra symbols for controlling andobserving them as part of the so-called testing variant. Def. 3 avoids this extrainstrumentation step. Also, note that Def. 3 presumes that each component ofa CSXMS is simple a condition imposed in [9] for input controllability.

Let W = (Λi)i≤n be a CSXMS. As defined above a configuration for SXMcomponent Λi is a tuple

cf i = (qi, ini,mi, outi, C, si, gi) ∈ Qi × MEM i × CM × Σ∗i × Γ ∗

where ini ∈ INi and outi ∈ OUTi are the current port values, mi ∈ Mi andqi ∈ Qi the current internal memory and control state, respectively, C the cur-rent state of the configuration matrix, and finally si ∈ Σ∗

i , gi ∈ Γ ∗i the cur-

rent contents of input and output streams. We now define the configurations ofthe CSXMS W by glueing together those local configurations using the com-munication matrix as a shared object. Thus, a configuration of W is a tuplecf = (cf 1, . . . , cf n, C) where each cf i is a configuration of component Λi and Cthe current global communication matrix such that C = π5(cf i) (πi refers to theith projection function) for all i ≤ n. Each such SXM Λi performs local changesof configuration cf i ⊢ cf ′i by executing a processing or communicating operationin the appropriate way. Since all Λi run under mutual exclusion regarding thematrix C, their execution can be serialised. Thus, a global change of configu-ration (cf 1, . . . , cf n, C) ⊢ (cf ′1, . . . , cf

′n, C ′) consist of an index i ≤ n such that

UKTest 2005

cf i ⊢ cf ′i and cf j = cf ′j for all other j 6= i. Again, we let ⊢∗ be the reflexive andtransitive closure of ⊢. A configuration cf = (cf 1, . . . , cf n, C) is initial if all cf i

are initial configurations of Λi and C = C0 is the initial (empty) matrix, and itis final if all cf i are final for Λi.

Definition 4. The stream relation computed by a CSXMS W = (Λi)i≤n is therelation fW : Σ∗

1 × · · · × Σ∗n ↔ Γ ∗

1 × · · · × Γ ∗n with (s1, . . . , sn)fW (g1, . . . , gn)

iff there exist initial and final configurations cf 0 = (cf 01, . . . , cf

0n, C0) and cf t =

(cf t1, . . . , cf

tn, Ct), respectively, such that si = π6(cf

0i ) and gi = π7(cf

ti) for all

i ≤ n and cf 0 ⊢∗ cf t.

As shown in [9] the relation fW computed by any CSXMS (under the as-sumptions made) actually is a partial function.

4 Design for Test Conditions

It is not hard to see that the design for test conditions, as they have been definedin [9], are strong enough not only for allowing the original testing approach,but also for supporting the unit testing described above. Moreover, as it hasbeen shown in [18], it is possible to perform unit testing from the model underconsideration (CSXMS) with the same accuracy as the SXM testing approachprovides for stand-alone SXM. Here we generalise the results of [18] to exploitthe input and output ports as extra points of control and observation. In thissection we discuss our design for test conditions

– Memory Closedness– Input Completeness– Output Distinguishability

considering that for driving our tests we have available as PCOs not only theinput and output streams but also the input and output ports and the commu-nication matrix. This gives rise to a new technique of obtaining relaxed designfor test conditions, generalising the notion of attainable memory state [9].

4.1 Memory Closedness

The first part of the design for test conditions is the notion of closedness ofthe type Φi, which identifies a memory invariant that is maintained by everyoperation for a suitably restricted set of input conditions.

Definition 5. Let Λi be a CSXMS component. Let Ni ⊆ Mi be a subset ofinternal memory states and Si = 〈Sφ

i | φ ∈ Φpi 〉 with Sφ

i ⊆ Σi and Ti = 〈Tφi |

φ ∈ Φpi 〉 with Tφ

i ⊆ INi two families of stream input symbols and port inputvalues indexed by the processing operations of Λi. Then Φp

i is (memory) closedwith respect to (Ni, Si, Ti) if

– m0i ∈ Ni

– for all φ ∈ Φpi , m ∈ Ni, σ ∈ Sφ

i , τ ∈ Tφi ∪ {λ}, γ ∈ Γi, y ∈ OUTi, it is the

case that if φ(τ, m, y, σ) = (γ, λ,m′, y′) then m′ ∈ Ni.

Thus Φpi is closed with respect to (Ni, Si, Ti) if the initial memory state is

in Ni and every processing operation φ ∈ Φpi keeps the machine in region Ni

provided the stream input is in Sφi and the value of the input port is empty

or in Tφi . The set Ni is used to approximate the memory region in which the

tester maintains control of all processing operations at the cost that the inputsthrough which he drives the ICUT i are constrained by (Si, Ti). Thus, from thepoint of view of testing we would strive to make Ni small and both (Si, Ti) large.Note that the constraint Si can be accounted for directly at every computationstep by adjusting the stream input symbols, while the constraint Ti can onlyindirectly be satisfied via the communication matrix and the execution of receiveoperations. Observe that whenever the input port is empty we trigger throughthe stream input only. If it is not empty the value must have been entered by someprevious receive operation and at that point the tester that feeds the associatedcommunication cell can make sure this value is in Tφ

i . Using these existing PCOgeneralises previous SXM techniques, which only go through the stream input,and also relaxes the constraints on the programmer to install extraneous inputcontrol for testing.

Def. 5 considers processing operations only. We also need to care about com-munication operations. These never change the internal memory, so they preservethe invariant Ni, trivially. This, however, will not be enough to guarantee thatwe can always trigger a communication operation, which also depends on thecomunication buffers. In fact we need to maintain the right “fill” status of theports and the communication matrix. To this end we enlarge the invariant Ni

to an invariant INV i ⊆ Qi × MEM i on the state of component Λi such that

INV i = (Qci × MEM sr

i ) ∪ (Qpi × MEM sr

where MEM sri , MEM sr

i , MEM sri are the states of the memory in which the ports

are set so that, respectively, all send and all receive operations (coded sr), allsend but no receive operation (sr), no send but all receive operations (sr) areenabled, provided the communication matrix is set appropriately. Formally,

MEM sri = {λ} × Ni × (OUTi \ {λ})

MEM sri = (INi \ {λ}) × Ni × (OUTi \ {λ})

MEM sri = {λ} × Ni × {λ}.

Note that all three sets are disjoint. The states MEM sri = (INi \{λ})×Ni ×{λ}

are not attainable in Λi, and the state changes that are possible are shown inthe automaton invariant AINV i

of Fig. 3 whose set of control states correspondsprecisely to INV i. The automaton AINV i

represents an abstraction of the controlstructure and memory of the CSXM machine model.

Proposition 1. Let Λi be a CSXMS component which is memory closed withrespect to (Ni, Si, Ti). Then the initial state (q0

i , λ,m0i , λ) is in INV i and every

UKTest 2005

Qpi × MEMrs

INi 6= λ, OUTi 6= λ

Qci × MEMrs

i Qpi × MEMrs

INi = λ, OUTi 6= λ INi = λ, OUTi 6= λ

Qpi × MEMrs

INi = λ, OUTi = λ

Φsi , Si, Ti

Φpi , Si, Ti

Φri , Si, Ti

Fig. 3. The automaton invariant AINV ifor (memory-closed) CSXMS components.

operation of Λi preserves INV i under the input constraint (Si, Ti), i.e., for all

φ ∈ Φi and all states (q, x,m, y) ∈ INV i such that x ∈ Tφi ∪ {λ}, and inputs

σ ∈ Sφi : if Fi(q, φ) = q′ and φ(x,m, y, C, σ) = (γ, x′,m′, y′, C ′) then we have

(q′, x′,m′, y′) ∈ INV i for the next state, too. More specifically, processing andcommunicating operations only change the state according to AINV i

Proof. We only show that INV i is a state invariant. Checking the state transi-tions of Fig. 3 is trivial. That (q0

i , λ,m0i , λ) ∈ INV i follows from q0

i ∈ Ni∩Qpi and

the definition of INV i immediately. Next observe that due to memory closednessrelative to (Ni, Si, Ti) we know that under the input assumptions σ ∈ Sφ

x ∈ Tφi ∪ {λ} the next memory value m′ is in Ni, for all φ. Regarding the other

components of the total state we must distinguish the two types of operationsΦp

i and Φci and use the structural constraints on Λi made in Def. 3 as follows: By

assumption, every processing operation φ ∈ Φpi clears the input port, x′ = λ, and

sets the output port to a defined value y′ 6= λ, so the next state (q′, x′,m′, y′)obviously preserves INV i regardless whether q′ ∈ Qc

i or q′ ∈ Qpi . Now, let φ ∈ Φc

be a communicating operation. Since (q, φ) ∈ dom(Fi) our structural assump-tions imply that q ∈ Qc

i is a communicating state and the next state q′ must bea processing control state, i.e., q′ ∈ Qp

i . The former implies x = λ and y 6= λby the invariant assumption. Thus, we must necessarily have x′ = y′ = λ whenφ ∈ Φs

i or x′ 6= λ and y′ 6= λ for φ ∈ Φri . Under no circumstances can we

have x′ 6= λ and y′ = λ. In conjunction with m′ ∈ Ni this suffices to conclude(q′, x′,m′, y′) ∈ INV i as desired. ⊓⊔

At this point it is worth recalling that by the testing hypothesis each ICUT i

is modelled by a CSXM Λim. Since every (memory-closed) CSXM is a refinementof AINV i

, both the specification Λi and Λim operate according to AINV i. The

implication of this is that AINV igives an upper approximation on the possible

test sequences and associated memory values. This is a generalisation of thenotion of attainable memory [9] including also the constraints imposed on thestructure of the CSXMS W in [9, 13] (i.e. testing variant, simpleness).

The tester can use the information contained in AINV ito optimise tests in

the sense that any test sequence that is not in the language L(AINV i) does not

need to be exercised since it will neither be in L(Λi) nor in L(Λim). To make thismore precise assume that the tester has driven ICUT i through a test sequenceω = φ1φ2 · · ·φk which in the specification Λi resulted in a state (q, x,m, y).The tester does not know the control state of ICUT i but he knows that thememory state must be (x,m, y)2. Now suppose the tester needs to verify a sendoperation φk+1 = sendi→j in ICUT i’s current control state. If in the currentmemory the out-port y is empty, y = λ, the send is certainly not possible inICUT i. On the other hand, according to AINV i

, q is a processing state and thesend is not possible in the specification Λi, either. Hence, only for non-emptyout-port does the tester have to enable any sendi→j in ICUT i by choosing asuitable stream input symbol and emptying the communication cell. The sameapplies to a receive operation φk+1 = receivej→i. Where this receive is blockedbecause of a full in-port x 6= λ the corresponding state q reached by Λi must bea processing state. This means that the receive operation is not possible in thespecification either. Thus, the tester needs to switch free receivej→i only wherex = λ. But this is easy for we only need to find a triggering stream input andfill the communication cell.

The situation is different for a processing operation φk+1 ∈ Φpi . Suppose the

ICUTi has reached a memory state with x = λ. The invariant automaton AINV i

guarantees that such an operation can only ever be enabled (in both implemen-tation and specification) in the initial state or if the operation φk immediatelypreceding3 it is a processing operation or a send operation. But then x = λ inall computations that take the ICUT i through the sequence φ1φ2 · · ·φk. Thus,any triggering information we get from ICUT i with this particular value x = λin our particular test situation is fully representative, positively and negatively,about all computations of the same abstract operator sequence. Hence, wheneverwe find x = λ during the test, we may safely try and trigger a given operationthrough the stream input alone. On the other hand, if x 6= λ then from AINV i

we can infer that the preceding φk must be a receive operation, in which casewe can use this receive to supply φk+1 with a suitable value on the in-port fortriggering it.

Now, if the underlying test generation method that operates on the associ-ated control automaton of an SXM (e.g. W-method) is restricted to the pathsof AINV i

then it is possible to guarantee full fault coverage in the fault modelof interest (this applies for both positive and negative triggering of functions).

2 This is because by testing hypothesis all functions operating on memory are imple-mented correctly.

3 The Attainφ sets [9] only capture information about the future of the test sequenceis derived from the memory state, viz. that φ can be a successor in the path, whilehere we need information about the past.

UKTest 2005

This can be easily done, for example, by filtering out the sequences in χ thatare not in the language of AINV i

. The invariant automaton, which is part of thetesting hypothesis here, can itself be subjected to testing. In this way, testing isfactorised into two parts: A test in which ICUT i is verified to comply with theabstraction AINV i

and another in which we verify that ICUT i implements thespecification Λi using AINV i

as a filter. This suggests a more general methodol-ogy of X-machine testing through abstractions which we leave to future work.

4.2 Input Completeness

Now we can introduce the concept of input completeness which is to ensurethat any operation sequence ω ∈ Φ∗

i (from the test set χ) can be translatedinto a (non-interactive) sequence of stream and port inputs t(ω) to trigger theoperations in ω in the specified order starting in the initial memory state m0

i .To do this we exploit AINV i

, which essentially amounts to a generalisation ofthe notion of φ-attainable memory states introduced in [9]. The set Mattainφ isthe set of all memory states in which the specification expects the operation φto be enabled under some sequence of inputs from the start state. Our invariantyields an upper bound on this set as follows:

Mattainφ ⊆ {(x,m, y) | ∃q. (q, x, m, y, φ) ∈ dom(FINV i)},

where FINV iis the partial state transition function of AINV i

. Note that theautomaton invariant AINV i

is more general than the Mattainφ sets [9] sincefrom the former we also we get information about the past while the latter onlycaptures information about the immediate successor in a test sequence.

The following definition implements the relaxed design for test condition [9]that comes with AINV .

Definition 6. Let Λi be a CSXMS component such that Φpi is memory closed

with respect to (Ni, Si, Ti). Then, the type Φi is (input) complete with respectto (Ni, Si, Ti), if the following condition holds: For all φ ∈ Φi and (x,m, y) ∈Mattainφ:

– if x = λ then there exist σ ∈ Sφi , C ∈ CM such that (λ,m, y, C, σ) ∈ dom(φ);

– if x 6= λ then there exist σ ∈ Sφi , τ ∈ Tφ

i , C ∈ CM such that (τ, m, y, C, σ) ∈dom(φ).

Def. 6 essentially says that as long as we stay within INV i all operations inΦi can be triggered by suitable stream and port inputs. Note that if the in-portis non-empty (x 6= λ) we trigger both by the stream input σ ∈ Sφ

i and in-port

value τ ∈ Tφi . This is possible since by our invariant there must be a receive

operation immediately before through which this value can be supplied. It isworthwhile to analyse some of the formal consequences of input completeness.

First, depending on what operation φ we are looking at, the condition

(τ, m, y, C, σ) ∈ dom(φ) (1)

on the existence of σ, τ, C in Def. 6 has different operational meaning. For in-stance, no processing operation depends on the communication matrix, so thatany C will do. If φ is a sendi→j operation then the cells C[i′, j] for i′ 6= i areirrelevant and for i′ = i condition (1) uniquely determines these cells C[i, j] tobe empty. A similar argument applies to receive operations, for which (1) merelysays that input cells C[j, i] must be non-empty. Hence, the existence statementin Def. 6 concerning the communication matrix C is trivial. It does not imposeany constraint on the CSXM component, but only on the test environment tofill the input cells C[j, i] (with any value) and clear the output cells C[i, j] atthe right times.

Proposition 2. Let Λi be a CSXMS component such that Φpi is memory closed

with respect to (Ni, Si, Ti). Then, the type Φi is input complete with respect to(Ni, Si, Ti) if there exist choice functions as follows:

– For every φ ∈ Φpi there is a function choice

φi : Ni ×OUTi → Sφ

i × Sφi × Tφ

such that

(λ,m, y, π1choiceφi (m, y)) ∈ dom(φ) and

(π3choiceφi (m, y),m, y, π2choice

φi (m, y)) ∈ dom(φ)

– For all φ ∈ Φci a function choice

φi : Ni → Σi such that

• if φ = sendi→j and C[i, j] = λ 6= y then (y, C, choiceφi (m)) ∈ dom(φ)

• if φ = receivej→i and C[j, i] 6= λ then (λ, C, choiceφi (m)) ∈ dom(φ),

where m ∈ Ni, y ∈ OUTi, C ∈ CM.

Proof. Consider the case φ ∈ Φpi . We argue that input completeness implies the

existence of a choice function as specified in the statement of Prop. 2. SpecialisingDef. 6 to such φ ∈ Φp

i gives

∀(q, x, m, y, φ) ∈ dom(FINV i).

x 6= λ ⇒ ∃σ ∈ Sφi , τ ∈ Tφ

i , C ∈ CM. (τ, m, y, C, σ) ∈ dom(φ) and (2)

x = λ ⇒ ∃σ ∈ Sφi , C ∈ CM. (λ,m, y, C, σ) ∈ dom(φ). (3)

What does this imply? Since no processing operation depends on the commu-nication matrix we know that if we can find a tuple (τ, m, y, C, σ) ∈ dom(φ) thenin fact any C will do. Thus, the existence of C in both cases presents no restric-tions. Secondly, whether or not a particular (τ, m, y, C, σ) or (λ,m, y, C, σ) is indom(φ) does not depend on the choice of q, x. Hence, if we have constructed onesolution for some given (q, x,m, y) we can use the same σ, τ, C for all (q′, x′,m, y)with the same m and y. Further, observe that for every processing operationφ ∈ Φr

i and (m, y) ∈ Ni × OUTi there are (q, x) such that (q, x, m, y, φ) ∈dom(FINV i

). Thus, the implication of (2) is that for all φ ∈ Φpi there exist a

function choiceφi : Ni×OUTi → Sφ

i ×Tφi such that (removing C ∈ CM from the

condition, which is trivial) (π2choiceφ1 (m, y), m, y, π1choice

φi (m, y)) ∈ dom(φ),

UKTest 2005

for all (m, y) ∈ Ni × OUTi. Similarly, the import of (3) is a choice function

choiceφi : Ni × OUTi → Sφ

i such that (λ,m, y, choiceφi (m, y)) ∈ dom(φ) for

all (m, y) ∈ Ni × OUTi. In the statement of Prop. 2 we have packed up bothfunctions into one in the obvious way.

Further note that whenever we consider a communicating operation φ ∈ Φci

to be triggered in a control state q because of the automaton invariant AINV i

then we may conclude that q ∈ Qci , x = λ and y 6= λ. In other words, the ports

are ready and set so that triggering of φ can only depend on the stream inputsymbol σ and the communication cells being prepared accordingly. From thisthe existence of the choice functions as required follows easily.

We omit the other direction of the proof that the existence of choice functionsimplies input completeness. ⊓⊔

4.3 Output Distinguishability

At testing time we must be able to verify from the output produced that ICUT i

executes the right functions as specified by the test sequence. That these func-tions implement the correct transformations on the data path is tested indepen-dently and assumed by way of the testing hypothesis. The final element of thedesign for test conditions thus is output distinguishability which guarantees thatall functions in Φi can be identified uniquely from their output.

Definition 7. Let Λi be a CSXMS component such that Φpi be memory closed

with respect to (Ni, Si, Ti). The type Φi is output-distinguishable with respect to(Ni, Si, Ti), if for all φ2, φ2 ∈ Φi, m,m1,m2 ∈ Ni, y ∈ OUTi, τ ∈ INi, γ ∈ Γi ifit is the case that

– (σ, τ) ∈ Sφ1

i × (Tφ1

i ∪ {λ}) and (τ, m, y, σ) ∈ dom(φ2), or

– (σ, τ) ∈ Sφ2

i × (Tφ2

i ∪ {λ}) and (τ, m, y, σ) ∈ dom(φ1)

and both φ1 (τ, m, y, σ) = (γ, τ, m1, y) and φ2 (τ,m, y, σ) = (γ, τ, m2, y), thenφ1 = φ2,m1 = m2.

In nondeterministic systems, output distinguishability is also needed to con-trol the test sequence, since the actual memory state can only be determined attesting time. In the deterministic case we only need it to determine success orfailure at testing time.

4.4 Testing Hypotheses

To conclude this section let us establish the fault domain of the fault modelunder consideration by means of the testing hypothesis as follows.

– The IUT (derived from CSXMS specification W ) is modelled by a CSXMSWm such that Wm ∼=id W .

– Every Λim from Wm and Λi from W are refinements of the automatoninvariant AINVi

of Fig. 3.

– Every Λim from Wm and Λi from W have the same set Φi which is memoryclosed, input-complete and output-distinguishable.

– Let Qi and Qim be respectively the set of control-states in Λi and Λim.There is a known integer k such that card(Qim) − card(Qi) ≤ k.

– Λim and Λi are deterministic and its corresponding associated automatonis minimal.

– Λi is reachable.

5 Fundamental Test Function

Now let us see how the design for test conditions ensure the translation fromω = φ1φ2 · · ·φk ∈ χ into a concrete sequence of external inputs to drive apurported implementation Λim of Λis along transitions φi in the total statespace. It is assumed by way of the testing hypotheses that both machines haveexactly the same type Φi of operations over the same total data path. Our goalis to test if ω ∈ χ is in the language of the associated automaton L(Λim). Asusual, we assume that A(Λis) is a minimal (deterministic) control automaton.

In our test architecture the test object Λim is embedded in a system of testerstesterj (j 6= i) connected to the communication matrix C. They are designed tobe ready to read out any value that happens to be put into C[i, j] by Λim andalso to keep cell C[j, i] filled with a defined value of our choice at all times. Let usdenote this CSXMS by WT

i . In this way we may assume without loss of generalitythat the communication matrix is always in the form required for enabling sendand receive operations. The input alphabet of testerj is Σtester

j = A ∪ B ∪ {−},where A = {(?, z) | z ∈ OUTi} and B = {(!, w) | w ∈ INi}. Intuitively, thesymbol (?, z) should be interpreted by testerj as the instruction “receive z fromICUTi” and the symbol (!, w) should be interpreted as “send w to ICUTi”. Thesymbol − is used by the tester to complete the operations. The output alphabetfor testerj is Γ tester

j = A ∪ B ∪ {−, 1, 2}.

Tester

Consider testerj in Fig. 4, for every 1 ≤ i ≤ n, there are three sequences that cantake place in the tester. Firstly, is! · sendj→i occurs when a symbol of the form(!, w) is read from the input stream. The operation is! places the same symbol(!, w) in the output stream and updates the output port with y = w, afterwardsthe sendj→i is performed, which effectively sends this w to the ICUTi, and themachine returns to the initial control-state. The other two possible sequencesare is? · receivei→j · right and is? · receivei→j ·wrong. Both of them start whena symbol of the form (?, z) is read from the input stream. This operation is?writes the same symbol (?, z) in the output stream, stores the value z in thememory (i.e. m = z) and empties the input port. Then receivei→j takes placeby receiving a value, say z′, from the ICUTi (i.e. x = z′). Finally, there are twopossibilities, either the processing operation right or wrong is executed whenthe symbol − is taken from the input stream. The former occurs when the valuereceived from ICUTi is the same as the value stored in memory (i.e. m = x)

UKTest 2005

right wrong

is!(x, m, y, (!, w)) = ((!, w), λ, m, w)

is?(x, m, y, (?, z)) = ((?, z), λ, z, y)

right(x, m, y,−) = (−, λ, m, y), if x = m

wrong(x, m, y,−) = (′err′, λ, m, y), if x 6= m

receivei→j

sendj→i

sendj→i(λ, m, y, C,−) = (1, λ, m, λ, C[j, i] ← y)

receivei→j(λ, m, y, C,−) = (2, C[i, j], m, y, C[i, j] ← λ)

Fig. 4. The canonical tester in position j (testerj)

and the machine returns to the original initial control-state. The latter occurswhen m 6= x and in this case the machine goes to a terminal state which is notthe initial one.

For a given component Λi, let Σpi = Σi × INi, Σr

i = Σi × {!j | 1 ≤ j ≤ n} andΣs

i = Σi × {?j | 1 ≤ j ≤ n} × OUTi.

Definition 8. A function tx,m,y : Φ∗i → (Σp

i ∪ Σri ∪ Σs

i )∗ can be defined byrecursion as follows:

tx,m,y(λ) = λ, where λ is the empty sequence.

For φ ∈ Φpi and φ∗ ∈ Φ∗

if choiceφi(m, y) = (σ1, σ2, x1) then

tλ,m,y(φ · φ∗) = (σ1, x1) · tx′,m′,y′(φ∗), where φ(λ,m, y, σ1) = (γ, x′,m′, y′) andtx,m,y(φ · φ∗) = (σ2, x1) · tx′,m′,y′(φ∗), where φ(x1,m, y, σ2) = (γ, x′,m′, y′)

For φ = sendi→j ∈ Φsi and φ∗ ∈ Φ∗

tx,m,y(φ · φ∗) = (choiceφi(m), ?j , y) · tx,m,λ(φ∗).

For φ = receivej→i ∈ Φsi and φ∗ ∈ Φ∗

tx,m,y(φ · φ∗) = (choiceφi(m), !j) · tx,m,y(φ∗).

Henceforth, tm,y with m = m0i and y = λ is called fundamental test function of

Λi and it is simply denoted by t.

A remote test case (RTC) is a tuple (s1, . . . , sn), where si ∈ Σ∗i is a local test

sequence (LOC) for the ICUTi and for all 1 ≤ j ≤ n with j 6= i the sequencesj ∈ Σtester∗

j is a LOC for testerj .

Definition 9. A function p : (Σpi ∪ Σr

i ∪ Σsi )∗ → Σ∗

i for computing the LOCfor ICUTi can be defined as follows:

p(λ) = λ.

p((σ, x) · ψ∗) = σ · p(ψ∗).

p((σ, ?j , y) · ψ∗) = σ · p(ψ∗).

p((σ, !j) · ψ∗) = σ · p(ψ∗).

Definition 10. A function gj : (Σpi ∪ Σr

i ∪ Σsi )∗ → Σtester∗

j for computing theLOC for testerj can be defined as follows:

gj(λ) = λ.

gj((σ, x) · ψ∗) = gj(ψ∗).

gj((σ, ?k, y) · ψ∗) =

(?, y) · − · − · gj(ψ∗) if k = j

gj(ψ∗) otherwise.

gj((σ, !k) · ψ∗) =

(!, x) · − · gj(ψ′∗) if (k = j) ∧ (ψ∗ = (σ, x) · ψ′∗)

(!, τ) · − · gj(ψ∗) else if k = j

gj(ψ∗) otherwise,

where τ 6= λ is an arbitrary value in INi.

In this form, for every abstract test sequence φ∗ ∈ χ, the sequence si = p(t(φ∗))is the LOC for the ICUTi and for every testerj , the sequence sj = gj(t(φ

∗))corresponds to its input LOC. Finally, s = (s1, . . . , sn) is the RTC for the testingsystem.

Let us denote by Wm ≃′ W the verdict provided by the original method, andlet Λim ≃ Λi be the verdict given by our unit testing approach. It can be shownthat (∀ 1 ≤ i ≤ n,Λim ≃ Λi) ⇒ Wm ≃′ W .

6 Analysis

The approach just described has two main advantages with respect to the originalapproach. The first one is related to complexity and the second has to do withthe possibility of employing different parameters of reliability in different partsof the systems. Let us discuss these aspects in more detail.

UKTest 2005

6.1 Complexity

The complexity of a testing process can be split into two parts, namely the com-plexity of generating the test suite and the complexity of executing this test suite.Let us denote the former by Og and the latter by Oe. In order to consider theapproach suggested in this paper and the original approach by themselves, let usabstract the complexity of the underlying method employed over the associatedautomaton (W-method or Wp-method). In this form, we can say that for the un-derlying method Og = (f(card(Q), k)) and Oe = (f ′(card(Φ), k)), where f andf ′ are arbitrary functions and k is the parameter of reliability used to estimatethe difference between control-states in the implementation and the specifica-tion. In the original approach, from the form in which the SXM is obtained [9],it is not hard to see that card(Q) = O(card(Q′)n), where Q′ corresponds to theset of control-states of the component with the greatest number of control-statesand n is the number of system components. Similarly, card(Φ) = O(card(Φ′)n),where Φ′ is the type of the component with highest card(Φ′). The complexities ofthe original approach are Og = (f(card(Q′)n, k)) and Oe = (f ′(card(Φ′)n, k)).For the approach discussed here, the SXM testing method is applied n times(one for each component) independently. Thus, Og = (n · f(card(Q′), k′)) andOe = (n · f ′(card(Φ′), k′)), where Q′ and Φ′ are as before and k′ is the largestk used for testing a particular component. If on the top of this we consider thetesting of the architectural isomorphism it is easy to see that the complexityof the distributed algorithm is polynomial in terms of communication C (i.e.number of messages) and time T (i.e. communicating operations executed). Tobe more precise, C = T = O(n2) where n is the number of system components.

Thus, the approach of this paper has a complexity that is much better than thatof the product machine approach. Moreover, observe that the pre-processing costinvolved in constructing the product machine has not been taken into accountfor this analysis. In other words, even assuming that the product SXM can beobtained with a reasonable amount of resources (i.e. in polynomial time) theapproach of this paper is superior. Note that this holds independently of the un-derlying method (e.g. W-method), which in turn implies that any improvementin the underlying method will not contribute to the improvement of the productapproach. Finally, because in the unit testing every component is tested inde-pendently, this allows a high degree of parallelism in the process. In this way,if the testing of all the components is executed simultaneously this will resultin Og = (f(card(Q′), k′)) and Oe = (f ′(card(Φ′), k′)). To put it simply, in thesame order of time that it takes to test a SXM with card(Q′) control-states andcard(Φ′) functions, a complete CSXMS with n components of card(Q′) control-states and card(Φ′) functions each can be tested.

6.2 The value of parameter k

The possible different values on the parameter k (i.e. estimate difference ofcontrol-states) do not imply any difference regarding to the detection ability ofthe approaches. In fact, for any method compared to itself, the use of distinct

values of k may produce a different verdict. This does not suggest that for ob-taining this k, it is necessary to compute the global SXM in the componentsapproach. Although from this SXM the precise number of system’s states in thespecification can be obtained, the number of states in the implementation is stillan estimate, and hence k is an estimated value. Moreover, if the approach sug-gested here is employed then the only reason for constructing the SXM wouldbe to obtain an accurate value of card(Q) (the number of control-states in thewhole system). However, the construction of this machine demands a consider-able amount of resources and, therefore, a more sensible approach will be ratherto expend such resources for testing with greater values of k.

In practice, the value of k represents a coefficient of reliability between the spec-ification and the IUT. If the value of k is increased the test set will includelonger searching sequences. Thus, more faults are found if they are there, other-wise more resources (time) are used to obtain the same verdict. The designers,programmers, testers, etc. must decide how confident they are regarding to theimplementation or how critical it is, and from that set the value k.

There is, however, an issue that deserves special consideration, arising from thefact that in the components approach the components are tested separately.Since the number of states in every specified component v is known, differentvalues kv can be employed for testing different parts of the system (differentimplemented components). In this form, instead of having a flat parameter ofreliability, the testing can be adjusted in such a way that the process effort isemphasised or relaxed depending on the level of confidence required in each partof the whole system.

7 Related Work

Research has been abundant on test generation based on FSM (e.g. [2, 19]). Aleast common model employed in testing is the input/output transition system(IOTS), which usually models transition systems with concurrent input/outputbehaviour. In other words, pairs of inputs/outputs are not considered to beatomic actions (i.e. the next input can arrive before the previous output isproduced). This concurrent input/output behaviour is similar to that of theCSXMS components regarding to communicating operations. A framework fortesting IOTS through queues is introduced in [20]. In this framework, the testerand the system under test are two input-enabled message passing communicat-ing systems. Thus, both inputs and outputs are buffered using queues betweenthem, which are assumed to be bounded. In this form the system does notblock inputs from the tester and this does not prevent the system from pro-ducing outputs. In the CSXMS model, this is not necessarily the case since thecommunicating matrix (which can be considered a set of queues of size one) to-gether with the communication mechanisms defined may prevent a componentto execute a communicating operation due to the fact that the complemen-tary operation has not been executed by the corresponding component. In the

UKTest 2005

IOTS testing, the tester is split into two processes, an input-test process andan output-test process. The former applies inputs to the system while the lat-ter receives outputs from the system until no more outputs are detected. In aqueued-quiescence tester there is just one input-test process and one output-test process. A queued-suspension tester consists of several of such input-testand output-test processes. The implementation relations that can be tested arebased on traces (e.g. queued-quiescence trace-equivalence, queued-suspensiontrace-equivalence) and the queued-suspension tester has more fault detectionability that the queued-quiescence tester. Although this approach is architec-turally similar to the one suggested in this paper in the sense of having a num-ber of testers for testing a component, they differ basically in their aim. TestingIOTS as in [20] checks for quiescence in the system while testing componentsof CSXMS verifies functional-equivalence. Moreover, in the queued-suspensiontechnique the testers are employed to detect intermediate quiescent states inthe system, so an input sequence is decomposed into several testers in orderto achieve this. On the other hand, in the approach presented in this paper aninput testing sequence derived from a component is decomposed in such a waythat the testers emulate the expected behaviour of the other system componentswith respect to the ICUT .

In practice, when a distributed system is specified, some form of communicationis required amongst the several units or components of the system. Moreover,usually non-trivial distributed specifications (including protocols) require vari-ables and operations based on variable values. Hence, the specification is nor-mally divided into its control and data parts. This has implied the development ofsome extensions for the finite state machine (FSM) model to be powerful enoughto represent in a concise way these systems. For instance, for the communicationaspects the communicating finite state machine (CFSM) can be employed, forthe data part the extended finite state machine (EFSM) are applicable. Fur-thermore, the communicating extended finite state machine (CEFSM), as in theCSXMS case, includes both aspects communication and data.

In the EFSM model approaches like the W-method can be used to analyse thecontrol flow of the specification, but this leaves the data part (of the distributedsystem) untested. The data part can be tested using a data flow approach. Dataflow testing attempts to check the effects of data objects in software systems.In general, data flow testing is based on a data flow digraph with the nodesrepresenting the functional units of the system and the edges representing theflow of data objects. A number of methods have been proposed in the literaturefor testing EFSM using control flow and/or data flow techniques (e.g. [21–24]). However, these are applicable only to stand-alone EFSM. There is anotherissue that requires attention and this is that given an EFSM, if each variablehas a finite domain of values, then there are a finite number of configurations(combination of state and variable value) and then an equivalent FSM withconfigurations as states can be obtained. In this form, testing EFSM reduces totesting FSM.

On the other hand, in a CFSM each component (unit) is a FSM that is ableto communicate. The simplest approach for testing a system of CFSM consists ofcomposing all the units into one machine which is a FSM that models the com-plete system behaviour and then generate the test cases from it (using knownmethods). The trouble is that this approach leads to an explosion in the num-ber of states. To cope with this, methods for reducing the reachability analysishave been suggested for CFSM (e.g. [25, 26]). The idea is that of constructinga smaller graph representation of partial behaviour of the system that allows usto study communication properties. Alternatively, there are also heuristic pro-cedures for test generation for CFSM such as random walk, where the selectionof the next input is done randomly and on-the-fly. However, this may trap thetest into a portion of the system behaviour [27] and to cope with this the guidedrandom walk technique has been suggested in [28], where instead of choosing thenext input randomly, it is favored to visit transitions with higher priority. Con-versely, the notion of semi-independence for CFSM has been suggested in [29]as a means of avoiding concurrent reachability analysis altogether, by meansof identifying and testing separately the communicating transitions (CT) thatcan affect the behaviour of other machines. In this manner, it is possible (un-der certain conditions, namely semi-independence) to test a non-communicatingtransition (NCT) t by placing the machine into the source state (using onlyNCT’s) of t and then execute t followed by a unique input/output sequence(UIO) that allows us to identify the target state of t. For testing a CT c, it isnecessary to test the final state of all the machines involved. But, it is possibleto set up the other machines for the execution of c (unsign only NCT’s). Then, ifthere is no feedback (i.e. a feedback transition leads to a sequence of transitionsincluding at least one more in the same machine) the output can be verifiedand the final state of all the machines affected by c can be checked using onlyUIO. The test effort is thus reduced by finding a minimal cost set of sequencesthat covers all CT’s. If there are few feedback transitions the approach is that offinding the shortest sequence of transitions that test it. Otherwise, a large num-ber of feedback transitions suggests a high degree of dependency between thecomponents and it is necessary to apply other approaches (e.g. [30]). Since in aCSXMS a communicating operation always moves the control into a processingstate, there are no transitions with feedback which thus avoids this problem.

For the CEFSM, a small amount of work has been reported related to testing.Most existing methods deal only with CFSM where the data part is not consid-ered. As in the case of CFSM or EFSM, constructing a product machine fromthe system of CEFSM runs into the state explosion problem. There is, however,an alternative and this is that of testing in context CEFSM [31]. The frameworkfor testing CFSM in context is due to Petrenko et al. [32, 30]. The intuition is asfollows. When testing a communicating system, it could be the case that someof its components have already been tested or are not critical to the system.Hence, the distributed implementation under test can be viewed as composedof two parts. The component that requires testing is known as the embeddedcomponent, and the other components are assumed to be fault-free. Testing a

UKTest 2005

component embedded into a communicating system is known as embedded test-ing or testing in context. The goal is to test whether an implementation of acomponent conforms to its specification in the context of the other componentsand it is assumed that the tester does not have direct access to the componentunder test (ICUT ). Several approaches have been presented in the literature fortesting in context and they are based on fault-models [32], on reducing the prob-lem to testing of components in isolation [30], on test suite minimisation [28, 31,33], on fault coverage [34] and on test the system with uncontrollable (or semi-controllable) interfaces [35]. Intuitively, the idea of these approaches is that atest case can be removed from a test suite if the test case concerns only the con-text or if there is another test case that detects the same fault of the embeddedcomponent. However most of these approaches resort to reachability graphs tomodel the joint behaviour of all the system components, and again this exposesthem to the state explosion problem. In [36], a solution called Hit-or-Jump hasbeen suggested. This technique is a unification of both the exhaustive search andrandom walks.

The technique for testing CEFSM in context is similar to the one describedabove [31]. A partial product machine is obtained and then the EFSM Test Gen-erator algorithm [21] (an algorithm that generates test cases for EFSM whichcover both control and data flow) is employed. In this guided procedure, eachCEFSM is tested in isolation first, and then the global system is tested as fol-lows: (1) a partial product CEFSM is obtained and (2) the test cases for it aregenerated. These test cases are generated from the test cases obtained when theCEFSM was tested in isolation (as a guide) to compute its partial product.

The approach suggested in this paper for CSXMS does not correspond to thenotion of testing in context discussed above. This is because our objective hasnot been that of testing conformance of a component in the context of theother system components, but instead to test conformance of a component in allpossible contexts. In this sense our unit testing is more conservative than boththe product approach and testing in context. It provides strong guaranties ofcorrectness of components. Observe that the assumption that the ICUT can bemanipulated does not imply that from the outside of the system (environment)it is possible to access directly the communication interfaces (i.e. the ports),but just the environment interface (i.e. streams) of the components. Thus, inorder to access these communication interfaces we use the other components astesters.

8 Conclusions

The essence of the conformance testing paradigm is given by two factors: (i)the testing hypothesis and (ii) the assumption of a correct specification. Theformer could be summarised as follows: an abstraction from the concrete non-formal implementation can be made such that the particularities of interest (e.g.behaviour) are formally specified in a model of implementation. Nevertheless, it

is known that the generation of the test suite is strongly influenced by the modelof specification employed. In this research, the CSXMS formalism has been used.

This paper takes first steps towards a unit testing method for CSXMS. Thisis based on generating testing sequences from the components of the specifica-tion. These sequences are exercised into each of the implemented componentsindependently through a distributed test system, which can be the same systemunder test, but in which the components include some extra functionality to actas testers.

The justification for this approach for the model under consideration is thatthe product machine approach is, in general, prohibitively expensive. The usualpractice is that of developing techniques for reducing the size of this model.This implies that both a transformational process (for transforming a distributedspecification into a stand-alone specification) and a condensational process (find-ing an equivalent reduced specification) are required. The question that naturallyarises is: why is the specification distributed in the first place? In other words, isthe implemented system also distributed? If so, intuition suggests that, if thereis a correspondence between the components of the specification and those ofthe implementation then the whole process is overloaded. This claim is sup-ported by the observation that at the beginning a stand-alone form is obtainedby transforming and reducing an original distributed specification. Then thetest cases, generated from it, are applied ”backwards” into a distributed modelof the implementation. Why exercising all this transformational-condensationalmachinery for using a stand-alone testing technique when both specification andimplementation are distributed and the components of both correspond? Thispaper suggests that the complexity can be improved if the testing process iskept distributed, and it investigates to what extent this method requires distri-bution. This view considers that under some well-defined assumptions (testingconditions, testing hypothesis) some distributed testing tasks can be achieved ina local manner with no degradation in their accuracy regarding a global versionof the same.

9 Acknowledgments

We would like to thank one of our anonymous reviewers for astute questions andconstructive suggestions for improving the paper.

References

1. Laycock, G.: The Theory and Practice of Specification-Based Software Testing.PhD thesis, University of Sheffiled, U.K. (1993)

2. von Bochmann, G., Petrenko, A.: Protocol Testing: Review of Methods and Rel-evance for Software Testing. In: Proceedings of the International Symposium onSoftware Testing and Analysis. (1994) 109–124

3. Heerink, A.: Ins and Outs in Refusal Testing. PhD thesis, University of Twente,The Netherlands (1998)

UKTest 2005

4. International Organization for Standarization: Information Technology – OpenSystems Interconnection – Conformance testing methodology and framework.Parts 1–2, 4–7 IS 9646, ISO/IEC (1994)

5. Chow, T.: Testing Software Design Modeled by Finite-State Machines. IEEETransactions on Software Engineering 4 (1978) 178–187

6. Fujiwara, S., von Bochmann, G., Khendek, F., Amalou, M., Ghedamsi, A.: Test Se-lection Based on Finite State Models. IEEE Transactions on Software Engineering17 (1991) 591–603

7. Holcombe, M., Ipate, F.: Correct Systems: Building a Business Process Solution.Springer Verlag, Berlin (1998)

8. Ipate, F., Holcombe, M.: An integration Testing method which is proved to findall Faults. International Journal of Computer Mathematics 63 (1997) 159–178

9. Ipate, F., Holcombe, M.: Testing Conditions for Communicating Stream X-machineSystems. Formal Aspects of Computing 13 (2002) 431–446

10. Ipate, F., Holcombe, M.: Generating Test Sets from Non-Deterministic StreamX-Machines. Formal Aspects of Computing 12 (2000)

11. Hierons, R.M., Harman, M.: Testing conformance to a quasi-non-deterministicstream x-machine. Formal Aspects of Computing 12 (2000) 423–442

12. Hierons, R.M., Harman, M.: Testing conformance of a deterministic implementa-tion against a non-deterministic stream x-machine. Theoretical Computer Science4 (2004) 191–233

13. Balanescu, T., Cowling, A., Georgescu, H., Gheorghe, M., Holcombe, M., Vertan,C.: Communicating Stream X-Machines Systems are no more than X-Machines. 5

(1999) 494–50714. Institute of Electrical and Electronics Engineers: IEEE Standard Computer Dic-

tionary: A Compilation of IEEE Standard Computer Glossaries. Technical report,IEEE (1990)

15. Sarikaya, B.: Principles of protocol engineering and conformance testing. EllisHorwood series in Computer Communications and Networking (1993)

16. Walter, T., Grabowski, J.: Framework for the Specification of Test Cases for RealTime Distributed Systems. Information and Software Technology 41 (1999) 781–789

17. Cacciari, L., Rafiq, O.: Controllability and observability in distributed testing.Information and Software Technology 41 (1999) 767–180

18. Aguado, J.: Conformance Testing of Distributed Systems: an X-machine basedApproach. PhD thesis, University of Sheffiled, U.K. (2004)

19. Petrenko, A.: Fault Model-Driven Test Derivation from Finite State Models: An-notated Bibliography. In: Modeling and Verification of Parallel Processes, 4thSummer School, MOVEP. (2000) 196–205

20. Petrenko, A., Yevtushenko, N.: Queued Testing of Transition Systems with Inputsand Outputs. In: FATES. (2002)

21. Bourhfir, C., Dssouli, R., Aboulhamid, E., Rico, N.: Automatic executable testcase generation for extended finite state machine protocols protocols. In: IWTCS.(1997) 75–90

22. Chanson, S., Zhu, J.: A Unified Approach to Protocol Test Sequence Generation.In: INFOCOM. (1993) 106–114

23. Huang, C., Y.Lin, M.J.: Executable Data Flow and Control Flow Protocol TestSequence Generation for EFSM-Specified Protocol. In: IWPTS. (1995)

24. Ural, H., Yang, B.: A Test Sequence Selection Method for Protocol Testing. IEEETransactions on Communications 39 (1991) 514–523

25. Rubin, J., West, C.: An Improved Protocol Validation Technique. ComputerNetworks 6 (1982) 65–73

26. Gouda, M., Yu, Y.: Protocol Validation by Maximal Progress State Exploration.IEEE Transactions on Communications 32 (1984) 94–97

27. Aleliunas, R., Karp, R., Lipton, R., Lovasz, L., Rackoff, C.: Random walks, Uni-versal traversal sequences, and the complexity of maze problems. In: Proc of the20th Annual Symposium on Foundations of Computer Science. (1979) 218–223

28. Lee, D., Sabnani, K., Kristol, D., S. Paul, S.: Conformance Testing of ProtocolsSpecified as Communicating Finite State Machines - a Guided Random Walk BasedApproach. IEEE Transactions on Communications 44 (1996) 631–640

29. Hierons, R.: Testing from Semi-independent Communicating Finite State Machineswith a Slow Environment. IEE Proceedings on Software Engineering 144 (1997)291–295

30. Petrenko, A., Yevtushenko, N., von Bochmann, G., Dssouli, R.: Testing in context:framework and test derivation. a special issue on Protocol Engineering of ComputerCommunications 19 (1996) 1236–1249

31. Bourhfir, C., Dssouli, R., Aboulhamid, E., Rico, N.: A Guided Incremental TestCase Generation Procedure for Conformance Testing for CEFSM Specified Proto-cols. In: IWTCS. (1998) 275–290

32. Petrenko, A., Yevtushenko, N., von Bochmann, G.: Fault Models for Testing inContext. In: FORTE. (1996) 163–178

33. Yevtushenko, N., Cavalli, A., Lima, L.: Test suite minimization for testing incontext. In: IWTCS. (1998)

34. Zhu, J., Vuong, S., Chanson, S.: Evaluation of Test Coverage for Embedded SystemTesting. In: IWTCS. (1998) 111–126

35. Fecko, M., Uyar, M., Sethi, A., Amer, P.: Issues in conformance testing: multiplesemicontrollable interfaces. In: FORTE. (1998) 111–126

36. Cavalli, A., Lee, D., Rinderknecht, C., Zaidi, F.: Hit-or-Jump: An algorithm forembedded testing with applications to IN services. In: FORTE. (1999) 41–56

UKTest 2005

Testing from object machines in practice

Kirill Bogdanov and Mike Holcombe

Department of Computer Science, The University of Sheffield,

Regent Court, 211 Portobello St., Sheffield S1 4DP, UK

email: K.Bogdanov@dcs.shef.ac.uk

July 28, 2005

Abstract

Rigorous state-based testing methods for objects are capable of producing high-quality

test case sequences, but derivation of test data for them can be hard to automate; moreover,

test sequences using call-backs can be tedious to hand-code in JUnit. This paper describes

an approach for automatic construction of JUnit test sequences from templates provided by

a tester, which can be guided by one of the rigorous state-based test methods. Templates

are written in a style similar to traditional JUnit tests, from which actual test sequences are

automatically produced when JUnit requests a test suite from a tester object.

The main benefit of the described work is automation of rigorous testing of objects

which communicate with a number of collaborator objects.

1 Introduction

In many applications, objects are responsible for operations which go through a number of

states. Consider an object which has to perform a sequence of actions to accomplish a request

from a user, where each of the actions has to be delegated to collaborator objects. Each of the

actions may fail, making it necessary for the object to recover, depending on the action which

failed. For this reason, such an object has to have a state dedicated to each of these actions with

transitions corresponding to a success or a failure of actions. In addition, a user may change

his/her mind before a request has been completed. In such a situation, the object has to tell a

user that it cannot accept a new request until the previous one has either been completed or it

failed; a more sophisticated system could cancel the previous request and start a new one. An

action to cancel a request may also depend on the current state. Finally, an implementation

of the said object may use different forms of notification of the success or failure of actions,

namely (1) by values returned to the object from methods of its collaborator objects, (2) via

exceptions and (3) through methods of the object being called back by collaborators. Testing

objects of the described kind requires one not only to generate tests covering a variety of

possible paths through a transition diagram but also to utilise the three kinds of communication

mentioned.

The complexity of the control behaviour can make it difficult to perform traditional (such as

category-partition) testing of the state-transition behaviour of methods of an object effectively.

Consequently, it seems reasonable to attempt known rigorous methods for test generation from

a state-transition structure. The main benefits of these methods are that (1) states can be entered

by taking sequences of transitions rather than attempting to set state-related variables, which

may be hidden; (2) behaviour of each state is tested with both expected and unexpected inputs;

(3) states entered by an object during testing can be identified by attempting sequences of

inputs, hence it is often not necessary to make the internal variables available to a tester; (4)

faults targeted by state-based testing methods and the conditions under which such faults are

found are clearly stated. The downside of these test methods is that they generate numerous test

case sequences, which may be hard to run automatically. In addition, test generation requires a

state-based model of an object under test; unfortunately, traditional state-based testing methods

for objects do not make it easy to express call-backs as first-class elements of such models.

This paper describes how testing can be performed, avoiding these two problems; in addition,

it is relatively easy to make changes to a test suite to reflect changes in an object under test.

Section 2 describes how objects with call-backs can be modelled and Sect. 3 introduces one of

the testing methods which can be used for testing objects; Sect. 4 describes how the common

parts of test sequences can be highlighted by a developer. The construction of stubs to perform

testing is described in Sect. 5, followed in Sect. 6 by conclusions and a comparison to related

2 A model for objects: object machines

Typical models of objects used for state-based testing [15, 14, 25, 24, 11] describe the be-

haviour of objects in terms of an extended finite-state machine with transitions taken in re-

sponse to method calls received by objects. This model does not permit one to express call-

backs where an object delegates to a collaborator object, which in turn calls the considered

object back. The approach taken in this paper considers all instances when data flows into an

object under test as an input of it and all cases when it flows out of it — as an output. The

theory of this is described in detail in [1], while this paper focuses on the practical side of test-

ing. These practical results stem from a case study conducted by the authors testing 15K lines

of Java code with 25K lines of test code from which 3.3K of JUnit tests were automatically

generated. Although nowhere big in size, the case study will be referred to as the big case

study to differentiate it from a tiny one used in this paper to illustrate the proposed method for

testing of objects. The Java example used as an illustration in this paper is slightly different

from the code used in the big case study and can be obtained from URL

http://www.dcs.shef.ac.uk/∼kirill/simple om test.zip. This example was chosen in preference

to a part of the big case study for two reasons: (1) the example is simple and covers the be-

haviour based on inter-object communication, exceptions and call-backs; (2) any part of the

big case study illustrating the advantages of the modelling and testing method to be described

will need extensive explanation covering the purpose of such a part and details of its operation.

The object used for testing in the example has a single method start(int) which takes an

integer as a parameter and uses a collaborator object to perform computations. The transition

diagram for the example is depicted in Fig. 1. The object under test (OUT) of the example

will further be referred to as SOUT (Simple Object Under Test). A label of every transition

on the diagram contains an input/output pair; potentially complex computations associated

with labels are not shown on the diagram. An input part of a transition label reflects any

information received by an object, be it a method call, an exception or simply a value returned

to an object from a method it called. In a similar way, an output can be a method call made by

the considered object, an exception thrown by it or a value it returns to a caller of its method.

Tables at the top of the diagram depict inputs and outputs of the SOUT and its collaborator.

Method calls are start(int), init(int) and compute(); RET(true), RET(false) and RET() are return

values and MyException is the exception which may be thrown. The described treatment of

inter-object communications is important from two points of view: (1) it forces a developer

to consider every possible call-back or an exception in every state of an object and (2) any

UKTest 2005

start(int)/RET(true)

MyExceptionstart(int)/

MyException

{init, (impossible)}

{MyException, RET(true)}

{RET(true), (impossible)}

Response to each of the sequences:

compute()

RET(false)

MyException

RET(true)

init(int)

Outputs

Object Under Test (OUT)

{MyException, (impossible)}

Outputs

MyException

Collaborator

RET(true)

RET(false)

State identification: {start(int), RET()}

compute()

init(int)

Inputs

MyException/RET(false)

MyException/

MyException

RET(true)

Rstart(int)/init(int)

RET(false)/

RET(false)

RET()/RET(true)/

compute()

RET(false)

RET(true)

start(int)

Inputs

MyException

RET() RET()

start(int)/

public boolean start(int arg) throws MyException {if (processingStage == stateE) {

// State E

processingStage = stateINIT;return true;

}if (processingStage == stateINNER)

throw new MyException("Processing already active");

processingStage = stateINNER;

try{// About to enter state I

if (!collaborator.init(arg))

{ processingStage = stateINIT;return false;}}catch(MyException ex){

processingStage = stateINIT;throw ex;

// About to enter state C

collaborator.compute();

}catch(MyException ex){

processingStage = stateINIT;return false;

}processingStage = stateE;

return true;// everything is ok.

Figure 1: A simple object machine and the implementation of the start(int) method

call/return/exception can be used in testing of any state. Given that the considered model of

objects uses an explicit transition diagram with some computations associated with labels, it

seems reasonable to use Extended Finite-State Machines to give a formal meaning to such

models. X-machines [6] (a kind of Extended Finite-State Machines) are used for this reason;

this paper does not go into detail as to how objects are modelled, but it provides both the idea

of how this can be done and how to use the X-machine testing method for testing of objects.

X-machines describing objects are further called object machines.

In the example considered, there is a single transition corresponding to a call of the start(int)

method from every state. This does not have to be the case: it is possible to introduce different

transitions corresponding to a call of start(int) with different values of the integer argument.

Even more, in principle, one can introduce labels which accept both a method call and a return

as an input and either make a call to a collaborator or return to a caller, depending on the re-

sult of computation they performed. Although handling different values of an integer method

argument with different labels does not make testing difficult, testing object machines with la-

bels accepting/generating both calls and returns or even accepting/generating returns of data of

different types makes testing substantially more complex. For this reason, this case is not con-

sidered in this paper; for details refer to [1]. The big case study did not contain the complexity

mentioned in this paragraph other than splitting labels based on values of arguments.

The part of Fig. 1 below the diagram depicts a possible implementation of the SOUT’s

only method. The code style was chosen to ensure that everything fits on a page. When a

method call is received, the SOUT has to determine its current state and respond accordingly.

Although the processingStage variable holds the current stage, the difference between

I and C states is not captured by it, since the object knows when it made the init(arg)

call. This highlights the correspondence between states in an object machine and variables in

a program code: a state in a machine has to be related to (1) instance variables of an object,

(2) local variables of the object’s methods and (3) the currently executing location in the code.

Consider an object waiting for a collaborator to respond, such as the SOUT in state I. In this

state the executing method of the considered object could have some local variables. If this

object receives a callback call (such as from a collaborator object), a corresponding method

of the considered object will be invoked and new local variables created. The behaviour may

hence be affected by values of both original and new local variables. For this reason, the above

characterisation of a state of an object machine has to read (1) protected and private instance

variables of an object, (2) a stack of local variables of object’s methods and (3) a stack of

locations in the code corresponding to object’s methods. The mentioning of protected and

private variables and an omission of static ones is related to how accessible variables are to

other objects. When talking about objects, it is important to define a model which gives a

predictable outcome to an input received by a model. If variables associated with states can

be modified directly by other objects, in a faulty implementation an OUT can change a state

unpredictably; for a similar reason, state-related variables should not be directly accessible to

other objects. Any variables which do not satisfy the stated condition cannot be a part of a

state, but they can be implicitly considered inputs and outputs. This also includes instance or

local variables which are passed to other objects by reference, since nothing stops those objects

from storing these references and using them at any time in the future.

While an OUT is waiting for a response from a collaborator object after making a call to

it, nothing stops some other object calling back the OUT. The OUT may respond by making a

further call and receive a call-back again. In principle, there is no upper limit on the number

of nested callbacks received by an object. If every callback enters a separate state, this leads

to an infinite number of states in a model. Additionally, if a label has a method return as an

input, it can only be taken if an OUT has previously made a call; this implies that not every

path in a model may be executable. In this paper, it is assumed that objects do not call their

UKTest 2005

collaborators from within callbacks (for instance, the SOUT is not allowed to make a call to the

collaborator from I or C states). This assumption ensures that the number of states in object

machines is finite, makes it easy to check implementability of a model manually and makes

test application simpler. All but one objects used in the big case study satisfy this property and

the one which does not is nevertheless testable using the described framework.

With the constraints described above, an X-machine corresponding to the SOUT can be

defined as follows. A set of inputs is Σ = ({start} × N) ∪ ({RET } × {true, false}) ∪{RET, MyException}. Outputs can be defined in a similar way, Γ = ({init} × N) ∪({RET } × {true, false}) ∪ {compute, RET, MyException}. In X-machine terms, ev-

ery transition label from a set of labels Φ is called a function and shares a common data store

(called memory) with all other functions of the same machine. The presence of such a store is

important, since a developer can choose what to interpret as a state on a transition diagram and

what to put in this memory. Without a store, all data will have to be included on a transition

diagram, leading to a known problem of state explosion. On the other extreme, if a developer

includes too little information in a transition diagram, state-based testing will lose its effec-

tiveness. For this reason, it is necessary to include only the relevant control behaviour in a

transition diagram; how to determine what is relevant is outside the scope of this paper. A

combination of a state on a transition diagram and a value of memory is a global state of an

X-machine, in that a response of a machine to an input depends solely on these two. Given

that objects are modelled using X-machines, only variables not accessible to other objects can

be used to define their global state.

When supplied with an input σ ∈ Σ, an X-machine decides which function φ to take, such

that (1) such a function is defined for the current memory and the input, and (2) there is a

transition from the current state labelled with that function. If there is a function satisfying

these conditions, the X-machine takes the corresponding transition and executes the function;

this yields a state change, an output produced by the function and potentially a change to the

memory. Formally, X-machine functions are defined as partial functions to take an input and

memory and produce an output and a new value of memory, φ : Σ × M → Γ × M , where M

denotes the type of memory. Traditionally, functions are given names which are included on a

transition diagram; this has also been done for the big case study, but the example included in

this paper appeared easier to understand if input/output pairs are included instead. These pairs

uniquely identify the corresponding functions and make their purpose clear. To complete the

definition of an X-machine corresponding to the SOUT, it is necessary to define a set of states

Q = {R, I, C, E}, the initial state q0 = R and the transition diagram F : Φ × Q → Q, all the

three depicted in Fig. 1. With m0 denoting the initial memory value, the SOUT X-machine is a

tuple (Σ, Γ, Φ, M, m0, Q, q0, F ). Note that since the SOUT does not actually contain memory

variables, it can be defined without M and m0; this is not a common situation in practice.

3 X-machine testing of object machines

This section describes how the X-machine (extended finite-state machine) testing method [6]

can be applied to object machines and Sect. 4 shows how test sequences can be represented in

JUnit with common parts of them separated.

The main advantage of using the X-machine testing method is that it verifies by testing that

subject to certain conditions an implementation is behaviourally-equivalent to an X-machine

specification. Compared to many other testing methods, all these assumptions can be defined

formally and hence can be verified (or taken on trust). The foundation of the X-machine testing

method is finite-state machine testing methods [18, 20, 26, 2] which test a transition diagram.

They have been adapted [6] for testing X-machines by focusing on testing of the state-transition

diagram. For each state, the aim is to ensure (1) that all transitions with appropriate labels are

implemented from that state and lead to the expected states and (2) that transitions with all

other labels featuring in an object machine are not implemented from the considered state. For

instance, one would like to check that there is a transition RET(false)/RET(false) from the I

state, corresponding to the SOUT receiving a response false and returning false to the caller of

its start(int) method; moreover, there should be no transition RET(true)/RET(true) from that

state. Attempting such a function by returning true from the collaborator causes the collabo-

rator’s compute method to be called (instead of true being returned to a caller of the start(int)

method), making it clear to a tester that there is no transition with the RET(true)/RET(true)

label from the I state. Since an object may accept a variety of different inputs in any state, it is

not feasible to attempt all possible inputs from each state; for this reason, testing of a transition

diagram is limited to checking that from every state, only transitions with specified labels exist

and these transitions lead to the expected states. For instance, a transition from state I with a

label RET(false)/RET(false) should exist and enter R.

State verification can be done by observing values of object variables, however this con-

tradicts the spirit of object machines, since variables in a program code which contribute to

a global state of a corresponding object machine are not supposed to be externally acces-

sible. State-based testing methods such as the W method [26, 2] identify states by check-

ing the response of an object under test to various sequences of inputs; if each state can be

associated with a unique response, states can be identified. Compared to state variable ob-

servation, this is a more difficult approach to state verification but a more general one and

has been the main state-verification approach in the big case study. For this reason, it is

also used for testing of the simple object described in this paper. State-identification strat-

egy of the W method (when applied to object machines) is to find a set of sequences of labels

(called the W set), such that for every pair of states of an OUT there is a sequence in this

set which exists from one of these states and not from the other one. For example, state

C is the only one with the RET()/RET(true) transition from it; states C and I are the only

two with the start(int)/MyException transition; finally, the existence of a transition with the

start(int)/init(int) label uniquely determines the R state. The above three labels (comprising

the W set) make it possible to tell if the SOUT is in one of the three of its four states; if it is not

in any of the three, it must be in the E one. After running a test sequence during testing, one

would first attempt an input of start(int), which may lead a potentially faulty implementation

to an arbitrary state. For this reason, it is necessary to restart a test and run the same sequence

again, following it with RET(), if such an input can be applied. This requires an implementa-

tion of an OUT to possess a reliable reset. The response of the SOUT to these two inputs is

shown in the top-right corner of Fig. 1 and it is possible to observe that states can indeed be

uniquely identified this way; ‘(impossible)’ means that an input mentioned cannot be applied

in that state. In general, the W set is not uniquely defined; moreover, state identification is not

always possible, because non-deterministic behaviour and an inability to attempt certain inputs

may lead to some states being indistinguishable from other states [13]. In the context of object

machines, the situation is sufficiently constrained so that this is not a problem, except when an

object machine contains states with identical behaviour. In this case, all but one states with the

same behaviour are redundant, hence it seems reasonable to assume that a developer will build

object models without redundant states.

There are two good alternatives to the W method for finite-state machines, Wp [5] and HSI

methods [13]. Both can be applied to testing of X-machines and their usage will reduce the size

of a test set compared to the W method, while still detecting all faults in an implementation.

The method mostly used for the big case study was the HSI one, since it does not require two

separate stages of testing; refer to [13] or to [19] for the description of how the HSI method

works. This paper focuses on the W method for simplicity.

UKTest 2005

Given a set of labels on transitions Φ, let a set of sequences of labels C be such that

for every state of an OUT, there is a sequence in the set C which labels a path in an object

machine from the initial state to the considered one; for the initial state, such a sequence

has to be an empty sequence (denoted 1). To simplify the presentation, only sequences of

inputs to enter states are provided, hence Cinputs = {1, start(int), start(int) RET (true),start(int) RET (true) RET ()}. Such a set can be built if any state is reachable by a sequence

of labels from the initial state; unreachable states are redundant. For this reason, it is assumed

that every state can be reached in the considered object machine; this condition together with

the one about the absence of equivalent states is called minimality. According to [6], the

simplest set of test cases capable of finding all faults is C ∗ (W ∪ Φ ∗ W ). Set multiplication

of two sets of sequences means pairwise concatenation of sequences in those sets. The set

of test cases is a formal representation of what has been mentioned above: the C ∗ W part

corresponds to entering every state in an implementation and verifying that the correct state

was entered and C ∗Φ ∗W means that a tester should attempt every function from every state

and verify the entered state.

The extension of the W method to X-machines relies on the following main assumptions:

A tester knows which functions are present in an implementation. For object machines, there

is a clear relationship between the structure of their transition diagrams and the code,

hence checking that no unexpected functions are present in an implementation should

be relatively easy.

All functions of an implementation are correctly implemented. This condition is essentially

a requirement that there is a correspondence between functions of an object machine

and those in an implementation and the functions which correspond to each other are

behaviourally-equivalent. This typically requires testing of such functions; with the

two specific conditions (below) satisfied, such a testing can be performed together with

testing of a transition diagram [8] rather than requiring a tester to test every function

separately.

Every function can be attempted from every global state. This condition (called

input-completeness) aims to solve a problem where the value of memory in an imple-

mentation does not permit a tester to attempt a function of his/her choice. For instance,

an object implementing a stack may only resize itself when full, hence the precondi-

tion of a resize function may require a memory variable size to be of a specific value.

Following [6], the easiest way to solve this problem is to designate specific values of

arguments of methods or return values as test values, so that given a function φ, it is

possible to identify an input σφ such that regardless of the memory value, the precondi-

tion of φ will be satisfied if the input σφ is supplied to the OUT.

Consider attempting RET(false)/RET(false) after a test sequence which is expected to get

an implementation of the SOUT into state C. The only input to attempt RET(false)/RET(false)

is to return false to the SOUT from the collaborator, however in state C no value can be

returned since method compute() has no return value. This clearly means that there can-

not be a transition with the said label from the C state, but a tester cannot be complacent

about this: a faulty implementation may have entered a different state from which a

transition with such a function exists. For this reason, it is necessary both (a) to attempt

functions where the corresponding inputs can be applied and (b) to verify that inputs

for the remaining functions cannot be attempted. For example, if part (b) is not done,

verification of state I reduces to checking that calling start(int) causes an exception to

be thrown and hence makes it impossible to distinguish between states I and C.

It is assumed in this paper that both an object machine and its implementation are

deterministic, hence given an input, at most one function can be taken in response; the

definition of a deterministic X-machine [6] also requires that preconditions of functions

on transitions from the same state are disjoint.

It is possible to identify which function fired in response to an input from an output. For

the SOUT, the same input start(int) can be used to attempt any of the four functions tak-

ing start(int) as an input. During test execution, a tester needs to know which of them is

taken by an implementation in response to start(int); this is done by observing the output

from functions taken. The described condition is called output-distinguishability.

During execution, an object does not have an option of ignoring an input: it always has to

do something, i.e. execute a function. For this reason, it is possible to assume that for every

input which can be applied, an object machine has to execute a function. The fact that it is

additionally possible to verify that an input cannot be supplied means that object machines are

completely defined, i.e. it is possible to assume that for any input in any state there is a defined

response from an object machine and an implementation.

The paper [1] gives a sketch of a proof that subject to (1) the interface of an implementa-

tion of an OUT containing the expected methods with correct argument and return types, (2)

conditions underlined in this section and those outlined in Sect. 2 being satisfied, and (3) the

number of states in an implementation of the OUT being at most the number of states in the

object machine of the OUT, the X-machine testing method is capable of finding all faults. A

test set can also be generated for higher bounds on the number of states in an implementation

of an OUT, refer to [6] for details.

4 Test data generation: h-sequences

In order to run test sequences constructed from C ∗ (W ∪ Φ ∗ W ), one has to identify the

actual test data, i.e. determine sequences of (a) inputs to apply to an implementation to drive it

through all the sequences of labels from a test case set, and (b) outputs which will be produced

by an object machine in response to these inputs. This task requires consideration of precon-

ditions of functions, is frequently difficult to automate and requires a detailed mathematical

description of the behaviour of labels. The authors believe that for a number of objects, such

a test data has to be derived manually. This has an advantage that labels do not have to be

defined formally and although in this case an object machine cannot be used for a detailed

analysis of the behaviour of an object, the machine is still useful to guide implementation and

serve as a basis for test generation. In the big case study, object machines have been essential

for both of these. The main problem of test generation is the quantity of test cases: for the

SOUT this amounts to |C| ∗ (|W | + |Φ| ∗ |W |) =4 ∗ (3 + 9 ∗ 3) = 120 sequences. This can

make rigorous testing such as X-machine testing prohibitively expensive, especially if every

test input is determined manually and upon any change to the model the whole work has to be

done again. The effect of this problem can be substantially reduced by the extraction of numer-

ous common parts of test sequences. This way, (1) one can identify test data corresponding to

common elements manually and re-use it for all the sequences it is used in; (2) it is possible to

describe tests in a compact form, so that when an object machine changes, the said form can

be modified and all test sequences be regenerated from it automatically.

Identification of common parts of test sequences can be done manually, but the task is made

relatively simple by the structure of a set of test cases and specific properties of objects under

test. First of all, every sequence in C is common to all the sequences in W ∪Φ∗W . In general,

test data for elements of W depends on the sequence of functions executed before elements of

UKTest 2005

W ; both for the big case study and the simple example considered in this paper, the choice of

test data for state identification tends to depend only on the state to identify. For this reason,

test inputs and outputs corresponding to a W set applied in a particular state are common to

all test sequences requiring identification of that state. Rather frequently, a response of an

object under test to elements of Φ is also determined by a state from which these functions are

attempted.

The compact form of encoding test sequences is called hierarchical test sequences (abbre-

viated h-sequences). The hierarchy is used to separate common parts of test sequences. In this

paper, h-sequences are encoded as nested arrays of objects because Java provides convenient

methods to initialise arrays inline. For the SOUT, a sketch of a possible (incomplete) test for

state I is shown below.

public Object [] testStateI() { return new Object [] {attempt start,verify init called,

new Object [] {new Object[] { return true, verify compute called,

verifyEnteredC() },new Object[] { throw MyException,

verify start thrown MyException, verifyEnteredR() },

The attempt start,verify init called sequence ensures that the object under test

enters state I; this part consists of test data corresponding to a sequence from C. Attempts

to fire labels from state I are given by inputs return true, throw MyException. Each

of them has to be attempted after the sequence attempt start,verify init called

is taken. For this reason, a nested array of sequences is included in the test sequence above.

Each element of the nested array has to be concatenated with elements preceding such a nested

array; if there are elements following the nested array, they are appended to the resulting test

sequences. Sequences in a nested array, such as those starting from return true, are h-

sequences, so that it is possible to include nested arrays in them too. For an arbitrary degree

of nesting of arrays, starting from one, odd levels correspond to elements which have to be

taken in a sequence and even levels correspond to sequences, each of which has to be con-

catenated with a sequence at a lower level. If there are two nested arrays, such as if a test se-

quence contains new Object [] {attempt start,verify init called}, new

Object [] {return true, verify compute called}, each element of the first

of them has to be concatenated with every element of the second one.

The testStateI() method calls verifyEnteredC() and verifyEnteredR()

methods to generate h-sequences verifying states C and R, respectively. This makes it possible

to use the same h-sequences for verification of the same state in multiple tests. Deep nesting

of h-sequences is useful if in order to identify a state, a number of sequences of labels have

to be attempted. Usage of the verifyEntered()-kind methods was essential for testing in

the big case study.

The actual test inputs and outputs in h-sequences are represented by objects. This way,

attempt start has to be an instance of some object. In order to run tests, h-sequences are

converted to ordinary sequences by the test framework and each of the resulting sequences is

encapsulated in a JUnit test; the framework also partially automates running test sequences.

Stubs representing collaborators have to be written manually by a tester, which was easy for

the big case study. The description of stubs can be found in Sect. 5.

With h-sequences, it is easy to propagate changes made to an OUT, to the corresponding

test h-sequences. For new functions being added, new sequences attempting them have to be

added to tests for every state. An extra state in an OUT can be tested by introducing a new

method returning an h-sequence to test this state and writing a new state identification method.

This was performed a number of times in the progress of the work on the big case study.

Calls of methods in an OUT are described by objects extending the special testElem

class introduced in the object machine testing framework; instances of derivative objects are

typically instances of Java anonymous inner classes, implementing the run method which is

called by the test framework during test execution. Expected calls to a stub of the collaborator

object are represented by instances of the CollaboratorStub class which are initialised

with enough information to check that data passed to the stub is correct (namely that the argu-

ment passed to the init(int) method is correct). For exceptions raised by stubs and callbacks

from stubs to the SOUT, instances of the MyEx class and the testElem class, respectively,

can be used in h-sequences. Instances of ReturnValue are used to denote values to be re-

turned from stubs; those of NotApplicable are necessary to check that a particular input

cannot be applied.

In the big case study values to be returned from stubbed methods were often included in

the instances of stub test elements. In the simple example used in this paper they are included

as separate elements of h-sequences because this makes it easier to compose test sequences

using nested h-sequences. For instance, different return values have to be attempted after the

attempt start,verify init called sequence; if return values are included in an

instance of CollaboratorStub (i.e. coupled with the verify init called element),

it is not possible to have verify init called shared between all subsequent sequences.

This was not an issue for the big case study where state verification sequences never started

with a return value.

For the SOUT, testing of the R state can be accomplished using the h-sequence returned

by the method below, which includes instances of the appropriate objects.

public Object [] testRState() {return new Object [] {

verifyRstate(),// check that we are in the R state

new testElem("start") {// "start" is this input’s name

public void run() throws MyException

{ out.start(67);}},new CollaboratorStub("init",67),

verifyIstate(),

new classEnd()

A call to the SOUT is performed with out.start(67); the following element in the

sequence checks that 67 has been passed to the init(int) method of the collaborator. The h-

sequence returned by verifyIState() checks the entered state; the sequence testing R

ends with the new classEnd() element, which ends the test and unrolls a call stack. Un-

rolling is often necessary because test sequences end within a call to a collaborator with an

OUT waiting for a response; an unhandled exception is used to perform the unrolling.

As mentioned earlier, state verification is accomplished using sequences returned by meth-

ods such as verifyRstate(). In addition to sequences mentioned in Sect. 3, an empty

sequence is included in h-sequences used for state verification. Its presence is necessary to

verify states and continue testing. For example, state verification of the SOUT involves sup-

plying two test inputs to it, each of which may lead to an arbitrary state in a faulty implemen-

UKTest 2005

tation; a tester would like to attempt one of them, restart a test sequence, attempt another one,

restart the sequence again and then attempt some functions from the already-verified state. The

need to perform ‘verify and continue’ in testing of the big case study was a motivation for the

development of h-sequences. A method to verify the I state is provided below.

public Object [] verifyIstate() { return new Object [] {null,

new Object [] {// calling "start" should cause an exception.

new testElem("start I(-582)") {public void run() {

try { out.start(-582); }catch(MyException ex) {

return; // everything is ok if we got here.

}fail("Exception was not thrown");

}},new classEnd()

},new Object [] {

new NotApplicable(new ReturnValue(null)),

new classEnd()

The first sequence is empty (represented with null) and the following two correspond to

the two inputs of the W set and the expected response from the SOUT. In the second sequence,

a call of start(int) is attempted and a test passes if an exception is thrown by the SOUT in re-

sponse; the third sequence verifies that it is not possible to return from a stub without providing

a return value (such a verification is necessary to distinguish state I from state C).

It is possible to integrate testing of X-machine functions into testing of a transition diagram

[8]. This is based on first testing a number of functions on transitions from the initial state

using, for instance, the category-partition testing method [17]. Subsequently, a different state

may be entered using one of the tested functions and functions on transitions from that state

can be tested in a similar way. This process can be repeated until all functions are tested. Such

a testing process can be directly integrated into h-sequences where in addition to attempts to

verify transitions from each state, different inputs are supplied to functions on those transitions

in order to test their behaviour.

If multiple instances of the same collaborator class can be used by an OUT, a reference to

an instance which is expected to be called by the OUT has to be stored in each test element

reflecting an expected call to a test stub.

Testing of self-delegation can follow either of the two approaches: (a) to stub an OUT and

thus capture calls directed from the OUT to itself, or (b) assume that self-delegated calls are

not visible to a tester and limit stubbing to objects other than the OUT. The choice between

these two methods depends on the degree of abstraction used when an object machine of the

OUT is built: a higher-level model will ignore self-delegated calls while the lower-level one

will take them into account.

JUnit [10] tests are often generated by constructing a tester class derived from a JUnit-

supplied class and including test methods in it. Test methods are expected by JUnit to follow

specific syntactic conventions such as a name of a test method has to begin with test. When

a suite() method of such a tester class is called, JUnit collects all test methods defined in

the class and packages each of them in an instance of a TestCase class; a collection of these

instances is returned from the suite() method. The framework for object testing using h-

sequences is following a similar pattern, where all methods with names starting with test are

called in order to obtain h-sequences; these sequences are expanded into ordinary sequences

by ‘flattening’ the hierarchy as explained in the introduction to h-sequences. Every sequence

obtained by the expansion of h-sequences is packaged in an instance of a TestCase-derived

class and the collection of the resulting objects is returned from the suite() method.

5 Usage of stubs to run test sequences

A traditional approach to testing of an object operating in a context of other objects is to stub

the context; these stubs are known as mocks. A number of tools [9, 4] exist to generate such

mocks. These tools aim at mocking the behaviour of a single object, rather than at creating

mocks to check that multiple collaborators are called in a particular order. The approach

described in this paper expects mocks to interpret test sequences constructed from h-sequences

and communicate with an object under test as prescribed by these sequences. For this reason,

(1) the order in which collaborators of an OUT are called is verified and (2) a single mock

can be built for each mocked interface (or an object) and used to test a variety of objects

communicating via that interface (or with that object). Mock construction can be accomplished

manually (as has been done for the big case study) or automated using [9, 4] or any other mock

creation tools.

The code for the stub of the init(int) method is a simplification of the code used in the

simple example considered in this paper, obtained primarily by deletion of error-checking

code. A test sequence constructed from an h-sequence is stored in the array called testData

and the current test datum is assumed to be at the position position of that array.

public boolean init(int arg) throws MyException {Assert.assertEquals("init",

((CollaboratorStub)testData[position]).argName);

Assert.assertEquals(

((CollaboratorStub)testData[position]).expectedArg,arg);

position++;// move to the next element

while (testData[position] instanceof NotApplicable)

{// verify that the input cannot be returned.

}runSequence();tryMyException();

return ((Boolean)getReturnValue()).booleanValue();

The first assertEquals line checks that the test sequence expects the init(int) method

to be called and the second assertEquals statement verifies that the argument passed to

init(int) is the expected one. If a test sequence includes checks that particular inputs cannot be

applied (described by instances of the NotApplicable class), the stubbed method verifies

this in the while loop. The runSequence() method is used by the stub to perform call-

backs on the SOUT, if callbacks are included in a test sequence. Finally, either an exception

is thrown or a value returned to the SOUT; the former is accomplished using a helper method

tryMyException() and the latter is carried out by getReturnValue().

UKTest 2005

The runSequence() method runs tests by calling methods of an object under test, as

provided by a developer in objects extendingtestElem. A simplification of runSequence()

is included below.

protected synchronized void runSequence() {while(position < testData.length &&

!(testData[position] instanceof TestEx ) &&

!(testData[position] instanceof ReturnValue) {int currentPosition = position; position++;

if (testData[currentPosition] instanceof testElem)

((testElem) testData[currentPosition]).run();

else if (testData[currentPosition] instanceof classEnd)

throw new UnrollCallStackException();

The while loop iterates through the test sequence and calls the run() method of every

instance of testElem. There are two cases when runSequence() is called, to run a

top-level test sequence and to perform call-backs, as mentioned above. The former is a test

sequence where each call to an OUT is made when such an OUT has returned from a previous

call (states R and E of the SOUT). An instance of classEnd or the end of a test sequence (the

first line of the condition of the while loop) are used to terminate a top-level test sequence;

the last two lines of the while loop condition aim to terminate a call-back sequence.

6 Conclusion

This paper described how a rigorous state-based testing method can be used in conjunction

with JUnit for testing of objects using call-backs and exceptions for communication. The

approach presented also handles the situation where the test method generates numerous test

sequences but test data generation is difficult to automate, by separating common parts of test

sequences. This also makes tests easy to adapt to changes in an object machine.

The idea of test reuse is not new — it has been previously used for testing of inheritance

where one could maintain a hierarchy of test classes in parallel to the structure of classes

under development; another example of test re-use is IFTC [7], where tests are developed

for each interface and an object implementing them can be tested by running tests for all the

interfaces it implements. This paper described a testing method complementary to these two

approaches. Reference [22] uses an algebraic approach to test generation. The framework

described in this paper can be used for the resulting tests but the authors of this paper believe

that it could be easier to find commonalities between test sequences produced from state-based

testing methods. For a detailed comparison to papers [16, 23, 3], refer to [1].

Reference [12] describes how to determine an order on object testing in order to create as

few mocks as possible. Paper [21] describes how a behaviour of objects can be recorded and

used as mocks for objects which are difficult to use in a test environment. The work on object

state testing primarily targets unit testing and stems from the view that an object under test

has to be controllable and observable. Using real objects as collaborators, rather than mocks

made to follow test sequences, makes controllability and observability more difficult. In the

approach described in this paper, in order to facilitate the construction of stubs, (1) a stub is

built per interface (or an object) and can be used without changes to run any tests and (2)

various helper methods were introduced to help writing stubs. In contrast, problems addressed

by [12, 21] seem rather important for integration testing. The authors believe that the object

machine testing method can be useful for this type of testing too, but it will have to be applied

from higher-level object machines.

It is worth pointing out that the described approach to object modelling and testing can be

applied to software components; in the context of the big case study, 7.6K lines of assembly

were tested using 8.6K lines of test code (180 tests). This translates to 7.6 lines per test for

Java and 48 lines per test for assembly; the substantial difference between these numbers is

primarily due to the lack of automation of test generation for assembly.

The main limitation of the described work is the limit on the number of nested call-backs

in the model. Although not demonstrated here, the framework seems capable of supporting an

arbitrary (bounded) number of nested callbacks; verifying this capability of the framework as

well as considering unbounded nested callbacks in the model is a subject of future work.

Acknowledgement

This research was in part sponsored by EPSRC grant GR/M56777 ‘MOTIVE’. The authors

would like to thank Tony Simons, Barry Norton and Mike Stannett for valuable discussions.

References

[1] K. Bogdanov, M. Holcombe, and A. Simons. A state-based model for generating com-

plete test sets for objects using interception of communication between them. To be

submitted to ACM Transactions on Software Engineering and Methodology, 2005.

[2] T. Chow. Testing software design modeled by finite-state machines. IEEE Transactions

on Software Engineering, SE-4(3):178–187, 1978.

[3] J. Davies and C. Crichton. Concurrency and refinement in the unified modeling language.

Electronic Notes in Theoretical Computer Science, 70(3), 2002.

http://www.elsevier.nl/locate/entcs/volume70.html.

[4] Easymock web site.

http://www.easymock.org, December 2004.

[5] S. Fujiwara, G. von Bochmann, F. Khendek, M. Amalou, and A. Ghedamsi. Test selection

based on finite state models. IEEE Transactions on Software Engineering, 17(6):591 –

603, June 1991.

[6] M. Holcombe and F. Ipate. Correct Systems: building a business process solution.

Springer-Verlag Berlin and Heidelberg GmbH & Co. KG, September 1998.

[7] Interface/hierarchial test case (IFTC) web site.

http://groboutils.sourceforge.net/testing-junit/using iftc.html

January 2005.

[8] F. Ipate. Complete deterministic stream X-machine testing. Formal Aspects of Comput-

ing, 16(4):374–386, 2004.

[9] Jmock web site.

http://jmock.codehaus.org, December 2004.

[10] Junit web site.

http://www.junit.org, December 2004.

UKTest 2005

[11] D. Kung, Y. Lu, N. Venugopalan, P. Hsia, Y. Toyoshima, C. Chen, and J. Gao. Object state

testing and fault analysis for reliable software systems. In IEEE 7th Int’l symposium on

Software Reliability Engineering, pages 133–142. IEEE Computer Society Press, 1996.

[12] Y. Labiche, P. Thevenod-Fosse, H. Waeselynck, and M.-H. Durand. Testing levels for

object-oriented software. In Proceedings of the 22nd International Conference on Soft-

ware Engineering, pages 136–145. ACM Press, June 2000.

[13] G. Luo, A. Petrenko, and G. von Bochmann. Selecting test sequences for partially spec-

ified nondeterministic finite state machines. In IFIP Seventh International Workshop on

Protocol Test Systems, Japan, pages 95–110, 1994.

[14] J. McGregor. Constructing functional test cases using incrementally derived state ma-

chines. In Eleventh International Conference on Testing Computer Software, 1994.

[15] J. D. McGregor. Functional testing of classes. In Proc. 7th International Quality Week,

San Francisco, CA, May 1994. Software Research Institute.

[16] OMG. OMG Unified Modeling Language specification, version 1.5.

http://www.omg.org/technology/documents/formal/uml.htm,

March 2003.

[17] T. J. Ostrand and M. J. Balcer. The category-partition method for specifying and gener-

ating functional tests. Communications of the ACM, 31(6):676–686, June 1988.

[18] A. Petrenko. Fault model-driven test derivation from finite state models: Annotated bib-

liography. In Modeling and Verification of Parallel Processes (MOVEP’2000), Nantes,

France, volume 2067 of Lecture Notes in Computer Science, pages 36–43. Springer Ver-

lag, 19-23 June 2000.

[19] A. Petrenko, N. Yevtushenko, and G. v. Bochmann. Testing deterministic implementa-

tions from nondeterministic FSM specifications. In Proc. of 9th International Workshop

on Testing of Communicating Systems (IWTCS’96), pages 125–140, 1996.

[20] T. Ramalingam, A. Das, and K. Thulasiraman. On testing and diagnosis of communi-

cation protocols based on the FSM model. Computer communications, 18(5):329–337,

May 1995.

[21] D. Saff and M. Ernst. Mock object creation for test factoring. In Workshop on Program

Analysis for Software Tools and Engineering PASTE 2004. ACM, June 2004.

[22] D. Stotts, M. Lindsey, and A. Antley. An informal formal method for systematic JUnit

test case generation. Lecture Notes in Computer Science, 2418:131–143, 2002.

[23] J. Tenzer and P. Stevens. Modelling recursive calls with UML state diagrams. In

FASE’03, volume 2621, pages 135–149. Lecture Notes in Computer Science, 2003.

[24] C. D. Turner. State Based Testing - A New Method for the Testing of Object-Oriented

Programs. PhD thesis, University of Durham, UK, 1995.

[25] C. D. Turner and D. J. Robson. The testing of object-oriented programs. Technical Report

TR-13/92, Computer Science Division, University of Durham, 1993.

[26] M. Vasilevskii. Failure diagnosis of automata. Cybernetics, Plenum Publ. Corporation,

NY, 4:653–665, 1973.

UKTest 2005

A Formal Model for Test Frames

A. J. Cowling

Department of Computer Science,

University of Sheffield,

Regent Court, 211 Portobello Street,

Sheffield, S1 4DP, United Kingdom

Email: A.Cowling @ dcs.shef.ac.uk

Telephone: +44 114 222 1823 Fax: +44 114 278 0972

Abstract

Motivated by errors that students have been observed to make while learning to use the category-

partition test method, this paper describes a new model that has been developed for the concept of test

frames, by defining them in terms of the characteristic conditions identifying sets of test cases. This

model is illustrated by reference to the stream X-machine model of computation, but is also applicable

to other computational models. It is shown that the model can be applied to both functional and

structural test methods, and that it leads to a view of them as processes for generating sets of test

frames, which are complemented by processes that are based on the model for generating test cases

from these test frames. The paper discusses the need for test frames to identify representative sets of

test cases, and introduces the concept of structural continuity of specifications and implementations as

a way of capturing the requirements for test frames to be representative. The application of this

concept to primitive operations is discussed, and is shown to require a hierarchical approach that

reflects the structure of the process of integration testing.

Key Words and Phrases

Category-Partition Testing, Functional Testing, Structural Testing, Test Cases, Test Methods, Structural Continuity,

Integration Testing.

1. Introduction

The term “test frame” has many different meanings, including being a synonym for “test harness” or “test driver”, and also

being used to refer to test data or specific test cases for various kinds of applications that use concepts that they call

“frames”, such as link layer network protocols, digital image processors (eg for use with cameras, scanners, etc), and

window-based graphical user interfaces. While all of these meanings are important in their own specific contexts, this

paper addresses a more general context, in which the meaning of the term is derived instead from the work of Ostrand &

Balcer [1] on the category-partition method for functional testing. Here, they define the term to mean that “a test frame

consists of a set of choices from the specification, with each category contributing either zero or one choice”.

The motivation for discussing this concept stems from experience with teaching the category-partition method to a number

of successive cohorts of students. This experience has indicated that there may be a structural weakness in the method as it

is currently defined, and that the role of test frames as it stems from this definition may be the key to this weakness. The

purpose of the paper is therefore to try to address this weakness by developing a new model for test frames that is more

formal than the one given by Ostrand & Balcer. In particular, part of this weakness of the current model that has been

identified is that its formulation is very dependent on the concepts of categories and partitions, but these are currently only

defined formally in the context of the test specification for a specific system, and so the current models only support formal

reasoning about the application of these concepts to specific systems. Hence, one of the aims for this new model is that it

should provide more support for reasoning about the generic properties of test frames than the current models do.

To achieve these aims, the structure of the rest of the paper is as follows. Section 2 explains how this weakness in the

category-partition method has been identified from experience of teaching it, by describing the assignment that the students

have been required to work through in order to demonstrate that they can apply the method successfully, and identifying the

problems that some of them have found in carrying out this assignment. Section 3 then presents the model for test frames

that has been developed to try to overcome this weakness, and section 4 shows how this model applies to the role of test

frames in functional testing, while section 5 extends this to structural test methods. An issue that is raised by the need to

relate these two kinds of test methods is that of ensuring that a test case generated from a test frame will possess the

property of being representative of the test set that corresponds to such a frame, and section 6 discusses this issue from the

perspective of functional test methods, by introducing a concept that will be called structural continuity. Section 7 shows

how this concept can be extended to structural test methods, and section 8 discusses the implications of this concept for the

process of integration testing. Finally, section 9 summarises the conclusions from this work and discusses possible future

extensions of it.

2. Experience of the Category-Partition Method

Like any topic in software engineering, software testing is a practical subject, and so a practical approach must be taken to

teaching it [2], which means that students of it need to spend a significant amount of their study time in actually carrying

out the testing of pieces of software. In this particular case, which is a course taught to final year undergraduates and

masters students, the pieces of software to be tested had been produced by second-year undergraduates as part of a course

in data structures and algorithms that the author had taught previously [3]. The requirements for these pieces of software

were given by the following scenario, which will be used as the example for the rest of the paper.

“A number of towns have express bus services between them, which simply run from one town to another: for

this purpose it is assumed that there are no through services which stop at intermediate towns, so to plan a

journey from one town to another may involve several changes of bus at the intermediate towns. The basic

requirement is therefore to produce a program which will perform two main functions, as follows.

The first main function will be to read in details of the towns that are served by buses, the pairs of towns

between which buses run, and the distances between each of these pairs. This could in practice be extended to

also allow a set of data which had been read in to be written out to a file, so that subsequently data could be

read either from the keyboard or from a file: such an extension might help simplify the eventual testing of the

system.

The second function will be to read in the names of two of the towns, determine whether there are any routes by

which someone could travel from one to the other by bus, and if there are to print out the route that involves the

minimum number of intermediate changes (or routes if there are several of them), together with the total

distance for each such route.

Since the main emphasis in this assignment is on the construction of the necessary data structures, the user

interface for the system should be made as simple as possible (where ‘simple’ is intended to mean ‘simple to

program’, rather than ‘simple to use’). To facilitate this, all town names will be represented throughout by two-

letter abbreviations (eg you might want to use SH for Sheffield, LE for Leeds, YO for York, DN for Doncaster,

BA for Barnsley, etc.). All distances will be expressed as whole numbers. If it helps, you can also assume that

most towns will only have bus services to up to four other towns, although there may be a few towns that have

services to more.”

Students taking this previous course had produced some five working systems that they were prepared to make available for

use in the software testing assignment, although (not surprisingly) there were significant variations between these systems

in the ways that their authors had interpreted the requirements. Thus, the software testing assignment required the students

firstly to select one of the candidate systems, and undertake some exploratory testing in order to determine how its authors

had interpreted the requirements. Then, the students (who were encouraged to work in groups of two or three) were

required to apply the category-partition method to each of the two main functions that were identified in the scenario, in

order to produce a test specification for each function and (using a locally-written tool) generate the corresponding sets of

test frames. Each individual student was then required to select ten test frames for the route finding function, and actually

carry out the testing for test cases produced from these frames. Finally, each group was required to produce and submit a

report on the work that it had done, and evaluate both their work and the methods that they had used.

This structure for the assignment has been in use for a number of years, and apart from minor refinements it has needed

little change, as in general students have been able to cope with it well and produce reasonable test specifications, even if

some of them have been less thorough than others. On the other hand, one mistake has been observed to occur fairly

persistently, even if not particularly frequently. This mistake arises in trying to create categories for the routes that are

output, where some students correctly identify the total distance for each route as one of the output parameters for this

function, but then (following the pattern that they have used for at least some of the inputs) identify a category concerned

with the validity of this distance. This they interpret as meaning whether or not the distance that is output is actually the

sum of the distances for the individual segments of the route, which leads them to suggest that there should be two

partitions for this category: one for the case where the distance is correct, and the other for the case where it is not.

The reason for classing this as a mistake is simply that, unless the exploratory testing has already identified cases where the

system to be tested does actually calculate the total distance wrongly, it will be impossible for the students to actually create

test cases corresponding to any test frame that involves the partition “total distance is incorrect”. In terms of the concept of

domain testability, as described by Freedman [4], the nature of this mistake is that test specifications have been created that

UKTest 2005

do not possess the property of domain testability, because the domain that is being defined by this particular category is not

controllable. As such it contrasts with the informal specification (as given above), where there is nothing that would imply

a lack of controllability (or, for that matter, a lack of observability either), so that this lack of controllability in the test

specification actually constitutes an inconsistency between the test specification and the system specification from which it

has been derived.

This inconsistency thus constitutes a fault in the test specification, but the most interesting aspect of it is the nature of the

mistake that had led to this fault occurring. On the first few occasions when this mistake was observed it was assumed that

it was simply due to the students not having understood the method properly, but as the mistake has continued to recur, so it

has begun to raise the question as to whether the problem is actually a structural one with the category-partition method (or,

at least, with the way in which it is normally described in the literature), rather than simply being a consequence of the

students not having understood it fully.

Reflecting on this issue has led to a number of hypotheses being proposed for the precise nature of the problem with the

method. The common feature of all of these is that there are some aspects of the method which are somehow being masked

by the way in which the method is normally presented, with the consequence that some of its underlying theoretical aspects

are not being properly appreciated by at least some of the students, including those who have made this particular mistake.

The most fundamental candidate for such an unappreciated aspect of the method appears to be the causality that is implicit

in the input-output relation for a system, meaning that these students at least (and maybe others as well) are not properly

appreciating the significance of the fact that the inputs cause the outputs.

In particular, it appears that some of the students are not properly appreciating that pre-conditions for inputs (such as

“being valid”), which may or may not hold depending on what data is supplied, have a different significance from the post-

conditions for outputs, where the responsibility for ensuring that they hold rests entirely with the system itself, and not with

the external environment in which it is run and from which the inputs are supplied. This difference means that, while it is

reasonable to expect a system to check whether its inputs satisfy the preconditions, and deal sensibly with cases where they

are violated, for the post-conditions the only responsibility on the system is to maintain them, and (except in the rare cases

where there could be good reasons for the system not being able to maintain these post-conditions, which are outside the

scope of this paper) it is not reasonable to expect the system to check whether they might have been violated, or to take

action if such violations have actually occurred.

A consequence of this, which also needs to be appreciated, is how this issue of what responsibility the system has for

checking conditions affects the kinds of test cases that should or should not be constructed for a system. If violations of a

condition can occur in the environment, so that the system is responsible for checking for them, then test cases are needed

to ensure that these checks are actually being carried out correctly, and that the system is taking appropriate actions both

when the checks are passed and when they are failed. If the system is not responsible for checking for particular

conditions, however, as is the case when it can be expected to be constructed so as to maintain them itself, then it is

pointless trying to test whether such checking is being carried out: any such checking would be outside the specified scope

of the system, and so any attempt to create test cases to determine whether such unrequired checks might fail would (as in

this particular example) simply lead to tests for which it would be impossible to construct appropriate data.

In the scenario for this particular example, there is an alternative view that might be taken of this aspect of ensuring post-

conditions, namely that the property of the total distance for a route being the sum of the distances for the individual

segments of it is an invariant that needs to be maintained. As with a post-condition, though, this means that the only

responsibility on the system is to maintain the invariant, and (in general) that it does not have any responsibility for

checking whether it has been maintained. Consequently, this alternative view still leads to the same conclusion, that since

the specification of the system does not require it to check whether this condition might be maintained, it will be impossible

to use the specification in trying to create the test data that might be needed for any test case which assumed that such a

check would be conducted and would fail.

At a more theoretical level, another aspect of the method that the students who make this mistake may not be appreciating

properly is the relationship between the partitions, the test frames that are constructed from them (by the tool) and the test

data that must then be selected to create the test cases corresponding to each of the test frames. This relationship means

that test frames can only be regarded as valid if test data can be constructed from them, and similarly that partitions can

only be regarded as valid if the test frames constructed from them are valid. Consequently, the test frames that they are

creating by specifying partitions such as “total distance is incorrect” are actually invalid ones, because there is no sensible

corresponding test data that can be selected for them, rather than simply being less meaningful than those frames for which

data can be selected.

Part of the reason for not appreciating this aspect of the method may well be that the usual descriptions of the method rather

gloss over the steps in the process where the frames are generated from the test specification (particularly since there is a

tool available for doing this step), and where the test data is selected from the frames. Of course, the original developers of

the method subsequently extended both it and its supporting toolset [5], to cover the specification of actual test data for test

frames as well as the generation of the frames, but we did not have this toolset available. If we had, then maybe its

emphasis on the need to define appropriate actual test data for test frames might have help to avoid the problem. Given that

we did not have this toolset available, though, then part of the reason why some students are not appreciating these aspects

properly may also be that (as indicated in the introduction) the models underlying them are actually not very clear. In

particular, this is because these models are described in terms that are specific to the individual systems for which

categories and partitions are being identified, rather than in terms that are independent of the specific details of individual

systems.

This, therefore, leads to the goals for the work described in the rest of this paper, since a possible way of addressing this

problem would be to find an alternative model for the concepts of partitions and test frames, particularly if this could be

formulated so as to allow some formal reasoning about their properties at a generic level, rather than at the level of the test

specification for an individual system. A second goal is that, if such a model could be constructed, then it should also

allow a more precise description of the processes by which test frames are constructed from partitions, and by which test

data is selected to correspond to test frames, and so it should help to make these aspects of the method clearer. The third

goal is then related to this, in that it would be desirable for such a model and its associated description of the process to

also put more emphasis on the role of the input-output relation for the system, so as to help clarify this aspect too.

3. Functional Test Frames

Given these goals, the model that has been developed for test frames is derived from the observation that a test frame

specifies a set of possible test cases, and does so in terms of ranges of test data that meet criteria derived from the different

partitions being combined within that test frame. Hence, a test frame can be defined to be the characteristic condition that

specifies such a set of test cases, so that any test case satisfying this characteristic condition can be said to satisfy that test

frame. Here, the condition may involve ranges of values for input parameters to the system, outputs from the system

(which for this purpose will also be referred to generally as parameters), or possibly both. The original description of the

category-partition method also identified a third kind of parameter, known as an environment condition, but more recently

Ostrand has indicated [6] that this can best be understood in terms of data that is input to the system and then an updated

version of it output, and so such parameters can simply be treated as appearing as both inputs and outputs.

To make this definition more precise requires some formal model of the computation that is performed by a general system,

and for this purpose the model that will be used here for illustration is that of the stream X-machine, although the model of

test frames is not inherently dependent on the X-machine model. Indeed, it could equally well be built on top of other

models of computation, but the reason for choosing the X-machine model is that it has already been shown to be a very

successful one for obtaining useful results concerning the power of software testing methods [7, 8].

A stream X-machine consists of a finite-state machine that models the overall control structure of a computation, such as

the input of successive data items or the output of successive results, and this machine is known as the associated

automaton of the X-machine. To produce the X-machine model it is augmented with a memory that contains the working

data of the computation, and the steps in the computation are represented by functions that can be applied to this memory,

so as to read it and update it. Each function also reads the next input to the machine and produces the next output. Then, at

each step in the operation of the machine the choice of which function to apply is determined by the current control state of

the machine, and which of the functions have their preconditions satisfied by the current input and memory values.

There are a number of variants of the formulation of the X-machine model, which are reviewed in [9]: the one that will be

adopted here is that an X-machine Λ can be represented as a tuple Λ = (Σ, Γ, Q, M, Φ, F, q0, T, m0), where:

Σ is the input alphabet of the machine, which defines the type(s) of values that can be input to it;

Γ is the output alphabet of the machine, which defines the type(s) of values that can be output by it;

Q is the set of control states of the machine (ie the states of its associated automaton);

M is the set of possible values of the memory of the machine;

Φ is the set of processing functions that the machine can apply to the memory, inputs and outputs;

F is the next state function, which defines the next control state for each control state and processing function;

q0 is the initial control state of the machine;

T is the set of terminal control states of the machine, which must be a subset of Q; and

m0 is the initial memory value of the machine.

Here, it should be noted that the memory of the machine will usually be structured as the Cartesian product of a number of

components, corresponding to the different variables used in the computation, but for this purpose it is not necessary to

consider the details of this structure further. All that needs to be noted is that the machine starts in the control state q0 and

the memory state m0, and with an input stream s* in Σ*, and an empty output stream.

UKTest 2005

Then, at each step in the operation of the machine a processing function φ is selected from Φ, such that φ (m, σ) is defined

where σ is the head of the stream s* and where F is defined for the pair (q, φ). Once this function φ has been chosen, the

element σ is removed from the stream s* and the function φ is applied to yield a pair (m’, γ) = φ (m, σ). This pair is used to

update the memory value to m’, while γ is appended to the output stream g* in Γ*, and also the control state is updated to

the new value q’ = F(q, φ). These processing steps continue until both the input stream is empty and the machine enters a

control state that is in T, when the machine terminates, having computed the output stream g* from the input stream s*.

Figure 1. The associated automaton of the X-machine for typical functional systems.

For a system such as the one specified in the scenario, which essentially just provides a number of functions that can be

invoked in any order, the next state function F will typically have the form illustrated by the state transition diagram in

figure 1. For such systems it will often be the case that T will be a singleton set, whose sole element will be the control

state that corresponds to reaching the end of the program. For testing a system of this kind, though, where the focus is

simply on verifying the behaviour of the individual functions, it is often convenient to regard other states (such as the state

Running in figure 1) as also being terminal, since the tester may well either not be interested in any further computation

steps, or may wish to go on to perform another test without having to terminate and restart the system.

This convention, of regarding as terminal any control state that corresponds to the end of a particular test case, also applies

to those systems where a significant part of the testing is concerned with verifying that specified sequences of control states

and transitions between them are implemented correctly. This kind of state-based testing, which includes conformance

testing [10, 11], is a more complex process than the function-based testing that is being discussed here, although it

obviously includes function-based testing as an important part, since the validity of the results obtained from it obviously

depend on the individual processing functions Φ of the X-machine model being implemented correctly. This paper,

though, is just concerned with the function-based part of the testing activity, and so possible extensions to state-based

testing will be regarded as beyond its scope.

Given this model, then in principle the characteristic condition of a test frame f could involve any of the elements of the

tuple, but for functional test frames we shall just be concerned with three possibilities:

f involves only elements s* of Σ*, in which case it will be termed an input-only frame;

f involves only elements g* of Γ*, in which case it will be termed an output-only frame; or

f involves both elements s* of Σ* and elements g* of Γ*, in which case it will be termed an input-output frame.

Structural test frames will be considered in section 5, but these possibilities mean that the most basic form of functional test

frame will simply specify values or ranges of values for some elements of s*, g* or both. It should be noted that such basic

test frames will usually correspond to one partition of a single category, but this is not inconsistent with the original

informal definition of a test frame, since although this could be taken as implying that a frame would usually consist of

combinations of partitions from a number of categories, it did not actually require this, and so a choice of a single partition

from a single category would in fact be permitted by that definition.

In practice, of course, one wishes to construct more elaborate test frames, which will combine partitions from a number of

categories. Syntactically these can be constructed from the basic test frames using the usual operators of boolean algebra,

so that if f1 and f2 are any test frames, then f1 ∧ f2, f1 ∨ f2, and ¬f1 are also all test frames. In particular, if f1 is a test

frame that represents some partition of one category c1, and f2 is a test frame that represents some partition of a different

category c2, then f1 ∧ f2 is the test frame that represents the combination of these two partitions from the categories c1 and

Of course, there is a possibility that the characteristic conditions of two frames f1 and f2 may conflict, so that there will

actually be no data that could satisfy the condition f1 ∧ f2, and any frame for which there is no data satisfying the

Function 1 .................... Function n

Running

characteristic condition is said to be infeasible. The simplest kinds of conflict that could produce such infeasible frames are

where two categories c1 and c2 both apply to the same parameter (whether input or output), and define properties of that

parameter that are related by a constraint that prohibits the combination of the two partitions f1 and f2. More general forms

of conflict can arise whenever the specification of the system contains a constraint between two categories c1 and c2 that

apply to different parameters, such as the dependency of an output on an input: if this constraint means that the

combination of f1 and f2 is prohibited, then the frame f1 ∧ f2 will be infeasible.

This can be illustrated easily from the example, since two of the categories needed in the test specification of the route

finding function relate to the validity of the start and end town names respectively, and for each of these categories two of

the partitions would be that the named town either is or is not in the map of the bus network. Other partitions might be that

the input characters do not even constitute a possible town name, because they are not letters, or these could be separated

into categories for syntactic validity, as opposed to the semantic validity that derives from being contained within the map.

Other categories in this test specification will relate to properties of the route that is to be output, such as the number of

alternative routes and the number of intermediate towns, and for the latter the obvious partitions will be none, one and more

than one. Hence, we might have two basic frames:

f1 = “start town not in map”, and

f2 = “number of intermediate towns = one”,

but it should be clear that the frame f1 ∧ f2 will be infeasible, because if the start town is not in the map then the function

should find no routes at all, and so there certainly can not in this case be a route found with an intermediate town.

In principle this way of building up frames can obviously be extended until one has combined partitions from all the

categories, to produce what have sometimes been called complete test frames. In doing this, though, it is important to

avoid the construction of infeasible test frames, since (as noted in section 2) any infeasible frame is essentially invalid,

because it represents conditions that can not actually be tested. Hence, any process for constructing test frames must check

whether they are feasible, and take action to avoid producing them if they are not. This is why test specifications need to

represent the constraints between the different partitions and categories, in order to specify that if certain partitions have

been chosen, then either certain partitions from other categories must not be combined with them, or possibly even the

other categories must be ignored completely, in the sense that no partition from them may be included in the combination.

One effect of these constraints is to impose a hierarchical structure on the set of categories, and indeed Grochtman and

Grimm’s classification tree method [12] is based on explicitly identifying such hierarchical structures. A classification tree

then contains a leaf node for each individual test case, and paths down the tree correspond to combinations of the partitions

represented by the intermediate nodes through which they pass. For instance, a part of the classification tree for this

example would be as shown in figure 2, where sets of edges are labelled with the relevant categories, and nodes are labelled

with the partitions drawn from these categories. From this it can be seen that, in order to avoid constructing infeasible test

frames, what is required here instead of a complete test frame is what might be called a sufficiently complete test frame,

meaning one that in terms of the classification tree model incorporates partitions from all the categories that are represented

by a particular path from the root of the tree to one of its leaf nodes.

Figure 2. Part of the classification tree for the route finding function.

Unfortunately, from the perspective of the goals for the model being developed here, describing the concept of sufficient

completeness in this way has the weakness that any classification tree structure will be specific to a particular test

semantic validity of towns

start town

not in map

end town

not in map

both towns

in map

nonemore than oneone

number of intermediate towns

UKTest 2005

specification. Of course, there are generic aspects to such tree structures, as illustrated by their use in the method of test

templates developed by Stocks and Carrington [13, 14], but even this method can only take the generic aspects so far.

Beyond that point the construction of further levels of the tree structure has to reflect the specific properties of the

specification of the system, which in their examples they assume to have been defined in Z [15]. Also, the choice relation

framework that was developed by Chen et al for the category-partition method [16] defines some generic properties of the

relationships between choices from different partitions (namely whether one is fully embedded in another, partially

embedded or not embedded), but again the application of these relationships to any given system specification is wholly

specific to that system.

Thus, while in principle the model being developed here needs the notion of a test frame being sufficiently complete, it is

not appropriate to define this formally in terms of specific constructions such as classification trees, test templates, or the

sets of constraints in a test specification. For the moment, therefore, it will be assumed that it is going to be possible to

define this idea formally, so that testing methods can be described in terms of constructing sets of sufficiently complete test

frames, which will enable attention to be turned to the way in which these frames are used in the process. Then, the

problem of actually defining sufficient completeness will be returned to in section 6, once the application of these ideas to

structural testing has also been considered.

4. The Process of Functional Testing

The application of this model for test frames to the process of functional testing depends on two key notions. One of these

is that all significant functional test methods operate by generating a set of input-output test frames, although the validity of

this notion can only be demonstrated informally. The previous section has effectively demonstrated that this notion applies

to the category-partition method and others based on it, since the model of test frames that has been created is consistent

both with Ostrand & Balcer’s and with the classification tree method. As functional test methods are usually described (for

instance by Roper [17]), many of them are essentially contained either within these two, or the method of cause-effect

graphing, to the extent that we will claim that any others are not significant. Then, it should be obvious that this notion

applies to the cause-effect graphing method too, since causes and effects are defined in terms of logical properties that

apply to the inputs and outputs respectively, and so map directly into the characteristic conditions of test frames that

correspond to them. Since the decision tables used in this method then simply produce test frames by constructing

disjunctions of the causes and effects, these too map directly into the structures of test frames as they are being defined in

this model, and so this method also generates a set of input-output test frames as defined here.

The other key notion can be defined formally, and this is the notion that any specification of a system will need (amongst

other things) to define the relationship between the inputs that can be supplied to that system and the corresponding outputs

that will be produced. In the case of the stream X-machine model this relationship is defined formally as the relation

consisting of all possible pairs of the form (s*, g*) where the machine computes g* from s*. If the machine is deterministic

then this relation will in fact be a function with type Σ* → Γ*, so that if this function is denoted Spec then the operation of

the machine can be described as g* = Spec (s*).

Thus, given these two key notions, when a functional test method has produced a set of input-output test frames, the

process of generating the corresponding set of test cases can be defined in terms of four steps, as follows.

1. Each input-output frame F in this set is considered separately, and it is decomposed into a pair of frames: an

input-only frame Fi and an output-only frame Fo, such that F = Fi ∧ Fo.

2. From the output-only frame Fo an equivalent input-only frame Fe is constructed by forming the relational

image of Fo under the inverse of the function Spec. Here (as in the Z notation), the postfix symbol ~ is used

to denote the operation of constructing the relation that is the inverse of a function or relation, and the special

brackets · ‚ are used to denote the operation of forming the relational image, so that this construction of Fe can

be written as Fe = Spec~ ·Fo‚.

3. A new input-only data frame Fd is constructed, as Fd = Fi ∧ Fe, and then if Fd is feasible any set of input data

d that satisfies Fd can be selected as the input for the test case.

4. Finally, the expected output from the test case must be determined, and this can be constructed as Spec (d).

In principle, therefore, this process defines formally how the set of test frames produced for any system by any functional

test method is converted into the equivalent set of test cases, which is a step that is usually not defined precisely in the

descriptions of the test methods themselves. As described in the previous section, though, an important issue in this

definition of the process is what should happen at the third step if the constructed data frame Fd is infeasible, since while

this obviously means that a test case can not be generated for this frame, the fact that the frame is therefore invalid also

means that some error has occurred earlier in the operation of the test method. Thus, the generation of such an infeasible

frame could indicate simply that the person who constructed the test specification for the system had not understood

correctly some aspect of the original system specification, and that therefore the test specification needed to be revised to

try to eliminate the infeasible frame. Alternatively, though, the infeasibility may indicate that some consequences of the

original system specification have not been defined clearly, so that while there might be an inconsistency between it and the

test specification, this would not necessarily indicate an error on the part of the person who had subsequently constructed

the test specification, but might just be a consequence of the weakness in the system specification.

In terms of this process, there are actually three possible situations that could give rise to an infeasible data frame Fd:

firstly Fi could be infeasible, secondly Fe could be infeasible, and thirdly Fi and Fe could both be feasible on their own,

but they could conflict. The simplest of these situations is the one where Fi is infeasible, since this clearly indicates that the

test specification did not correctly reflect the actual constraints on what inputs can be supplied to the system, so that the test

method produced a combination of partitions that could not actually occur in practice, and that should therefore have been

eliminated during the operation of the method. In this situation, therefore, the tester needs to go back and modify the test

specification, and then rerun the test method.

The same is also true of the situation where Fi and Fe are both feasible, but conflict. This indicates that the test

specification has produced a legal combination of input partitions, and a legal combination of output partitions, but that the

combination of the two is not legal. Hence, the test specification has not properly modelled the way in which this particular

combination of input partitions should be reflected in the properties of the outputs from the system, and so has allowed this

combination of input partitions to be combined in a test frame with a combination of output partitions that actually could

not result from that input. Again, therefore, the remedy is that the tester needs to go back and modify the test specification,

and then rerun the test method.

The remaining situation is the one where Fe is infeasible, which means that there are no inputs that could cause the system

to produce outputs satisfying the conditions of Fo. While this could indicate a simple error in the test specification, this is

not the only possible explanation, as it is very common for the specification of a system to be written in such a way that,

within what is notionally defined as the range of its possible outputs, there are some values that may actually never occur in

practice, so that the system may therefore not be completely controllable over the whole of its notional output domain.

Furthermore, the more complex a system is, the more likely it is that such “holes” in the output domain may occur, but they

may well not be defined directly in the system specification, and their existence may not be easy to analyse. If such “holes”

do exist, though, then any combination of output partitions Fo that is contained entirely within one of them will result in a

frame Fe that is infeasible. In such a situation, though, it may be necessary to work backwards through the system

specification from the infeasible frame in order to identify the “hole”, before the tester can then revise the test specification

so as to exclude the generation of such a combination of output partitions.

In the example the operation of this extended process is that the partition “total distance is incorrect” produces an output

frame Fo that violates the specification Spec, and so the constructed input frame Fe = Spec~ ·Fo‚ is infeasible. In analysing

the cause of this infeasibility (which is not a step that is required by the category-partition method as normally described) it

will then become very obvious that the partition that gives rise to Fo does lie outside the range of possible outputs of the

system as specified, which highlights that this partition must be illegal, and so can not just be ignored, because any test

specification for this system which contains such a partition must be erroneous. Hence, this form of the process makes it

much clearer than was previously the case that a test specification for this system that contains such a partition needs to be

revised, so as to remove the erroneous partition, and consequently also remove the category that would give rise to it.

The other issue that arises from this treatment of infeasible frames is that in principle the inverse relation Fe = Spec~ ·Fo‚

may not necessarily be computable, so that in theory it may not be possible to determine whether or not a frame Fe is

feasible. In practical testing, though, this is not a significant issue, since the important step is to be able to find at least one

set of values that satisfy Fe. Typically, rather than actually trying to compute Spec~, a tester will approach this step by

using a method that can be described as successive approximations, in which each approximation consists applying Spec to

some likely set of input values, to determine whether they produce a result that satisfies Fo. If such a set of input values

can be found, then Fe must be feasible; but if the tester fails to find such a set of values, then in practice they will have to

treat Fe as being infeasible, and so the question of whether or not it is theoretically infeasible will become irrelevant.

5. Structural Testing

The common feature of all structural test methods is that each test case can be characterised in terms of the set of structural

elements that its execution covers, where different test methods are distinguished primarily by the different definitions that

they use of what constitutes a structural element. This feature is then extended to characterizing a test set in terms of the

union of the sets of structural elements that the individual test cases cover, and typically the goal of such test methods is to

build up a test set for a system that covers some specified fraction (often 100%) of the structural elements that make up the

code of that system.

To extend the definition of test frames to structural testing, therefore, all that is required is to reverse the notion of a test

case covering a set of structural elements, so as to define the criterion for a structural test frame as being of the form that a

given set of structural elements must be covered. Then, any test case that covers this set of structural elements satisfies this

test frame. In practice, though, there is also an implicit expectation that any such test frame should be sufficiently tightly

UKTest 2005

defined (ie should require sufficiently many elements to be covered) that all test cases satisfying it should cover exactly the

same set of structural elements.

Given such a definition of structural test frames, then again we can assert that structural test methods essentially operate by

producing a set of test frames, typically by an incremental process of analysing which structural elements are not covered

by the test cases satisfying these frames, and then adding further frames to cover these elements. The details of this

incremental process do not need to be discussed here, beyond noting that in order to do the required analysis of candidate

test frames a definition is needed of the relationship between the inputs that can be supplied to a system and the

corresponding structural elements that will be covered.

In applying these concepts to the stream X-machine model, the starting point is that structural elements can be defined in

terms of any of the components Q (the control states), M (the memory states), Φ (the processing functions) or F (the next-

state function), or of combinations of these components. The most comprehensive form of structural element is then the

execution path of the machine, which is usually denoted as an object of a type called Path. This type is defined as a

sequence of alternate states (ie pairs of control and memory states) and invocations of processing functions, where each

processing function is applied to the previous memory state and input, and takes the machine to the next control and

memory state (via the next state function), while producing the relevant output. For any deterministic stream X-machine

and input sequence s*, the corresponding path p is computed by a function that can be denoted Exec, of type Σ* → Path, so

that we can write p = Exec (s*). Coverage of elements corresponding to any of the more basic components Q, M, Φ or F

(or combinations of them) can then be obtained from the path p by projection.

Given this model, then the process of generating a set of test cases from the set of test frames produced by a structural test

method can be defined in terms of the following three steps, which are similar to those of the specification-based method.

1. Each frame F in this set is considered separately, and from it is constructed an equivalent input-only frame Fe,

which is defined as Fe = Exec~ · F ‚.

2. If Fe is feasible then any set of input data d that satisfies Fe can be selected as the input for the test case.

3. Finally, the expected output from the test case must be determined, and as before this can be constructed as

Spec (d).

In this process the issue of feasibility arises because in any piece of code there will be constraints between the different

structural elements that mean that certain combinations of them can not be covered by the same test case. As an obvious

example, given any piece of code of the form

if condition

then block1

else block2

fithat is not nested inside any loop construction, then it will be immediately apparent that any single test case can only result

in the execution of either block1 or block2, but not both. Consequently, any test frame that required coverage of structural

elements derived from both block1 and block2 would be infeasible. Typically, though, structural test methods do not

employ any form of test specification to capture such constraints, but rather they would rely on the iterative step of splitting

any such infeasible test frame into two frames, one relating to the structural elements derived from block1 and the other

relating to those derived from block2. The precise details of how such splitting of infeasible frames might be

accommodated within the test method will depend on the method itself, and in this paper it is not practical to try to discuss

this separately for each of the various structural test methods that have been proposed. The key point, though, is the

general one that this process of splitting frames is an iterative one, so that if one of the frames resulting from it is still

infeasible, then it can be split further, until eventually a set of frames is produced in which every individual frame is

feasible.

6. Representative Test Cases

Having shown that this model for test frames can be applied to both functional and structural testing, the next stage in

developing the model is to return to the issue of the completeness of test frames, and in particular to the aspect of this of

specifying how many partitions need to be included in a test frame. Underlying this issue is the observation that was made

originally by Bernot et al [18], that essentially any software testing is concerned with trying to make a generalisation, from

statements of the form “for some test case the system works correctly” to “for all test cases satisfying certain conditions the

system works correctly”. In terms of any system, and some test frame tf for it, this generalisation can be expressed as going

∃ a test case tc satisfying tf � system is correct for tc

∀ test case tc satisfying tf � system is correct for tc

which makes it immediately obvious that some additional justification is required to support the validity of the

generalisation.

In principle this additional justification will come from the property of test cases that, under some suitable criterion, it is

possible for one of them represent some set of other similar tests, where the degree of similarity is such that it is indeed

reasonable to claim that if the system passes one of the tests in this set then it should pass all of them. Here the criterion

corresponds to what Bernot et al call the uniformity hypothesis, and it relates to the kind of test method being employed, so

that in a structural test method it will mean that the execution paths for each test case in the representative set are

sufficiently similar that each case will cover the same set of structural elements, whereas in a functional test method it will

mean that in some sense the functional behaviour of the system is the same for each test case.

In terms of test frames, the implication of this property is that a test frame can be treated as sufficiently complete if each

test case that satisfies it is representative of the whole set of test cases that satisfy it, so that we can define that the test frame

is a representative frame under the criterion being used. For structural test frames this property of them being

representative can, as indicated above, be expressed directly in terms of the similarity of the execution paths of the test

cases that satisfy the frame, but for functional test frames the equivalent notion that is required is that of similarity of

specified behaviour, and this is more difficult to define formally.

In terms of the category-partition method, another way of looking at the requirement for test frames to be representative is

that it is equivalent to requiring that each category and partition must be “small enough” that each legal combination of all

(or enough) of them must be representative. This in turn then requires that each of these combinations of the partitions

must capture just a single kind of behaviour, so that within any representative test frame there must not be any alternative

patterns of behaviour that might occur instead. On the other hand, for reasons of efficiency, we do not wish to achieve this

by making the categories or partitions smaller than they need to be, so that there should not be any cases where two

different test frames actually correspond to the same behaviour of the system.

In order to capture this property in a formal model, there needs to be a way of representing this requirement for a set of test

cases to be representative, namely that the behaviour of the system must be uniform for each case in the set. To model this

requirement, a concept is introduced that will be called structural continuity. Informally, the basis of this concept is that

system specifications or implementations, or components of them such as conditions, expressions or statements, can be

regarded as structurally continuous over any part of their domain in which their behaviour is in some sense uniform, but

that any boundary within this domain where the behaviour changes (ie from one alternative to another) represents a

structural discontinuity. In the case of specifications this can be formalised by observing that any deterministic

specification can be expressed in a clausal form, in which the set of alternative clauses can be written as

clause1 else clause2 else ... clausen

and where each clausei has the form

if PartitionCombinationi then Expressioni

and where the various conditions that are denoted PartitionCombinationi are all mutually exclusive.

The simplest case of such a clausal form specification is one that only requires a single clause, and where the expression in

the clause does not involve any alternatives either. In this case the expression is implicitly structurally continuous, and so

such a specification is defined to be structurally continuous. By contrast, if the expression does involve alternatives, in a

form which means that in order to cover all the possible behaviours allowed by the partition combination it should itself be

expressed in clausal form using more than one clause, then the specification is said to be structurally discontinuous, and

similarly any specification that requires more than clause is in principle structurally discontinuous too. In practice, though,

there is an important intermediate situation, which is a specification that requires a number of clauses, but where the

expressions in each of the clauses are all structurally continuous. Such a specification is defined to be piecewise

structurally continuous.

The significance of defining structural continuity in this way is that, in such a clausal form specification for a system, then

given the way in which test frames have been defined, each partition combination will be a test frame for that system.

Consequently, if a specification is piecewise structurally continuous, then (because the corresponding expression in each

clause involves no alternative behaviours), each of its partition combinations must therefore constitute a test frame for the

system that will be representative under the criterion for functional testing. This therefore provides the formal equivalent

for the informal notion of sufficient completeness that was introduced in section 3: a test frame is sufficiently complete if

the corresponding clause in the specification (meaning, the one that has this test frame as its partition combination) has an

expression that is structurally continuous. Hence, test frames that are sufficiently complete according to this definition will

also be representative under the criterion for functional testing. In both of these cases, though, the restriction to the

criterion of functional testing is important, as it can not be guaranteed that such test frames will also be representative under

criteria for structural testing. This is because there is no guarantee that the designers or the implementers of the system will

have made the implementations of each clause in the specification completely uniform for all cases of the behaviour

represented by that clause, even though this might be a reasonable thing for them to aim at doing, and indeed it is

sometimes referred to as the “reasonable implementation” principle when justifying the significance of the representative

properties of functional test sets.

UKTest 2005

Another useful property of such piecewise structurally continuous specifications is that, using the equivalence of the

partition combinations and the test frames, one can rewrite each clause in the general form

TestFramei ∧ (output = Expressioni)

Then, since the assumption has been made that the partition combinations (ie the test frames) are mutually exclusive, it will

be apparent that any such specification is actually in disjunctive normal form, as is commonly derived during the operation

of methods for generating test cases from formal specifications, such as those described originally by Dick & Faivre [19]

and then developed by the many successors to them (see, for example, [20] for a review of these).

Related to this is the property that any specification which is not in piecewise structurally continuous form can be rewritten

into this form. This follows from the fact that, in a clause of the form

if TestFramei then Expressioni

if Expressioni is not structurally continuous then it must involve alternative behaviours, and so must actually be in a form

such as if conditionj then Expressioni1 else Expressioni2. Hence, this single clause can be rewritten as the pair of clauses

[if TestFramei ∧ conditionj then Expressioni1] else

[if TestFramei ∧ ! conditionj then Expressioni2]

This rewriting step is very similar to the unfolding process for conditional axioms that is described by Bernot et al, and

ideally repeated applications of it should eventually reach a form in which all the combinations of test frame and condition

are mutually exclusive, and all expressions are structurally continuous. If such a form can be achieved, then the whole

specification will be in piecewise structurally continuous form. In principle, though, such a form may not be achievable:

indeed, it may not even be possible to compute the number of rewriting steps that might be required for it. In this case,

some arbitrary upper bound has to be assumed for the number of rewriting steps that will be applied, which correspond to

what Bernot et al call the level of the regularity hypothesis.

This process can then be developed to make these definitions of structural continuity and piecewise structural continuity

completely rigorous, by looking more closely at the way in which expressions are built up. The simplest case is that an

expression uses variables but no operators, and in this case it must be structurally continuous. Any expression in a

specification that is more complex than this must use some operators, but any operator Opj can be assumed to have a

specification in clausal form, and by the construction above it can without loss of generality be required that this

specification be in piecewise structurally continuous form. Under these conditions it will then follow that Expressioni will

be structurally continuous if

∀ Opj used in Expressioni � (∃ clause k in OpSpecj � PartitionCombinationi ⇒ PartitionCombinationk)

Furthermore, if this condition is satisfied, then since for any Opj the PartitionCombinationk are mutually exclusive, the

clause k satisfying the condition must be unique, because the variables occurring in PartitionCombinationk will be a subset

of those occurring in PartitionCombinationi, and the specified ranges of these must correspond to a unique clause in the

specification of Expressioni.

An issue that is still left open by this construction is that of precisely how an operator is specified, since there may well be

cases where it is convenient to regard operators as structurally continuous in practice, even though in principle they are

defined in such a way that they are only piecewise structurally continuous. This issue will be returned to in section 8, but

before doing so it is appropriate to examine how the concept of structural continuity applies to implementations.

7. Structural Continuity of Implementations

The application of the concept of structural continuity to implementations in conventional programming languages is built

up in the usual fashion, starting with the most basic forms of statement and then going on to the various kinds of structured

statements. Basic statements, such as assignments, will inherently be structurally continuous if the expressions that they use

are. Thus, if the expressions just use individual variables, the statements will be structurally continuous, or if the

expressions involve any form of operator application or function call and all of the operators or functions are structurally

continuous then the expression will also be structurally continuous. Conversely, if any of the operators or functions in it are

structurally discontinuous then the whole expression will be structurally discontinuous, but if some or all of them are

piecewise structurally continuous and the rest (if any) are structurally continuous then (by the construction given in the

previous section) the whole expression can be regarded as equivalent to one that is in piecewise structurally continuous

form. Similarly, if the expression in a statement is piecewise structurally continuous, then for the purposes of analysing its

behaviour when executed the statement can also be regarded as being piecewise structurally continuous.

When such statements are composed sequentially into a block, then if all the individual statements are structurally

continuous the block will be structurally continuous too. If one of the statements is only piecewise structurally continuous

then similarly the block will be piecewise structurally continuous, and this generalises to the case where more than one of

the statements in the block is piecewise structurally continuous. Here, though, there will in principle need to be one clause

in the equivalent piecewise structurally equivalent form for each possible combination of the clauses in the forms for the

individual statements, although in practice there may be constraints between the conditions of the clauses that mean that

some of the combinations can be ignored.

For any kind of structured statement that expresses a choice, such as

if condition then block fi ,

if condition then block1 else block2 fi, or more general forms such as

switch expression case block1 case block1 ... case blockn end ,

the basic definition of the concept of structural continuity means that it should be obvious that the statement will inherently

be structurally discontinuous, unless the blocks of code nested within it are all structurally continuous, in which case the

whole statement is piecewise structurally continuous. Alternatively, if each block nested within the statement is either

structurally continuous or piecewise structurally continuous, then by the construction given in the previous section the

whole statement can be regarded as being equivalent to one that is in a piecewise structurally continuous form. Thus, for

the purposes of analysing its execution behaviour, such a statement can be treated as being piecewise structurally

continuous.

The structured statements that express repetitions conventionally divide into two groups, depending on whether or not the

number of repetitions is fixed at the start of execution of the statement. Statements that provide for a fixed number of

repetitions typically have a form such as

for range of iterations do block endand if the possibility of there being zero repetitions is ignored, then essentially they can be treated as the sequential

composition of the appropriate number of occurrences of block. Hence, in principle the number of clauses in the

equivalent piecewise structurally continuous form would be the number of clauses in the form for block, raised to the

power of the number of repetitions, which means that in practice this form will suffer from a combinatorial explosion in the

number of clauses.

A similar problem applies to the statements that provide for indefinite numbers of repetitions, such as

while condition do block end or

repeat block until condition endsince these are inherently structurally discontinuous, but again in principle if the blocks of code nested within them are

piecewise structurally continuous then it ought to be possible to treat the resultant statements as piecewise structurally

continuous. In practice, though, this would involve expanding the loop up into a series of alternatives, one for each

possible number of iterations, and since each of these alternatives would suffer from the kind of combinatorial explosion

that affects fixed numbers of repetitions, the resultant piecewise structurally continuous equivalent would suffer from an

even more severe combinatorial explosion, that would make it very intractable for any serious analysis.

The final form of statement that needs to be analysed is the invocation of a procedure or equivalent kind of routine, for

which the structural continuity will be the same as for the code of the procedure or routine being invoked. In particular,

this means that any correctly formed recursive procedure or function must inherently be either structurally discontinuous or

piecewise structurally continuous, since the requirement for correct formation means that any such procedure must contain

some form of choice statement in order to select either the base or the recursive case.

8. Integration Testing

Given the problems of combinatorial explosion that can arise with the number of clauses in a piecewise structurally

continuous block of code, as described above, an issue that is obviously significant is that of just how many clauses are

required to define the behaviour of operations, and in particular whether it is possible to control this number, perhaps by

treating some of the more primitive operations as being structurally continuous in practice, even though in principle they

are only piecewise structurally continuous. The point here is that typically a primitive operation will be specified in terms

of a set of axioms, which will often have been derived from the structure of primitive type definitions that are inherently

recursive, with the consequence that the set of axioms will give rise to a piecewise structurally continuous specification.

As an illustration, at the most primitive level natural numbers are defined in terms of a recursive data type that uses two

constructor functions, typically denoted zero (a constant) and succ (a unary function). Consequently, any primitive

operation over the natural numbers will be specified in terms of axioms that define the value that it returns for each

combination of these two possible constructions. For instance, in the case of addition the relevant axioms will be:

zero + zero = zero,

zero + succ (n) = succ (n),

succ (n) + zero = succ (n), and

succ (n1) + succ (n2) = succ (succ (n1 + n2) ).

Hence, this set of axioms produces a specification that is at best piecewise structurally continuous, and that appears to

require four clauses, although actually the need to unwind the recursions means that in principle a separate clause is

required for every possible combination of natural number values.

UKTest 2005

Similarly, if one considers the implementation of natural numbers in terms of the usual binary representations for unsigned

integers, each number needs to be mapped into a string of bits, and then the usual half-adder and full-adder operations need

to be defined in terms of pairs or triples of bits, and extended to strings of appropriate maximum lengths. Thus, the

implementation of the basic half-adder operation will be piecewise structurally continuous, with clauses for the four pairs

(0, 0), (0, 1), (1, 0) and (1, 1), and similarly the implementation of the full-adder operation will require clauses for the eight

possible triples. Then, the extension of these to appropriate length bit strings will involve embedding these in a loop that

operates over the maximum number of bits in the string, and so in principle it will again give rise to a clause for every

possible combination of natural number values.

In practice, though, when testing a piece of software that used such an addition operation, one would not want to focus on

testing every possible set of inputs to the addition, since one would normally expect that this operation had already been

thoroughly tested, probably using both functional and structural methods, and so could be relied upon to operate correctly.

Thus, one would want instead to focus on issues such as whether the expression using the addition operation had been

written correctly: for instance, to identify faults such as writing a * b or a – b, instead of the expression a + b that was

actually intended. Furthermore, in order to find test frames that will be particularly appropriate for identifying such faults,

if they have occurred, the ones that arise most naturally from either the specification or the implementation of the addition

operator may be of little value. For instance, test cases with either a = 0 or b = 0 may not help much in distinguishing the

correct expression from the incorrect ones.

What is needed, therefore, is some way of treating such primitive operations as though they are structurally continuous, in

order to avoid the combinatorial problems that would otherwise result from having to recognise that actually they are only

piecewise structurally continuous. If this is to be done, though, the decision about which operations are to be treated as

being structurally continuous in this way has to be made on the basis of how thoroughly the operations have actually been

tested, or even verified, rather than on any inherent property of their specifications or implementations. This is not a new

situation, and indeed even for the stream X-machine testing method the claim that it will find any possible faults in the

system under test is predicated on the assumption that the individual processing functions have already been thoroughly

tested, so that the only remaining faults can be those arising from the way in which these functions have been integrated

into the complete X-machine system.

What this leads to, therefore, is a new interpretation of the usual view of the process of integration testing as being one of

assembling components into a hierarchy of sub-systems that are regarded as having been tested. In this interpretation, each

node in the hierarchy now consists of a component or sub-system that, because of the way in which it has been tested or

verified, is effectively declared to be structurally continuous, even though its underlying specification or implementation is

only piecewise structurally continuous. Thus, the leaf nodes in this hierarchy will be whatever operations are deemed to be

“primitive” for the purpose of the system being tested, which might well be those built in to whatever programming

language is being used for the development. On the other hand the hierarchy does not have to stop here, and for ultimate

correctness one might want to go down further, through the various layers of the system software (compiler, loader,

operating system, etc), and perhaps even to the microcode or the hardware itself.

In the other direction, as components or sub-systems are tested to the point where it is considered safe to declare them to be

structurally continuous, then they too can be added to the hierarchy at a layer above the components on which they depend.

Thus, in the X-machine test method, the layer above the primitive operations would have the processing functions declared

to be structurally continuous and added to it, as the testing of each of them is completed. Once all of these functions have

been assembled into this hierarchy, then at the next layer up the X-machine that uses the functions can be tested according

to the method, and eventually added to the hierarchy. Then, if these machines in their turn are used as functions in a

higher-level X-machine, which is one way in which the X-machine approach can be used to specify complex system, this

higher-level X-machine can in its turn be tested using the X-machine method, and added to the hierarchy, and so on.

In theory there are then two possible extensions to this approach. One is that it could be extended to the case where the

underlying specification or implementation of a component is structurally discontinuous rather than being piecewise

structurally continuous, although the significance of the construction presented in section 6 is that in practice this extension

should never be necessary. The other possible extension would be to allow a component or sub-system to be declared to be

piecewise structurally continuous, but with fewer clauses than its underlying specification or implementation. Here,

though, the examples given above suggest that the underlying specification or implementation may not be of much help in

identifying which clauses would need to be retained in the new piecewise structurally continuous form, but this may not

always be the case.

For instance, suppose that a component had been produced to implement a stack structure. Intrinsically the behaviour of

this might well be piecewise structurally continuous, with a clause being required for each possible value of the number of

items in the stack. Once it had been thoroughly tested, then in principle one might want to deem it to be structurally

continuous before trying to integrate it with other components. In practice, though, there will be an obvious boundary

where its behaviour will not be completely uniform, in that some operations (such as top and pop) will behave differently

when the stack is empty from when it contains some data. Hence, this would probably have to be recognised by treating the

component as piecewise structurally continuous, with either just two clauses (for empty and non-empty) or three (for

empty, full and in between), which is likely to be many fewer clauses than would be required by its intrinsic behaviour.

What this indicates is that, for such an extension to be useful, a method needs to be developed for identifying the clauses

that would be relevant to the use of such components or sub-systems. Consideration of such a method is beyond the scope

of this paper, although it can be observed that any such method would have to involve analysing the use of the component

or sub-system. For instance, in the case of this stack example, any use of an operation such as top or pop will have to

recognise that it may deliver a legitimate value, or if the stack is empty it may have to take some other action to indicate

this, such as raising an exception. Thus, one basis for such a method might be to start from the assumption that any

integrated component can in principle be treated as structurally continuous, but then in practice superimpose on top of this

whatever test frames (and their associated clauses) might arise from these different possible patterns of use of the various

operations.

9. Summary and Conclusions

The main conclusion to be drawn from this work is that it is possible to define test frames in terms of the characteristic

conditions that identify sets of test cases, and indeed there are significant advantages in doing so. Unlike the original

definition of test frames that was given by Ostrand & Balcer, such a model is independent of the structure of categories or

partitions that are needed in the test specification for any particular system. Partly because of this independence, this model

then has the advantage that it enables a unified view to be taken of test methods, since they can all – both functional and

structural methods – be regarded as generating a set of input-only test frames. For each of these kinds of method the model

then provides a basis for defining a formal structured process for the activity of generating the required test cases (inputs

and expected outputs) from these test frames. Furthermore, by its treatment of infeasible test frames this process identifies

more explicitly than was previously the case those situations where errors might have arisen in the operation of the test

method itself.

The second key feature of this model is that it provides a good basis for specifying formally an important property of a test

frame, namely that the test set that it generates is a representative one. This property depends on being able to specify

formally the requirements for a test set to be representative, and for this purpose the concept of structural continuity has

been introduced, and it has been shown to provide a good basis for doing this. As with the model of test frames, this

concept too is applicable to both functional and structural testing, and for both kinds of testing representative sets of test

frames can be derived directly from any construction that is in a piecewise structurally continuous form. Furthermore, it

has been shown that for a specification that is structurally discontinuous it is always possible to produce an equivalent

piecewise structurally continuous form, or at least one that can be treated as piecewise structurally continuous under some

assumed bound on the number of rewritings that are required, so that for functional testing this concept has the highly

desirable property that it can always lead to a set of functional test cases that can be taken as representative.

The third key feature of this model is that this concept of structural continuity provides a theoretical basis for the practical

approach to integration testing, in which units of code are assembled into progressively larger sub-systems until the whole

system being developed has been integrated, thus leading to a hierarchical structure for the system as a whole. Informally it

is obvious that each level of this hierarchy consists of sub-systems that have been tested sufficiently thoroughly that they

can be regarded as having been properly integrated in some sense, although usually the meaning of “properly integrated”

here is fairly imprecise. The importance of the concept of structural continuity is that it enables this to be made much more

precise, since effectively it treats “properly integrated” as meaning that at least some aspects of the behaviour of this sub-

system can now be deemed to be structurally continuous, even though intrinsically it is actually only piecewise structurally

continuous.

These still leave two main issues open as further work. One of these issues is that of how this approach to the hierarchical

structure of the integration testing process can be extended, so that instead of treating all sub-systems that have been

integrated as structurally continuous, some can where necessary be deemed to be piecewise structurally continuous, but

with fewer clauses than would be required to model their intrinsic behaviour. As described in the previous section, this

requires some method for analysing how such sub-systems or components are used, so that the different possible cases that

need to be treated as separate clauses can be clearly distinguished. The development of such a method, and its

incorporation into the process of integration testing, is a problem that still needs to be addressed.

The other main open issue is that of how this model of test frames, and the associated concept of structural continuity,

should be extended from the function-based testing regime that has been discussed here, to state-based testing. Clearly

such an extension will not affect the basic concepts of verifying the conformance of states in the implementation to state in

the specification, and the conformance of sequences of state transitions. What still needs to be investigated, though, is

whether it might be necessary to test some particular sequences of state transitions more than once, in order to cover

UKTest 2005

different combinations of categories and partitions in the specifications of the processing functions that are invoked to

perform the state transitions. If it is necessary, then one might expect that this would be because of some kind of structural

discontinuity in one of the processing functions, but the possible nature of any structural discontinuity that might require

such combinations of test cases is not clear, and requires further analysis.

Acknowledgements

The material presented in this paper has benefited greatly from discussion with colleagues within the FORTEST network,

which was funded by EPSRC, and in particular, the contributions of Stuart Reid and Robert Hierons to these discussions

are gratefully acknowledged. The presentation of the material has also benefited greatly from the comments of the referees

on the original version, in which they identified a number of additional issues that ought to be discussed, and suggested

relevant references for these.

References

1 Ostrand TJ, Balcer MJ. The Category-Partition Method for Specifying and Generating Functional Tests.

Communications of the ACM, June 1988; 31(6): 676-686.

2 Cowling AJ. What Should Graduating Software Engineers Be Able To Do? Proceedings of 16th Conference on

Software Engineering Education and Training, Madrid, Spain, March 2003. IEEE Computer Society Press: Los

Alamitos, CA, 2003; 88-98.

3 Cowling AJ. Teaching Data Structures and Algorithms in a Software Engineering Degree: Some Experience with

Java. Proceedings of 14th Conference on Software Engineering Education and Training, Charlotte, North

Carolina, USA, March 2001. IEEE Computer Society Press: Los Alamitos, CA, 2001; 247-257.

4 Freedman RS. Testability of Software Components. IEEE Transactions on Software Engineering, 1991; 17(6):

553-564.

5 Balcer MJ, Hasling WM, Ostrand TJ. Automatic Generation of Test Scripts from Formal Test Specifications.

Proceedings of 3rd

Symposium on Software Testing, Analysis and Verification, Key West, FL, December 1989.

ACM Press, New York, NY, 1989; 210-218.

6 Ostrand TJ. Generating Formal Specifications from Test Information. Proceedings of 2nd

Workshop on Formal

Approaches to the Testing of Software (FATES), Brno, Czech Republic, August 2002 (at

<http://www.brunel.ac.uk/~csstrmh/concur2002/fates.html>); 11-18.

7 Holcombe M. What are X-machines: Editorial to Special Issue. Formal Aspects of Computers, 2000; 12: 418-

8 Holcombe M, Ipate F. Correct Systems: Building a Business Process Solution. Springer Verlag (Series on

Applied Computing): Berlin & London, 1998.

9 Aguado J, Cowling AJ. Foundations of the X-machine Theory for Testing, Department of Computer Science

Research Report CS-02-06, University of Sheffield, 2002, at

<http://www.dcs.shef.ac.uk/research/resmems/papers/CS0206.pdf>.

10 Petrenko A, Yevtushenko N, von Bochmann G, Dssouli R. Testing in context: framework and test derivation.

Computer Communications, 1996; 19: 1236-1249.

11 Hierons RM, Harman M. Testing conformance of a deterministic implementation against a non-deterministic

stream X-machine. Theoretical Computer Science, 2004; 323: 191-233.

12 Grochtmann M, Grimm K. Classification Trees for Partition Testing. Software Testing, Verification and

Reliability, 1993; 3(2): 63-82.

13 Stocks PA, Carrington, DA. Test Templates: A Specification-based Testing Framework. Proceedings of 15th

International Conference on Software Engineering, Baltimore, MD, May 1993. IEEE Computer Society Press:

Los Alamitos, CA, 1993; 405-414.

14 Stocks PA, Carrington, DA. A Framework for Specification-based Testing. IEEE Transactions on Software

Engineering, 1996; 22(11): 777-793.

15 Spivey JM. The Z Notation: A Reference Manual. Prentice Hall: New York & London, 1989.

16 Chen TY, Poon PL, Tse TH. A Choice Relation Framework for Supporting Category-Partition Test Case

Generation. IEEE Transactions on Software Engineering, 2003; 29(7): 577-593.

17 Roper M. Software Testing. McGraw-Hill (International Software Quality Assurance Series): London, 1994.

18 Bernot G, Gaudel MC, Marre B. Software testing based on formal specifications: a theory and a tool. Software

Engineering Journal, 1991; 6: 387-405.

19 Dick J, Faivre A. Automating the Generation and Sequencing of Test Cases from Model-Based Specifications.

Proceedings of FME’93: Industrial-Strength Formal Methods, Odense, Denmark (Lecture Notes in Computer

Science, vol. 670). Springer: Berlin, 1993; 268-284.

20 Offutt J, Liu S, Abdurazik A, Ammann P. Generating Test Data from State-Based Specifications. Software

Testing, Verification and Reliability, 2003; 13: 25-53.

UKTest 2005

The need for new statistical software testing models

John May, Maxim Ponomarev, Silke Kuball, Julio Gallardo

Safety Systems Research Centre, University of Bristol

ABSTRACT

There is growing interest in Statistical Software Testing

(SST) as a software assurance technique. Whilst the

approach has major attractions, we show that there is a

need for new statistical models to infer failure

probabilities from SST. We construct a simple but

realistic case in which traditional models do not work.

KEY WORDS

Software testing, Software assurance, Failure probability,

Statistical test models, Statistical estimation

1 Statistical software testing

Interest in Statistical Software Testing (SST) is growing

because it provides a software assurance technique that is

both sound and practical. In common with formal proof

methods, it is one of the few techniques that offers an

objective, quantitative measure of software quality (SST

provides a failure probability estimate). In addition, SST

side-steps the famous statement “Program testing can be

used to show the presence of bugs, but never to show theirabsence!” [1]. The statement is true, but assumes that a

total absence of failure is the only acceptable goal. This

assumption is not made in other engineering disciplines,

where the goal is risk reduction and complete absence of

system failures is seen as unrealistic. In this context

Dijkstra’s statement loses its impact.

Statistical software testing (SST) is a dynamic testing

technique, designed in a very specific way that makes it

possible to assess the system’s probability of failure on

demand or failure per hour from the test results. No other

testing technique allows us to do this. The aim of SST is

to have and execute a test-set that facilitates the deduction

of a dependability figure (e.g. a failure probability) for the

software under test. Such a figure may be used as

evidence in a safety-case or as stand-alone assurance for

the software under test. Simply stated, the core conditions

for statistical testing are that statistical test-cases have to

be a) generated through a probabilistic simulation of the

application environment for which the dependability

statement needs to be derived, and b) they have to be

statistically independent. For more details on the

technique and its application see for example [2], [3], [4],

2 The need for new statistical models of

software testing

The statistics underpinning SST invariably relies on the

Binomial model of failure [2]. This paper constructs an

example where the simple Binomial model does not

apply.

There has been research into other forms of SST model,

in particular where models are required to relate program

fail probability to the fail probabilities of program

components [8], [9], [10], [11], [12], [13], [14], [15], [16].

It is clear that the simple Binomial model does not solve

this problem, and neither does its partition variant. SST

models for component-based software remain an open

problem, and we do not address them in this paper. Our

example is one where, at first sight, the Binomial model

appears relevant, and is the only accepted model in the

SST literature.

3 The example

The example was inspired by software found inside smart

sensors. These are being used in safety critical systems,

and present a problem to safety analysts because the

presence of software makes it hard to ascertain the

reliability of such devices.

The example focuses on the simple problem of computing

a rolling average of 8 sensor readings. In each cycle, a

new reading is taken and used as the 8th reading in a

computation of an average over the most recent 8

readings, which is then output. Only 8 readings are kept in

memory, older readings are discarded. A sequence of 8

readings is regarded as a test, although tests could be

defined as sequences of any length. Thus 8 averages are

output on each test. There are no gaps between tests; the

8th reading of test n is followed directly by the 1st reading

of test n+1. Each test is chosen by a random search

mechanism analogous to placing all tests in a bag

according to an operational distribution and blind picking

a test from the bag (and replacing after each pick), as

described in detail by Miller at al [2].

An important feature of this program is that the first 7

outputs in any test depend on inputs from the previous

test. Thus neighbouring tests are dependent. This creates a

situation where it is possible that occurrence of an ‘initial’

failure on a test is random (initial tests occur according to

the Binomial model), but the test immediately following

such an initial failure always fails. This occurs where, for

example, the average computation processes a specific

single reading (e.g. the reading ‘0’) incorrectly.

Therefore, in a case where this occurs, it is clear that the

Binomial model does not apply since in the failure

process it describes the failure probability on each test

must be equal, irrespective of previous test history.

3.1 A new statistical model of failure probability

estimation

Developing the likelihood function for 0 failures under

the assumption that a failure induces at least one

follow-up failure.

We shall consider the following failure pattern. Consider

N software tests, where N is large, within which j failures

occur, where j is between 1 and N.

Whenever an initial failure occurs on a test-run Ti, then

this will lead to another failure at test-run Ti+1 due to

shared data. We call this type of failure in Ti+1 ‘follow-

up failure’. The distinction between the two types of

failure is important. Ti+1 can also produce a follow-up

failure in Ti+2 due to new data contained in Ti+1 i.e.

Ti+1 would contain a follow-up failure and an new initial

failure. If Ti+1 did not contain an initial failure, and Ti+2

did not contain an initial failure, the failure sequence

would stop i.e. Ti+2 would not fail.

Thus, a total of j failures can occur as a result of a

sequence of failure events, each of which is of the form:

an initial failure event on test-run Ti, which induces one

additional certain (follow-up) failure event on the next

test run Ti+1 (follow-up failure), due to shared data

between the two tests. If initial failures are rare, then the

failure pattern is likely to be failures occurring in pairs

separated by long sequences of failure free test-runs.

All we observe from the outside is the total number of

failures. However, if we are aware of the different

patterns that might lead to observing j failures, we can use

this information when constructing the probability of

observing j failures.

We aim at developing an expression for this probability.

Furthermore, the sum over such probabilities for j=1,…,N

will yield the probability Pr(at least 1 failure in N tests)

given the underlying failure process. The complement of

this probability would be the likelihood function of

observing 0 failures in N tests given the underlying failure

process. This formula depends on parameters capturing

the failure probability of the program on a single test, and

it would then be possible to go on to build estimators

these parameters and hence the failure probability of the

program given a number of tests (although we do not do

this).

It seems plausible that the likelihood function under the

assumption of having follow-up failures would differ

from the traditional likelihood function for 0 failures in N

tests when we assume that all failures occur

independently from each other (see for example [2], [3],

Ultimately, we are interested in comparing these two

likelihood functions to identify whether the use of a

traditional Binomial model in the case of having more

complex failure patterns would underestimate the

software failure probability.

For a test-run Ti, we call the probability of failure on Ti

caused by the newly observed data in Ti: . (This is less

than the probability of failure for Ti since there is also the

possibility that Ti fails as a follow-up failure caused by

data read in Ti-1). The probability of follow-up failure (in

Ti+1) based on the same data already present in Ti and

carried over to Ti+1 is 1.

j failures could could happen in many different ways and

the question is how this would influence the likelihood

function for the event:

(0,N) – 0 failures in N tests.

We call this likelihood function Pnew(0,N| ). It depends

on . We calculate it via its complement P( 1 failures,

N| ). This is the sum over all probabilities

P(j failures, N| ), where j takes on any value in 1,…,N:

( 1 , | ) ( , | )N

new new

P failure N P j failures N(1)

It is a counting problem now for probabilities of having

different numbers of failures inside the line (sequence) of

N tests.

The first term in eq.(1) is the probability of a single

failure inside the N tests. This single failure is possible in

the very end of our test line. So it can occur in the only

one (last) place and its probability is along with the

probability of independent ‘success’ (1 ) in the rest N-

1 places of the test line. Therefore, we have the following

likelihood for exactly one failure:

new 1P (1 failure, N| ) = (1 )N (2)

The second term in eq.(1) is the probability of having two

failures inside the N tests. This can happen in two ways.

The first one is raising the exact pair of the initial and the

follow-up failures. This one can happen with the

UKTest 2005

probability in some place. And there are

(N-1) free places for the initial failure with the follow-up

failure following immediately. This gives us the first term

in the sum in the eq.(3). The second term in this eq.(3)

gives probability of the more exotic case when the initial

failures happen in pair overlapping the follow-up failure.

It can happen in the very end of the N test line only with

the probability

2(1 )N

2 and probability of (N-2) independent

successes . This cut overlap can happen in

the one place only – in the end boundary of the N test

line. So we obtain the probability of having 2 failures for

our test line as follows:

2(1 )N

P (2 fails, N| ) = N-1 (1 )

By analogy considering probabilities for more failures in

our N test line, we obtain more terms for the eq.(1): P(j

failures, N| ), for j=3,4:

new 2 3

P (3 fails, N| ) = 2 N-2 (1 )

new 2 4

-1 -3P (4 fails, | ) = (1 )

2 -3 (1 )

For j larger than 4, and these probabilities are

decreasing significantly (corresponding to the higher

orders of small parameter

2( )N ). These specific

conditions allows us to approximate P( 1 failures, N| ),

via a truncated (partial) sum in eq.(1) after j=3.

Furthermore if we exclude terms higher than the second

order of magnitude in , (N ) , we have the following

approximation:

new 1 2

2 2 2 3

P ( 1 failure, N| ) (1 ) -1 (1 )

(1 ) 2 - 2 (1 )

-1 -3(1 )

For a large number of tests ( ) and small expected1N

(and ) we can estimate eq.(6) asymptotically:1N

new 2P ( 1 failure, N| ) 22

As a next step, we want to compare the result above with

the result one would obtain if we applied the traditional

Binomial model to the test results, as follows:

P ( 1,M | ) =

1 2 ...( 1)(1 )

M M M M j

This expresses the probability of observing at least one

random failure in M tests with a Binomial failure pattern.

Hereby failure probability on demand is denoted as .

Using the Poisson approximation for eq.(8), which is

effective for large M and small (so that 1M ), we

can again argue that terms above a certain limit L can be

neglected in the sum in eq.(8). This yields the following

approximation of eq.(8):

SUM -M

(M )P ( 1, M | ) e

Excluding terms higher than the second order of

magnitude in ( )M , we have truncated sum in eq.(9):

SUMP ( 1 failure, M| )2

M e (10)

We can examine the differences between eqs.(7) and (10)

by considering the case ~

2 and M=N (the same

number of tests). When the ‘initial’ fail rate ~

is small

and N is large, the overall fail rate of the ‘new’ process is

approximately twice~

. In this case, we can compare the

probability of seeing 0 failures in N tests for a Binomial

failure process and a ‘new’ failure process with the same

overall failure rate. We merely note that the expressions

differ, so that there is scope for error if a program obeys

the ‘new’ failure process and the Binomial model is used

to model it. For example, if the traditional Binomial

model gives a lower value for the probability of zero

failures in N tests, it can only explain this in terms of a

lower overall failure rate estimate. The result, given the

observation of 0 failures during testing with N tests,

would be to underestimate the program failure rate.

It should be noted that here we made estimations in the

specific case of large test number ( ), small

expected

,M N 1

, , and . Calculations for

more general cases according to our model (when the

1, 1N M

truncation of the sum in eq.(1) is not possible) could give

different deviations.

4 Discussion

SST provides test results with a useful meaning, and so

appears to be an important future software assurance

technique. However, this paper has shown that there is a

need to develop more sophisticated statistical models of

SST than are currently available. This will require input

from computer science, i.e. they will involve software

analyses, as in the example, where the basic requirements

dictate a certain use of program memory and consequent

test dependence. It will not be possible to derive these

models using standard statistical techniques alone; there is

a need for collaborative efforts between statisticians and

computer scientists.

The example we have constructed is not a pathological

case. It is simple and realistic, and the implications are

therefore potentially important. It demonstrates that new

statistical models of SST are needed in scenarios where

traditional analysis only offers the Binomial model and its

partition variants.

Application of the Binomial model to this example has

the potential to be incorrect in a dangerous sense. In a

safety-critical context, an underestimate of the program

failure probability (i.e. over-confidence in the software

reliability) might sanction the deployment of a program

whose reliability does not meet acceptable targets.

REFERENCES

[1] Dijkstra E. Notes on Structured Programming, in

Dahl O, Dijkstra E, Hoare C (Eds.) Structured

Programming, (Academic Press 1972).

[2] Miller W.M. , Morell L.J., Noonan R.E., Park S.K.,

Nicol D.M., Murrill B.W. and Voas J.M. Estimating the

probability of failure when testing reveals no failures,

IEEE Trans. on Software Engineering v18 n1 1992.

[3] Thayer R., Lipow M., and Nelson E. Software

Reliability (North Holland 1978).

[4] Ehrenberger W. Probabilistic techniques for software

verification in safety applications of computerised process

control in nuclear power plants, IAEA-TECDOC-581, Feb

[5] May J.H.R., Hughes G and Lunn A.D. Reliability

Estimation from Appropriate Testing of Plant Protection

Software, Software Engineering Journal, Nov. 1995.

[6] May J.H.R and Lunn A.D. New Statistics for

Demand-Based Software Testing, Information ProcessingLetters 53, 1995.

[7] Kuball, S., Hughes, G., May, J.H.R., Gallardo, J.,

John, A.: The effectiveness of Statistical Testing whenqpplied to logic systems, Safety Science, Vol. 42, pp. 369-

383, Elsevier, 2004.

[8] J.May, S.Kuball, G.Hughes, Test Statistics for System

Design Failure, International Journal on Reliability,

Quality and Safety Engineering (IJRQSE),Vol. 6, No.3,

pp 249--264 (1999).

[9] Kuball S., May J.H.R. & Hughes G. Building a

System Failure Rate Estimator by Identifying Component

Failure Rates, Procs. of the 10th Int. Symposium onSoftware Reliability Engineering (ISSRE’99), Boca

Raton, Florida, Nov 1-4, 1999) pp32-41 (IEEE Computer

Society 1999).

[10] Gokhale S.S. & Trivedi K.S. Structure-based

software reliability prediction, Procs. Advanced

Computing (ADCOMP 97), Chennai, India 1997.

[11] Smidts C. & Sova D. An architectural model for

software reliability quantification: sources of data,

Reliability Engineering & System Safety 64 (2) pp. 279-

290, 1999.

[12] Hamlet D., Mason D., Woit D. Theory of software

reliability based on components, Proceedings of the 23rd

International Conference on Software Engineering (ICSE

2001), pp12-19 May 2001, Toronto, Ontario, Canada.

IEEE Computer Society 2001, ISBN 0-7695-1050-7.

[13] Littlewood B. A reliability model for systems with

markov structure, Applied Statistics v24 n2 pp. 172-177,

[14] Shooman M. Structural models for software

reliability prediction, 2nd Int. Conf. Software Engineeringpp268-280, 1976.

[15] Krishnamurthy S. & Mathur A.P. On the estimation

of reliability of a software system using reliabilities of its

components, 8th International Symposium on Software

Reliability Engineering, pp. 146-155, Albuquerque, New

Mexico, 1997.

[16] Goseva-Popstojanova K, & Trivedi K.S.

Architecture-based approach to reliability assessment of

software systems, Performance Evaluation v45 n2-3,

[17] Musa J.D. Operational profiles in software reliability

engineering, IEEE Software 10(2) 1993.

[18] Littlewood B. and Wright D. Some conservative

stopping rules for the operational testing of safety-critical

software, IEEE Trans. on Fault Tolerant Computing

Symposium, pp 444-451, Pasadena, 1995.

[19] Butler R.W. & Finelli G.B. The infeasibility of

quantifying the reliability of life-critical real-time

software, IEEE Trans. on Software Engineering v19 n1

[20] Littlewood B. & Strigini L. Validation of Ultra-High

Dependability for Software-based Systems,

Communications of the ACM, 36 (11), pp.69-80, 1993.

UKTest 2005

A Theory of Regression Testing for Behaviourally

Compatible Object Types

Anthony J H Simons

Department of Computer Science, University of Sheffield,

Regent Court, 211 Portobello Street, Sheffield S1 4DP, United Kingdom A.Simons@dcs.shef.ac.uk

http://www.dcs.shef.ac.uk/~ajhs/

Abstract. This paper presents a behavioural theory of object compatibility,

based on the refinement of object states. The theory predicts that only certain

models of state refinement yield compatible types, dictating the legitimate de-

sign styles to be adopted in object statecharts. The theory also predicts that

standard practices in regression testing are inadequate. Functionally complete

test-sets that are applied as regression tests to subtype objects are usually ex-

pected to cover the state-space of the original type, even if they do not cover

transitions and states introduced in the subtype. However, such regression test-

ing is proven to cover strictly less than this in the new context and so provides

much weaker guarantees than was previously expected. Instead, a retesting

model based on automatic test regeneration is required to guarantee equivalent

levels of correctness.

Keywords: Object-oriented, behavioural subtyping, state refinement, state-

based testing, regression testing, test generation, testing adequacy.

1 Introduction

Practical object-oriented unit testing is influenced considerably by the non-intrusive

testing philosophy of McGregor et al [1, 2]. In this approach, every object under test

(OUT) has a corresponding test-harness object (THO), which encapsulates all the test-

sets separately. This separation of concerns is the main motivation for McGregor’s

parallel design and test architecture in which an isomorphic inheritance graph of test

harness classes shadows the graph of production classes [2]. This embodies the be-

guiling intuition that, since a child class is an extension of its parent, so the test-sets

for the child are extensions of the test-sets for the parent. The presumed advantage is

that test-sets can be inherited from the parent THO and applied, as a suite, to the child

OUT, in a kind of regression test. The purpose of such retesting is to ensure that the

child class still delivers all the functionality of the parent. The child THO will supply

additional test-sets to exercise methods introduced in the child OUT [1, 2].

More recently, the JUnit tool has fostered a similar strategy for re-testing classes

that are subject to continuous modification and extension [3, 4]. JUnit allows pro-

grammers to develop test scripts, which are converted into suites of methods behind

the scenes. These are executed on demand, to test objects and to re-test modified or

extended versions of those objects. One of the key benefits of JUnit is that it makes

the re-testing of refined objects semi-automatic, so it is widely used in the XP com-

munity, in which the recycling of old test-sets has become a major part of the quality

assurance strategy. XP makes a strong claim that a programmer may incrementally

modify code in iterative cycles, so long as each modification passes all the original

unit tests: “Unit tests enable refactoring as well. After each small change the unit tests

can verify that a change in structure did not introduce a change in functionality” [5].

There are two sides to this claim. Firstly, if the modified code fails any tests, it is

clear that faults have been introduced, so there is some benefit in reusing old tests as

diagnostics. Secondly, there is the implicit assumption that modified code which

passes all the tests is still as secure as the original code. Tests are implicitly being

used as guarantees of a certain level of correctness.

In this paper, we prove that the second assumption is unsound and unsafe. In sec-

tion 2, a state-based theory of object refinement is presented, which encompasses

object extension with subtyping, the concrete satisfaction of abstract interfaces and

refactoring of implementations with unchanged behaviour. The theory predicts that

only certain models of state refinement yield compatible types, dictating the legitimate

design styles to be adopted in object statecharts. In sections 3 and 4, the theory also

predicts that standard practices in regression testing are inadequate. Functionally

complete test-sets that are applied as regression tests to subtype objects are usually

expected to cover the state-space of the original object, even if they do not cover tran-

sitions and states introduced in the refinement. However, such regression testing is

proven to cover strictly less of the original object’s state space in the new context and

so provides much weaker guarantees than expected. After passing the recycled tests,

objects may yet contain introduced faults, which are undetected.

In place of the unsafe kinds of regression testing, we propose a new approach,

which is based on automatically generating the tests from the refined object’s state

machine. The theory predicts that simply adding new test suites to the existing encap-

sulated suites does not achieve coverage. It is necessary to generate new test-sets

from scratch, in which methods are interleaved in different orders than before, to ob-

tain the same level of guarantee.

2 A Theory of Compatible Object Refinement

In classical automata theory, the notion of machine compatibility is judged by compar-

ing sets of traces, sequences of labels taken from transition paths computed through

the machines in question. Two machines are deemed equivalent if their trace-sets are

equivalent. A machine is behaviourally compatible with another if its trace-set in-

cludes all the traces of the other, that is, for every trace in the protocol of the reference

machine, such a trace also exists in the protocol of the modified machine. Object-

oriented design methods [6, 7] include object statecharts, which are influenced by

Harel’s statecharts [8] and SDL [9]. These notations are more complex than simple

finite state automata. Equivalence and compatibility between statecharts are judged

UKTest 2005

by considering syntactic relations between the transformed state spaces, from which

the trace behaviour follows.

2.1 McGregor’s Statechart Refinements

Fig. 1. McGregor’s structural statechart refinements. The basic state machine M0 is refined

by the compatible machines M1, M2, M3. M1 refines M0 by adding an extra transition. M2

refines M0 by introducing substates. M3 refines M0 by introducing concurrent states.

McGregor et al. proposed one of the early theories of object statechart refinement [10,

11]. In McGregor’s model, object states derive from the object’s stored variable val-

ues, as seen through observer methods. The machines have Mealy-semantics, with

quiescent states and actions on the transitions, representing the invoking of methods.

Figure 1 illustrates a contemporary reworking of McGregor’s three main structural

refinements, to allow comparison with trace models. These kinds of refinement were

deemed compatible because they observed the rules:

• all states in the base object are preserved in the refined object;

• all introduced states are wholly contained in existing states;

• all transitions in the base object are preserved in the refined object.

These structural refinements may be compared with trace models. The traces of

M0 are the set {<>, <a>, <a, b>}, where <a, b> is a sequence of method invocations

in the protocol of M0. M1 adds an extra method c to the interface of M0. This is a

derived method, analogous to function composition [12], that computes a more direct

route to the destination state S3. The traces of M1 are {<>, <a>, <a, b>, <c>} so it

is clear that this includes the traces of M0.

M2 adds two extra methods d and e, which examine state S2 at a finer granularity.

S2 is completely partitioned into substates S2.1 and S2.2. Since states are abstractions

over variable products [10], this is equivalent to dependence on disjoint subsets of

variable values. The usual statechart semantics of M2 is that entry to S2 implies entry

to the default initial substate S2.1; and the exit transition b from S2 preempts other

substate events. The statechart may therefore be flattened to a simple state machine,

with transition a leading directly from state S1 to S2.1 and an exit transition b from

both substates S1.1 and S1.2 to state S3. The traces of M2 are infinite (due to the

infinite alternation of d, e), but include {<>, <a>, <a, b>, <a, d>, <a, d, b>, <a, d,

e>, <a, d, e, b>, …} and so include all the traces of M0.

M3 introduces concurrent states S4, S5 and extra methods d and e which depend on

the new states. This represents the definition of new variables in the object subtype,

together with new methods whose behaviour is orthogonal to existing behaviour. The

usual statechart semantics is that both machines execute concurrently. Formally, this

is equivalent to a flat state machine containing the product of the states of the two

concurrent machines, which we denote as: {S1/4, S2/4, S3/4, S1/5, S2/5, S3/5}. The

traces of M3 are infinite, but include {<>, <a>, <d>, <a, d>, <d, a>, <a, b>, <d,

e>, <a, d, e>, <d, e, a>, <a, d, b>,…} and so include all the traces of M0.

2.2 Cook and Daniels’ Statechart Refinements

Fig. 2. Cook and Daniels’ additional statechart refinements. As before, M0 is the reference

machine. M4 refines M0 by adding a transition to a new state. M5 refines M0 by transition

splitting. M6 refines M0 by retargeting a transition.

In their Syntropy method [13], Cook and Daniels permit further extensions to state-

charts. Their full set of refinements includes (p207-8): adding new transitions, add-

ing new states, partitioning a state into substates, splitting transitions either at source

or destination substates, retargeting transitions onto destination substates and

composition with concurrent machines. Figure 2 illustrates the three main kinds of

transformational refinement not already covered above.

UKTest 2005

These refinements may also be compared with trace models. M4 refines M0 by

adding a new method c leading to a new state S4. This new state represents the addi-

tion of object variables, but unlike the case M3, the associated behaviour is not or-

thogonal, but tightly coupled to state S1. We sometimes refer to S4 as a new external

state, to distinguish this from a new substate, of the kind in M2. The traces of M4 are

the set {<>, <a>, <a, b>, <c>} and so include the traces of M0. However, this re-

finement breaks the second of McGregor’s rules about new states being introduced as

wholly contained substates.

M5 refines M0 by splitting the exit transition b, which no longer proceeds from the

S2 state boundary, but from the individual substates S2.1 and S2.2. This represents

the redefinition of the method b in the refinement, to depend disjointly on the intro-

duced substates. The overall response is equivalent to the original b. The traces of

M5 are {<>, <a>, <a, b>, <a, d>, <a, d, b>} and so include the traces of M0. By

the usual semantics of object statecharts, an exit transition from a superstate boundary

is equivalent to exit transitions from every substate. It is therefore inevitable that state

partitioning will split exit transitions.

Cook and Daniels [13] also allow the symmetrical case, splitting entry transitions to

target different destination substates. Mutually exclusive and exhaustive guards are

introduced to distinguish which of the substates should be reached by each partial

transition. However, fairness in partitioning incoming transitions to all substates is

later shown to be irrelevant in the retargeting rule. M6 refines M0 by retargeting the

transition a onto an arbitrary substate of S2. We choose to target S2.2 simply to illus-

trate how this is different from the default initial substate S2.1, even though the model

now cannot enter S2.1. The traces of M6 are {<>, <a>, <a, b>} and so are exactly

the traces of M0.

According to the classical theory of trace inclusion, all of the refinements M1-M6

may be substituted in place of M0 and will exhibit identical trace behaviour in re-

sponse to M0’s events. However, we argue below that this is an insufficient guarantee

of behavioural compatibility in object-oriented programming, where objects are ali-

ased by handles of multiple types. For this, a stronger theory is required.

2.3 Behaviourally Compatible Statechart Refinement

The fundamental philosophical problem to decide in the theory is how to treat the

introduction of new variables in subtype objects. Do these variables correspond to

missing pieces of the object’s earlier state, and so their concatenation in the subtype

gives rise to brand-new external states (like M4 above)? Do these variables already

exist in virtuo at the abstract level, in which case their concrete exposure in the sub-

type creates new substates (like M2 above)? Are these variables orthogonal and so

give rise to concurrent states in the subtype, equivalent to state products (like M3

above)? These different views may be in conflict.

The M3 refinement can be shown to be more general than M2. By flattening M2, a

statechart is obtained in which all a-transitions target the default initial substate, S2.1.

The product machine obtained by flattening the M3 refinement is more sophisticated,

since the a-transitions go from S1/4 and S1/5 to S2/4 and S2/5 respectively. M3 is

more sensitive to orthogonal behaviour than M2. It is reasonable to assume that we

must expect subtype objects to exhibit orthogonal behaviour at least some of the time,

so the M3 refinement is chosen over M2.

Both M3 and M2 assume that introduced state variables are exposed as substates of

existing states. This contrasts with M4, which assumes that entirely new states may be

introduced. In M4, the c-transition takes an object entirely out of the S1 state, whereas

in M3, the d-transition still leaves the object in its S1 state (going from S1/4 to S1/5).

This means that in all contexts and under all firings of d- and e-transitions, the M3

object can be abstracted to a M0 object, whereas this cannot be done for a M4 object.

Abstracting away from M4 in state S4 leaves an object in no recognizable M0 state,

and furthermore the object will deadlock in this state for any attempt to fire a-

transitions. In terms of the π-calculus process algebra [14], M3 strongly simulates

M0, whereas M4 only weakly simulates M0. This is discussed in section 5.4 below.

S2cS1/3

S1/4dS1

Fig. 3. The model of behaviourally-compatible refinement. L2 is the refined statechart result-

ing from the concurrent composition of L0 and L1, without respect to order. The states of L0

and L1 become intersecting regions in the refinement, which contains the product of states.

Since in general we must expect to support refinements like M3, in which full state

products are computed, the notion of hierarchical superstates encapsulating substates,

in the style of M2, becomes moot. It is more sensible to think of the old states as

being completely partitioned into new states. Figure 3 illustrates this in a more com-

pelling way. Here, L2 is the refinement resulting from the concurrent composition of

L0 and L1. However, it is irrelevant whether L0 is the basis and L1 is the supplement,

or vice-versa. Whereas in figure 1 we were tempted to view composition as ordered,

here we cannot. Accordingly, we cannot say that any particular superstate hierarchy is

UKTest 2005

more valid. So, we dispense with superstates and think instead of regions, intersecting

areas enclosing states that share some common transition behaviour. In figure 3, re-

gions are shown as dashed outlines. Four intersecting regions can be identified in L2

that correspond to the pairs of simple states in L0 and L1.

The process of refining a state machine then becomes a matter of turning states into

regions, whose enclosed states completely partition the original unrefined state. After

this, the main obligation is to ensure that all the transition behaviour of the base object

is preserved in the refined object. Partitioning a state will always split outgoing transi-

tions, for example, the a-transition from S1 is turned into a pair of partial a-transitions

from S1/3 and S1/4. Because we are assuming orthogonal behaviour, these also target

separate partitions of S2, the states S2/3 and S2/4. However, what if the behaviours of

c, d are not entirely independent of a, b? In this case, incoming transitions might be

retargeted onto different states.

Let a region correspond to a state that is being refined. Retargeting has no adverse

effect on the validity of the refinement, so long as the transition retargets a state within

the same region. Suppose the a-transitions were retargeted onto different states within

S2. No matter which destination states within region S2 we retarget, we should still be

able to abstract away to S2. In all cases, the partial a-transitions would be merged in a

single transition from S1 to S2. Retargeting may select an arbitrary state, or combina-

tion of states within the destination region. Supposing now that the c-transition from

S1/3 were retargeted outside the S1 region, to S2/4, within the different region S2.

The c message now interacts unfavourably with the alternating behaviour of a, b. This

means that a sequence <c, a> will deadlock from S1/3. While this modification is not

compatible with L0, it is compatible with L1. Retargeting must therefore be consid-

ered with respect to the compatibility relation desired between specific machines.

From these considerations, we obtain the statechart refinement rules for behav-

ioural compatibility. With respect to the statechart for a given object type, the state-

chart for a compatible object may introduce additional states, corresponding to the

exposure of extra variable products, and additional transitions, corresponding to the

introduction of new methods, so long as:

• Rule 1: new states are always introduced as complete partitions of existing

states, which become enclosing regions;

• Rule 2: new transitions for additional methods do not cross region bounda-

ries, but only connect states within regions;

• Rule 3: refined transitions crossing a region boundary completely partition

the old entry/exit transitions of the original unrefined state;

• Rule 4: refined transitions within a region completely partition the old self-

transitions of the original unrefined state.

Rule 1 is the fundamental rule, which preserves the hierarchy of state abstractions.

It confirms McGregor’s second rule of statechart refinement [11]. It disallows the

introduction of new external states, so rules out Cook and Daniels’ refinement by

extension (such as M4) [13]. Rule 2 defines limits on state retargeting for new meth-

ods, with respect to the chosen compabitility relationship. In section 5.4 we show how

these two rules relate to strong simulation. Rule 3 captures all of Cook and Daniels’

rules about transition splitting and retargeting within a superstate (a region, in our

approach). The important generalisation is the complete partitioning of transitions,

which ensures that the set of new transitions behaves exactly like the old single transi-

tion. Rule 4 is a similar rule to ensure that self-transitions are preserved explicitly in

the refinement. These two rules essentially describe the faithful replication of transi-

tions for states that have been partitioned. They ensure that the refined machine is a

non-minimal equivalent to the original machine.

Together, the four rules enforce a strict behavioural consistency between the re-

fined and original state machines, analagous to strong simulation (see 5.4). This is

stronger than some other trace-based models of consistency, which only look at model

executions in the absence of a theory of state and state generalisation. The invoca-

tional consistency of Ebert and Engels [15] requires the subtype to contain all the

traces of the supertype. This is equivalent to Cook, Daniels and McGregor’s position,

described above [11, 13]. Ebert and Engels’ observational consistency is weaker still,

since it merely requires all the supertype’s traces to be derivable by censoring the

subtype’s traces to remove methods that were introduced in the subtype [15].

3 The Generation of Complete Unit Test-Sets

In state-based testing approaches [16, 17, 11, 18], it is possible to develop a notion of

complete test coverage, based on the exhaustive exploration of the object’s states and

transitions. However, the nature of the guarantee obtained after testing varies from

approach to approach. The following is an adaptation of the X-Machine testing

method [18, 19], which offers stronger guarantees than other methods, in that its test-

ing assumptions are clear and it tests negatively for the absence of all undesired be-

haviour as well as positively for the presence of all desired behaviour.

3.1 State-Based Specification

Em pty

s ize() = 0

Normal

s ize() > 0

push(e)

pop() [s ize() > 1]pop() [s ize() = 1]pop()

Fig. 4. Abstract state machine for a Stack interface. The two states (Empty, Normal) are de-

fined on a partition of the range of the size access method. No self-transitions for access meth-

ods are notated, by convention, but all other transitions must be shown

We assume that the object under test (OUT) exists in a series of states, which are

chosen by the designer to reflect modes in which its methods react differently to the

same message stimuli (formally, the notion of state derives from state-contingent re-

UKTest 2005

sponse and has nothing to do with whether the object has quiesecent periods). The

OUT is assumed to have a unique transition to its initial state and may or may not

have a final state, a mode in which it is no longer useable, for example, an error state

(representing a corrupted representation – see figure 4), or a terminated state (repre-

senting the end of the object’s life history).

The states of an object derive ultimately from the product of its attribute variables,

but can be characterised more abstractly as the product of the ranges of its access

methods. Formally, we assume that states are a complete partition of this product.

For completeness, a finite state model must define a transition for each method in

every state. However, suitable conventions may be adopted to simplify the drawing of

the state transition diagram, in particular, to establish the meaning of missing transi-

tions. Figure 4 shows a simplified state machine for an abstract Stack interface, in

which the omitted transitions for all the access methods size, empty and top may be

inferred implicitly as self-transitions in every state.

It must be possible to determine the desired behaviour of the object, in every state,

and for each method. If more than one transition with the same method label exits

from a given state, the machine is nondeterministic. Qualifying the indistinguishable

transitions with mutually exclusive, exhaustive guards will restore determinism (in

figure 4, ambiguous pop transitions exiting the Normal state are guarded). Certain

design-for-test conditions may apply, to ensure that the OUT can be driven determin-

istically through all of its states and transitions [18]. For example, in order to know

when the final pop transition from Normal to Empty is reached, the accessor size is

required as one of Stack’s methods.

3.2 State-Based Test Generation

The basic idea, when testing from a state-based specification, is to drive the OUT into

all of its states and then attempt every possible transition (both expected and un-

wanted) from each state, checking afterwards which destination states were reached.

The OUT should exhibit indistinguishable behaviour from the specification, to pass

the tests. It is assumed that the specification is a minimal state machine (with no du-

plicate, or redundant states), but the tested implementation may be non-minimal, with

more than the expected states. These notions are formalised below.

The alphabet is the set of methods m ∈ M that can be called on the interface of the

OUT (including all inherited methods). The OUT responds to all m ∈ M, and to no

other methods (which are ruled out by the syntactic checking phase of the compiler).

This puts a useful upper bound on the scope of negative testing.

The OUT has a number of control states s ∈ S, which partition its observable

memory states. A control state is defined as an equivalence class on the product of the

ranges of the OUT’s access methods. If a subset A ⊆ M of access methods exists,

then each observable state of the OUT is a tuple of length |A|. Formally, tuples fall

into equivalence classes under exhaustive, disjoint predicates p : Tuple → Boolean,

where each predicate p corresponds to a unique state s ∈ S. In practice, these predi-

cates are implemented as external functions p : Object → Boolean invoked by the test

harness upon the OUT : Object, which detect whether the OUT is in the given state

using some combination of its public access methods.

Sequences of methods, denoted <m1, m2, …>, m ∈ M, may be constructed. Lan-

guages M0, M1, M2, … are sets of sequences of specific lengths; that is, M0 is the set

of zero-length sequences: {<>} and M1 is the set of all unit-length sequences: {<m1>,

<m2>, …}, etc. The infinite language M* is the union M0∪ M1

∪ M2∪ … contain-

ing all arbitrary-length sequences. A predicate language P = {<p1>, <p2>, …} is a set

of predicate calls, testing exhaustively for each state s ∈ S.

In common with other state-based testing approaches, the state cover is determined

as the set C ⊆ M* consisting of the shortest sequences that will drive the OUT into all

of its states. C is chosen by inspection, or by automatic exploration of the model. An

initial test-set T0 aims to reach and then verify every state. Verification is accom-

plished by concatenating every sequence in the state cover C with every predicate in

the predicate language P, denoted: C ⊗ P, where ⊗ is the concatenated product which

appends every sequence in P to every sequence in C.

T0 = C ⊗ P (1)

A more sophisticated test-set T1 aims to reach every state and also exercise every

single method in every state. This is constructed from the transition cover, a set of

sequences K1 = C ∪ C ⊗ M1, which includes the state cover C and the concatenated

product term C ⊗ M1, denoting the attempted firing of every single transition from

every state. The states reached by the transition cover are validated again using all

singleton predicate sequences <p> ∈ P.

T1 = (C ∪ C ⊗ M1) ⊗ P (2)

An even more sophisticated test-set T2 aims to reach every state, fire every single

transition and also fire every possible pair of transitions from each state. This is con-

structed from the augmented set of sequences K2 = C ∪ C ⊗ M1∪ C ⊗ M2 and the

reached states are again verified using the predicate. The product term C ⊗ M2 de-

notes the attempted firing of all pairs of transitions from every state.

T2 = (C ∪ C ⊗ M1∪ C ⊗ M2) ⊗ P (3)

In a similar fashion, further test-sets are constructed from the state cover C and

low-order languages Mk⊆ M*. The reached states are always verified using <p> ∈ P,

for which exactly one should return true, and all the others false. The desired Boolean

outcome is determined from the model. Each test-set subsumes the smaller test-sets of

lesser sophistication in the series. In general, the series can be factorised and ex-

pressed for test-sets of arbitrary sophistication as:

Tk = C ⊗ (M0∪ M1

∪ M2 ... Mk) ⊗ P (4)

For the Stack shown in figure 2, the alphabet M = {push, pop, top, empty, size}.

Note that new is not technically in the method-interface of Stack. It represents the

default initial transition, executed when an object is first constructed, which in the

formula is represented by the empty method sequence <>. The smallest state cover C

= {<>, <push>}, since the “final state” is really an exception raised by pop from the

UKTest 2005

Empty state. Other sequences are calculated as above. Test-sets generated from this

model may be used to test any Stack implementation that has identical states and tran-

sitions, for example, a LinkedStack, which uses a linked list to store its elements.

3.3 Test Completeness and Guarantees

The test-sets produced by this algorithm have important completeness properties. For

each value of k, specific guarantees are obtained about the implementation, once test-

ing is over. The set T0 guarantees that the implementation has at least all the states in

the specification. The set T1 guarantees this, and that a minimal implementation pro-

vides exactly the desired state-transition behaviour. The remaining test-sets Tk pro-

vide the same guarantees for non-minimal implementations, under weakening assump-

tions about the level of duplication in the states and transitions.

A redundant implementation is one where a programmer has inadvertently intro-

duced extra “ghost” states, which may or may not be faithful copies of states desired

in the specification. Test sequences may lead into these “ghost” states, if they exist,

and the OUT may then behave in subtle unexpected ways, exhibiting extra, or missing

transitions, or reaching unexpected destination states. Each test-set Tk provides com-

plete confidence for systems in which chains of duplicated states do not exceed length

k-1. For small values of k, such as k=3, it is possible to have a very high level of

confidence in the correct state-transition behaviour of even quite perversely-structured

implementations.

Both positive and negative testing are achieved, for example, it is confirmed that

access methods do not inadvertently modify object states. Testing avoids any uni-

formity assumption [20], since no conformity to type need be assumed in order for the

OUT to be tested. Likewise, testing avoids any regularity assumption that cycles in

the specification necessarily correspond to implementation cycles. When the OUT

“behaves correctly” with respect to the specification, this means that it has all the

same states and transitions, or, if it has extra, redundant states and transitions, then

these are semantically identical duplicates of the intended states in the specification.

Testing demonstrates full conformity up to the level of abstraction described by the

control states.

The state-based testing approach described here is an adaptation of the X-Machine

approach for complete functional testing [18, 19], replacing input/output pairs with

method invocations. The need for “witness values” in the output is eliminated by the

guaranteed binding of messages to the intended methods in the compiler. The test

generation algorithm adapts Chow’s W-method for testing finite state automata [16].

In Chow’s method, states are not directly inspectable. Instead, reached states are

verified by attempting to drive the implementation through further diagnostic se-

quences chosen from a characterisation set W ⊆ M*, each state uniquely identified by

a particular combination of diagnostic outcomes. Here, we know that the OUT’s state

is inspectable, since it must be characterised by some partition of the ranges of its

access methods.

4. Object Refinement and Test Coverage

The notion of behaviourally-compatible refinement introduced in section 2 applies

equally to the realisation of interfaces (in the UML sense that a concrete class imple-

ments an abstract interface [7]) and also to the specialisation of object subtypes. In

both cases, the notion of refinement is explained in terms of deriving a more elaborate

state transition diagram by subdividing states and adding transitions to a basic dia-

gram. In this paper, we also consider that the need to re-implement an object, in the

sense of XP’s refactoring [5, 21], constitutes a refinement in the same sense. This is

because modification typically replaces simple solutions with more complex ones, in

response to new requirements. At the unit-testing level, individual OUTs tend to

become more complex. (It is also possible, when refactoring an entire subsystem

[21], for certain objects to become simplified, at the expense of introducing new ob-

jects, or shifting the complexity onto other objects, or by deleting unnecessary code –

we do not consider this here).

4.1 Test Coverage of a Modified or Refactored Object

Figure 5 illustrates a refined object statechart for a DynamicStack, an array-based

implementation of a Stack. We may either consider this to be a concrete realisation of

the Stack interface of figure 4, or else a change in implementation policy, a refactoring

of an old linked Stack. Firstly, we wish to confirm that the DynamicStack specifica-

tion conforms to the abstract Stack specification in figure 4.

Em pty

s ize() = 0

Loaded

s ize() < n

push(e)

push(e)[s ize() < n-1 ]

pop() [s ize() > 1]

pop() [s ize() = 1]

newFull

s ize() = n

push(e)[s ize() = n-1]

push(e) /res ize()

Fig. 5. Concrete machine for a DynamicStack, which realizes the Stack interface. The two

states (Loaded, Full) partition the old Normal state in fig. 4, resulting in the replication of its

transitions. The behaviour of push in the Full state must be tested

The main difference between the DynamicStack and the earlier Stack machine is

that the old Normal state, now only shown as a dashed region, has been partitioned

into the states {Loaded, Full}, in order to model the dynamic resizing of the Dynam-

icStack (push will behave differently in the Full state, triggering a memory realloca-

tion). This is a complete partition (no other substate of Normal exists), so rule 1 is

satisfied. No new methods are introduced, so rule 2 is not applicable. The Normal

UKTest 2005

state’s old entry and exit transitions now cross over the region boundary, reaching the

exposed Loaded state. The new pair of push, pop transitions exactly replaces the old

pair (without splitting), so rule 3 is satisfied. The Normal state’s old self-transitions

are now replicated inside the region, as a consequence of splitting the state. The for-

mer push transition is first split in two (one replication for each new state) and then

the transition from Loaded is split again, with exclusive guards on size. Similarly, the

former pop transition is replicated for each new state and its former guard: size() > 1 is

preserved in both states; however, the guard need not be notated in the Full state, as

there is no other conflicting pop transition. So, rule 4 is also satisfied. The refined

DynamicStack implementation (in figure 5) is therefore compatible with the original

Stack interface’s behaviour (in figure 4).

Next, we consider the issue of test coverage. Increasing the state-space has impor-

tant implications for test guarantees. Consider the sufficiency of the T2 test-set, gen-

erated from the abstract Stack specification in figure 4. This robustly guarantees the

correct behaviour of a simple LinkedStack implementation with S = {Empty, Normal},

even in the presence of “ghost” states. T2 will include one sequence <push, push,

push, isNormal>, which robustly exercises <push, push> from the Normal state and

will even detect a “ghost” copy of the Normal state. A strong guarantee of correctness

after testing may therefore be given for a LinkedStack implementation.

In classical regression testing, saved test-sets are reapplied to modified or extended

objects in the expectation that passing all the saved tests will guarantee the same level

of correctness. If the Stack’s T2 test-set were reused to test a DynamicStack con-

structed with n ≥ 3, so having all the states {Empty, Loaded, Full} and all the transi-

tions shown in figure 5, the resizing push transition would never be reached, since this

requires a sequence of four push methods. To the tester, it would appear that the

DynamicStack had passed all the saved T2 tests, even if a fault existed in the resizing

push transition. This fault would be undetected by the saved test-set.

4.2 Test Coverage of a Subclassed or Extended Object

In more complex examples of subclassing, the refinement introduces new behaviour,

which partitions all existing states. Figure 6 illustrates the development of an abstract

class hierarchy leading to concepts like the loan items in a library. The upper state

machine describes the abstract behaviour of a Loanable entity, which oscillates be-

tween its Available and OnLoan states. The lower state machine describes a LoanItem

entity that extends the Loanable entity. This is a product machine with four states,

resulting from the concurrent composition of the Loanable machine with a supplemen-

tary Reservable machine (not illustrated), which, we may infer, oscillates between

Unreserved and Reserved states. The resulting four states are named {OnShelf, PutA-

side, NormalLoan, Recalled}. The behaviours of loaning and reserving are dependent

on each other in interesting ways.

First, we check the refinement for compatibility. The four states completely parti-

tion the two states of Loanable, so rule 1 is satisfied. The new methods {reserve,

cancel} introduced in LoanItem stay within the prescribed region boundaries, so rule

2 is satisfied. Looking now at the splitting of transitions required by rule 3, while

return has been split by the partitioning of OnLoan into two states {NormalLoan,

Recalled}, the borrow transition is more interesting. One partial transition from On-

Shelf allows the loan to go ahead. The other partial transition from the PutAside state

is guarded, and only succeeds if the LoanItem is borrowed by the same person who

reserved it previously. While such behaviour is reasonable, it makes LoanItem in-

compatible with Loanable. The refinement of the borrow transition breaks rule 3,

since the partials are not a complete partition of the original. From Loanable’s per-

spective, borrow always succeeds from the Available state, whereas it sometimes fails

for a LoanItem. This illustrates the practical effect of breaking refinement rules.

However, compabitility may be restored by adding a borrow transition from the

Available state to itself, in the Loanable abstract class, indicating the anticipated null

operation. The abstract state machine is then nondeterministic, since the choice of the

successful or failing borrow transition cannot yet be decided.

new Available

loaned() = false

OnLoan

loaned() = true

borrow(b)

return()

OnShelf

reserved() = false

PutAside

reserved() = true

newreserve(a)

cancel()

NormalLoan

reserved() = false

Recalled

reserved() = true

reserve(a)

cancel()

borrow(b)

borrow(b)[a ≠b]

borrow(b)[a = b]

return()

Fig. 6. The upper state machine captures the behaviour of a Loanable entity, with the methods

{borrow, return}. The lower LoanItem machine extends this with reservations, combining the

behaviour of {borrow, return, reserve, cancel}. The refined machine is not yet wholly com-

patible with the base machine, but this can be addressed

Next, we consider the issue of test coverage. Assuming that a T2 test-set is gener-

ated from the Loanable specification in figure 6, this will robustly confirm that bor-

row and return succeed and fail correctly (for a Loanable instance), even in the pres-

ence of “ghost” versions of the OnLoan and Available states. However, when the

same tests are reapplied to the extended LoanItem, they will only cover half of the

partitioned states. The saved T2 test-set includes the sequences: {<isAvailable>,

<borrow, isOnLoan>, <return, exception>, <borrow, return, isAvailable>, …} and no

UKTest 2005

sequence will contain reserve or cancel, which are first introduced in the subclass’s

protocol. The test-set will therefore oscillate between the states {OnShelf, Normal-

Loan} and will not reach the states {PutAside, Recalled}. Because of this, only half

of the borrow and return transitions will be exercised in the refinement, compared to

all of them in the original. Partitioning states always results in splitting transitions.

Consider now that every pair of methods like {borrow, return} and {reserve, cancel}

introduces further partitions in every existing state. The proportion of the original

transitions still covered falls off as a geometrically decreasing fraction in each succes-

sive refinement. Contrary to popular expectations that recycled regression tests con-

firm the base object’s behaviour in the refined object, regression tests actually cover

significantly less of the base object’s state space in each successive refinement.

5. Conclusions: Regression versus Regeneration

The weakness in conventional regression testing comes from recycling saved test-sets

as a whole, rather than reconstructing test-sequences from scratch. This culture goes

back to the parallel design and test architecture [1, 2] (see section 1), in which test

suites are saved as methods of the THO and are inherited as a whole. The prospect of

reusing whole test suites is so beguiling, that it is hard to refuse, especially after the

effort invested in developing the tests in the first place. Likewise, in JUnit [3, 4], test

scripts are saved and recycled as a whole, in the expectation that this provides a guar-

antee against the effects of entropy in the modified code.

5.1 Overestimation of Regression Test Coverage

Programmers do not expect regression tests to exercise the new features introduced in

the refinement. For this, they develop additional tests, sometimes exercising the new

features in combination with old features. However, they do expect the regression

tests to exercise all of the original features completely. This corresponds to an impov-

erished view of refinement, as illustrated by the model M4 (see section 2.2 above).

The state space of a valid refinement is actually much greater, more like the model L2

(see section 2.3 above).

Unfortunately, recycled test-sets always exercise significantly less of the refined

object than the original. As the state-space of the modified or extended object in-

creases, the guarantee offered by retesting is progressively weakened. This under-

mines the validity of popular regression testing approaches, such as parallel design-

and-test, test set inheritance and reuse of saved test scripts in JUnit. To achieve the

same level of coverage, it is vital to test all the interleavings of new methods with the

inherited methods, so exploring the state-transition diagram completely. This simply

cannot be done reliably by human intuition and manual test-script creation.

5.2 Completeness of Regenerated Test Sets

In the proposed approach, the test-sets for refined object types, such as the Dynamic-

Stack or the LoanItem introduced in section 4, should be regenerated entirely from

scratch, using the algorithm from section 3. With even very simple object state ma-

chine specifications, this process can be automated, generating test-sets to the desired

T1, T2, T3… confidence levels.

The regenerated tests are not regression tests in the normal sense, but all-new tests

in which the state-space of the refined OUT is fully explored. Regenerating the test-

set works equally well, whether or not the OUT is a behaviourally compatible refine-

ment of some original object, since the test-set is derived directly from the refined

specification, not the original one. For this reason, the proposed re-testing approach is

robust under all kinds of software evolution, whether this is by subclassing, by refac-

toring or by simple textual editing of the OUT, and works independently of behav-

ioural compatibility. However, regenerated tests do satisfy the expectations of regres-

sion testing, in that they test up to the same confidence-levels as the original tests.

In common with all test-sets generated from object state machines, regenerated

tests provide specific guarantees for specific amounts of testing. Because the test-sets

are generated systematically, the tester may choose whether to test using T1, T2, T3…

etc. up to the desired level of k in the algorithm. The significance of this is that the

same levels of guarantee may be provided for both the original and retested objects,

something that is not possible with conventional regression testing using recycled test-

sets, for which the guarantees are progressively weakened in each new context.

5.3 Testing to a Repeatable Level of Quality

This paper turns a number of regression-testing concepts on their head. Conventional

regression testing assumes that a refined object is compatible with its unrefined pre-

cursor, if it passes the same tests [2, 3, 5, 21]. This was shown to be false, in section 4

above. Compatibility cannot be assured directly through re-testing, but it can be

proved indirectly by verification in a formal model. Figure 7 shows the different phi-

losophies.

Compatibility is redefined as a verifiable refinement relationship between two ob-

ject specifications. Each OUT may only be proven to conform to its own specifica-

tion, by a specific test-set generated from that specification (the B-test and R-test sets

in figure 7). The refined OUT is then only provably compatible with the basic speci-

fication by virtue of the transitive composition of the R-test conforms and refines

relationships.

The strength of the guarantee obtained in conventional regression testing is badly

overestimated. Recycled test-sets exercise significantly less of the refined object than

the original, such that re-tested objects may be considerably less secure, for the same

testing effort. By comparison, in the test regeneration approach, it is possible to pro-

vide specific guarantees for levels of confidence in the OUT. After the OUT has been

refined, the same levels of confidence may be retained after re-testing using fully

UKTest 2005

regenerated test-sets. This notion of guaranteed, repeatable quality is a new and

important concept in object-oriented testing.

OSpecBasic

OUTBasic

OUTRefined

B-test conforms

OSpecBasic

OSpecRefined

OUTBasic

OUTRefinedR-test

conforms

B-test conforms

transitively conforms

Regression

Regeneration

Fig. 7. The new philosophy for testing. The Refined OUT does not conform to the Basic

OSpec because it B-test conforms to that specification, but rather because it R-test conforms to

the Refined OSpec, which is a provably correct refinement of the Basic OSpec.

5.4 Links with Simulation in Process Calculi

As demonstrated in section 2.2, Cook and Daniels’ [13] examples of statechart re-

finement are all equivalent to the classical refinement of automata, which judges com-

patibility by trace inclusion [15]. This works so long as the subtype object aliased

through the supertype handle is only manipulated through the protocol of that super-

type. In more realistic execution contexts, objects may be aliased simultaneously by

handles of many types. This is in fact quite common in object-oriented design, where

generic algorithms are factored into parts introduced at different levels in the inheri-

tance hierarchy (see the Template Method design pattern [24, p325]). In this context,

an object may be manipulated by more than one protocol, and messages from the

different protocols may be interleaved, which may cause deadlocks [22, 23].

We showed in section 2.2 above how an M4 object could be manipulated through

the protocol of M0, until it receives <c> through another M4 protocol, at which point

the M0 protocol deadlocks. M4 is not strongly compatible with M0, although it

clearly includes the traces of M0. We therefore draw an analogy with Milner’s π-

calculus [14], which contrasts trace inclusion with the stronger simulation relation-

ship. From the viewpoint of the M0 protocol, unseen events that affect the aliased

object through the simultaneous M4 protocol are “invisible actions”, rather like τ-

actions in π-calculus. Weak simulation is where one process behaves like another up

to null assumptions about invisible τ-actions (ie that they do not affect behaviour).

The contrasting strong simulation is where one process behaves like another in all

contexts, irrespective of the τ-actions’ unseen behaviour. Our behavioural compatibil-

ity is like strong simulation, because the protocol of the supertype is preserved, no

matter what invisible actions may be interleaved by the protocols of subtype handles.

This is achieved by making sure, in rules 1 and 2, that invisible actions cannot force a

refined object into a state that is unrecognised by its supertype’s protocol. The rules

are therefore normative, since simulation follows from this.

5.5 Acknowledgement

This research was undertaken as part of the MOTIVE project, supported by UK

EPSRC GR/M56777.

References

1. McGregor, J. D. and Korson, T.: Integrating Object-Oriented Testing and Development

Processes. Communications of the ACM, Vol. 37, No. 9 (1994) 59-77

2. McGregor, J. D. and Kare, A.: Parallel Architecture for Component Testing of Object-

oriented Software. Proc. 9th Annual Software Quality Week, Software Research, Inc. San

Francisco, May (1996)

3. Beck, K. Gamma E. et al.: The JUnit Project. Website http://www.junit.org/ (2003)

4. Stotts, D., Lindsey, M. and Antley, A.: An Informal Method for Systematic JUnit Test

Case generation. Lecture Notes in Computer Science, Vol. 2418. Springer Verlag, Berlin

Heidelberg New York (2002) 131-143

5. Wells, D.: Unit Tests: Lessons Learned, in: The Rules and Practices of Extreme Program-

ming. Hypertext article http://www.extremeprogramming.org/rules/unittests2.html (1999)

6. Rumbaugh, J., Blaha, M., Premerlani, W., Eddy, F. and Lorensen, W.: Object-Oriented

Modeling and Design, Prentice Hall, Englewood Cliffs, NJ, 1991

7. Object Management Group, UML Resource Page. Website http://www.omg.org/uml/

(2004)

8. Harel, D. and Naamad, A: The STATEMATE Semantics of Statecharts. ACM Trans. Softw.

Eng. and Meth., Vol. 5, No 4 (1996), 293-333

9. Bjorkander, M: Real-Time Systems in UML (and SDL), Embedded Systems Engineering,

October/November, 2000, http://www.telelogic.com/download/paper/realtimerev2.pdf

10. McGregor, J. D. and Dyer, D. M.: A Note on Inheritance and State Machines. Software

Engineering Notes, Vol. 18, No. 4 (1993) 61-69

11. McGregor, J. D.: Constructing Functional Test Cases Using Incrementally-Derived State

Machines. Proc. 11th International Conference on Testing Computer Software. USPDI,

Washington (1994)

12. Liskov, B., and Wing, J. M.: A New Definition of the Subtype Relation, Proc. ECOOP

’93, LNCS 707, Springer Verlag, 1993, 118-141

13. Cook, S. and Daniels, J.: Designing Object-Oriented Systems: Object-Oriented Modelling

with Syntropy. Prentice Hall, London (1994)

UKTest 2005

14. Milner, R.: Communicating and Mobile Systems: the π-Calculus, Cambridge University

Press, 1999.

15. Ebert, J. and Engels, G.: Dynamic Models and Behavioural Views. International Sympo-

sium on Object-oriented Methods and Systems. Lecture Notes in Computer Science, Vol.

858. Springer Verlag, Berlin Heidelberg New York (1994)

16. Chow, T.: Testing Software Design Modeled by Finite State Machines. IEEE Transactions

on Software Engineering, Vol. 4 No. 3 (1978) 178-187

17. Binder, R. V.: Testing Object-Oriented Systems: a Status Report. 3rd edn. Hypertext article

http://www.rbsc.com/pages/oostat.html (2001)

18. Holcombe, W. M. L. and Ipate, F.: Correct Systems: Building a Business Process Solution.

Applied Computing Series. Springer Verlag, Berlin Heidelberg New York (1998)

19. Ipate, F. and Holcombe, W. M. L.: An Integration Testing Method that is Proved to Find

All Faults. International Journal of Computational Mathematics, Vol. 63 (1997) 159-178

20. Bernot, B., Gaudel, M.-C. and Marre, B.: Software Testing Based on Formal Specifica-

tions: a Theory and a Tool. Software Engineering Journal, Vol. 6, No. 6 (1991) 387-405

21. Beck, K.: Extreme Programming Explained: Embrace Change. Addison-Wesley, New York

(2000)

22. Simons, A. J. H., Stannett, M. P., Bogdanov, K. E. and Holcombe, W. M. L.: Plug and Play

Safely: Behavioural Rules for Compatibility. Proc. 6th IASTED International Conference

on Software Engineering and Applications. SEA-2002, Cambridge (2002) 263-268

23. Simons, A. J. H.: Letter to the Editor, Journal of Object Technology. Received December

5, 2003. Hypertext article http://www.jot.fm/general/letters/comment_simons_html (2003)

24. Gamma, E., Helm, R., Johnson, R. and Vlissides, J.: Design Patterns: Elements of Reus-

able Object-Oriented Software, Addison Wesley (1995)

UKTest 2005

Exploring test adequacy for database systems

David Willmor and Suzanne M Embury

Informatics Process Group, School of Computer Science,

University of Manchester, Oxford Road, Manchester, M13 9PL, United Kingdom

d.willmor|s.m.embury@cs.man.ac.uk

Abstract

Database systems are an important asset for many businesses. As such, it is important to test database systems thor-

oughly, as any faults that remain hidden may significantly impact critical business processes. However, these systems

bring additional complexities that make them amongst the most complex and difficult kind of system to test. While soft-

ware testing in general is a well-developed area, techniques specifically aimed at testing database systems are still in

their infancy. In this paper, we present a family of test adequacy criteria for database systems that can be used to deter-

mine the “quality” of a test suite. These criteria consider various aspects of database systems including the source code

structure (in terms of patterns of database operations), the existence of define–use pairs between database operations

and the interactions between different applications of the database system. The criteria we present differ from existing

adequacy criteria as we focus on a general definition of a database test case that is based on intensional constraints. This

overcomes the problems associated with adequacy being constrained to a single static database state. We also consider

transactional operators that alter the behaviour of a database system and influence adequacy.

Keywords: software testing; database systems; test adequacy criteria

1. Introduction

Database systems are an important asset for many organisations, since they contain vital business data (both historical

and current) and support critical business processes. Because of this the testing of database systems is an important

concern. Database systems often present a strong integration of computation, data and communication aspects. Not only

do they incorporate a secondary persistent database state that is used for the storage and querying of data, but they are

also often spread across a number of logical tiers. The type of system typically considered by research into software

testing exists solely within a single program state, which is volatile in nature, and so their testing reflects this. Techniques

are often focussed on: the definition and use of variables; the passing of control; the coverage of statements and paths;

and outputs from the system. Each of these techniques is based on the computational aspects of the software system —

how an output comes to exist. However, when testing database systems, data and communication aspects must also be

explicitly considered during testing. Such considerations result in new testing challenges.

A database system consists of two forms of state: program and database. These states are not simple variants of

each other; they are fundamentally different in several ways. Where program state is organised as a collection of named

locations where individual data values can be stored, database state is organised relative to the concepts provided by a

data model (such as the relational model). Large segments of a database state can be accessed or modified through the

execution of a single program statement in contrast to a typical program statement which will typically define at most

one variable. Moreover, changes to the state are controlled by a transaction mechanism and may be undone by some later

statement. Understanding how the state may be changed is further complicated due to the huge space of possible database

states. Finally, database state is persistent, so that changes made during one execution may affect the behaviour of later

executions. Whilst one test case may execute correctly, its effect on the database state may affect other test cases, possibly

causing an error. This can result in errors that are hard to isolate and correct.

The communication aspects of a database system handle the movement of data between program and database state.

Program and database states communicate using database operations often in the form of SQL, a powerful set–oriented

language with explicit semantics. SQL is used for both the definition of the database schema and for querying the data

captured by it. Often, some form of database access layer (such as JDBC for Java) is used to allow SQL queries to be

embedded directly into the program code. The location of embedded queries (where communication occurs between the

program and the database) are likely locations of faults. Therefore, these locations often are a focus of the software testing

process for database systems.

Often, in a business setting, multiple programs will interact with a single database. For example, a sales application

will modify the stock levels of items to reflect purchases, a warehouse application will also modify stock levels to reflect

deliveries and a management system may analyse stock levels whilst investigating trends. Due to the persistent nature

of database state, the effect that one system has on the database will affect the behaviour of other completely separate

systems. Whilst this effect may often be intentional it may lead to unanticipated behaviours and possible faults.

In this paper, we present a generalised view of what a database system and test case is that can be applied to the many

different types of database systems that exist (Section 2). This view is based on the concept of intensional constraints that

allow real world (and dynamic) database states to be used for testing. In Section 3, we present a family of test adequacy

criteria based on structural and data–oriented aspects of a database system. These criteria allow us to address what

sufficient testing is. These criteria also differ from those presented in existing work as they are based on the reasoning of a

dynamic database state. Thus, our adequacy criteria are not constrained to one specific instance of database state. Finally,

in Section 4 we present concluding remarks and discuss possible avenues for future work.

2. A view of database testing

Considering the widespread use of database systems there has been relatively little research into their testing. The

work that has been produced differs by a number of factors, not least in the terminology that is used. In order to provide

consistency in this paper we use the following terminology:

Application : a software program designed to fulfil some specific requirement. For example, we might have separate

application programs to handle the entry of a new customer into the database, and to cancel dormant accounts once

a time-limit has passed.

Database : a collection of interrelated data, structured according to a schema, that serves one or more applications.

Database application : an application that accesses one or more databases. A database application will operate on both

program and database state.

Database system : a logical collection of databases and associated (database) applications.

Testing is more difficult (or, at least, different) when dealing with database applications. The full behaviour of a database

application program is described in terms of the manipulation of two very different kinds of state: the program state and

the database state. It is not enough to search for faults in program state, we must also generate tests that seek for faults that

manifest themselves in the database state and in the interaction between the two forms of state. A further complication for

testing is that the effects of changes to the database state may persist beyond the execution of the program that makes them,

and may thus affect the behaviour of other programs [18]. Thus, it is not possible to test database programs in isolation,

as is done traditionally in testing research. For example, a fault may be inserted into the database by one program but then

propagate to the output of a completely different program. Hence, we must create sequences of tests that search for faults

in the interactions between programs. This issue has not yet been considered by the testing research community. This has

been shown to be particularly important for regression testing where the change to the functionality of one program may

adversely effect other programs via the database state [18].

The literature on testing database systems varies in a number of ways. A fundamental difference in the literature is in

the understanding as to exactly what a database system is. Each definition is constrained to a particular situation. There is

no definition general enough to be applied to the different scenarios in which database systems may be used. The simplest

view is when a single application interacts with a single database [3, 4, 8, 7]. This has been moderately extended to handle

the situation in which multiple databases exist [13]. Whilst the situation in which multiple applications interact with a

database has been considered in a constrained form [12, 18] there does not exist a generalised definition that is applicable

to both this situation and the previous ones. Therefore, the following is a general definition of a database system that is

applicable to all existing work on database testing:

UKTest 2005

Definition 1. A database system consists of:

• a collection of database applications P1, P2, . . . , Pn,

• a collection of databases D1, D2, . . . , Dm,

• a schema Σ describing the databases.

Conceptually we can view each individual database as a single logical database D that matches the data model Σ. Multiple

databases are often used as from an implementation perspective they are easier to understand, manage and optimise. Also,

database systems are often not constructed from scratch they often must use existing databases. We do not constrain Σ to

a particular data model, for example relational [5, 6], object–relational [6, 17], object–oriented [1, 6, 14] etc..., however

for the remainder of this paper for readability we assume that it is relational.

As with the definition of a database system there is no agreed view as to what a database test is, but an informal

consensus is beginning to emerge. The following is a definition of database test cases and suites that can form the

foundation for the proposals for test adequacy criteria (described in the next section) and for future work. A test case

usually involves stimulating the system using some form of input, action or event. The output from the system is then

compared against a specification describing what is expected and any faulty behaviour identified. In terms of database

systems, the concept of a test case becomes more complicated. Not only must we consider program inputs and outputs we

must also consider the input and output database states. A database test case must therefore describe what these database

states are. For initial database states, existing proposals either adopt an extensional approach [13] or do not consider

database state on a per test basis instead specifying a fixed initial database state for all tests [3, 4, 7, 8]. For output states,

existing approaches adopt either an extensional approach [13] or intensional approach [3, 4, 7, 8].

A robust approach for testing database systems should specify both initial and output database states intensionally. This

allows test cases to be executed on a variety of different states (often real world or changing states) allowing for more

realistic testing. Before justifying this we present our definition of a database test case and then discuss the advantages of

an intensional approach:

Definition 2. A test case t is a quintuple 〈i,∆ic, P, o,∆o

c〉 where:

• P the program on which the test case is executed,

• i is the application input,

• ∆ic are the intensional constraints the initial database state must satisfy,

• o is the application output, and

• ∆oc are the intensional constraints the output database state must satisfy.

In this definition P , i and o represent the same concepts as the traditional notion of a test case. The database aspects of

the test case are described by constraints ∆ic and ∆oc. We have chosen to specify the input and output database states

using intensional constraints as they allow us to address a number of limitations with extensional states. In terms of input

states, extensional states are: difficult to store, especially where either database states or test suites are large; difficult to

maintain as each state must often be modified to reflect changes to the test case, application or data model; and difficult

to ensure they reflect the real–world and changes to the database state that may occur over time. In terms of the output

state, extensional states are: expensive to determine if two large states are identical; difficult to maintain as the output state

must be modified to reflect changes to the input state and the functionality of the system; and time consuming to manually

create states that reflect complex behaviour that a test case may exhibit on the initial state. Our intensional technique

specifies constraints that a test case must satisfy to determine (a) applicability (if the input state is valid for the test case)

and (b) success (if the output state is correct). Consider the following very simple example in which a new customer is

added to the database:

Test Case 1: add a new customer with <name>, <email> and <postcode>

• ∆ic : initial state constraint

1. no customer C in CUSTOMER has C.NAME=<name>, C.EMAIL=<email>

and C.POSTCODE=<postcode>

• ∆oc : output state constraint

1. at least one customer C in CUSTOMER has C.NAME= <name>,

C.EMAIL=<email> and C.POSTCODE=<postcode>

This test case is relatively simple and imposes a single input constraint that specifies that no customer should exist in

the database that matches the customer to be added. The output constraint specifies that after executing the test case the

database should contain exactly one customer matching the customer to be added. We specify exactly one customer in the

output constraint as it allows us to cover faults where no customer was added and where multiple customers were added.

The use of intensional constraints against a real–world database raised the question of how we can deal with situations in

which the initial constraint does not hold. This is important as whilst using a real–world database state provides us with

realistic data, we cannot create opportunities for exposing faults that might arise in the future, but which are not present

in existing data.

A test case aims to test a particular use of a system. However, database systems exhibit significantly more complex

functionality. For example, a sequence of related tasks may be carried out by a user interspersed with tasks of other users.

Tasks may also be spread across a number of individual programs. These cannot be captured by the execution of a single

test case since our definition of a test case assumes a single program execution. Consider the situation in which a test case

t1 adds an item to a shopping cart and t2 increases the quantity of the item added. If t1 does not correctly add the item, it

is not possible for t2 to increase its quality. Therefore, the execution of t2 may fail not as a result of a problem with the

program but because t2 is dependent upon t1. This dependency problem can be addressed by modifying database state to

satisfy the initial constraints. However, this approach has a number of limitations. The simplest are due to the resources

required for generating database states. The most important is due to the fact that whilst we can satisfy t2s requirements

from t1 we are unsure if t1 has an unforeseen impact on t2. For example, a test case may change part of the database state

that can adversely affect the behaviour of a subsequent test case. Therefore, it is obvious that certain behaviours require

the execution of individual tests in an ordered sequence. A test sequence s is a sequence of test cases 〈t1, . . . , tn〉. Each

test of the sequences is executed in the specified order. If a test case does not meet its output conditions (the test fails) the

user is notified of the failure. The database state is then modified to allow the sequence to proceed. However, the test result

of the sequence is flagged to tell the user that it did not execute correctly. This is done, instead of simply stopping the

sequence, as the tests still provide a certain confidence in the system. Our approach to test sequences allows an individual

test case to exist in a number of test sequences. It can also be observed that test sequences can be used for more than

testing complex functionality. It can potentially take a lot of effort to set up a database for a particular test case. If several

test cases require similar input databases, then it will be much more efficient to run them all against the same database.

For example, consider the situation where a database contains records for customers. In an example sequence, the first

test case would create a customer; the second would modify the customer; and the third would delete the customer. Each

test case represents important functionality of the system which are all related through the use of the same customer. It is

therefore more efficient to use a sequence to group related test cases.

3. Test adequacy criteria for database systems

A test suite is a collection of test cases (or test sequences in our approach) usually targeted towards the verification of the

entire system or a specific section of the system. The manner in which a test suite is generated varies between different

situations. The simplest is to randomly generate tests for the system. However, it is common to use more formalised

techniques based on some aspect of the system, including: the systems specification, observations of the system being

used, and the structure of the systems source code. To test every possible input is impractical for anything but the simplest

of programs. For database systems this becomes impossible. Thus, if we cannot completely test a system, what is

sufficient testing? Current work into software testing has proposed a number of test adequacy criteria that if satisfied will

sufficiently test the system according to some characteristic of the specification or implementation. In terms of database

testing only one set of criteria exist for determining the quality of a test suite [13]. This approach is based on determining

for each database operation an extensional representation of the portion of the database defined or referenced by the

application. This approach is limited for a number of reasons: (a) It generates an extensional representation in which

non–determinism countered by using a fixed initial state. However, this means that the test suite can only be considered

valid for this particular state. (b) It does not consider the effect transaction operators have on the definition–use pairs and

so will include unnecessary test cases for interactions that will not occur. (c) Nor does it consider multiple applications or

instances of the same application accessing a single database. The only other work on test adequacy for database systems

UKTest 2005

is by Suarez-Cabal and Tuya [2] in which they present a metric for testing coverage of an SQL SELECT query, and a

method for detecting when tuples needed to be added to the database to ensure better coverage, based on analysis of the

corresponding query tree1. Their work aims to determine adequacy of a single query (specifically the SELECT query)

and does not consider the behaviour of the system as a whole.

In this section we present a number of test adequacy criteria based on our intensional specification of database state

and intensional descriptions of the behaviour of database operations. The criteria presented, are focussed on the structural

and data-oriented elements of database systems. Briefly, structural elements include branches, loops and procedure calls.

Data-oriented elements are the points at which data is defined and used in the program. However, first we present a brief

discussion about the types of faults that can occur within a database system.

The fundamental issue in database testing is whether the application behaves as specified [3, 4]. From a simple per-

spective this can be seen as determining if the output from a database system matches its required output. Bearing in mind

that database state is persistent, it can be observed that the output database state is dependent not only on the input to the

program but also the initial (or input) database state. Therefore, a test case execution can be seen as moving the database

from one state to another. It is this transition that we aim to verify. The first type of fault is simply that the implemented

functionality does not match the specified functionality. Other faults include: attempting to access a database entity that

does not exist, operations attempting to violate the databases constraints (such as primary key or referential integrity), and

transactions being aborted or committed incorrectly. These types of faults manifest themselves either in the database state

or as a result of an interaction between program and database state.

3.1. Structural test adequacy criteria

Structural test criteria are a commonly used software test adequacy criteria [20]. This form of testing is based on a

structural model that represents the physical implementation of the software application. Kapfhammer and Soffa’s [13]

approach to test adequacy is based on an extended version of a control flow graph in which extra edges are included to

capture the dependencies that exist between database operations. However, this model only captures dependencies that

exist between database operations in a single procedure. Instead we base our model on a representation that completely

describes the structure of a database system and its composite components.

Definition 3. Each application P of the database system is modelled as an interprocedural graph consisting of:

• CP , the set of control flow graphs, where each ci ∈ CP corresponds to a procedure mi of program P .

• I is a graph where each edge represents a procedure call.

This is a general model that can be tailored towards particular implementations. For example, for a Java program I is

an interclass relation graph (IRG) representing the relationships between the classes (and their methods) in P [10]. The

IRG models the complexities associated with object–oriented languages, including variable and object type information;

internal and external methods; interprocedural interactions; inheritance, polymorphism and dynamic binding; and ex-

ception handling. In the model described in Definition 3 each statement of a program is captured by a particular node.

These nodes can be categorised and annotated with additional information. In particular we use the concept of a database

operation type:

Definition 4. A database operation δ is a node of a control flow graph that consists of some form of interaction with a

database D. Each δ is an notated with;

• δ.Σadd the subset of D that is updated by δ,

• δ.Σdel the subset of D that is deleted by δ, and

• δ.Σread the subset of D that is read by δ.

The sets δ.Σadd, δ.Σdel and δ.Σread allow us to reason over the interactions with the database and possible relations

between different database operations. The subsets of the database are represented intensionally as relational algebra

expressions.2

1A query tree is conceptually similar to an abstract syntax tree for programming languages.2For more information about how these sets are generated , reasoned over and can be used please see [18, 19].

Our structural test adequacy criteria are based upon a static analysis of the model according to a specific criterion. These

structural criteria can be seen as analogous to traditional control flow–based criteria but with the additional characteristics

of database systems incorporated. Similar to control–flow based techniques we utilise the concept of a complete path π

that starts at the graph’s entry node and ends at an exit node [9]. Intuitively, it can be observed that the execution of a

particular test case will result in the execution of a particular complete path (relative to the input to the program and the

state of the database at that moment in time). Therefore we use the notation πt to refer to the complete path relative to

the execution of test case t.3 As our model is based upon control–flow graphs, existing criteria, such as statement, branch

and path coverage, are applicable to database systems. For brevity we refer the reader to [20] for a detailed description of

these criteria. The use of complete paths in test adequacy criteria is complicated by the fact that we describe initial states

using intensional constraints. Therefore, multiple executions of a test case may result in different paths due to changes in

the database state. A test suite is therefore only adequate in terms of the state in which it was executed. For example, a

test suite may only satisfy a subset of the requirements of a criterion when executed against a particular state, however, as

that state changes so may the degree of satisfaction. The test suite executed at a later date may result in a different degree

of coverage.

The database system model includes a special type of node for each database operation in an application. Criterion 1

simply assesses coverage of all database operations without discriminating as to their effect. Whilst a fault may not

be caused directly by a database operation, it is through these statements that database-related faults become detectable,

either by propagation to the database state through state change or by retrieval from the database state. Since each database

operation may potentially reveal a fault, ensuring coverage of all such statements by a test suite gives some guarantee that

a wide variety of database faults will be detected.

Criterion 1. A test suite ts satisfies the All Database Operations criterion if for each database operation δ, there exists

a t ∈ ts such that δ ∈ πt.

Criteria 2 and 3 assess coverage relative to two of the most common types of interactions with a database: retrieval and

update. They each potentially require fewer test cases than the 1 criterion, and so can be more efficient when dealing with

kinds of programs where there is some expectation as to the kind of fault that may appear. For example, when testing a set

of batch applications, we may prefer to concentrate our testing effort on updates to the database, ensuring that the result

states produced by the programs are correct. In such programs, database retrieval is used only to pull data into memory in

order to manipulate it before writing it back to the database. Any errors in this part of the code will likely result in incorrect

update operations as well. Alternatively, where a system consists of many report-style or query-browsing applications,

we may wish to focus attention on how data is retrieved from the database, and how it is later manipulated before being

presented to the user, rather than the minor house-keeping updates that occur during report generation.

Criterion 2. A test suite ts satisfies the All Read Operations criterion if for each database operation δ where δread 6= ∅,

there exists a t ∈ ts such that δ ∈ πt.

Criterion 3. A test suite ts satisfies the All Write Operations criterion if for each database operation δ where (δadd 6=∅) ∨ (δdel 6= ∅), there exists a t ∈ ts such that δ ∈ πt.

Although faults may be visible to a particular program in the middle of a transaction, the key point at which they

become fully visible externally is when the changes made by the program are made durable by execution of a transaction

commit. These are key points in the program, when a set of related changes is declared to be either consistent (i.e. legal)

and can therefore be made durable within the database state, or inconsistent and must therefore be undone. Therefore,

by ensuring that all such operations are executed at least once by a test suite, we have some guarantee that our test suite

is exercising a significant proportion of the kinds of database interactions that the application programs implement. This

leads us to propose the criterion 4 in which all commit and abort statements are required to be exercised by the test suite

for it to be considered adequate. In general, this criterion will allow smaller test suites to be used than criteria 2 and 3.

Criterion 4. A test suite ts satisfies the All Commits and Aborts criterion if for each database operation δ where

type(δ, commit) ∨ type(δ, abort), there ∃ t ∈ ts such that δ ∈ πt.

The above adequacy criteria are all subsumed by the standard all–statements criterion,4 and therefore share its inherent

weaknesses, in that test suites may satisfy them but may still leave a large part of the control flow graphs of a set of

3This is often referred to as the execution trace of t in the literature.4Sometimes referred to as all–nodes.

UKTest 2005

programs unexplored. In our context, this disadvantage is most serious in the case of the weakest criterion ( 4). The

motivation behind this criterion is that the test case should execute all transactions within the programs. However, the

structure of most database programs is much more complex than this criterion implies. There will in general be many

ways of reaching a specific commit or abort operation. Many of these will be slight variants on the same basic transaction

behaviour, but others may represent very different transactions that need to be tested.

In other words, rather than being satisfied with testing one path to each commit or abort, we would ideally prefer to test

all such paths. In order to define such an adequacy criterion, we first require a notion of a transaction path.

Definition 5. A transaction path in a program P is a subpath ni, . . . , nj in a complete path of P where:

• the node immediately preceding ni is either START or a commit or abort operation,

• nj is either a commit or an abort operation, and

• the subpath ni, . . . , nj−1 is commit- and abort-free.

Based on this, we can now define an adequacy criteria that requires all transaction paths to be exercised by a test suite:

Criterion 5. A test suite ts satisfies the All Transactions criterion iff every transaction path in P is covered by some test

t ∈ ts.

In practice, of course, such a criterion would need to be used in conjunction with some mechanism for ensuring that the

presence of cyclic paths does not lead to an infinite number of transaction paths to be tested. For example, Howden’s

boundary interior criterion could be applied [11]. If testing resources are limited (as is usually the case) we may also

want to avoid repeated testing of very similar transaction paths, and would instead want to concentrate testing effort on

covering as wide a variety of transaction paths as possible. This would require some heuristic to be combined with the

criterion, in order to determine which transaction paths are deemed sufficiently different from those already explored to

be worth attention.

In database systems, we have an additional source of structure that can form the basis for further test adequacy criteria.

This is the structure of the database itself. A common strategy when testing database applications, for example, is to

choose tests that cover all parts of the schema,5 and all forms of operation on each schema element. For example, if a

new database system is created that includes a Customer table, we would expect there to be code that controls the addition

of new customers to the database, that handles modifications to their details (such as name or address) and that carries

out deletion of data for customers that are no longer deemed to be active. We would also expect there to be at least one

program that reads data from the Customer table, since if it is not used then there is little point in maintaining the data.

A well designed database system will often be created with sub-routines or sub-programs that handle these basic

updates, and which ensure that the same business logic is applied, regardless of what higher level application program

they are called from. (Such systems are often constructed as three-tier systems, with an upper interface layer, a middle

business logic layer and a supporting database services layer.) In testing such systems, we may wish to ensure that each

form of operation on each part of the schema has been tested at least once, rather than testing many calls to the same

operations. This leads to a further structural adequacy criterion, which we define here in terms of the relational data

model (though the same principle could easily be applied to other models):

Criterion 6. A test suite ts for a database system with schema Σ satisfies the All Schema Elements criterion iff for every

relation r ∈ Σ, the following operations are covered by at least one test case in ts (though not necessarily the same test

case):

• an operation which retrieves data from r,

• an operation which inserts new tuples into r,

• an operation which deletes tuples from r, and

• an operation which modifies some existing tuples in r.

Further variants of this criteria can be considered, which operate a finer granularity of database structure. For example,

we might wish to ensure that our test case will include operations which read from each attribute in each table, as well as

5Or, since databases are often required to support a wide variety of applications, coverage may be limited to those parts of the schema actually accessed

by the programs to be tested, which can (in general) be determined statically.

modifying the attribute. Another common area of focus for testing in database applications is on the relationships between

tables (modelled using foreign keys and additional integrity constraints in relational systems), since errors in modelling

the cardinality and optionality of relationships are common in database programming, and can have severe ramifications.

3.2. Define–use test adequacy criteria

Traditional programs are based around the definition and use of variables [15, 16]. A definition occurs when a variable is

on the left hand side of an assignment. A use may either occur: (a) on the right hand side of an assignment (a computation–

use) or (b) in the predicate of a conditional logic statement (a predicate–use). A definition–use pair occurs between a

statement that defines a variable and a subsequent (in the control flow graph) statement that uses that variable and there

is no intervening definition. A number of different criteria have been proposed based on the concept of definition–use

pairs [15, 16]. The simplest include the all-uses criterion (in which all uses must be covered) and the all-du-paths criterion

(in which all paths between definition–use pairs must be covered) [15, 16].

For database applications define–use pairs relate to the definition and use of parts of the database. Whilst, this is more

complicated than program statements, it has been successfully employed in testing [13, 18] and program slicing [19].

We will utilise the approach we proposed previously in the context of slicing [19] and regression testing [18] as it has a

finer level of granularity than the approach of Kapfhammer and Soffa [13] and also includes the effects of the transaction

operators: commit and abort. This is important as a definition–use pair cannot exist if the definition has been aborted

before it is used. The following is a description of a database definition–use pair:6

Definition 6. A database definition–use pair exists between operation δ1 and operation δ2 iff:

1. δ2.Σread ∩ (δ1.Σadd ∪ δ1.Σdel) 6= ∅, and

2. there is a rollback-free execution path p between δ1 and δ2 such that:

• δ1.Σadd \ p.Σdel 6= ∅ o r

• δ1.Σdel \ p.Σadd 6= ∅

For brevity, we use the notation δ1 99K δ2 to denote the fact that there exists a definition–use pair between δ1 and δ2. In

the above description it can be observed that definition–use pairs will only be encountered if certain paths of the program

are traversed (some paths will not be rollback–free). Therefore, in order to specify coverage we specify that Πδ199Kδ2is

the set of all paths in which the specific definition–use pair will occur. Using this description of a database definition–use

pair it is possible to propose a criterion based on their coverage:

Criterion 7. A test suite ts satisfies the All Database Application Define–Use Pairs criterion if for each δ1 99K δ2, there

exists a t ∈ ts such that πt ∈ Πδ199Kδ2.

In this criterion each instance of a define–use pair should be matched by a test case in which the use occurs after the

definition in the path of the test case. Intuitively, it can be observed that the above definition–use pairs only exist within

a single program. However, given the persistent nature of database state data is shared between different programs.

Therefore, a program may define a part of the database that may be subsequently used by another program. Given this

situation we are able to specify a definition–use criterion across the entire database system:

Criterion 8. A test suite ts satisfies the All Database System Define–Use Pairs criterion if for each δ1 99K δ2 where

δ1 ∈ P1, δ2 ∈ P2, there exists a test sequence s ∈ ts where the complete path of the test sequence πs contains a

rollback–free subpath p between δ1 and δ2 such that:

• δ1.Σadd \ p.Σdel 6= ∅ o r

• δ1.Σdel \ p.Σadd 6= ∅

This criterion aims to cover each instance of a define–use pair that may exist within an entire database system. For each

define–use pair a test sequence should exist in the test suite in which the definition occurs and then subsequently used. The

rollback-free subpath condition checks that between the definition occuring and it being used its effect on the database

state has not been reversed (either through an abort command or an intervening definition).

6Please refer to [19] for details describing how we reason over database queries and construct the definition–use pairs (which are described as database–

database dependencies).

UKTest 2005

Figure 1: Test adequacy criteria subsumption hierarchy

3.3. Subsumption hierarchy of test adequacy criteria

The test adequacy criteria proposed in this paper do not produce mutually exclusive test suites. We discuss the inclu-

siveness of two test adequacy criteria in terms of subsumption. A criterion C1 subsumes a criterion C2 if every test suite

that satisfies C1 also satisfies C2. Figure 1 shows the subsumption hierarchy for the criteria proposed in this paper. The

relationships presented in this hierarchy are justified as follows:

• All Database Operations subsumes All Read/Write/Commits&Aborts Operations and All Schema Elements

The simplest of subsumption relationships. Each of type read, write, and commits & aborts are a type of database

operation. Therefore, a test suite that covers all database operations will inherently cover each of the types.

• All Statements subsumes All Database Operations

A database operation is a particular type of statement that interacts with the database. Therefore, a test suite that

covers all statements inherently covers all database operations.

• All Transactions subsumes All Database Operations

A transaction is a collection of database operations. Every database operation must exist in a transaction.7 There-

fore, a test suite that covers all statements inherently covers all database operations.

• All Paths subsumes All Transactions

A transaction is captured by a transaction path. This in turn is a subpath of the systems source code. Therefore, to

cover all paths (which inherently covers all subpaths) will cover all transactions.

• All Database System Define–Use Pairs subsumes All Database Application Define–Use Pairs

Define–use pairs of the system as a whole will also include all define–use pairs located in an individual program.

Therefore, to cover all database system define–use pairs inherently covers all database application define–use pairs.

4. Conclusions and future work

In this paper, we have addressed a number of fundamental issues regarding database testing, particularly: (a) what a

database system is? (b) what a database test case is? and (c) what is adequate testing of a database system? In response to

this, we have presented the following contributions in this paper:

• Basic definitions for database testing:

7A number of database programming languages (and access layers) allow certain operations to be performed outside of transactions or operate in an

auto commit mode (in which each operation is automatically committed if it is successful). We therefore view each operation outside of a transaction as

existing within its own individual transaction.

– A database system that is applicable to both the types of systems in use with businesses and the currently

existing techniques on database testing. Our definition is based on the concept that a database system may

consist of one or more applications that interact with one or more databases.

– A database test case. This definition uses intensional constraints to specify requirements required of the initial

and output database states. The use of intensional constraints (particularly for the input state) allows us to

utilise both real–world and artificial database states for testing.

– A database test sequence. This definition describes the execution of a sequence of database test cases aimed

at verifying some form of complex behaviour. Often the behaviour of a system cannot be verified by a single

test case. A test sequence also allows us to group test cases that can operate on the same initial state without

affecting each other.

• Test adequacy criteria for database systems that aim to determine what “adequate” testing is:

– Structural criteria focus upon the structural aspects of the database system. We focus on two forms of struc-

tural information: the application source code and the data model. The source code based criteria are based

on the possible patterns of database operations that may be executed. Particularly, we have presented criteria

based on covering different types of database operations, transactional statements and complete transactions.

The data model based criterion aims to cover the different entities of the data model. This translates to cover-

ing: tables, columns, foreign key relationships, etc. . . in the relational model.

– Define–use criteria focus upon the relationships between database operations in terms of what subset of the

database they define or use. Our approach differs from existing techniques as the reasoning over the effect

of a database operation on the database is based on intensional descriptions. This allows us to determine all

of the possible define–use pairs and not just those valid for the current database state. We specify define–use

criteria for a single application and the database system as a whole.

• Subsumption hierarchy describing the relationships between our test adequacy criteria and by placing them into

perspective with classical program state based criteria.

This work has presented a number of avenues for further work. The first avenue is an empirical comparison between the

proposed test adequacy criteria. Whilst our subsumption hierarchy provides a descriptive comparison of the differences

between the criteria it does not tell us anything about the costs associated with determining adequacy, the fault coverage

of different adequate test suites or the cost of executing each test suite. Furthermore, we plan to investigate possible opti-

misations to improve testing. This is particularly important for concurrent systems as the number of possible combination

is a combinatorially explosive problem.

References

[1] J. Banerjee, H.-T. Chou, J. F. Garza, W. Kim, D. Woelk, N. Ballou, and H.-J. Kim. Data model issues for object-oriented

applications. ACM Trans. Inf. Syst., 5(1):3–26, 1987.

[2] M. J. S. Cabal and J. Tuya. Using an SQL coverage measurement for testing database applications. In Proceedings of the 12th

ACM SIGSOFT International Symposium on Foundations of Software Engineering (SIGSOFT FSE), pages 253–262, October-

November 2004.

[3] D. Chays, S. Dan, P. G. Frankl, F. I. Vokolos, and E. J. Weber. A framework for testing database applications. In International

Symposium on Software Testing and Analysis (ISSTA), pages 147–157, August 2000.

[4] D. Chays, Y. Deng, P. G. Frankl, S. Dan, F. I. Vokolos, and E. J. Weyuker. An AGENDA for testing relational database applications.

Software Testing, Verification and Reliability, 14(1):17–44, 2004.

[5] E. F. Codd. A relational model of data for large shared data banks. Communications of the ACM (CACM), 13(6):377–387, 1970.

[6] T. Connolly and C. Begg. Database Systems. Addison-Wesley, 3 edition, 2002.

[7] Y. Deng and D. Chays. Testing Database Transactions with AGENDA. In Proceedings of the 27th International Conference on

Software Engineering (ICSE). IEEE Computer Society, May 2005.

UKTest 2005

[8] Y. Deng, P. G. Frankl, and Z. Chen. Testing database transaction concurrency. In International Conference on Automated Software

Engineering (ASE), pages 184–195. IEEE Computer Society, October 2003.

[9] P. G. Frankl and E. J. Weyuker. An applicable family of data flow testing criteria. IEEE Transactions on Software Engineering,

14(10):1483–1498, 1988.

[10] M. J. Harrold, J. A. Jones, T. Li, D. Liang, A. Orso, M. Pennings, S. Sinha, S. A. Spoon, and A. Gujarathi. Regression test

selection for java software. In Proceedings of the 16th Annual ACM SIGPLAN Conference on Object-Oriented Programming,

Systems, Languages, and Applications (OOPSLA), pages 312–326. ACM, October 2001.

[11] W. E. Howden. Methodology for the generation of program test data. IEEE Transactions on Computers, 24(5):554–560, 1975.

[12] G.-H. Hwang, S.-J. Chang, and H.-D. Chu. Technology for testing nondeterministic client/server database applications. IEEE

Transactions on Software Engineering, 30(1):59–77, 2004.

[13] G. M. Kapfhammer and M. L. Soffa. A family of test adequacy criteria for database-driven applications. In Proceedings of the

11th ACM SIGSOFT Symposium on Foundations of Software Engineering, pages 98–107. ACM, September 2003.

[14] C. Lecluse, P. Richard, and F. Velez. O2, an object-oriented data model. In H. Boral and P.-A. Larson, editors, Proceedings of the

1988 ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, June 1-3, 1988, pages 424–433. ACM

Press, 1988.

[15] S. Rapps and E. J. Weyuker. Data flow analysis techniques for test data selection. In Proceedings of the 6th International

Conference on Software Engineering (ICSE), pages 272–278, September 1982.

[16] S. Rapps and E. J. Weyuker. Selecting software test data using data flow information. IEEE Transactions on Software Engineering,

11(4):367–375, 1985.

[17] M. Stonebraker, P. Brown, and D. Moore. Object-Relational DBMSs. Morgan Kaufmann, 2 edition, 1998.

[18] D. Willmor and S. M. Embury. A safe regression test selection technique for database–driven applications. In To Appear in the

proceedings of the 21st International Conference on Software Maintenance (ICSM 2005). IEEE Computer Society, September

[19] D. Willmor, S. M. Embury, and J. Shao. Program slicing in the presence of a database state. In Proceedings of the 20th

International Conference on Software Maintenance (ICSM 2004), pages 448–452. IEEE Computer Society, September 2004.

[20] H. Zhu, P. A. V. Hall, and J. H. R. May. Software unit test coverage and adequacy. ACM Computing Surveys, 29(4):366427, 1997.

3. Search-Based Software

Testing

Automatic Software Test Data Generation For String Data Using

Heuristic Search with Domain Specific Search Operators

Mohammad Alshraideh, Leonardo Bottaci

Department of Computer Science,The University Of Hull,

HULL, HU6 7RX,UK.

M.Alshraideh@dcs.hull.ac.uk,L.Bottaci@hull.ac.uk

July 28, 2005

Abstract

This paper presents a novel approach for automaticsoftware test data generation where the test data isintended to cover program branches which dependon string predicates such as string equality, stringordering and regular expression matching. Geneticalgorithm search is used and initially four simplepredicate cost functions are proposed and their per-formance is investigated on a small number of sim-ple programs. One cost function consistently outper-forms the others. It becomes clear that the simplecost functions are failing to exploit all the availabledomain knowledge and that performance can be im-proved by introducing domain specific search oper-ators. Some operators are proposed and shown toproduce a significant improvement for a small num-ber of simple test programs.

Key words: Software test data generation, ge-netic Algorithms, string predicates.

1 Introduction

The goal of automatic test data generation is to gen-erate test data that can satisfy a given test crite-rion. For the purpose of empirical investigation, aspecific criterion, namely branch coverage, has beenadopted for this papers. A number of automatic soft-ware testing approaches have been developed to ob-tain control flow coverage [3]. These approaches

include random test generation, symbolic execution-based test generation [1], rule-based test genera-tion, constraint-based test generation and dynamictest generation [3], [10]. In the research reported inthis paper, a dynamic test data generation approachis used to automatically generate test data. In thisapproach, a heuristic search is guided by a measureof closeness to a test goal. A simple branch coverageproblem is illustrated in Figure 1. The problem isto find an input string s so that the required branchis executed. If s is such that the predicate fails, acost is associated with s. This cost is used to guidethe search. Given the use of a particular search tech-nique such as a genetic algorithm, a key problem ishow to compute a useful cost for this predicate fail-ure.Current work has been largely limited to programswhose predicates compare numbers [17], [11], [12],but not strings. This overly reduces software testingapproaches for applications in practice since stringpredicates are widely used in programming.The remainder of this paper is organized as follows.In Section 2, background and related work are pre-sented, Section 3 presents cost function for strings,Section 4 and Section 5 present genetic algorithmsand outline of string data generation using a GA, inSections 6 and 7 experimental investigation and ex-perimental results are presented, Section 8 presentsimproved search operation, further results are shownin Section 9 and Section 10 presents the conclusion

var string: s;

if (s == astring) {

//execution required

Figure 1: Need to measure similarity of string s tostring value of astring in order to measure the costof failure to execute the required branch.

and future work.

2 Background and Related

There are a number of different automatic test datageneration approaches such as random test data gen-eration [6], symbolic execution based test data gen-eration [7], [8] and dynamic test data generation.Random test data generation develops test data atrandom until a useful input is found [6]. In general,random test data generation performs poorly and isgenerally considered to be ineffective on realistic pro-grams [14]. Often the evaluation of guided searchmethods uses random testing as a benchmark [3].Symbolic execution has problems with dynamic fea-tures of programming languages, such as array sub-scripts and pointers.

Dynamic test data generation is a popular ap-proach for generating test data. Dynamic test datageneration relies on the execution of the programunder test in order to gain information that can beused to generate suitable test data. When searchingfor test data, candidate test inputs are executed toidentify how close the test input is in meeting thetest requirement [2]. With the aid of feedback, testinputs are gradually modified until one of them sat-isfies the requirement. For example, suppose that aprogram contains the condition statement

if (y == 30 ) {

//execution required

and the true branch of the predicate should be taken.Thus, we must find an input that can make the vari-able y equal to 30, when the condition statement isreached. For a given input, a simple way to deter-mine the value of variable y in the predicate is toexecute the program up to the condition statementand record the value of y. Given this information,the absolute difference between y and 30 is a possiblecost function to guide the search. This approach canalso be applied to all the arithmetic relational oper-ators: <, <=, =, ! = (non-equality), >, and >=.Techniques for handling compound predicates aredescribed in [10]. The cost functions may be usedto guide search techniques that include gradient de-scent, tabu search, genetic search, and simulated an-nealing [9].The majority of systems developed by applying thesetechniques have focussed on generating numeric testdata. In such systems, string data is considered interms of its underlying numeric character format andis the form used by cost functions and search op-erators. For example string equality may be inter-preted as equality of the underlying binary represen-tation [16]. A notable exception is the use of a stringspecific predicate in [15]. The claim in this paper isthat search performance can be improved by usingstring specific cost functions and search operators.

3 Cost Functions For Strings

A cost function is intended to measure the search ef-fort required to produce a solution from a candidatesolution. For string data, cost functions are requiredfor the string relational predicates of equality, order-ing and regular expression matching.

3.1 String Equals

The aim is to assign a measure to the extent towhich any two different strings have the same value.Initially four functions were considered, as follows:

UKTest 2005

1. Hamming Distance (HD): nk where n is thenumber of non-matching corresponding charac-ters and k is a positive constant.For example if:s1 = “COMPARISON”.s2 = “COMPARE”.and k=2; then HD(s1, s2)=8.

The above function takes no account of charac-ter ordering. For example in matching “AAA”with “BBB” and “ZZZ”, the first match ap-pears to be closer and yetHD(“AAA”,“BBB”) = HD(“AAA”,“ZZZ”) = 6

The following function attempts to overcomethis problem.

2. Character Distance (CD): which is the totalof the absolute ASCII code difference betweencorresponding characters in the two strings.Let string s1= c1

l−1... c1

1have ascii codes

l−1... a1

CD(s1, s2) =∑

i− a2

Strings are left aligned and absent charactersare treated as nulls. For example if:s1 = “SET”s2 = “CASE”CD(s1, s2) = |ascii(‘S’)-ascii(‘C’)|+ |ascii(‘E’)-ascii(‘A’)|+ |ascii(‘T’)-ascii(‘S’)|+ |0 -ascii(‘E’)|CD = |83−67|+|69−65|+|84−83|+|0−69| =95

3. Character Value (CV): The string is consideredas a number with base equal to the cardinalityof the character set.

Let s = clcl−1... c3c2c1 be a string of length land

ξ(s) = a1 +m * a2 + m2 * a3 + ... + ml−1 * al

where m is max ascii char valueLet s1 and s2 be two strings then

CV (s1, s2) = |ξ(s1) − ξ(s2)|For example if:s1 = “ACDE”s2 = “ABC” ,ξ(s1) = 2560 * (69)+2561 * (68) + 2562 * (67)+ 2563 * (65)= 69 + 17408+ 4390912 + 1090519040= 1094927429ξ(s2) = 2560 * (67)+2561 * (66) + 2562 * (65)= 67 +16896 +4259840= 4276803, thenCV (s1, s2) =1090650626.In practice ξ(s) may be very large for longstrings, too large to be represented using lan-guage provided integer data types. In [15],where this function is presented, integer over-flow is avoided because the algorithm compar-ing strings compares and searches for one char-acter at a time using a single character distancefunction. This problem also occurs when thisfunction is used for string ordering.

4. Levenshtein Distance (LD): is a measure ofthe similarity between two strings, in term ofthe number of deletions, insertions, or substi-tutions required to transform one string intoanother. For example LD(“TEST”,“TOT””)=2, because 1 deletion and 1 substitution is re-quired.

All methods of the form String.EndsWith,String.StartsWith and Substring ultimately use thestring Equals cost formula and so the cost functionsgiven above can be used.

3.2 String Ordering

Here the problem is to return a numeric value thatmeasures the extent to which one string is less than,or precedes another, which is usually interpreted ac-cording to lexicographical ordering. The previouscharacter value function can be adapted as follows:Let s1 and s2 be two strings of length l with asciisequencesa1

l... a1

1and a2

l... a2

Then cost(s1 ≤ s2)= (a1

1) +m *(a1

m2 *(a1

3) + ... + ml−1 * (a1

where m is max ascii char value.As an example:

cost(“ACDE” ≤ “ABC”) becomescost(“ACDE” ≤ “ABC0”)=2560 * (69 -0)+2561 * (68 -67) + 2562 * (67 - 66)+ 2563 * (65 -65)=69 +256 *1 + 2562 *1+0 = 65861This cost function may produce very large values.Consider the problem of comparingcost(”AAAAAAAAAA”,”XXXXXXXXXX”) andcost(”AAAAAAAAAA”,”MXXXXXXXXX”)both these values are too large for 64 bit integers.It seems as if large integers values must be accom-modated if the above strings comparisons are to becorrectly costed. In the implementation used for thispaper, costs were represented by floating point val-ues up to a maximum of 1.7 × 10308 which allowsstrings up a length of 80 characters to be costed butwith some loss of precision.

3.3 Cost Of Matching Regular Ex-

pressions

In principle, a cost function for matching a givenstring to a regular expression can be derived fromthe cost function for string equality as follows. Aregular expression denotes a set of strings. If thegiven string is a member of this set then the cost ofthe match should be zero. If the given string is nota member of the set it seems plausible to define thematch cost as the lowest equality cost of the givenstring with any string in the regular expression set.The computational cost of such a cost function isof the order of the size of the regular expressionset, which is unacceptably high. Muzatko [13] givesan algorithm for constructing a FSM that acceptsstrings with up to a defined number of mismatches.The algorithm constructs a non-deterministic ma-chine containing l copies of the regular expressionmachine where l is the maximum number of mis-matches to be detected but the algorithm complex-ity is exponential.An acceptable cost function would have a computa-tional complexity of order equal to the size of thegiven string. The idea behind the proposed cost

function is to parse the given string using a finitestate machine, in much the same way as would bedone to check its membership of the regular set. In-stead of the finite state machine producing a simpleaccept or reject output, however, the machine com-putes a cost by counting “mismatched” state trans-lations.The example in Figure 2 shows how the cost ofmatch(“acb”,(ab)∗) may be computed. E is themachine that recognizes (ab)∗. E is the machinethat computes the cost of matching any string with(ab)∗. A distinction is made between not executinga transition because the current input character isnot present in any acceptable string and not execut-ing a transition because the current input characteris in the wrong position in the input string i.e themachine E has an acceptable transition but not atthe current state.k is the cost of a mismatched transition resultingfrom an input symbol not present in any regularstring and k/2 is the cost of a mismatched transi-tion that results from an input symbol present insome regular string.Where a transition has an output, k or k/2, thisvalue is added to the current cost. In E, match(“acb”, (ab)∗) begins with the character ′a′ and pro-duces a zero cost transition, the character ′c′ leadsto a k cost transition since ′c′ is not in the alphabetof the regular set, followed by a zero cost transitiondue to b to give a total cost of k. Match (“bca”,(ab)∗) produces a cost of k/2 + k to the end of thestring “bca” followed by the empty transition to thefinal state at a cost of k/2 to give a total cost of 2k.In general, E is defined as follows. Let S be a setof strings and E be a finite state machine definingsome regular subset of S.Let A be the alphabet for all strings in S.Let AE be the alphabet for all strings in E.Using E, define another machine with the samestates to accept any string in S, and in the processcompute a cost CE for that string. The machine isdefined as follows:For each state, add a transition from that state toitself to be traversed for any character in A\AE .Whenever this transition is traversed, k (say thenumber of states in E) should be added to the value

UKTest 2005

b, k /2

A\{a,b}, k

a, k /2

A\{a,b}, k

=, k/2ε

Alphabet A = {a b, c}

Figure 2: The Finite state machine E computes thecost of matching a string with the set defined by E

computed by CE . For each state in E, for each charc in AE where no transition is defined at that state,for input c add a transition from the state to everyother state to which a transition in E is defined for c.These transitions cause k/2 to be added to the valuecomputed by CE . There is also a transition fromeach nonfinal state to the final state, it consumes noinput but k/2 is added to the cost.The number of states in E is equal to the numberin E but E has O(n2) additional transitions in theworst case. Although E is non-deterministic whichmeans that the number of state in the equivalent de-terministic machine is of O(2k) in the worst case,it has fewer states than the machine proposed byMuzatko which is O(2lk) in the worst case.

4 Genetic Algorithms

The above cost functions can be used to guide thesearch in a genetic algorithm. A genetic algorithm(GA) is a search algorithm based on principles fromnatural selection and genetic reproduction [5], [4].GAs have been applied successfully to a wide range ofapplications including optimization, scheduling, anddesign problems. Key features that distinguish GAsfrom other search methods include:

1. A population of individuals where each individ-ual represents a potential solution to the prob-lem to be solved.

2. A fitness function which evaluates the utility ofeach individual as a solution. In genetic algo-rithm search, as used in this paper, a cost func-tion is called a fitness function and is used torank the candidate solutions in the populationfor selection for crossover. Producing a new gen-eration that will (hopefully) be better.

3. A selection function which selects individuals forreproduction based on their fitness.

4. Genetic operators alter selected individuals tocreate new individuals for further testing. Theseoperators, e.g. crossover and mutation, attemptto explore the search space without completelylosing information (partial solutions) that havealready been found.

Figure 3 shows the basic steps of a GA. Firstthe population is initialized, either randomly orwith user-defined individuals. The GA then iteratesthrough an evaluate selector produce cycle until ei-ther a user defined stopping condition is satisfied orthe maximum number of allowed generations is ex-ceeded.

5 Outline of String Data Gen-

eration using a GA

5.1 String Representation

It is necessary to distinguish between the programinput test data type in a programming language(phenotype) which includes various types includingstrings and the encoded representation of an indi-vidual solution the chromosome or genotype (oftenalso called a string) that is used in the GA. Figure 4illustrates this concept - that the phenotype is therepresentation which is evaluated and that the geno-type is the representation which is manipulated bythe GA. For the purpose of this investigation, twogenotype forms were used. One form is identical to

Initialize Population

(generate random sets ofinput String values

Evaluation execution of inputsof test program

Branch

Satisfied

selection ofcandidate inputs

use crossover and

mutationto produce new inputs

success

Figure 3: General Flowchart of String Test Data gen-eration using a GA

char genotype

crossover

mutation

fitness

phenotype

Test program execution GA execution

binary genotype

Figure 4: genotype and phenotype

the phenotype, i.e the string data type, the otherform is a binary string formed by concatenating thebinary forms of each character of the string. Thegenetic operators are aware of the basic data type ofeach part of the genotype and so string and numericdata can be represented in a single genotype.

5.2 Crossover

Crossover is a genetic operator that combines(mates) two chromosomes (parents) to produce newchromosomes (offspring).

5.2.1 Character Crossover

Character crossover operates on the character se-quence genotype form. It selects a crossover point

within a chromosome then interchanges the two par-ent chromosomes segments at this point to producetwo new offspring.

Consider the following 2 parents which have beenselected for crossover. The | symbol indicates therandomly chosen crossover point which may extendbeyond the length of the shorter string.Parent1= “DATA | ”.Parent2= “GENER|ATION”. The offspring are:Offspring1= “DATA |ATION”.Offspring2= “GENER|”.Character crossover does not introduce any newcharacters, the following crossover does introducenew characters.

5.2.2 Binary String Crossover

Multipoint crossover is used. In this crossover eachcharacter is converted to binary then single pointcrossover is used between all two binary characterpairs. Following is a multipoint crossover example:p1 =”EBWU”.p2 =”RSCD”.The characters are changed to binary.parent1 = “100010|1 10000|10 101|0111 1010|101”.parent2 = “101001|0 10100|11 100|0011 1000|100”.offspring1 = “100010|0 10000|11 101|0011 1010|100”.offspring2 = “101001|1 10100|10 100|0111 1001|101”.Then after changing binary to character :offspring1 = “DCST”.offspring2 = “SRGM”.Note that there are two new characters G,M in theoffspring.

5.3 Mutation

Mutation is a genetic operator that alters one ormore gene values in a chromosome from its initialstate. A random character is selected for mutationand converted to binary of which one random bit ismodified.

6 Experimental Investigation

In order to investigate the effectiveness of the fourcost functions and two genotype representations,

UKTest 2005

a number of short programs were instrumented tocompute the cost of string predicates. These pro-grams are shown in Figures 8, 9, 10, 11 and 12.No examples of regular expression matching wereincluded as this cost function has yet to be imple-mented.

The aim of each experiment is to find input data toexecute all the branches in the programs. The initialinput sets were randomly generated across the rangeof ASCII code 0 to 255. A run with a population of60 was made with a maximum of 100000 generationsbefore stopping. The average result of 30 runs isshown in Table 1.

7 Experimental Results and

Discussion

The results in Table 1 show that the character dis-tance cost function is clearly better than the otherthree. For all programs, it is the best performingfunction. Comparing the two crossover operators,the number of offsprings are close to each other andthere is no clear winner. Note that although thecharacter crossover operator produces no new char-acters, the mutation operator does.

8 Improved Search Operations

In the previous part of the paper, the cost functionsand genetic search operators were selected to be in-dependent of any particular program under test. Inthe rest of the paper, domain and program specificsearch operations are considered in an attempt toimprove the search performance.In some programs, the target string for a search maybe available from the program text. For example, inthe simple program

function f(s:string) {

if (s.Equals("CHILD") {

The first branch of this program is true when s ="CHILD". This suggests a heuristic to guiding thesearch for values for the strings s, namely, set s toa string literal that appears in the program undertest.That this heuristic may fail is clear from theexample below

function f1(s:string) {

s=s+"D";

if (s.Equals("CHILD") {

Although neither of the two string literals "D" nor"CHILD" are input values that execute the targetbranch, they do provide reasonable starting pointsfor a guided search.

To generalize somewhat, consider the followingprogram fragment

function f(s:string) {

s=op(s);

if(s=="AC") {

//executed required

Where op is a string operation that returns a string,possibly different from its input. Employing theheuristic of using string literals from the program asstarting points for the search for the target branch,the search should begin with s = "AC". If op returnsa string equal to its input, then f("AC") will executethe target branch and the heuristic has produceda solution immediately. A more likely situation,however, is that op will return a string that is notequal to its input. Assume, for example, that op

reverses its input, in which case f("CA") executesthe target branch. To understand the implicationsof this, consider a search space that consists of9 strings only as shown in Figure 6. The edges

of the bidirectional graph indicate possible stringtransformations by a search operator that may only“increment” or “decrement” a single character inthe string. The graph shows that a minimum of 4applications of the search operator are required tofind the solution when the search begins from "AC",as it would do when employing the heuristic of“seeding” the search using program literals. In thisexample, the heuristic provides the worst possibleseed for the search as the minimum distances fromother seeds are all shorter.The search space may be modified however byintroducing an additional search operator. Figure 7shows the space produced when an additional“reverse” search operator is introduced.The minimum number of applications of a searchoperator necessary to transform the seed string"AC" to the solution "CA" is now just one. Overall,the mean number of operations is reduced sincealthough, the addition of the reverse operatorincreases the number of edges they are all shortcutson paths from "AC" to "CA" and so the searchdistance is always reduced.The choice of the additional search operator in thisexample was clearly motivated by the knowledgethat op reverses its input in the program under test.op is, of course unknown. Even so, some informationabout the sort of operation that is performed by op

could reduce the search space. In general, additionalgenetic search operators (e.g mutation and crossoveroperators) can be constructor to perform typicalstring operations that can be found in programsthat operate on string data.

This has motivated the introduction of additionalgenetic operators. These operators require access tothe string literals of the program and so the randomstring generator was modified to bias generation to-wards these string literals. The test data generationtool used was modified to collect string literals fromthe test program.Initial population individuals were generated by se-lecting randomly from the domain in Figure 5. Thereis a bias towards selecting strings that appear as lit-erals in the program. This is achieved by setting a5% probability that a random string is selected from

Domain of the strings

"D" "CHILD"

Literals from program

All character sequences

Figure 5: The string domain (higher probability ofselecting a string from the literals set)

the subdomain of literals rather than the domain ofall char sequences.

The swap mutation operator exchanges two char-acters in the strings. The insertion mutation op-erator selects an insertion point, within the stringwith probability of 10% and an insertion point at oneor other end with a probability of 90%. A randomstring is inserted with a 90% bias to the programliteral string set. The bias to the ends of the stringare intended to reflect the common use of the stringconcatenation operator.The deletion mutation operator deletes from thegiven string a string selected randomly from the pro-gram literal string set, if such a string is present,otherwise it deletes a random character.

The introduction of new search operators has im-plications for the cost functions as illustrated bythe previous discussion. For example in matching“ABCD” with “ABXCD” and “ABXYZ”, the firstmatch appears to be closer, the Levenshtein distanceis only 1, and yetHD(“ABCD”,“ABXCD”) = 6HD(“ABCD” ,“ABXYZ”)= 6.The idea that “ABXCD” is a relatively close matchto “ABCD” is based on the possibility of transform-ing “ABXCD” to “ABCD” using a single characterdeletion. In contrast three characters deletion andtwo insertions are required to transform “ABXYZ”to “ABCD”.The following two functions were motivated by anattempt more accurately cost the extent to which astring requires transformation by the available searchoperations.

UKTest 2005

1. Member Hamming Distance (MHD): is simi-lar to Hamming Distance but is sensitive tocharacters that are present but in the incorrectposition; each non-matching and absent char-acter counts k, each non-matching but presentcharacter, counts k/2.For example ifs1 = “ABCD”s2 = “ABXCD”S3 = “ABXYZ”and k = 2MHD(s1, s2)=0 + 0 + 1 + 1 + 2 = 4MHD(s1, s3)=0 + 0 + 2 + 2 + 2 = 6

2. Member Character Distance (MCD): is similarto Character Distance; if a character c1 in s1is not matched by its corresponding characterc2 in s2 and moreover c1 is absent from s2then count m(max char value) + |ascii(‘c1’)-ascii(‘c2’)|.If a character c1 in s1 is not matched by itscorresponding character c2 in s2 but c1 ispresent in s2 then count |ascii(‘c1’)-ascii(‘c2’)|,following is an example to illustrate MCD :s1 = “ITALY”s2 = “ISLAND”Let k = 256MCD(s1, s2)=0 + (| ascii(’T’) -ascii(’S’)| +256)+ |ascii(’A’) -ascii(’L’)|+ | ascii(’L’)-ascii(’A’) |+ (| ascii(’Y’) -ascii(’N’)| + 256)+ (| ascii(’D’) -ascii(”) | + 256)MCD=0 + | 84 - 83 | +256 + | 65 -76 |+ | 76 -65 | +|89 -78 | + 256 + | 68 -32 | +256MCD = 257 + 11 + 11 + 267 + 292 = 838

Solution

Starting point of search

"AA" " BA " "CA"

" BC "

Figure 6: The search space

Solution

Starting point of search

"AA" " BA " "CA"

" BC "

Figure 7: The search space after addition of a newreverse search operator

function example1(s1 : String) {

if( s1.Equals("University")) {

\\executed required

Figure 8: Example test program

if (s1 >= "ZZZZZX") {

\\executed required

function example3(s1 : String, s2: String

s3 : String, s4: String) {

String s = s1 + s2;

s =s + s2.reverse() + s4;

if (s2 < s4 )

s.Insert(s.Length/2,s3);

s.Remove(1,2);

if (s.Equals("RANDOM")) {

\\executed required

if ( s1.Length >20)

s1=s1.Remove(2,5);

else if( s1.Length >10 )

s1=s1.Insert(0,"MICRO");

else if(s1.Length >5 )

s1=s1.Replace("A","B");

s1=s1+"ROSOFTDEVLOPMENT";

if ( s1.Equals("MICROSOFTDEVLOPMENT") {

//executed required

String[] sparts =split(s1,"/");

int k=sparts.Length-1;

if (sparts[k].Conatins(’.’)) {

String[] fileparts =split(sparts[k],".");

String suffix =fileparts[1];

if (suffix == "DOC") {

if (sparts[k - 1] =="WORD") {

print("Word folder contains document");

print("document not in word folder");

if (suffix == "PDF") {

if [sparts[k - 1] =="ACROBAT") {

print("Acrobat folder contains PDF");

print("document not in Pdf folder");

UKTest 2005

9 Results after introducing

string specific search opera-

Using these new cost functions and new genetic op-erators, test data was once again generated for thesample programs without the use of the biased testdata generator. Table 2 shows the results.It is clear that introduction of the genetic operatorsalone leads to slight overall improvement at best.The out-performance of the character distance basedfunction is retained but it is not quite as effective assimple character distance. Member Hamming dis-tance is a significant improvement over HammingDistance.When the biased string generator is introduced, theimprovement is significant. The results are shownin Table 3. The good performance for the programin Figure 11 depends heavily on the insertion anddeletion operators.

10 Conclusion

This paper presents a test data generation ap-proach for program branch coverage where branchpredicates include string predicates. Experimentshave been conducted on simple programs contain-ing string predicates. The preliminary experimentalresults show that the methodology is effective par-ticulary if string literals from the program under testcan be used.The further work is to implement and investigateregular expression matching and also to evaluate themethod on a larger class of programs.

References

[1] Beizer B., Software testing techniques, 2nd ed.,New York: van Nostrand Rheinhold, 1990.

[2] Korel B, Dynamic method for software test data

generation. software testing, Verification andReliability 2 (1990), no. 4, 203–213.

[3] Korel B., Assertion-oriented automated test

data generation, Proceedings of the 18th In-ternational Conferance on Software Engineering(1996), 71–80.

[4] Goldberg D. E., Genetic algorithms in search

optimization and machine learning, AddisonWesley, 1989.

[5] Holland J. H., Adaptation in natural and ar-

tificial systems, University of Michigan Press(1975).

[6] Duran J. and Ntafos S., An evaluation of ran-

dom testing, IEEE Transactions on SoftwareEngineering 10 (1984), no. 4, 438–443.

[7] King J., A new approach to program testing, InProceedings of the International Conference onReliable Software (1975), 228–233.

[8] , Symbolic execution and program test-

ing, Communications of the ACM 19 (1976),no. 7, 385–394.

[9] Tracey N. Clark J. and Mander K., Automated

flaw finding using simulated annealing, Inter-national Symposium on Software Testing andAnalysis 30 (1998), no. 1, 73–81.

[10] Bottaci L., Predicate expression cost functions

to guide evolutionary search for test data, InProceedings of the Genetic and EvolutionaryComputation Conference (2003), 2455–2464.

[11] Harman M. and etl, Improving evolutionary

testing by flag removal, In Proceedings of theGenetic and Evolutionary Computation Confer-ence (2002), 1359–1366.

[12] McMinn P., Search-based software test data gen-

eration: A survey, Software Testing, Verifica-tion and Reliability 14 (2004), no. 2, 105–156.

[13] Muzatko P., Approximate regular expression

matching, 1996.

[14] Coward P.D., Symbolic execution and testing,Information and Software Technique 33 (1991),no. 1, 229–239.

[15] Zhao R., Character string predicate based auto-

matic software test data generation, Third Inter-national Conference On Quality Software, 2003,pp. 255–263.

[16] Jones B. Sthamer H. and Eyres E., Automatic

structural testing using genetic algorithms, Soft-ware Engineering Journal 11 (1996), 299–306.

[17] Baresel A. Wegener J. and Sthamer H., Evo-

lutionary test environment for automatic struc-

tural testing, Information and Software Technol-ogy 43 (2001), no. 14, 41–54.

UKTest 2005

Table 1: Number of Offspring to find a solution, averaged over 30 trials for each of the example programs.String generation is uniform random over all strings up to length 50.

Example Binarycrossover CharcrossoverProgram CD HD LD CV CD HD LD CV Best

Figure 8 634 1210 819 916 546 1786 1067 1002 CDFigure 9 1923 not used not used 1923 1742 not used not used 1742 CDFigure 10 8787 30369 11322 13202 9378 34968 16280 19652 CDFigure 11 836 10536 3162 5331 1028 11385 3359 7845 CDFigure12 24210 97215 71317 46214 22754 86120 64312 52125 CD

Average 7278 34833 21655 13517 7090 33565 21255 16473 CD

Table 2: Number of Offspring to find a solution, averaged over 30 trials for each of the example programs.String generation is uniform random over all strings up to length 50. Using Member Character and MemberHamming Distance

Example Binarycrossover CharcrossoverProgram MCD MHD MCD MHD BestFigure 8 598 1065 496 1413 MCDFigure 9 1700 not used 1617 not used MCDFigure 10 8493 26998 9286 25189 MCDFigure 11 724 8754 723 5316 MCDFigure12 28641 85411 27812 78645 MCD

Average 8031 30557 7987 27641

Table 3: Number of Offspring to find a solution when domain specific genetic operators are used, averagedover 30 trials. String generator is biased towards program string literals, both binary and character crossoverare used with equal probability.

Example Program CD MCD HD MHD LD CV

Figure 8 4 4 4 4 3 3Figure 9 4 4 not used not used not used 4Figure 10 5916 4561 7675 7022 6321 6451Figure 11 109 91 167 148 132 358Figure12 8392 7952 27230 24562 21361 14521

Average 2885 2522 8769 7934 6954 4267

UKTest 2005

Use of branch cost functions to diversify the search for test data

Leonardo Bottaci�August 2, 2005

Abstract

Heuristic search techniques have been used with some success for the automatic generation of program test

data. A problem occurs, however, when the test goal requires the solution of a subgoal which is not included in

the cost or fitness function that guides the search. A particular example of this is the need to execute a particular

program path in order to execute a given branch even though the branch may be reached by a number of other

paths. A method is described by which the search for test data is directed at exploring diverse program paths

when data to satisfy the given test goal is difficult to find. When the search can no longer progress towards

satisfying the test goal, the test goal is augmented to search for input data that not only executes the test goal but

also executes specific branches that are consistent with the test goal but have either not been executed or executed

infrequently. These branches can be identified from the values of the branch predicate cost functions that are

necessarily computed to guide the search. The advantage of this method is that the search explores a greater

variety of execution paths through the program, increasing the likelihood of finding a solution. The method has

been implemented and tested with success on three difficult to test programs.

1 Introduction

A test adequacy criterion specifies the extent to which a given program should be tested. In the context of unit

testing, for example, common test adequacy criteria include statement coverage, branch coverage and multiple

condition coverage. In general, the more stringent the test adequacy criterion to which a program has been sub-

jected, the more confidence there is in the correctness of the program. Obviously, a strict adequacy criterion is to be

preferred, but in practice, test cases are usually constructed manually by a tester who may need to spend significant

time analysing the program under test. Consequently, there is much interest in the prospect of generating test data

automatically.

In the context of unit testing, an ideal automatic test data generation tool inputs the program under test and outputs

a set of test data for a given adequacy criterion. Because the general test data generation problem, i.e. the problem

of taking a given program and constructing an input that produces a given program behaviour, is well known to be

undecidable, (the halting problem is a special case of this problem) no ideal tool of this kind may be constructed.

Research effort has thus been directed towards heuristic approaches and a number of heuristic search methods

have been investigated (Jones, Sthamer, and Eyres 1996) (Korel 1990) (Korel 1992) (Ferguson and Korel 1996)

(Tracey, Clark, and Mander 1998) (Tracey, Clark, Mander, and McDermid 2000) (Wegener, Baresel, and Sthamer

2001) (Baresel, Sthamer, and Schmidt 2002) (Michael, McGraw, Schatz, and Walton 1997).

A key component of all heuristic search methods is an evaluation function (also known as an objective function or

cost function) that guides the search towards the solution. A cost function provides an evaluation of each point in

the search space in terms of how “close” it is to a solution. As a simple example of a cost function in the context

of software test data generation, consider the problem of searching for test data to execute the target branch in the

following program fragment where v is an integer variable.

if (v == 1) {

//TARGET BRANCH

Initially, test inputs are generated randomly. Any input that fails to execute the target branch, i.e. a test case that

produces a value in v not equal to 1 may be assigned a cost of abs(v � 1). This cost function is positive for all

non-solution points.�Department of Computer Science, University of Hull, Hull, HU6 7RX, l.bottaci@dcs.hull.ac.uk

The role of the cost function within a heuristic search algorithm is to discriminate between the points in the search

space and thereby identify the best direction in which to explore. In the case of the example program, the search

algorithm would use the guidance of the cost function to generate program inputs with successively lower cost

until, hopefully, a solution is found. In practice, the search may be terminated after a budgeted period of time has

elapsed.

2 Test data generation problem

The progress of a test data search algorithm is arrested when it is unable to generate any test input that has a lower

cost than the current best input. In these conditions, the search is said to be trapped on a local optimum or plateau.

As a simple example, consider the problem of searching for test data to execute the target branch in the following

program fragment below.

v = 2;

if (x == 0) {

v = 0;

if (y == 0) {

v = v + 1;

if (z == 0) {

v = v - 3;

if (v == 1) {

//TARGET BRANCH

In general, in order for a particular program branch to be executed, the control dependency conditions for that

branch must be satisfied. Assuming no other branches are present, the control dependency condition for the target

branch is satisfaction of the predicate v == 1which leads to the cost function abs(v�1). In some cases, however,

the control dependency condition cannot be satisfied unless branches that do not appear in the condition are also

executed. In order to execute the target branch, in the above example, it is necessary that x be zero so that v may

be set to 0 and that y also be zero so that v may be set to 1 and that z be nonzero. The simple cost function,

abs(v � 1), is however, essentially insensitive to the values of x, y and z and can therefore not guide the search.

If the values of x, y and z at the conditionals are determined randomly then it is plausible that the probability of

satisfying x == 0 and y == 0 is low. In this situation, the execution of the target branch depends on what is

essentially a random search for an input that executes the x == 0 and y == 0 branches without executing the z

== 0 branch.

It should be appreciated that in its general form, the problem just illustrated is one that eventually is faced by all

attempts to generate test data for nontrivial programs via heuristic search. Different researchers have tackled this

problem in different ways (Ferguson and Korel 1996) (Baresel and Sthamer 2003) (Harman, Hu, Hierons, Baresel,

and Sthamer 2002). One approach, the chaining method (Ferguson and Korel 1996), is to further analyse the

program under test and thereby improve the cost function. In the case of the above example, data flow analysis

may be used to determine that the target predicate uses a value of v and that furthermore, there are definitions of

v in the first three conditionals that have not been executed and so their execution are plausible subgoals. A more

general approach, and the one presented in this paper, is to view the problem as one of inadequate search diversity.

Search methods that maintain, at each step in the search, not one, but many candidate solutions, are inherently better

suited to avoiding the local optimum or plateau trap. If these candidate solutions are widely spaced, sufficiently

diverse, there is a lower chance that they will all lead to the same local optimum. Although the importance of

diversity is recognised it is not easy to achieve since diversity is countered by the convergence process that is

essential for the search to find a solution.

There are two broad approaches to maintaining diversity; domain independent strategies and domain specific

strategies. The strategy adopted in the work reported here is domain specific and relies on the fact that there is

usually a more than one path by which an input may execute a program to satisfy the test goal. At conditional

statements, where program inputs may execute either branch without violating the control dependency condition

for the test goal, there is scope for introducing the execution of one particular branch as an additional subgoal to

UKTest 2005

the test goal. By identifying non-critical branches that have not been executed and introducing them as subgoals

for independent searches, the search is directed to inputs that execute the program differently.

To illustrate this approach, consider again the previous example program and assume that after a period of searching

for data to execute the target branch, the target branch has not been executed. At this stage the search has been

guided only by the cost abs(v � 1). The next step is to suspend the search and identify those branches that have

been reached but not executed and yet may be executed on a path that executes the target branch. The program

under test is instrumented to record the predicate costs at all branches and so this information is available.

In the example program, all the true branches will have been reached but it is unlikely that an input has been able

to satisfy any of them. The reached but unexecuted branches, that are absent from but consistent with the target

search goal, are all candidates for additional subgoals. In this example, the three branches prior to the target branch

are each combined with the target branch goal to produce three additional search goals. Three new searches are

now initiated, the original search remains suspended.

Since it is necessary to satisfy both of the first two conditionals of the program, it is still unlikely that any of these

three new searches will lead to the target. After a period of during which the searches make no progress, they will

be suspended and similarly examined to identify suitable additional subgoals. In this way a search is instigated for

an input to satisfy the first two conditionals in addition to the target. It is clear that once the goal of satisfying x

== 0 and y == 0 is included in a cost function, the search may progress satisfactorily.

In general, the aim is to direct the search into as yet unexplored areas of the input domain by changing the program

behaviour at specific branches. To explain the method in more detail, the following section defines the basic

concepts of program execution and is followed by a short introduction to genetic algorithm search. A detailed

description of the branch diversity method is then presented as an algorithm. Following that, some examples are

worked through in detail. The branch diversity search algorithm has been implemented and the results obtained for

the example program above and two other example programs are given.

3 Definitions

3.1 Program structure

A program control flow graph is a directed graph CFG = (N;E; s; e), where N is a set of nodes, E is a set of

edges, i.e. pairs from N and s is a unique start node and e a unique exit node. Nodes in N may correspond to the

statements or basic blocks of a program. An edge exists from one node to another (i.e. from a node to its control

flow successor) if and only if execution of one node may immediately be followed by execution of the other. Those

nodes in N that have more than a single control flow successor are called conditional nodes and correspond to

if-then statements, while statements and the like. Without loss of generality, it is assumed that a conditional node

has exactly two successors. The outgoing edges from a conditional node are called branches, and are labelled, one

with the symbol T, the other with the symbol F. Each conditional node is associated with a predicate expression.

Program execution flows along the branch labelled T when the predicate expression is true and along the branch

labelled F when the predicate expression is false.

A path in the CFG is a sequence of nodes < n0; ::; ni; ::; nk > such that a node may directly follow another in

the path only if it is a control flow successor of that node. An executable path is a path where n0 = s, nk = e.

An executable path is feasible if some program input may execute it. A branch is said to be executed if the first

and second nodes of the branch are adjacent in an execution path. A set of branches is executable iff there is an

execution path that executes them. A branch is said to be reached if the first node of the branch is executed, i.e. a

member of an execution path.

The program control dependency graph (Ferrante, Ottenstein, and Warren 1987; Harrold and Rothermel 1996) may

be derived from the program control flow graph. A node x dominates a node y (x 6= y) if x is a node through which

all executable paths to y must pass. A node y post-dominates a node x (x 6= y) if every executable path that passes

through x contains y. Let x and y be nodes in a control flow graph. y is control dependent on x if there is a path p

from x to y with all nodes z in p (excluding x and y) post-dominated by y but x is not post-dominated by y.

The control dependency relation may be used to construct the control dependency graph. In this graph, the node

s is control dependent on a root node entry, as is the node e, assuming the program terminates. All other nodes

(assumed to be reachable) are directly or indirectly control dependent on entry. A path in the control dependency

graph from entry to a given node x defines a sequence (possibly empty) of branches that, if executed, will lead

to the execution of x. The conjunction of predicate conditions along a control dependency path to x is known

as a control dependency path condition. There may be more than one control dependency path to x and since a

path that reaches x must satisfy one of the control dependency path conditions for x, the disjunction of the control

dependency path conditions for x is the control dependency condition for x. The control dependency path (control

dependency condition) for a branch may be defined as the control dependency path (control dependency condition)

of the second node of the branch.

The method described in this paper requires searching for inputs that execute a set of branches as a means to

execute a target branch. These sets are produced by combining control dependency paths. Clearly, these sets of

branches should be executable. To appreciate the implications of this, consider the following example.

if (P) {

if (R) {

if (Q) {

return;

if (T) {

//TARGET BRANCH

Assume that the test goal is to execute the branch T. The control dependency condition for T is (not P; T) or (P; not Q; T)To maintain diversity, each of the two disjuncts (control dependency path conditions) is used to guide a search in

a separate population, which is to say that one search is made to find an input to satisfy (not P; T) and another is

made to satisfy (P; not Q; T). Consider now the problem of deciding if R may be added as an additional subgoal to

either of these search goals. It is necessary to determine if R and the branches in the two control dependency path

are executable.

It is easy to establish if two branches are jointly executable by examination of the transitive closure of the control

flow graph. In this graph, each node has a reach set consisting of all the nodes reachable from that node. A branch

may be said to be reachable from a given branch if the reach set of the second node of the given branch includes

the first node of the other branch.

Given a set of branches, a brute-force algorithm that examines all sequences of branches is computationally ex-

pensive. Considering how this expense might be avoided, notice that in the example above, it might be assumed

that it is necessary to establish only that R and the target branch T is executable since a control dependency path is

always executable. In fact, although R and T are executable, and also the branches P; R; not Q; T, it does not follow

that all control dependency paths to T are executable with R, R; not P; T is not.

Multiple control dependency paths exist only in the presence of explicit transfer of control statements such as

return, break and continue statements1 and so the majority of control dependency conditions will consist of a

single path. Nonetheless, even in cases where a target branch is control dependent on multiple control dependency

paths it is possible to establish that the target branch together with a particular control dependency path and some

given branch are executable without examining all sequences of the branches involved.

This is justified by the following observation. Let T be a target branch and let p be one of the control dependency

paths to T and let R be a branch not in p. If the branches in p together with R are executable then these branches

may be placed in a sequence in which each successor branch is reachable from its predecessor and moreover, the

ordering of branches in p is preserved. This means that to establish the executability of p and R, it is necessary to

examine only sequences of these branches in which the ordering in p is preserved rather than examine all possible

sequences.

The justification for this observation is as follows. If p together with R is executable then there exists a sequence

of these branches, s, in which each successor branch is reachable from its predecessor. Let the set of branches p0be the subset of p that occurs in s prior to the first occurrence of R. Let b be the last branch in p0 as defined by

the ordering in p. It follows that the prefix of p that ends in b followed by R is executable. Note that although

the predecessor of R in s may precede b in p, there is a path through all the branches of p and therefore from this

predecessor to b.

1In the absence of explicit transfer of control statements, there are two control dependency paths to a loop header node but one of the paths

is produced by adding a back edge to the path to the loop body and as such it does not provide an alternative path to the loop header and thus is

not used as a search guide.

UKTest 2005

If b is not the last node of p then let c in p be the successor of R in s. From c it is possible to reach all the branches

in p less the branches in the prefix up to b, recall the existence of s, hence it is possible to reach the successor of b

in p. It follows that R together with p less the prefix that ends in b is executable.

Returning to the above example, adding R to the path (not P; T) requires examination of the sequences obtained

by merging control dependency paths (not P; T) and (P; R) which is the control dependency condition for R.

This produces the sequences (not P; P; R; T); (not P; P; T; R); (P; not P; R; T); (P; not P; T; R); (P; R; not P; T) and(not P; T; P; R) none of which are executable. Merging control dependency paths (P; not Q; T) and (P; R) produces

the sequence (P; P; R; not Q; T)) which is executable.

In the above example, the control dependency condition for R consists of a single path. In general, there will be

multiple control dependency paths to a branch such as R and subject to executability, a single path from the set of

paths to R is combined with a single path to the target branch to form the search goal of a single population. Again,

executability of two control dependency paths can be established by examination of branch sequences in which the

ordering of branches in each control dependency path is preserved. This is because the argument that applies to

any prefix of the control dependency path to the target branch also applies to any prefix of the control dependency

path to the branch R.

3.2 Branch cost functions

To use a control dependency path or any set of branches as a search goal, it is necessary to determine the cost

values for each branch predicate. To do this, each conditional node in the program is associated with a real-valued

predicate cost function that is evaluated whenever the conditional node is executed. This predicate cost function

returns a positive value whenever the predicate is false and a negative value if the predicate is true. The cost of

an evaluation of a logical negation of a predicate is the arithmetic negation of the cost of the evaluation of the

predicate.

Each reached branch maintains two cost values, both derived from the associated predicate cost function. One

cost value is the cost that all attempts to execute the branch are successful. This is called the cumulative and-cost.

The other cost value is the cost that any attempt is successful, called the cumulative or-cost. These costs can be

illustrated with an example showing three failed and two successful attempts to execute the predicate a � b for

various integer values of a and b. The predicate cost function is a� b when the predicate is false, and a � b� 1when the predicate is true. The cost values produced by relational predicates are normalised but the unnormalised

values are used in the table below in order to more clearly show the arithmetic.

Table 1: Cumulative or-cost and and-cost for the predicate a <= b for the values listed.

a b cost and-cost or-cost

4 1 3 3 3

3 1 2 5 6/5

2 1 1 6 6/11

1 1 -1 6 -1

2 1 -2 6 -3

The cost of a conjunction of two false costs is the sum of the costs of the conjuncts. The cost of a disjunction of

two false costs is pq

p+qwhere p and q are the disjunct costs. If only one cost is false, the conjunction cost is this cost,

the disjunction cost is the true cost. The motivation for these functions is given in (Bottaci 2003). The relevant

property of these cost functions for the work presented here (as can be seen in Table 1) is that the cumulative

and-cost increases with each failure to execute the predicate and the cumulative or-cost decreases.

Note that when both branches at a conditional node have been executed the and-cost is positive and the or-cost is

negative. Moreover, the magnitude of the and-cost is an indication of the number and magnitude of the failures

to satisfy the predicate. A high and-cost indicates that the predicate has hardly been satisfied. A low and-cost

indicates that the predicate has hardly not been satisfied.

The cost of a search goal is calculated, according to the formula for conjunction given earlier, as the conjunction of

the individual branch goal costs. Each individual branch cost is either a branch or-cost or a branch-and cost. With

this method there is the danger that a single large branch cost may dominate the overall cost value. Normalisation

or costs reduces this risk. An alternative method, not used in the work reported here, is to compute a cost consisting

of two components. One component, the most significant, counts the branch goals that that have yet to be to be

satisfied. This cost component is the analogue of the approximation level used by Wegener et al. (Wegener, Baresel,

and Sthamer 2001) and Baresel et al. (Baresel, Sthamer, and Schmidt 2002). The second component is applicable

only if the first component is nonzero and is calculated as the disjunction of the unsatisfied branch goals.

For each branch, there are two associated branch search goals that may be specified to guide a search, namely

branch-or (on at least one occasion that the branch is reached, it is executed) and branch-and (on every occasion

that the branch is reached, it is executed). A branch goal is satisfied if the associated or-cost or and-cost is negative.

If execution of a branch is required to satisfy branch coverage or a control dependency condition then branch-or is

the relevant branch goal. When a search goal is augmented with additional branch goals in order to guide the search

to new execution paths, these branch goals are generated from branch execution data according to the following

rules:� If a branch (at a predicate that is not a member of the current search goal) has been reached but not executed

then branch-and is an additional branch goal.� If a branch (again, at a predicate that is not a member of the current search goal) has been executed but its

negation has not then the negation branch-and is an additional branch goal.� In cases where both a branch and its negation have been executed then two additional branch goals are

adopted, branch-and for the branch and branch-and for the branch negation.

Note that the goal of satisfying the branch or-cost is not adopted primarily to avoid creating an excessive number of

subgoals. Moreover, if is necessary to find an input that executes both branches at a predicate, it is hoped that such

an input will be found during the search for an input to satisfy the and-cost. Recall that the and-cost is adopted as

the branch goal after one branch at a predicate has already been executed.

The above rules apply to non-loop branches only. Loops are treated differently to if-statements because for pro-

grams that terminate, loop entries are eventually followed by a loop exit. For this reason, a subgoal that specifies

that a loop predicate is always true is not sensible and thus the two possible branch goals at a loop conditional are

loop entry and no loop entry.

4 Branch diversity search method

A genetic algorithm is an appropriate search technique given that the basis of the branch diversity search method

is to conduct a number of searches to explore different regions of the input domain. For the purpose of the work

presented here, a genetic algorithm can be described crudely in terms of three components, a set of candidate

solutions, called a population, a cost function (also known as a fitness function) and a set of search (genetic)

operators that can produce new candidate solutions by copying and modifying existing candidate solutions in the

population.

A basic genetic algorithm conducts a search by selecting candidate solutions from the population and using them

to produce new candidate solutions. The selection is random but biased towards the most promising candidates as

estimated by the cost function. The size of the population is usually fixed and so as new candidates are produced,

the least promising are discarded. This is survival of the fittest. Over many iterations, the population is said to

evolve towards a solution.

A multi-population genetic algorithm (Cantu-Paz 1998) extends the basic genetic algorithm by including a number

of populations. In the work reported here, each population is evolved with its own cost function. This is done

to direct the search in different populations to different regions of the input. The use of multiple populations is

a simple method of maintaining diversity in the set of inputs. If only a single population is used, survival of the

fittest can lead to the elimination of all individuals except those exploring the single lowest cost input region.

In general, multi-population genetic algorithms may allow individuals to “migrate” from one population to another.

An individual produced in one population is added to another population providing its evaluation according to the

cost function of the foreign population is sufficient to displace an existing individual. Migration is normally limited

in order to maintain the differences between populations.

In the work reported here, migration is unrestricted. There are two reasons for this, firstly, each population has its

own cost function which is the overriding determinant of which individuals remain in a population irrespective of

the number of migrants from other populations. For this reason, unrestricted migration does not lead to the loss of

diversity that it might in other multi-population genetic algorithms. Secondly, it is efficient to reuse executed tests

UKTest 2005

wherever possible since the time required to execute the program under test is usually the most important factor

that determines the speed with which test data is generated. Once an input has been executed and the cost function

data collected, an evaluation of the input against any specific cost function can be produced relatively quickly.

The following algorithm describes the main iterative procedure of the search method. For each control dependency

predicate path to the target branch, a population and associated cost function is created to search for the target.

The initial populations are constructed by randomly generating and executing a number of inputs. The following

algorithm then applies.

while (test program execution count < max test program execution count) {

foreach (non-stagnant population) {

evolve population for a new input

if (target branch executed)

if (population stagnant) {

if (can identify suitable branches as additional subgoals) {

foreach (additional subgoal control dependency path) {

create new search goal consisting of

current search goal and subgoal control dependency path

if (new search goal not a duplicate) {

create new population with new search goal

seed new population from existing and new tests

add new population to current populations

if (all populations stagnant) {

set all populations non-stagnant

Since it is not known which population will produce a solution, each population is evolved for only one input

in turn before moving on to the next population. A genetic algorithm of the so-called steady-state variety such

as Genitor (Whitley 1989) is a convenient may to do this. Reproduction takes place between two individuals

who produce one or two offspring (depending on the choice of reproduction operator). These offspring are then

evaluated and either inserted into the original population expelling the one or two least fit or discarded if the

offspring are the least fit. The population is kept sorted according to cost and the probability of selection for

reproduction is based on rank in this ordering.

An important consideration is determining when a population is no longer evolving towards a solution. In the work

reported here, search progress in a single population is considered to have stopped when a sequence of input cost

values of a given length k has been accumulated and in comparing each cost with the cost l, l � k=2, costs later,

the majority of comparisons do not show a cost decrease. Such a population is said to be stagnant.

Stagnant populations are not evolved but their search goals are extended and used to evolve other populations.

Whenever all populations are stagnant and the maximum execution count has not yet been reached then in order

to continue searching for inputs, the stagnant status of all populations is cleared. This is done by simply emptying

the sequence of accumulated cost values. This ensures that a formerly stagnant population is evolved for at least

k inputs before there is the possibility of once again becoming stagnant. Note that, in this scheme, since the most

effective populations will take longer to stagnate, they will be given more of the computation time.

A search goal for a population is a set (actually a conjunction) of branch search goals. New search goals are

generated as follows. The best input so far found is executed to identify the set of reached predicates that are

also absent from the current search goal. Branch goals are generated from each of these predicates as described

earlier. Branch goals that cannot be executed before the target branch are discarded. For each remaining branch

goal, and for each control dependency path to the branch goal, if the control dependency path and current search

goal are executable then the current search goal is extended by adding the branch goal and the goals of the control

dependency path. The extended search goals that are not duplicates of existing search goals are used to evolve new

populations. Note that since a search goal may contain conditions in addition to execution of the target branch, the

target branch may be executed without satisfying a search goal.

All new populations, apart from the initial population, are seeded half from the inputs in existing populations and

the other half are generated randomly. Reusing the existing tests is usually efficient since once a test is found to

execute a given branch, it need not be “rediscovered” if it is required for a later branch.

5 Case study

To describe in practice how branch goals are selected and how search goals are generated to find input for particular

branch execution, two sample programs are studied in detail. Consider the following example test program that

checks whether there is an equal number of space and non-space characters in a string of at least eight characters.

1 if (s.Length > 7) {

mismatchcount = 0;

i = 0;

4 while (i < s.Length) {

5 if (s[i] == ’ ’) {

mismatchcount++;

else {

mismatchcount--;

12 if (mismatchcount == 0) {

print("parity"); //TARGET BRANCH

To execute the target branch, the two branches of the inner conditional must be executed the same number of

times. The control dependency condition for the target branch, however does not include these branches, it is

s:Length < 7 and mismatchcount = 0.

Given the 16-bit Unicode character set and uniform random character generation, the probability of generating a

random string that includes a space is low. As a result, the cost function produces low values only by directing the

search towards shorter strings since this is the only way to reduce the cost of abs(mismatchcount� 0). Eventually,

the population of candidate solutions may be expected to consist of strings all with a length of eight characters,

none of which are spaces. Note that in the unlikely event that any string does include a single space then the cost

function will rapidly ensure that all candidate solutions also include a single space since such strings have a cost

of only six. The crossover operator of a genetic algorithm is also likely to distribute those spaces and so the target

will eventually be satisfied. Until the first space is found, however, the search is directed by the cost function

abs(mismatchcount� 0) which in effect is a random search for a string that includes a space character.

The behaviour of the branch diversity algorithm for the example space-parity program is outlined below.

1. The initial search goal is the set of goal branches f1:To; 12:Tog. A goal branch is shown as an integer

(this is the predicate identifier and corresponds to the line number shown) and one of the four symbols

To; Ta; Fo; Fa which denote, respectively, ‘at least one execution of the predicate true branch’, ‘all execu-

tions of the predicate should be true’, ‘no execution of the predicate true branch’, ‘at least one execution of

the predicate false branch’.

2. A single population with the initial search goal is evolved until stagnant; at which point the reached pred-

icates are: f1; 4; 5; 12g. Since branches at predicates 1 and 12 are already present in the search goal, only

branches at predicates 4 and 5 are considered.

3. The predicate at 4 is a loop, which has been entered, and so a potential branch goal is 4:Fo, i.e. no loop entry.

If it is assumed that the predicate at 5 has not been satisfied then 5:Ta is a potential branch goal. Both 4:Fo

UKTest 2005

and 5:Ta are executable with f1:To; 12:Tog and so two new search goals are constructed, f1:To; 4:Fo; 12:Togand f1:To; 4:To; 5:Ta; 12:Tog. Note that 4:To is added because it is the control dependency condition for the

branch goal 5:Ta.

4. The population with the search goal f1:To; 4:Fo; 12:Tog is likely to stagnate. This prompts an attempt to

generate new search goals. Again, the reached predicates are: f1; 4; 5; 12g. Since branches at predicates 1,4 and 12 are already present in the search goal, only branches at predicate 5 are considered. If it is assumed

that the predicate at 5 has not been satisfied then 5:Ta is a potential branch goal. Note, however, that 5:Ta is

not executable with f1:To; 4:Fa; 12:Tog and so may not be added to the search goal. There are no other ways

in which this search goal may be extended.

5. The population with the search goal f1:To; 4:To; 5:Ta; 12:Tog is likely to find an input that executes the true

branch at 5. Once this occurs, a reduction in the cost of the 12:To branch follows.

Note that the presence of 5:Ta rather than 5:To guides the search towards inputs that have a relatively large number

of spaces and indeed, on its own it would attempt to maximise the number of spaces in the string. The presence

of 12:To counters this tendency. In this example either 5:To or 5:Ta is adequate to guide the search to a solution

but this is not always the case. There are cases where Ta rather than To is the necessary branch goal type. The

following example, illustrates this. There may also be occasions where 5:To is required rather than 5:Ta. Adopting

both 5:Ta and 5:To as branch goals significantly increases the number of populations with the danger that the

method becomes unworkable.

In order to more directly compare the method presented here with the chaining method (Ferguson and Korel 1996),

the program example presented in (Ferguson and Korel 1996) (line numbers as in original) is reworked to show

that the method presented here is also able to solve the test goal. The program accepts two integer arrays a and b

of length 10 and an integer target. The overall test goal is to execute the true branch of the if-statement at line

16. This branch is executed when at least one member of a and all the members of b are equal to target.

i = 0;

fa = false;

4 fb = false;

5 while (i < 10) {

6 if (a[i] == target) {

fa = true;

i = i + 1;

9 if (fa == true) {

i = 0;

11 fb = true;

12 while (i < 10) {

13 if (b[i] != target) {

14 fb = false;

i = i + 1;

16 if (fb == true) {

print("mess 1"); //TARGET BRANCH

1. The control dependency condition for the target branch is fb == true and is the search goal for the initial

population. This first population very quickly stagnates because the cost function for fb == true has but

two values corresponding to the two boolean values and is unable to provide a decreasing cost gradient.

2. The candidate subgoals are the reached branches not in the population goal, namely the while statement

at line 5, the if-statement at line 6 and the if-statement at line 9. This last conditional may be assumed to

be false because of the low probability of satisfying the if-statement at line 6, consequently the remaining

branches are not reached. From these predicates, the following branch goals are generated, 5:Fo (no loop

entry), 6:Ta and 9:Ta.

3. The searches for 5:Fo and 9:Ta will stagnate but the search for 6:Ta will eventually prove successful. Once

the true branch of the if-statement at line 6 has been executed, the true branch at line 9 is executed, the

loop at line 12 is entered and it is very likely that the true branch of the following if-statement will also be

executed. Unfortunately, execution of this branch prevents satisfaction of the test goal. The populations will

all stagnate.

4. The reached non-executed branches not in the population goal are the false branch at line 9 and very likely

the false branch of the if-statement at line 13. Adopting this false branch as a subgoal i.e. 13:Fo (with the

and-cost as the cost function since only the true branch has been executed) will guide the search towards

arrays b in which each element is equal to target. This leads to the execution of the target branch.

Notice that since the predicate at 13 must be false on every visit, the 13:Fa branch goal would not have led to

success.

In practice, the generation of new search goals depends on when a population becomes stagnant and the branches

reached by the current best input. An example run for the above program is given in the next section.

Comparing the behaviour of the branch diversity method with that of the chaining method for this program, an

initial search is initiated for an input to execute the target branch. If no such input can be found data flow analysis

is used to identify the statements that are the last definitions of variables that are used in the failed branch predicate.

A definition of a variable is a last definition if there is a definition free path for that variable from the definition

successor in the control flow graph to the variable use. The chaining method creates search subgoals, represented

as event sequences. A local, non-population based search method is used to find input data that executes each last

definition along some definition free path. In the example program, the root of the tree is the event sequence that

specifies execution of the branch at line 16. From this point it identifies last definitions of fb at lines 4, 11 and

14. An attempt to execute the definition free path from the definition at line 4 has already failed and so this is not

reconsidered. The definition free paths from the definitions at lines 11 and 14 are therefore used to create the two

child event sequences of the root node. McMinn and Holcombe (McMinn and Holcombe 2004) have adapted the

chaining method to use genetic algorithm search to satisfy the subgoals identified but they retain event sequences.

The relative merits of using data-flow analysis over predicate cost values to select subgoals is an interesting ques-

tion. The methods are not incompatible with each other and so there may well be an argument for combining

them. In the first example program presented, each conditional statement body defined and used the significant

variable v. If, however, there were many more conditionals in the program, none of which defined v then the

advantage of data flow analysis becomes clear as it can avoid pursuing those branches that cannot affect the value

of v. The branch diversity algorithm will pursue all branches indiscriminantly of data flow. The tree of search

goals is developed breadth first according to the progress of each search.

6 Results

The branch diversity algorithm has been implemented in order to assess the efficiency of the method in practice.

One practical concern is the number of searches that may be instigated. In the worst case, there may be one

population for each subset of the set of non-target branches in a program. It is unclear how often such cases arise

in practice. This problem is mitigated somewhat, however, by the fact that the populations of ineffective searches

become stagnant and so claim only a small proportion of the computational time. It might be considered that once

a population is stagnant its evolution should be discontinued all together. The argument against this is that the

newly instigated searches are usually directed at only a proper subset of the region that may solve the test goal.

In addition, newly discovered inputs in a new population can be copied (migrated) and used to restart progress in

a stagnant population. Notice also that since a population may become stagnant more than once, it may produce

more than one set of extended search goals. Usually, these goals duplicate previously produced goals but this need

not be the case. New tests continue to migrate to stagnant populations. This means a stagnant population may be

revived to execute a different set of branches and thereby produce a different set of subgoals.

Another concern is the total size of the populations. Currently, the implementation maintains a maximum overall

population size for each set of populations that is created from extending a single population only and, the available

memory is equally divided between the newly created populations. There is no limit to the number of populations.

The first example program2 presented in this paper was submitted to an implementation of the branch diversity

search method in order to generate inputs for all branch coverage. The input domain for each integer variable,

2The program was coded as a JScript program, the source language acceptable to the implementation.

UKTest 2005

x, y and z was [�500000; 500000℄. The overall population size for each new formed set of populations was

120. The sequence of costs used to determine if a population is stagnant had a length of 100. No attempt was

made to tune any parameters of the genetic algorithm. In each of one of 30 trials, inputs were found to execute

all branches. Branch coverage required an average of 20,434 executions of the example program under test. Of

particular interest is the time taken to find an effective search goal. On average, a population with an effective

search goal, i.e. the satisfaction of the first and second predicates, was generated after 1842 executions of the

example program under test.

The second example program that checks the number of spaces in a string for parity was also submitted to an

implementation of the branch diversity search method in order to generate inputs for all branch coverage. The

input domain for each character variable of the input string was the 16 bit Unicode character set. The overall

population size and the sequence of costs used to determine if a population is stagnant was as before and again

no attempt was made to tune any parameters of the genetic algorithm. Over 30 trials, all branches were covered

in an average of 16,860 executions of the example program under test. On average, a population with an effective

search goal, i.e. the satisfaction of the inner if-statement 5:Ta, was generated after 1749 executions of the example

program under test.

The example program from (Ferguson and Korel 1996) was also submitted to an implementation of the branch

diversity search method, again in order to generate inputs for all branch coverage. The input domain for each

of the 21 integer values was [�99; 99℄. The overall population size and the sequence of costs used to determine

if a population is stagnant was as before and again no attempt was made to tune any parameters of the genetic

algorithm. Over 30 trials, all branches were covered in an average of 20661 executions of the example program

under test. On average, a population with an effective search goal, i.e. including branch goals 6:Ta; 13:Fo, was

generated after 1208 executions of the example program under test.

Given that stagnant populations give rise to new populations. Populations can be arranged as a tree. The root is

the initial population. The populations used for searches instigated because the search in this initial population

stagnated are children of the initial population. The tree of search goals that was developed during one trial run of

the implementation is shown below.16:To5:Fo 1009:Ta 29512:Fo 56912:To 13:Fo5:To 6:Ta12:Fo 9409:To 12:Fo9:To 12:To 13:Fo effective search goal9:Ta5:To 6:Ta 29712:To 13:Ta 79515:To 6:Fo12:Fo 62112:To 13:Fo12:Fo5:To 6:Ta 57212:To 13:Fo5:To 6:Ta 587 effective search goal

The initial population search goal is 16:To. After 100 (the length of the sequence of costs used to determine

stagnation) executions of the program under test, the initial population is stagnant and three new search goals are

formed, these appear at the first indentation level as 5:Fo, 5:To 6:Ta and 9:Ta, these being the additional branch

goals added to the initial population. Notice that it is at this point that the first of the crucial branch goals, 6:Ta, is

created. The second population to become stagnant is 16:To 5:Fo after a total of 295 executions. At this point, the

branch goal 9:Ta is added to 16:To 5:Fo to form a new search goal. Note also that 9:To has not yet been executed

even though the required branch goal 6:Ta has been used in the search from execution 100 onwards.

After 297 executions, 4 new populations are created by adding branch goals to the population previously created

with 9:Ta at 100 executions. At this point, notice that the second of the crucial branch goals, 13:Fo, is created to-

gether with it control dependency goal branch 12:To. At this stage, although 6:To has been executed, unfortunately

no search goal combines both of the crucial branch goals. After 487 executions, however, the 5:To 6:Ta branch

goals are added to the search goal 16:To 9:Ta 12:To 13:Fo to provide an effective search goal.

The tree of populations continues to grow as shown to 7951 executions. Note that a second effective search goal is

created after 940 executions. In fact, the presence of 9:Ta rather than 9:To is the only difference between this search

goal and the effective search goal generated at 487 executions. After a total of 28636 executions, the required input

is found.

7 Conclusions and further work

A branch diversity search method has been presented for avoiding local optima and plateau during the search for

test data. It relies on identifying branches in the program under test where the flow of control may be modified

without invalidating the test goal. Additional searches are instigated when it is deemed that one or more of the

current searches are not making progress. The additional searches are designed to explore regions of the input that

have currently not been explored. This is achieved by identifying branches in the program where the execution

behaviour may be changed without violating any control flow constraints implied by the current search goal. The

predicate cost values that are calculated to guide the search for test data are used to select suitable branches. Three

short but difficult to test programs have been shown to be amenable to the method.

The immediate further work is to apply the method to a larger set of programs to gain experience with its per-

formance and uncover its problems and strengths. For example, for large programs, the best search strategy for

developing the search tree of populations, depth first, breadth first etc. is not clear. The computational demands

of the method are not clear. In looking for branches that are suitable as subgoals, branches that have not been

executed are preferred. Loops, as mentioned earlier are treated differently to if-statements, the available subgoals

are loop entry and no loop entry. Although it is not sensible to search for an input that enters the loop and never

exits, it may be useful to find inputs that maximise the number of loop iterations.

References

Baresel, A. and H. Sthamer (2003). Evolutionary testing of flag conditions. In Proceedings of GECCO 2003,

pp. 2442–2454. Springer Verlag.

Baresel, A., H. Sthamer, and M. Schmidt (2002). Fitness function design to improve evolutionary structural

testing. In Proceedings of Genetic and Evolutionary Computation Conference GECCO 2002, pp. 1329–

1336. Morgan Faufmann.

Bottaci, L. (2003). Predicate expression cost functions to guide evolutionary search for test data. In Proceedings

of Genetic and Evolutionary Computation Conference (GECCO 2003), pp. 2455–2464. Springer Verlag.

Cantu-Paz, E. (1998). A survey of parallel genetic algorithms. Calculateurs Paralleles, Reseaux et Systems

Reportis 10(2), 141–171.

Ferguson, R. and B. Korel (1996, Jan). The chaining approach for software test data generation. ACM Transac-

tions on Software Engineering and Methodology 5(1), 63–86.

Ferrante, J., K. J. Ottenstein, and J. D. Warren (1987, July). The program dependence graph and its use in

optimization. ACM Transactions on Programming Languages and Systems 9(3), 319–349.

Harman, M., L. Hu, R. Hierons, A. Baresel, and H. Sthamer (2002). Improving evolutionary testing by flag

removal. In Proceedings of Genetic and Evolutionary Computation Conference GECCO 2002, pp. 1359–

1366. Morgan Faufmann.

Harrold, M. J. and G. Rothermel (1996). Syntax-directed construction of program dependence graphs. Technical

Report OSU-CISRC-5/96-TR32, Department of Computer Information Science, The Ohio State University,

Columbus, OH.

Jones, B. F., H. Sthamer, and D. Eyres (1996). Automatic structural testing using genetic algorithms. Software

Engineering Journal 11(5), 299–306.

Korel, B. (1990, August). Automated software test data generation. IEEE Transactions on Software Engineer-

ing 16(8), 870–879.

Korel, B. (1992). Dynamic method for software test data generation. Software Testing, Verification and Relia-

bility 2(4), 203–213.

UKTest 2005

McMinn, P. and M. Holcombe (2004, June). Hybridizing evolutionary testing with the chaining approach. In

Proceedings of GECCO 2004, pp. 1363–1374. Springer Verlag.

Michael, C., G. McGraw, M. Schatz, and C. Walton (1997). Genetic algorithms for dynamic test data genera-

tion. Technical Report RSTR-003-97-11, RST Corporation, Suite 250, 21515 Ridgetop Circle, Sterling VA

20166.

Tracey, N., J. Clark, and K. Mander (1998, March). Automated program flaw finding using simulated annealing.

Software Engineering Notes 23(2), 73–81.

Tracey, N., J. Clark, K. Mander, and J. McDermid (2000). Automated test data generation for exception condi-

tions. Software – Practice and Experience 30, 61–79.

Wegener, J., A. Baresel, and H. Sthamer (2001). Evolutionary test environment for automatic structural testing.

Information and Software Technology 43, 841–854.

Whitley, D. (1989). The genitor algorithm and selective pressure: why rank based allocation of reproductive

trials is best. Proceedings of the Third International Conference GAs., 116–121.

UKTest 2005

Testability Transformation for Efficient

Automated Test Data Search in the Presence of Nesting

Phil McMinn

University of Sheffield,

Regent Court,

211 Portobello Street,

Sheffield, S1 4DP, UK

p.mcminn@dcs.shef.ac.uk

David Binkley

Loyola College

4501 North Charles Street

Baltimore,

MD 21210-2699, USA

binkley@cs.loyola.edu

Mark Harman

King’s College

Strand, London

WC2R 2LS, UK

mark@dcs.kcl.ac.uk

Abstract

The application of metaheuristic search techniques to the automatic generation of software

test data has been shown to be an effective approach for a variety of testing criteria. However,

for structural testing, the dependence of a target structure on nested decision statements can

cause efficiency problems for the search, and failure in severe cases. This is because all

information useful for guiding the search - in the form of the values of variables at branching

predicates - is only gradually made available as each nested conditional is satisfied, one after

the other. The provision of guidance is further restricted by the fact that the path up to that

conditional must be maintained by obeying the constraints imposed by ‘earlier’ conditionals.

An empirical study presented in this paper shows the prevalence of types of if statement

pairs in real-world code, where the second if statement in the pair is nested within the

first. A testability transformation is proposed in order to circumvent the problem. The

transformation allows all branch predicate information to be evaluated at the same time,

regardless of whether ‘earlier’ predicates in the sequence of nested conditionals have been

satisfied or not. An experimental study is then presented, which shows the power of the

approach, comparing evolutionary search with transformed and untransformed versions of

two programs with nested target structures. In the first case, the evolutionary search finds

test data in half the time for the transformed program compared to the original version. In

the second case, the evolutionary search can only find test data with the transformed version

of the program.

1 Introduction

The application of metaheuristic search techniques to the automatic generation of software testdata has been shown to be an effective approach for functional [11, 21, 20], non-functional [26, 19,27], structural [12, 13, 4, 29, 10, 17, 25, 16, 15], and grey-box [14, 24] testing criteria. The searchspace is the input domain of the test object. An objective function provides feedback as to how‘close’ input data are to satisfying the test criteria. This information is used to provide guidanceto the search.

For structural testing, each individual program structure of the coverage criteria (for exampleeach individual program statement or branch) is taken as the individual search ‘target’. The effects

if a > b

if b > c

TARGET

TARGET MISSED (c - b) fed to objective function

TARGET MISSED (b – a) fed to objective function false

true if c > d false

TARGET MISSED (d - c) fed to objective function

void example(int a, int b, int c, int d)

(1) if (a > b)

(2) if (b > c)

(3) if (c > d)

(4) // target

Figure 1: Nested targets require the succession of branching statements to be evaluated by theobjective function one after the other

of input data are monitored through instrumentation of the branching conditions of the program.An objective function is computed, which decides how ‘close’ an input datum was to executingthe target, based on the values of variables appearing in the branching conditionals which lead toits execution. For example, if a branching statement ‘if (a == b)’ needs to be true for a targetstatement to be covered, the objective function feeds back a ‘branch distance’ value of abs(b − a)to the search. The objective values fed back are critical in directing the search to potential newtest data candidates which might execute the desired program structure.

However, the search can encounter problems when structural targets are nested within morethan one conditional statement. In this case, there are a succession of branching statements whichmust be evaluated with a specific outcome in order for the target to be reached. For example,in Figure 1, the target is nested within three conditional statements. Each individual conditionalmust be true in order for execution to proceed onto the next one. Therefore, for the purposes ofcomputing the objective function, it is not known that b > c must be true until a > b is true.Similarly, until b > c is satisfied, it is not known that c > d must also be satisfied. This gradualrelease of information causes efficiency problems for the search, which is forced to concentrate onsatisfying each predicate individually. For example, inputs where b is close to being greater thanc are of no consequence to the objective function until a > b.

Furthermore, the search is restricted when seeking inputs to satisfy ‘later’ conditionals, becausesatisfaction of the earlier conditionals must be maintained. If when searching for input values forb > c, the search chooses input values so that a is not greater than b, the path taken through theprogram never reaches the latter conditional, and thus the search never finds out if b > c or not.Instead it is held up again at the first conditional, which must be made true in order to reach thesecond conditional again. This inhibits the test data search, and the possible input values it canconsider in order to satisfy predicates appearing ‘later’ in the sequence of nested conditionals. Insevere cases the search may fail to find test data.

Ideally, all branch predicates need to be evaluated by the objective function at the sametime. This paper presents a testability transformation approach in order to achieve this. Atestability transformation [7] is a source-to-source program transformation that seeks to improvethe performance of a test data generation technique. The transformed program produced is merelya ‘means to an end’, rather than an ‘end’ in itself, and can be discarded once it has served itsintermediatory purpose as a vehicle for an improved test data search.

The ability to be able to evaluate all branch predicates at the same time results in a significantpositive impact in the level of guidance that can be provided to the search. This can be seen byexamining the objective function landscapes of the original and transformed versions of programs.Experiments carried out using evolutionary algorithms on two case studies confirm this. In thefirst study, test data was found in half the number of input data evaluations for the transformedversion. In the second study, the test data search was unsuccessful unless the transformed versionof the program was used.

UKTest 2005

An empirical study is presented which examines if statement pairs occurring in forty real-world programs. In this study, the latter if statement of the pair is nested in the first. Theresults further serve to show the benefit of the proposed transformation approach. In previouswork [3], a method is presented to simultaneously evaluate all nested branch conditions, but onlyif no further statements occur between each pair of if statements. The empirical study showsthat this only occurs for 18% of if pairs, whereas the transformation approach is also potentiallyapplicable to the additional 82% of cases.

2 Search-Based Structural Test Data Generation

Several search methods have been proposed for structural test data generation, including thealternating variable method [12, 13, 4], simulated annealing [23, 22] and evolutionary algorithms[29, 10, 17, 25, 16, 15]. This paper is interested in the application of the alternating variablemethod and evolutionary algorithms to structural test data generation.

2.1 The Alternating Variable Method

The alternating variable method [12], is employed in the goal-oriented [13] and chaining [4] testdata generation approaches, and is based on the idea of ‘local’ search. An arbitrary input vectoris chosen at random, and each individual input variable is probed by changing its value by a smallamount, and then monitoring the effects of this on the branch predicates of the program.

The first stage of manipulating an input variable is called the exploratory phase. This probesthe neighborhood of the variable by increasing and decreasing its original value. If either moveleads to an improved objective value, a pattern phase is entered. In the pattern phase, a larger moveis made in the direction of the improvement. A series of similar moves is made until a minimumfor the objective function is found for the variable. If the target structure is not executed, thenext input variable is selected for an exploratory phase.

In the example of Figure 1, the search target is the execution of node 4. Say the program isexecuted with the arbitrary input (a=10, b=20, c=30, d=10). Control flow diverges away fromthe target down the false branch from node 1. The search attempts to minimize the objectivevalue, which is formed from the true branch distance from node 1, i.e a − b. Exploratory movesare made around the value of a. A decreased value leads to a worse objective value. An increasedvalue leads to an improved smaller objective function value. Larger moves are made to increase a

until a is greater than b. Suppose the input to the program is now (a=21, b=20, c=30, d=10).Execution now proceeds down the true branch from node 1, but diverges away down the falsebranch at node 2. The search now attempts to minimize the objective function b - c in orderto execute node 2 as true. Exploratory moves around a have no effect on the objective function.Therefore exploratory moves are made around the values of b. A decreased value of b leads toa worse objective function value, whilst an increased value leads to execution taking the falsebranch at node 1 again. Therefore the search explores values around the current value of c.Increased values have a negative impact on the objective function, whilst decreased values lead toan improvement. Further moves are made to decrease the value of c until input is found whichexecutes node 2 as true. Suppose this is (a=21, b=20, c=19, d=10). Execution now proceedsdirectly through all branching statements to target node 4.

2.2 Evolutionary Testing

Evolutionary testing [29, 10, 17, 25, 16, 15] employs evolutionary algorithms for the test datasearch. Evolutionary algorithms [28] combine characteristics of genetic algorithms and evolutionstrategies, using simulated evolution as a search strategy, employing operations inspired by geneticsand natural selection.

Evolutionary algorithms maintain a population of candidate solutions rather than just onecurrent solution, as with local search methods. The members of the population are iteratively

recombined and mutated to in order to evolve successive generations of potential solutions. Theaim is to generate ‘fitter’ candidate solutions within subsequent generations, which representbetter candidate solutions. Recombination forms offspring from the components of two parentsselected from the current population. The new offspring form part of the new generation ofcandidate solutions. Mutation performs low probability random changes to solutions, introducingnew genetic information into the search. At the end of each generation, each solution is evaluatedfor its fitness, using a ‘fitness’ function. The fitness function can be the direct output of an objectivefunction, or this value ranked or scaled in some way. Using fitness values, the evolutionary searchdecides whether individuals should survive into the next generation or be discarded.

In applying evolutionary algorithms to structural test data generation [29, 10, 17, 25, 16, 15],‘candidate solutions’ are possible test data inputs. The objective function evaluates each test datainput with regards to the current structural target in question. This is performed in a slightlydifferent way to the alternating variable method. The notion of branch distance is key, but asthe search does not work to iteratively improve one solution, the objective function incorporatesanother metric known as the approach level (also known as the approximation level) [25] to recordhow many nested conditionals are left unencountered by an input en route to the target.

Take the example of Figure 1 again. If some test data input reaches node 1 but divergesaway down the false branch, its objective value is formed from the true branch distance at node1, and an approach level value of ‘2’ to indicate there are still two further branching nodes tobe encountered (nodes 2 and 3). If the test data input evaluates node 1 in the desired way, itsobjective value is formed from the true branch distance at node 2, with the approach level valuenow being one. At node 3, the approach level is zero and the branch distance is derived from thetrue branch predicate.

Formally the objective function for a test data input is computed as follows:

obj val = approach level + normalize(branch dist) (1)

where the branch distance branch dist is normalized into the range 0-1 by the function normalize

using the following formula [1]:

normalize(branch dist) = 1 − 1.001−branch dist (2)

thus ensuring the value added to the approach level is close to 1 when the branch distance is verylarge, and zero when the branch distance is zero.

The approach level, therefore, adds a value for each branch distance which remains unevaluated.Since these values are not known, as the path of execution through the program has meant theyhave not been calculated, the maximum value is added, i.e. 1 (this ‘approximation’ to real branchdistances is why the approach level is sometimes referred to as the ‘approximation level’). Aswill be seen in the next section, the addition of this value rather than actual branch distance caninhibit search progress.

3 Nested Search Targets

The dependence of structural targets on one or more nested decision statements can cause problemsfor search-based generation methods, and even failure in severe cases.

The problem stems from the fact that information valuable for guiding the search is onlyrevealed gradually as each individual branching conditional is encountered. The search is forced toconcentrate on each branch predicate one at a time, one after the other. In doing this, the outcomeat previous branching conditionals must be maintained, in order to preserve the execution pathup to the current branching statement. If this is not done, the current branching statement willnever be reached. This restricts the search in its choice of possible inputs, narrowing the potentialsearch space.

In case study 1 (Figure 2a), where the target of the search is node 4, the fact that c needs tobe zero at node 3 is not known until a == b is true at node 1. However, in order to evaluate node

UKTest 2005

Node(s) void case_study_1_original{double a, double b)

(1) if (a == b)

(2) double c = b + 1;

(3) if (c == 0)

(4) // target

(a) Original program

void case_study_1_transformed(double a, double b)

double _dist = 0;

_dist += branch_distance(a == b);

double c = b + 1;

_dist += branch_distance(c == 0);

if (_dist == 0.0)

// target

(b) Transformed version of program

Figure 2: Case study 1

3 in the desired way, the constraint a == b needs to be maintained. If the values of a and b arenot -1, the search has no chance of making node 3 true, unless it backtracks to reselect values ofa and b again. However, if it were to do this, the fact that c needs to be zero at node 3 will be‘forgotten’, as node 3 is no longer reached, and its true branch distance is not computed.

This phenomenon is captured in a plot of the objective function landscape (Figure 3a), whichuses the output of Equation 1. The shift from satisfying the initial true branch predicate of node1 to the secondary satisfaction of the true branch predicate of node 2 is characterized by a suddendrop in the landscape down to spikes of local minima. Any move to input values where a is notequal to b jerks the search up out of the minima and back to the area where node 1 is evaluated asfalse again. When stuck in the local minima, the alternating variable method can not alter bothinput variables at once. As the method will not accept an inferior solution which would place itback at node 1, it declares failure. The evolutionary algorithm, meanwhile, has to change bothvalues of a and b in order to traverse the local minima down to the global minimum of (a=-1,

b=-1).Case study 2 (Figure 4a) further demonstrates the problems of nested targets, this time with a

target within three levels of nesting. This can be seen in a plot of the objective function landscape,

−100 −50 0 50 100−100

jective

−100

jective

(b) Transformed version

Figure 3: Objective function landscape for case study 1

UKTest 2005

Node(s) void case_study2(double a, double b, double c)

(1) double d, e;

(2) if (a == 0)

(3) if (b > 1)

(4) d = b + b/2;

(5) d = b - b/2;

(6) if (d == 1)

(7) e = c + 2;

(8) if (e == 2)

(9) // target

void case_study2_transformed(double a, double b, double c)

double _dist = 0;

double d, e;

_dist += branch_distance(a == 0);

if (b > 1)

d = b + b/2;

d = b - b/2;

_dist += branch_distance(d == 1);

e = c + 2;

_dist += branch_distance(e == 2);

if (_dist == 0.0)

// target

(b) Transformed version of program

Figure 4: Case study 2

−100

jective

−100

jective

(b) Transformed version

Figure 5: Objective Function for case study 2, plotted where c = 0

UKTest 2005

seen in Figure 5a. The switch from minimizing the branch distance at node 2 to that of node 6 isagain characterized by a sudden drop. Any move from a value of a = 0 has a significant negativeimpact on the objective value, as the focus of the search is pushed back to satisfying this initialpredicate. In this area of the search space, the objective function has no regard for the values ofb, which is the only variable which can affect the outcome at node 6. To select inputs in orderto take the true branch from node 6, the search is constrained in the a = 0 plane of the searchspace.

3.1 Related Work

Baresel et al. [3] consider the nested search target problem where no further statements existbetween each subsequent if decision statement, as in the example of Figure 1. It is observed thatthe branch distances of each branching node can simply be measured at the ‘top level’, i.e. beforenode 1 is encountered, and simply added together for computing the objective function. However,if statements do exist between pairs of if statements, this solution is no longer plausible. In casestudy 1 (Figure 2), for example, the value of c at node 3 is fixed at node 2, which occurs afternode 1 is executed. In case study 2 (Figure 4), the value of d at node 6 could be fixed at nodes 4or 5, depending on the input value of b. Furthermore, the value of e is decided at node 7, whichis nested in the true branches of nodes 6 and 2. A burning question therefore, is how often suchintermediatory statements occur between if pairs in real-world code. Is an extended solution tothe nested target problem justified?

3.2 Nesting in real world programs - an empirical study

An empirical study investigated nested if statement pairs for forty real-world programs. A de-scription of each program, and its size, measured in lines of code by the tools wc and sloc can befound in Table 1.

The if statement pairs analyzed, for if statements P and Q - where Q is nested in P - followedthe system dependence graph [9] pattern of the following form:

1. Q is control dependent on P

2. P is not transitively control dependent on Q

Control dependency [5] is informally defined as “for a program node I with two exits (e.g. anif statement), program node J is control dependent on I if one exit from I always results in J

being executed, while the other exit may not result in J being executed”. Rules 1 and 2, therefore,ensure that Q is nested in P and that P is differentiated from Q.

As outlined in the previous section, the issue of a possible statement sequence A existingbetween P and Q is an important feature which distinguishes this work from the earlier work ofBaresel et al. [3]. Such occurrences were checked by the following rules:

3. A (if it exists) depends on some X which is control dependent on P (i.e. A depends onsomething nested in P )

4. A (if it exists) is not transitively control dependent on Q

A further fifth rule checked if A has a role in determining the outcome at Q, i.e. there is somevariable assigned to in A that is used in the predicate at Q:

5. Q is transitively data dependent on A, and this dependency is not loop-carried

The condition that the dependency is not loop-carried ensures that if Q is data dependent on A,A does indeed occur in between P and Q, and does not merely appear after both P and Q withinthe body of a loop.

Table 1: Details of the real world programs

Program LOC Descriptionwc sloc

a2ps 63,600 40,222 Postscript formatteracct 10,182 6,764 Accounting packagebarcode 5,926 3,975 Barcode generatorbc 16,763 11,173 Calculatorbyacc 6,626 5,501 Berkeley YACCcadp 12,930 10,620 Protocol engineering tool-boxcompress 1,937 1,431 Data compression utilitycopia 1,170 1,112 ESA signal processing codecsurf-pkgs 66,109 38,507 Code surfer slicing toolctags 18,663 14,298 Produces tags for ex, more, and vidiffutils 19,811 12,705 File comparing routinesed 13,579 9,046 Unix editorempire 58,539 48,800 War gameEPWIC-1 9,597 5,719 Image compression toolespresso 22,050 21,780 Logic simplification for CAD (from SPECmark)findutils 18,558 11,843 File finding utilitiesflex2-4-7 15,813 10,654 BSD scanner (version 2.4.7)flex2-5-4 21,543 15,283 BSD scanner (version 2.5.7)ftpd 19,470 15,361 File Transfer Protocol daemongcc.cpp 6,399 5,731 Gnu C Preprocessorgnubg-0.0 10,316 6,988 Gnu Backgammongnuchess 17,775 14,584 Gnu chess game playergnugo 81,652 68,301 Gnu go game playergo 29,246 25,665 The game goijpeg 30,505 18,585 JPEG compressor (from SPECmark)indent 6,724 4,834 C formatterli 7,597 4,888 Xlisp interpreterntpd 47,936 30,773 Daemon for the network time protocoloracolo2 14,864 8,333 Array processorprepro 14,814 8,334 ESA array pre-processing codereplace 563 512 Regular expression string replacementspace 9,564 6,200 ESA ADL interpreterspice 179,623 136,182 Digital circuit simulatortermutils 7,006 4,908 Unix terminal emulation utilitiestile-forth-2.1 4,510 2,986 Forth Environmenttime-1.7 6,965 4,185 CPU resource measureuserv-0.95.0 8,009 6,132 Trust management servicewdiff.0.5 6,256 4,112 Diff front endwhich 5,407 3,618 Unix utilitywpst 20,499 13,438 CodeSurfer Pointer AnalysisSum 919,096 664,083Average 22,977 16,602

UKTest 2005

Table 2: Nesting in real-world programs

Program All Nothing Unrelated Relatedin between in between in between

a2ps 528 98 147 283acct 105 36 30 39barcode 116 8 40 68bc 114 28 29 57byacc 154 25 51 78cadp 136 65 26 45compress 22 6 4 12copia 1 1 0 0csurf-pkgs 757 97 267 393ctags 257 85 31 141diffutils 263 51 83 129ed 162 38 45 79empire 2,915 283 1,132 1,500EPWIC-1 160 35 38 87espresso 380 74 103 203findutils 187 38 42 107flex2-4-7 203 69 75 59flex2-5-4 261 80 110 71ftpd 900 174 203 523gcc.cpp 187 40 35 112gnubg-0.0 224 42 87 95gnuchess 498 134 159 205gnugo 1,578 384 531 663go 1,568 375 609 584ijpeg 277 104 64 109indent 224 56 46 122li 121 52 30 39ntpd 973 197 310 466oracolo2 282 22 65 195prepro 263 16 65 182replace 7 3 0 4space 283 22 65 196spice 3,010 428 717 1,865termutils 77 13 16 48tile-forth-2.1 44 20 6 18time-1.7 17 6 8 3userv-0.95.0 243 44 39 160wdiff.0.5 49 18 12 19which 33 4 12 17wpst 360 41 128 191average 448.5 82.8 136.5 229.2% 18.5% 30.4% 51.1%

The results can be seen in Table 2. ‘All’ is a figure of all if statement pairs analyzed. ‘Nothingin between’ records all if P and Q pairs with no A. ‘Unrelated in between’ records all P and Q

pairs with an A, but A does not have an effect on Q. ‘Related in between’, on the other hand,counts all A’s that have an effect on the predicate at Q.

The results show that the ‘unrelated in between’ case, that is the form of if pairs that can behandled by the technique of Baresel et al. account for less than a fifth of all if pairs studied. Afurther 30% of the ‘unrelated in between’ could also be handled, since the extra statements do notaffect Q. Therefore, the branch distance calculation could still legitimately take place before P ,however data dependency analysis would be required to establish this situation. The remaining50% of the ‘related in between’ cases would not be plausibly handled by the approach of Bareselet al. This is overcome by the application of a testability transformation approach described inthe next section.

4 Applying a Testability Transformation

A testability transformation [7] is a source-to-source program transformation that seeks to improvethe performance of a test data generation technique. The transformed program produced is merelya ‘means to an end’, rather than an ‘end’ in itself, and can be discarded once it has served itspurpose as an intermediary for generating the required test data. The transformation processneed not preserve the traditional meaning of a program. For example, in order to cover a chosenbranch, it is only required that the transformation preserve the set of test-adequate inputs. Thatis, the transformed program must be guaranteed to execute the desired branch under the sameinitial conditions. Testability transformations have also been applied to the problem of flags forevolutionary test data generation [2, 6], and the transformation of unstructured programs forbranch coverage [8].

The philosophy behind the testability transformation proposed in this paper is to remove theconstraint that the branch distances of nested decision nodes must be minimized to zero one at atime, and one after the other. The transformation takes the original program and removes decisionstatements on which the target is control dependent. In this way, when the program is executed, itis free to proceed into the originally nested areas of the program, regardless of whether the originalbranching predicate would have allowed that to happen. In place of the decision is an assignmentto a variable dist, which computes the branch distance based on the original predicate. At theend of the program, the value of dist reflects the summation of each of the individual branchdistances. This value may then be used as the objective value for the test data input.

The original version of case study 1 (Figure 2a) can therefore be transformed into the programseen in Figure 2b. The benefit of the transformation can be immediately seen in a plot of the ob-jective function landscape (Figure 3b). The sharp drop into local minima of the original landscape(Figure 3a) is replaced with smooth planes sloping down to the global minimum.

Case study 2 (Figure 4) is of a slightly more complicated nature, with the target positionedwithin three levels of nesting. A further if-else decision exists at level one, before the secondconditional en route to the target. Within both branches of this decision, a value is assigned to thevariable d, on which the if statement at node 6 is dependent upon. The transformed version ofthe program can be seen in Figure 4b. Again, the benefits of the transformation can be instantlyseen in a plot of the objective landscape (Figure 5b). The sharp drop in the original landscape(Figure 5a) corresponding to branching node 1 being evaluated as true and branching node 2 beingencountered, is replaced by a smooth landscape sloping down from all areas of the search spacedown into the global minimum.

5 Experimental Study

The two case studies introduced were put to the test with an evolutionary approach.The Genetic and Evolutionary Algorithm Toolbox (GEATbx) [18] was used to perform the

UKTest 2005

Table 3: Test data evaluations for case study 1

Run Untransformed Version Transformed Version

1 35,130 11,9102 31,350 18,3903 17,580 13,2604 24,060 10,5605 27,300 14,0706 38,100 13,2607 39,180 9,7508 27,300 13,8009 30,540 12,720

10 32,700 16,500Average 30,324 13,422

Table 4: Test data evaluations for case study 2

Run Untransformed Version Transformed Version

1 54,030 19,4702 54,030 20,5503 54,030 16,7704 54,030 17,8505 54,030 18,3906 54,030 19,2007 54,030 19,7408 54,030 19,4709 54,030 15,150

10 54,030 16,500Average 54,030 18,309

evolutionary searches, which were conducted as follows. 300 individuals were used per generation,split into 6 subpopulations starting with 50 individuals each. Linear ranking is utilized, with aselection pressure of 1.7. The input vectors are operated on by the evolutionary algorithm ‘as is’,i.e. as a vector of double values. Individuals are recombined using discrete recombination, andmutated using real-valued mutation. Real-valued mutation is performed using “number creep” -the alteration of variable values through the addition of small amounts. Competition and migrationis employed across subpopulations. Each evolutionary search was terminated after 200 generationsif test data was not found. Each experiment with each program version was repeated ten times.

The domains of each double variable were -1000 to 1000 with a precision of 0.001, giving searchspace size of 1011 for case study 1 and 1017 for case study 2.

For case study 1, the evolutionary algorithm generally performed less than half the numberof objective function evaluations (i.e. test data evaluations) for the transformed version of theprogram, compared to the untransformed version (Table 3). The average best objective valueplot, in Figure 6, shows search progress for the untransformed version of case study, with suddenimprovements in objective as the search navigates from local minimum to local minimum. Searchprogress for the transformed version, as expected, is more consistent and gradual.

The evolutionary algorithm encountered severe difficulties with the untransformed program forcase study 2. Due to the existence of three levels of nesting, the search fails on each occasion.Exactly the same number of test data evaluations are performed on each of the ten repetitions ofthe experiment (Table 4), terminating in the 200th generation. The search has much more success

Generation

ctive V

eOriginal version

Transformed Version

Figure 6: Average best objective value plot for case study 1

with the transformed version of the program, finding test data as early as the 55th generation inone of the ten repetitions.

6 Future Work

The transformation algorithm proposed in this paper does not accommodate for decision state-ments that are looping constructs, such as ‘while’ or ‘for’, or for if decision statements that arethemselves nested within an outer loop. This is because the branch distance value for a condi-tional could potentially be added more than once. An advanced version of the algorithm mightallow for loops by simply recording the minimum value of the branch distance encountered forthe conditional, and adding this to the end value of the dist variable. It makes no difference tothe transformation algorithm, of course, if intermediate blocks of statements occurring betweennested if pairs feature self-contained loops.

The transformation algorithm also has issues with certain type of predicates, which need tobe detected unless run-time errors are allowed to occur. One example of this is a predicate whichtests the possibility of a dynamic memory reference. The following example may lead to a programerror if it is transformed, due to the possibility of the array index of i being less than zero orgreater than the length of the array, and thus causing an array out of bounds error:

if (i >= 0 && i < length_of_a)

printf("%f\n", a[i]);

Another issue is the possibility of introducing division by zero errors, for example in thefollowing segment of code if the conditional were to be removed:

if (d != 0)

r = n / d;

UKTest 2005

1 100 200

Generation

ctive V

eOriginal version

Transformed Version

Figure 7: Average best objective value plot for case study 2

Currently, the transformation algorithm works on a per-target basis - a separate transformationneeds to be performed for each search target. An advanced version of the algorithm could modifythe predicates of the program to contain function calls. The function call would record the branchdistance, and then decide on the basis of the nesting of the current target as to a boolean value toreturn, and ultimately, whether execution should be allowed to proceed down a specific branch.For example, in the following, tt nesting check records the branch distance of a == b at node Aand lets execution flow down through its true branch regardless of whether a actually does equalb or not. However, tt nesting check remains true to the original predicate b == c at node B,since the current target is not nested within it, but allows execution through the true branch atnode C regardless of whether c == d.

Node(A) if (tt_nesting_check(a == b))

(B) if (tt_nesting_check(b == c))

(C) if (tt_nesting_check(c == d))

// current target nested in here

7 Conclusions

This paper has described how targets nested within more than one conditional statement cancause problems for search-based approaches to structural test data generation. In the presenceof nesting, the search is forced to concentrate on satisfying one branch predicate at a time, oneafter the other. This slows search progress and restricts the potential search space available forthe satisfaction of branching predicates ‘later’ in the sequence of nested conditionals.

A testability transformation approach was presented to the problem. A testability transfor-mation is a source-to-source program transformation that seeks to improve the performance ofa test data generation technique. The transformed program produced is merely a ‘means to anend’, rather than an ‘end’ in itself, and can be discarded once it has served its purpose as anintermediary for generating the required test data.

The main idea behind the testability transformation proposed in this paper is to remove theconstraint that the branch distances of nested decision nodes must be evaluated one after theother. The transformation takes the original program and removes decision statements on whichthe target is control dependent. In this way, when the program is executed, it is free to proceedinto the original nested areas of the program, calculating all branch distance values for the purposein order to compute objective values which are in full possession of the facts about the input data.

The approach was put to the test with two case studies. The case studies are small examples,and by no means represent a worse-case scenario, yet serve to demonstrate the power of theapproach. The transformed version of case study 1 allowed the evolutionary search to find testdata in half the number of test data evaluations over the original version of the program. Whilsttest data could not be found for the search target for the original version of case study 2, theevolutionary algorithm succeeded every time with the transformed version.

The transformation approach deals with assignments to variables in between nested condition-als which may affect the outcome at ‘later’ conditionals en route to the current structural target.The empirical study of if pairs in forty real-world programs, where one of the if statements ofthe pair is nested within the other, showed that this situation occurs just over 50% of the time.These cases can not be dealt with earlier work of Baresel et al. [3] which investigated the nestingproblem.

References

[1] A. Baresel. Automatisierung von strukturtests mit evolutionren algorithmen. Diploma Thesis,Humboldt University, Berlin, Germany, July 2000.

[2] A. Baresel, D. Binkley, M. Harman, and B. Korel. Evolutionary testing in the presence ofloop-assigned flags: A testability transformation approach. In Proceedings of the Interna-tional Symposium on Software Testing and Analysis (ISSTA 2004), pages 43–52, Boston,Massachusetts, USA, 2004. ACM.

[3] A. Baresel, H. Sthamer, and M. Schmidt. Fitness function design to improve evolutionarystructural testing. In Proceedings of the Genetic and Evolutionary Computation Conference(GECCO 2002), pages 1329–1336, New York, USA, 2002. Morgan Kaufmann.

[4] R. Ferguson and B. Korel. The chaining approach for software test data generation. ACMTransactions on Software Engineering and Methodology, 5(1):63–86, 1996.

[5] J. Ferrante, K. Ottenstein, and J. D. Warren. The program dependence graph and its usein optimization. ACM Transactions on Programming Languages and Systems, 9(3):319–349,1987.

[6] M. Harman, L. Hu, R. Hierons, A. Baresel, and H. Sthamer. Improving evolutionary testingby flag removal. In Proceedings of the Genetic and Evolutionary Computation Conference(GECCO 2002), pages 1359–1366, New York, USA, 2002. Morgan Kaufmann.

[7] M. Harman, L. Hu, R. Hierons, J. Wegener, H. Sthamer, A. Baresel, and M. Roper. Testabilitytransformation. IEEE Transactions on Software Engineering, 30(1):3–16, 2004.

[8] R. Hierons, M. Harman, and C. Fox. Branch-coverage testability transformation for unstruc-tured programs. The Computer Journal, To appear, 2005.

UKTest 2005

[9] T. Horwitz S., Reps and D. Binkley. Interprocedural slicing using dependence graphs. ACMTransactions on Programming Languages and Systems, 12:26–60, 1990.

[10] B. Jones, H. Sthamer, and D. Eyres. Automatic structural testing using genetic algorithms.Software Engineering Journal, 11(5):299–306, 1996.

[11] B. Jones, H. Sthamer, X. Yang, and D. Eyres. The automatic generation of software test datasets using adaptive search techniques. In Proceedings of the 3rd International Conference onSoftware Quality Management, pages 435–444, Seville, Spain, 1995.

[12] B. Korel. Automated software test data generation. IEEE Transactions on Software Engi-neering, 16(8):870–879, 1990.

[13] B. Korel. Dynamic method for software test data generation. Software Testing, Verificationand Reliability, 2(4):203–213, 1992.

[14] B. Korel and A. M. Al-Yami. Assertion-oriented automated test data generation. In Pro-ceedings of the 18th International Conference on Software Engineering (ICSE), pages 71–80,1996.

[15] P. McMinn. Search-based software test data generation: A survey. Software Testing, Verifi-cation and Reliability, 14(2):105–156, 2004.

[16] P. McMinn and M. Holcombe. Hybridizing evolutionary testing with the chaining approach.In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2004),Lecture Notes in Computer Science vol. 3103, pages 1363–1374, Seattle, USA, 2004. Springer-Verlag.

[17] R. Pargas, M. Harrold, and R. Peck. Test-data generation using genetic algorithms. SoftwareTesting, Verification and Reliability, 9(4):263–282, 1999.

[18] H. Pohlheim. GEATbx - Genetic and Evolutionary Algorithm Toolbox,http://www.geatbx.com.

[19] P. Puschner and R. Nossal. Testing the results of static worst-case execution-time analysis. InProceedings of the 19th IEEE Real-Time Systems Symposium, pages 134–143, Madrid, Spain,1998. IEEE Computer Society Press.

[20] N. Tracey. A Search-Based Automated Test-Data Generation Framework for Safety CriticalSoftware. PhD thesis, University of York, 2000.

[21] N. Tracey, J. Clark, and K. Mander. Automated program flaw finding using simulated an-nealing. In Software Engineering Notes, Issue 23, No. 2, Proceedings of the InternationalSymposium on Software Testing and Analysis (ISSTA 1998), pages 73–81, 1998.

[22] N. Tracey, J. Clark, and K. Mander. The way forward for unifying dynamic test-case genera-tion: The optimisation-based approach. In International Workshop on Dependable Computingand Its Applications, pages 169–180. Dept of Computer Science, University of Witwatersrand,Johannesburg, South Africa, 1998.

[23] N. Tracey, J. Clark, K. Mander, and J. McDermid. An automated framework for structuraltest-data generation. In Proceedings of the International Conference on Automated SoftwareEngineering, pages 285–288, Hawaii, USA, 1998. IEEE Computer Society Press.

[24] N. Tracey, J. Clark, K. Mander, and J. McDermid. Automated test data generation forexception conditions. Software - Practice and Experience, 30(1):61–79, 2000.

[25] J. Wegener, A. Baresel, and H. Sthamer. Evolutionary test environment for automatic struc-tural testing. Information and Software Technology, 43(14):841–854, 2001.

[26] J. Wegener, K. Grimm, M. Grochtmann, H. Sthamer, and B. Jones. Systematic testingof real-time systems. In Proceedings of the 4th European Conference on Software Testing,Analysis and Review (EuroSTAR 1996), Amsterdam, Netherlands, 1996.

[27] J. Wegener and M. Grochtmann. Verifying timing constraints of real-time systems by meansof evolutionary testing. Real-Time Systems, 15(3):275–298, 1998.

[28] D. Whitley. An overview of evolutionary algorithms: Practical issues and common pitfalls.Information and Software Technology, 43(14):817–831, 2001.

[29] S. Xanthakis, C. Ellis, C. Skourlas, A. Le Gall, S. Katsikas, and K. Karapoulios. Applicationof genetic algorithms to software testing (Application des algorithmes genetiques au test deslogiciels). In 5th International Conference on Software Engineering and its Applications,pages 625–636, Toulouse, France, 1992.

4. Tools and Experience

Model-Driven Engineering Testing Environments

Paul Baker, Paul Bristow, Clive Jervis, David King, Rob Thomson

{Paul.Baker, Paul.C.Bristow, Clive.Jervis, David.King, Rob.Thomson}

@motorola.com

Motorola Labs

Abstract

Model-driven engineering has been prevalent in the Telecommunications industry for anumber of years. Formal graphical notations such as MSC and SDL have been widely usedfor requirements and design respectively. These notations are ideal for expressing messagepassing behaviour of concurrent processes i.e. exactly the Telecoms problem domain.With the advent of UML 2.0, which incorporates close relatives of these notations, themodel-driven approach is starting to become more widely used both inside and outside ofTelecoms, for example, the automotive industry.

With any relatively new techniques the demand for good tools exceeds the supply.Our work on testing in a model-driven engineering environment began because the toolswere not available. In this paper we will cover the techniques, notations, and experienceswe have used in developing testing environments for MSC/SDL and UML 2.0. Tests inthese environments are derived from requirements models, and these tests are run againstthe design models, so this is a strong form of functional conformance testing.

1 Introduction

Motorola, like many other high-tech companies, has a keen interest in improving softwarequality, whilst reducing cost. Model-Driven Engineering (MDE) offers some hope of achievingthis aim. In this approach high-level graphical notations are used, for example, MSC [11] isused for requirements specification, and SDL [10] is used for design. More recently, UML2.0 [14], which incorporates close relatives of these notations, has started to be piloted, andis likely to supersede MSC/SDL in the long-run for MDE. Using higher-levels of abstractionallows for faster development, and less maintenance. Having requirements in a formal notation(such as MSC), is a huge improvement over natural language requirements, since MSC hasa clear and unambiguous semantics. Tool support also allows interpretation of the formalMSC models, whereas this is a non-trivial problem with requirements in a natural language.The graphical notations can also aid understanding by visualisation compared to a textualnotation. Testing is often quoted to take 40–60% of the software lifecycle [4], and testing in amodel-driven environment is no exception. However, there is more potential for automationin a model-driven environment compared with traditional approaches, since we have formalmodels for requirements. This enables us to generate conformance tests automatically. Thegenerated tests will check that the signal ordering and data of the application conforms exactlyto the requirements. If the requirements are complete, then we can potentially generate teststhat cover every scenario, we’re only limited by the sheer volume of tests, and their executiontime.

2 MDE lifecycle

In Figure 1 we show the V-model development lifecycle used with the MDE approach. Thereis no fundamental difference with the classical V-model, the main difference is in the notationsand tools used.

2.1 Requirements

Starting at the top-left of Figure 1, requirements are created, mainly in the form of textualstatements. Requirements management at this stage is an important discipline, since defectsare often traced back to ambiguous, missing, or poor requirements. An often quoted figure isthat 50% of defects are traced back to errors in the requirements [13]. Tool support can beuseful here, for example, many groups in Motorola use Telelogic DOORS [16] for requirementsmanagement which allows for progress tracking, revision control, managing documentation,and providing linkage between requirements and relevant material.

Requirements

Code Design

Simulation or

Functional Testing

Component Testing

Integration Testing

Test Generation

Execution Environment

Application

Code Generation

Requirements

Specification

Figure 1: V-model development lifecycle.

Since requirements at this stage are in a textual form problems are not easily detected,and effort is spent conducting formal reviews [6]. Problems include: missing or incomplete re-quirements, incorrect requirements, ambiguous requirements, requirement interaction. Thereis certainly scope to improve analysis at this level, for example, natural language interpreta-tion of requirements to pick up contradictions. Such analysis techniques potentially can havea large impact on software productivity and quality.

2.2 Requirements specification: MSC/UML 2.0 SDs

After the top-level requirements we have the formal functional requirements specification.With MDE for telecoms MSC [11] or UML 2.0 Sequence Diagrams [14] are ideal formalnotations for expressing message-passing behaviour. The message data and additional dataused here depends on the application domain, and data language tool support. The ASN.1language [9] is one of the most expressive data languages, but isn’t always used because oflack of tool support, or legacy reasons.

UKTest 2005

The MSC-2000 language [11], is standardised by the ITU-T and describes the message-passing behaviour between independent concurrent processes at a high-level. Figure 2 illus-trates a basic MSC that describes the interactions for a simple Bank statement request. Thishas three instances: User, ATM, and Bank. Time progresses downwards on each instanceline, but instances are asynchronous with each other, and should be thought of as threeseparate concurrent processes. Messages transmit data between instances, and consist of asend/receive event pair. Messages have arbitrary latency so, for example, although messageAcknowledgment is sent before RequestStatement, it could be received after RequestStatement.

User ATM Bank

RequestStatement(cardNum)

Acknowledgment(userMessage)

RequestStatement(cardNum)

Letter(statement)

msc Statement

Figure 2: Example MSC for ordering bank statements.

The MSC language contains many more constructs than we have shown here, for example,inline expressions allow fragments to be composed in parallel, or with loops, alternative, oroptional. There are also constructs for references, guards, general ordering, and more. Apartfrom the basic MSC language shown here, there is a High-level MSC language that allows thecomposition of basic MSCs. In addition there is an MSC document language which allows forthe declaration of instances, variables, and other data aspects. There is a standard textualformat for MSC, and a well-defined trace semantics. See the language standard for a completedescription of MSC [11].

UML 2.0 Sequence Diagrams [14] are based on MSCs, rather than on Sequence Diagramsfrom UML 1.x. The notation and semantics of UML 2.0 SDs is almost identical to MSC.The main differences are in the names of constructs, the support for classes, and that UML2.0 SDs has more inline operators. With the naming differences MSC instances are calledlifelines, and MSC inline expressions are called combined fragments. There are more inlineexpressions, such as a strict sequencing operator that orders events based on their visual order,i.e. the higher on the page an event, the earlier it will occur, irrespective of which instanceit’s on. UML 2.0 Interactions Overview Diagrams are similar to High-level MSCs (HMSCs),and UML 2.0 Classes and Collaboration Diagrams can be used for MSC documents. See [8]for a more detailed comparison of UML 2.0 Interactions with MSC. For more information onUML 2.0 SDs see the UML 2.0 standard [14].

2.3 Code design: SDL/UML 2.0 Statecharts

SDL allows lower-level details of the processes to be defined compared with MSC. SDL is astate machine notation, and tools can be used for code generation or simulation, for example,Telelogic Tau [17]. There are constructs in SDL for sending, receiving, task boxes, states,and more. UML 2.0 Statecharts [14] are heavily based on SDL, and the same notations andsemantics are used.

Figure 3: UML 2.0 Statechart for Bank.

Figure 3 is a simple UML 2.0 Statechart for the Bank instance of the MSC in Figure2. In order we have a state Idle; a receive signal statement RequestStatement(cardNum);a decision box CheckCard(cardNum); and a send signal Letter(statement). Note thatthis implementation of the Bank instance doesn’t match the specification MSC, since theLetter signal is not always sent, whereas the specification says that it is always sent. This issomething that is detected by testing, and either the original specification should be changed,by adding an alt inline construct, or the SDL should be changed, by removing the decisionsymbol. For the purposes of code generation, or simulation additional information is requiredsuch as the data that is used for signals.

2.4 TTCN-3

TTCN-3 [5, 19] is a relatively new test language, although it has evolved from earlier incar-nations that started in the mid 1980’s. TTCN-3 is standardised by ETSI, and is starting tobecome the de facto language for tests. Other standards have started to express their testsuites in TTCN-3, for example, SIP, and IPv6 ETSI test specifications. TTCN-3 is ideal

UKTest 2005

for testing message-passing systems, such as Telecoms applications. More recently TTCN-3has been used in other domains such as automotive, railways, and web content testing. Seethe TTCN-3 2005 User Conference for presentations on these. There is strong support inTTCN-3 for types, signaling, verdicts, timers, test cases, test configurations, logging, and theusual programming language constructs. One of the main strengths of TTCN-3 is that it hasa runtime interface TRI (TTCN-3 runtime interface), and a control interface TCI (TTCN-3control interface). These interfaces allow the same test cases to be run for different platformsor with different encoding/decoding rules by changing the adapter code, rather than the testscripts themselves.

Figure 4 is a TTCN-3 testcase for the ATM in the MSC in Figure 2. Signals contain anumber of components, for example:

User.send(RequestStatement: cardNum);

will send the value cardNum with type RequestStatement over port User. These componentsneed to be explicitly defined in TTCN-3 for a complete test. Also note in our testcase wehave a choice operator alt, which is used here to check for our expected signal, or to checkfor spurious signals, or timeouts. Timers are used on receive signals, so that if our expectedsignal doesn’t arrive within a fix period of time then the test returns a fail verdict.

testcase Statement_test001() runs on MTCType

log("Test case: Statement_test001");

User.send(RequestStatement: cardNum);

toUser.receive(Acknowledgement: userMessage);

MaxTimer.start(5.0);

[] toBank.receive(RequestStatement: cardNum) {

MaxTimer.stop;

setverdict(pass)

[] any port.receive {

MaxTimer.stop;

setverdict(fail)

[] MaxTimer.timeout {

setverdict(fail)

Figure 4: A TTCN-3 testcase for the ATM as described by the MSC in Figure 2.

Another strength of TTCN-3 is its support for types, templates, and regular expressions.For example, CardNum can be defined using a template, which would have specific values forwhen signals are sent, i.e. just like a record, but the template used for when signals arereceived can contain regular expressions in some or all fields of the template. Below in anexample template, which has specific expected values for accountNum and pin, but a regular

expression ? for the SecurityCode. The ? regular expression means that the field may containany value.

template CardNum {

accountNumber := e_accountNumber,

securityCode := ?,

pin := e_pin

3 Test generation

Over the last eight years or so, we have developed the test generator ptk [2] which is usedthroughout Motorola. Originally, ptk took MSCs and generated SDL test scripts. Today ptksupports MSC and UML 2.0 Sequence Diagrams, and can generate test scripts in a numberof test languages including TTCN-3. ptk takes three main inputs:

• An MSC or UML 2.0 Sequence Diagram – these are used as requirement specifications.ptk supports almost all the constructs in these languages, including HMSC, and Inter-action Overview Diagrams. The input representation supported is the standard textualformat for MSC (called MSC-PR). For UML 2.0 Sequence Diagrams we translate toMSC-PR first, which has almost a one-to-one mapping. These diagrams can be drawnwith a number of editors, for example, ptk supports Telelogic Tau [17], and ESG fordrawing MSCs, and Telelogic TauG2 [17] for UML 2.0 Sequence Diagrams.

• A directives file – this file has a similar purpose to an MSC document [11] in declaringdata aspects, and also contains information such as what instances represent the SystemUnder Test (SUT), and test configuration information, such as the communication portsused, and how they are connected to each other. The format for the directives file isnon-standard, since there isn’t a standard that contains all this information. The UML2.0 Testing Profile [15] may provide a standard for this information in the future.

• A data specification file – this includes the types, and data that will be used by thetestcases. The language used for the data file is the same as the language we aregenerating tests in, so for TTCN-3 test generation, the data file is in TTCN-3.

ptk generates conformance tests from a master MSC, where a subset of the instancesdescribe the behaviour of the SUT. ptk generates one or more test scripts from a singlemaster MSC, basic or high-level, which may reference other MSCs. Specifications of messageformats can also be processed along with the MSCs. Depending on which options are invoked,ptk will produce a combination of the following:

• Semantic Analysis – ptk can produce a representation of MSC event semantics in severalforms including traces, partial order graphs, or trace trees. The output can be either intextual or graphical format for viewing by graph visualisation tools such as daVinci, orDOT from AT&T Labs.

• Test Analysis – the output is similar in form to the semantic analysis, but restrictedjust to the events that will appear in the generated tests. The analysis represents thestage before the model is split into individual test scripts.

UKTest 2005

• Test Scripts – test scripts can also be produced in a variety of formats. Abstract treerepresentations can be produced in textual or graphical formats; there is a dedicatedSDL code generator and TTCN-3 code generator; and lastly a generator that producescode templates. The templates are processed using a C-preprocessor and a macro def-inition library to produce the final concrete test scripts. We have developed a libraryfor producing TTCN-2 scripts, but others could, in principle, be defined for other testlanguages.

3.1 Local versus remote test strategies

The default method of ptk is remote testing in which the test system is connected to theSUT using the actual channels used in the full system implementation. Local testing, thealternative, assumes that the test system is connected directly to the SUT so that it has tomodel any latency effects in the communication channels.

For remote testing, the test events occur only in the orders that they occur in the MSC,so that if a receive event x always occurs before a receive event y in the MSC, then the samewill be true in the generated tests, regardless of the possible order that these messages canbe sent by the SUT according to the MSC. This method corresponds to a test system thatsits at the ends of channels connected to the SUT that behave according to the MSC.

The other view is that the test system is connected directly to SUT where latency ina message’s transmission between the test system and SUT cannot occur. In this case themessage events in the tests are the inversion of the events as ordered by the SUT instances.That is, the order of events is taken from the SUT ends of the messages, but a send event isreplaced by a receive event and vice-versa.

For the MSC of Figure 2, where the ATM instance is the SUT, for remote testing we willinclude three signals in the test scripts: RequestStatement from User, Acknowledgement,and RequestStatement to Bank. The latter two message will occur in either both orders inthe test scripts. Note that the message Letter does not appear in the test scripts, since it isnot involved in interactions with the ATM. For local testing we will have the same signals inthe test scripts, but the Acknowledgement signal will occur the RequestStatement to Banksignal, since this is the order given on the SUT.

Figure 4 gives a simplified version of a local testcase for the MSC is Figure 2 where theATM instance is the SUT. It’s simplified in that the timeout checks for the first receivemessage have been removed, as well as the logging statements that are generated in practice.The additional data descriptions and port and configurations are also not included, but wouldusually accompany the testcases.

User.send(cardNum)

User.recv(UserMessage)

Bank.recv(cardNum)

User.send(cardNum)

Bank.recv(cardNum)

Figure 5: A remote test (on the left), and a local test (on the right) for the ATM in Figure 2.

3.2 Test generation strategies

The algorithms used for test generation are described in [2], so we will only summarise thembriefly here. Internally ptk uses a partial graph representation of the MSC, where events (e.g.send or receive) are graph vertices, and causal orders are graph edges. The test generationalgorithms process this graph representation. The simplest algorithm generates one traceof events per test script. Since many test languages support non-determinism of receiveevents, for example, TTCN-3 does this with the alt statement, traces can be combined intotrees, where there is a choice of receive events at each node. The default algorithm for ptkgenerates tests in this form, which typically results in fewer testcases than the one test pertrace algorithm.

4 MSC/SDL testing environment

The testing of SDL design models has typically been achieved using the following methods:

1. Using SDL simulation scripts written as MSC traces, or

2. Using co-simulation, where a simulation of the SDL model is tested using a test suitedeveloped using TTCN-2 [12]

The following sections described these in further detail.

4.1 SDL simulation scripts written using MSC

In this process a simulation of an SDL design model is automatically produced using thecommercial Telelogic Tau tool [17]. Trace MSCs, annotated with explicit data values, are thendrawn depicting valid requirement scenarios. These MSCs are then automatically convertedinto simulation scripts, which provide the stimulus to the simulated SDL model, as well aschecking observations received from the SDL model are correct.

Because trace MSCs typically represent a single possible path through an SDL model,many are needed to achieve a reasonable level of test coverage. Also, with such a largenumber of MSCs, maintenance problems are encountered with the tight coupling of datawith behaviour specification. For example, when an SDL signal is sent between instanceson a trace MSC, the value of the signal is explicitly defined for each instance of that signal.This means that when the signal type is modified, each instance definition must also bemodified, resulting in a very large maintenance burden for engineering teams. This alsomeans that value definitions cannot be reused between specification, design, and testing. Byproviding mechanisms for decoupling data from behavior specification, we have seen verypositive results. In this particular case we parameterised MSC with TTCN-3 data [5] toprovide both a decoupling of the value definition from signal instances drawn on an MSC, aswell as provide a more abstract notation for value definition.

In addition, we introduced the ptk test generation tool to aid test coverage and the tar-getting of tests for types of testing. In doing so, we found a number of advantages:

• Users could develop more abstract and intuitive HMSC/MSC models. In doing so, theycould introduce factorisation through the use of MSC parameterisation and static dataexpressions. Typically, one abstract MSC would be equivalent to developing 3-4 traceMSCs.

UKTest 2005

• ptk could be used to generate tests for both simulator testing and target testing. Forexample, if SDL simulator tests were generated by ptk the semantics of the SDL simu-lator (i.e. a subset of MSC semantics) would be taken into account. Whereas, if testswere generated for target testing a different semantic model could be applied (i.e. thefull MSC semantics), as well as the generation of adaptive test cases.

As a consequence of our experiences with this type of model testing we are now promotingthe use of instance specification as a key strategy for data reuse and reduced model mainte-nance when using UML 2.0. Where, an instance is a run-time entity with an identity that isdistinguishable from other run-time entities. Hence, instance modeling refers to the creationof “signal” instances as objects that can be referenced and defined in an independent manner.

4.2 SDL/TTCN-2 co-simulation

In this process a simulation of an SDL design model is again produced using the commercialTelelogic Tau tool [17]. However, this time tests are manually developed from the requirementsusing TTCN-2 [12]. In doing so, an executable test suite is generated by compiling the TTCN-2 tests with adapter code allowing the executable test suite to communicate with the simulatedSDL model.

This approach has some distinct advantages over the previous method:

• Tests can be written in an adaptive (non-deterministic) manner.

• TTCN-2 has a well-defined operational semantics, including concepts such as defaulthandling and concurrency that are not apparent when using SDL simulator scripts.

• The TTCN-2 tests can be reused to test a SDL simulation and the target withoutmodification, by adapting the way in which the executable tests communicate with theSystem Under Test.

To enhance this approach we parameterised MSC with the TTCN-2 data language, sup-ported by the ptk test generation tool.

5 UML 2.0 testing environment

In our UML 2.0 testing environment we have again used Telelogic’s tools, this time TauG2[17]. The TauG2 tools support UML 2.0 with editors, code generators, and simulators. TauG2is relatively new though, so is less advanced than Tau in some respects, such as the simulator,which is called the model verifier in TauG2.

For testing UML 2.0 SDs against UML 2.0 Statecharts in TauG2, we have had to providea number of additional tools. The whole testing environment is illustrated in Figure 6. Wehave provided tools for: converting SDs to MSC; generating TTCN-3 tests from SDs usingptk; converting UML 2.0 data to TTCN-3 data using the in-house tool UMB; encoders anddecoders for the signal data; and the interface code between TTCN-3 and the UML 2.0Statecharts. The interface code was written using TCP/IP sockets.

When executing a TTCN-3 test against the model verifier we are able to monitor signalson both sides with trace logs which are in the form of Sequence Diagrams. This can be usefulfor tracing defects.

Figure 6: UML 2.0 testing environment.

After testing a simulation, the next step is to test the target code. The same TTCN-3tests can be used here, simply by changing the TTCN-3 TRI/TCI interfaces for the targetplatform.

6 Related work

Our focus has been with MSC/SDL and UML 2.0 models which are extensively used in theTelecoms domain, and now starting to be used in other domains. The reason we have providedtools such as ptk for test generation, and plug-ins for existing tools like TauG2 is that therehasn’t been off-the-shelf tools that we could use. AutoLink [7] is available in Tau [17], butdoesn’t meet our needs, since it supports only a limited subset of MSC, it’s semi-automatic,and generates one test per trace in TTCN-2 only. TTCN-3 is relatively new, and sometools are starting to appear which support its graphical format, such as Testing Technologiestoolsuite [18]. Recently, the OMG have been developing a UML Testing Profile standard [15]which will hopefully provide a modeling language for testing that tool vendors will support.

7 Conclusions

Model-driven engineering has proven to improve software quality, and at the same time reduceeffort by allowing for more automation [3]. Tool support is essential for MDE though, andto fully support testing in an MDE environment we have needed to provide additional toolsupport than is available externally such as test generators; interfaces between model code

UKTest 2005

and test code; data conversion tools; besides others.ptk developed in Motorola Labs for test generation has been used by several diverse groups

within Motorola over the last eight years. In general we have found there is a 33% reductionin effort compared to the manual approach. This takes into account issues of learning newtechnology, and the usual teething problems with migrating to a new environment. The otherbenefits are better coverage, better quality of tests, faster turnaround time (from changingthe specification to generating the tests), and easier maintenance. There are also side-benefitssuch as analysing the requirement specifications for defects. Indeed we have also developed atool that picks up defects on MSC requirements such as race conditions and non-local choice,which we’ve written about elsewhere [1].

We have briefly described two MDE testing environments in this paper, one with MSC/SDLand the other with UML 2.0. These environments allow for the early detection of errors inthe requirement specifications, which saves cost compared to errors that occur later in thelifecycle. In both environments it was important to use standard notations, with well-definedsemantics. The TTCN-3 language in particular has proven to be well-suited for describingdata and tests. The TRI/TCI interfaces to TTCN-3 allow for testing to be moved from asimulation environment to a real target platform by changing the adapters rather than thetests, this again saves effort.

With the advent of UML 2.0 the MDE approach is likely to become more widespreadwithin the telecoms industry, as well as other industries. There is currently still a need forrobust and complete tool support though. With better tool support, software quality is likelyto be improved and cost reduced, which is typically the bottom-line in competitive industries.

References

[1] P. Baker, P. Bristow, S. Burton, C. Jervis, D. King, B. Mitchell, and R. Thomson.Detecting and resolving semantic pathologies in UML sequence diagrams. In EuropeanSoftware Engineering Conference, Foundations of Software Engineering (ESEC-FSE’05),Portugal, Sept. 2005.

[2] P. Baker, P. Bristow, C. Jervis, D. King, and B. Mitchell. Automatic generation ofconformance tests from Message Sequence Charts. In Telecommunications and Beyond:The Broader Applicability of MSC and SDL, LNCS 2599, pages 170–198. Springer-Verlag,2003.

[3] P. Baker, S. Loh, and F. Weil. Model-driven engineering in a large industrial context—Motorola case study. In ACM/IEEE 8th International Conference on Model DrivenEngineering Languages and Systems (MoDELS/UML 2005), Jamaica, Oct. 2005.

[4] B. Beizer. Software Testing Techniques. Van Nostrand Reinhold, New York, 1990.

[5] European Telecommunications Standards Institute (ETSI). Methods for Testing andSpecification; The Testing and Control Notation version 3 (TTCN-3); Part 1: TTCN-3Core Language, Feb. 2003. Available from http://www.etsi.org.

[6] T. Gilb and D. Graham. Software Inspection. Addison-Wesley, London, 1993. ISBN0-201-63181-4.

[7] J. Grabowski. Specification Based Testing of Real-Time Distributed Systems. PhD thesis,University of Lubeck, 2002.

[8] Ø. Haugen. Comparing UML 2.0 interactions with MSC-2000. In SAM 2004: SDL andMSC Fourth International Workshop, LNCS 3319, pages 69–84. Springer-Verlag, 2004.

[9] International Telecommunication Union (ITU-T). ITU-T Recommendation X.680: Ab-stract Syntax Notation One (ASN.1) Specification of Basic Notation, 2002. Availablefrom http://www.itu.int.

[10] International Telecommunication Union (ITU-T), Geneva. ITU-T Recommenda-tion Z.100: Specification and Description Language (SDL), 2002. Available fromhttp://www.itu.int.

[11] International Telecommunication Union (ITU-T), Geneva. ITU-T RecommendationZ.120: Message Sequence Chart (MSC), 2002. Available from http://www.itu.int.

[12] ITU-T – International Telecommunications Union. The Tree and Tabular CombinedNotation (TTCN-2); ITU Recommendation X.292, 1997.

[13] M. Nelson, J. Clark, and M. A. Spurlock. Curing the software requirements and costestimating blues: The fix is easier than you might think. Program Manager, XXVIII(6),1999.

[14] Object Management Group (OMG). UML 2.0 Superstructure Specification, Aug. 2003.Available from http://www.omg.com.

[15] Object Management Group (OMG). UML 2.0 Testing Profile Specification, 2003. Avail-able from http://www.omg.com.

[16] Telelogic AB, Sweden. DOORS documentation, 2005. Information available fromhttp://www.telelogic.com.

[17] Telelogic AB, Sweden. Tau and TauG2 documentation, 2005. Information available fromhttp://www.telelogic.com.

[18] Testing Technologies IST GmbH. TTWorkbench, 2005. Information available fromhttp://www.testing.tech.de.

[19] C. Willcock, T. Deiß, S. Tobies, S. Keil, F. Engler, and S. Schulz. An Introduction toTTCN-3. John Wiley and Sons Ltd, England, 2005. ISBN 0-470-01224-2.

MOTOROLA and the Stylized M Logo are registered in the US Patent & TrademarkOffice. All other product or service names are the property of their respective owners.

UKTest 2005

Towards the Holy Grail of Software Testing

Ian Gilchrist, IPL

In my presentation to UKTest 2003 (entitled ‘Limits to Testing; Limits to Test

Automation’) I attempted to define the Holy Grail of software testing as being the

ability to extract information from a software design repository and automatically

generate test cases from this to run against code produced (by an independent route)

from the same design specification. I gave at the time a number of reasons why this

was difficult to do, but hinted that stirrings of industrially useful activity were being

detected. The aim of this presentation is to update delegates on some progress that is

being made in this area.

One of the difficulties I mentioned earlier was that no commercially viable products

will be produced unless there is a recognised software design standard which has

enough critical mass to encourage the product vendors to create saleable products. It

is easy enough (with a lot of money) to create ‘boutique’ products which will work

for one specified product chain, but much harder to create something with mass

appeal. The moment seems reasonably ripe to be able to proclaim that UML V2.0 is

now showing signs of offering the necessary power/cost ratio to encourage the growth

of a cottage industry around it, which will include test generation. Indeed, UML 2.0

has already spawned a Testing Profile Specification which is worth a look.

There are currently a number of supplier vendors in the UML-design tool area. One of

these is the I-Logix company, with its Rhapsody product now at Version 6. Rhapsody

has for some while included the ability to generate high-level ‘test vectors’ based on

UML Sequence diagrams. These are usually used to help validate an emerging UML

model (i.e. confirm that the system design is going in the ‘right direction’). In the case

of Rhapsody this test vector execution facility is known as TestConductor. As is

reasonably well-known, a UML design in Rhapsody can also be used to generate

code, typically C++, which contains most of the C++ class templates need to turn a

validated design into working code.

However, I-Logix also took a view approximately 2-3 years ago that helping users

generate software test scripts based on details of the UML design would be a very

useful feature. In the aerospace industry in particular the production and successful

running of software unit tests is an important aspect of getting software certified as

‘fit to fly’. Crucially it is not sufficient simply to run the tests but they also have to

show evidence of enough test coverage. At the highest (safety-critical) level this

includes Modified Condition/Decision Coverage (MC/DC) as well as the more

obvious Statement and Decision forms of coverage. Achieving this high level of

coverage is a significant burden - at least, that’s the way it’s seen.

Accordingly in 2002 I-Logix contracted with Bremen-based OSC to produce a tool

(ATG, standing for Automatic Test Generator ) which would work from UML design

data within the Rhapsody tool database to generate test cases intended to ensure that

tests based on these would enable a ‘high’ level of coverage to be reached. In fact, test

goals can be set as options within ATG as ‘states’, ‘transitions’, ‘events’ and MC/DC.

Users also have the option of specifying whether tests should be at the unit,

integration, or high-levels of testing.

ATG uses proprietary heuristics (based on standard UML data - class and object

diagrams, statecharts, activity diagrams, and implementation of operations in C++

notation) to generate a Test Case file in either ASCII or XML formats. This data can

then be exported to variety of test script formats, one of which is Cantata++. I-Logix

chose to make Cantata++ one of the standard formats offered because they saw it as

being one of the leading test tools of its type.

If accepted, my presentation will include a short (approx 15 minutes) demonstration

of Rhapsody, ATG and Cantata++ in action, working form a UML design, through

the auto-generation of test data, and a working Cantata++ test script. Since this is a

very fast-moving area of technological development it is not possible at this stage

(March 2005) to say exactly what will be demonstrated in September 2005.

To paraphrase Francis Fukiyama’s famous declaration on the fall of the Berlin Wall

that we were witnessing the ‘end of history’, is it now reasonable to say that we are

witnessing the ‘end of testing’? Not quite. Some issues remaining will include the

following:

• How ‘meaningful’ are the generated test scenarios? Do they lead to a

heightened sense of confidence that the code being tested is really suitable for

• How easy is it to calculate expected values and insert Checks for these?

• Can simulation techniques (stubs and wrappers) be easily employed to enable

true unit testing to be carried out?

These and other questions will mean that software testing practitioners are not yet out

of a job!

UKTest 2005

UKTEST 2005 September 2005

of 18 E. Perez-Miñana, JJ Gras

Improving fault prediction using Bayesian Networks for the development of embedded

software applications

Elena Pérez-Miñana

Motorola Labs / Basingstoke, UK

elena.perez-minana@motorola.com

Jean-Jacques Gras

Motorola Labs / Paris, France

jjgras@motorola.com

ABSTRACT

Predicting faults early on software projects has become possible thanks to the

utilization of innovative techniques in causal models that can combine observations of

key quality drivers with expert opinion and historical data. During a joint effort between Motorola Labs and a Motorola Toulouse software

engineering group, Bayesian Network (BNs) models were constructed for predicting the

faults inserted or found at each phase of a software development project. Defect

predictions were compared against product and process quality goals to help the

organization drive corrective actions during the project in a timelier manner. This

methodology contributed to a reduction of latent defects and associated Cost of Quality

(COQ) improvement. The scope and usage of these models was expanded through the specification of a

calibration method that led to the improvement of the BNs prediction models. In addition,

the sets of models were extended to cover new defects types and new areas of software

development such as system testing.

This paper describes the results that were generated from the validation and

refinement of the improved BN models outlining a set of clear guidelines for the

generation of a good predictor.

1. Introduction

Characterizing software production and quality is becoming more important over time

to Motorola with the ever growing amount of software embedded in products. Reliability

is the main product attribute traditionally used to assess product quality both for hardware

and software. Unfortunately, the assumptions held by existing software reliability

engineering methods are far from being true in a production context and their results,

based on failure count during testing, come too late to offer a chance to perform

significant corrective actions if the predicted product quality is not meeting expectations,

[4], [10]. In order to anticipate and satisfy the quality levels expected by customers while

controlling costs, high maturity organizations need to adjust their process continuously,

based on data and predictive methods.

A review of a set of existing Motorola Labs defect prediction models [6], representing

different software development activities, was conducted by the organizations involved.

In doing so, they were adapted in order to achieve fitness to the local context using

Motorola Labs BN construction tools. The Bayesian Test Assistant (BTA), one of the

components of this set of tools, enables end-users to interact with BN models, and get

predictions or run what-if scenarios at the component and system level once the models

have been built, and satisfactorily calibrated. The modelling technique used provides the

means to build causal models that can combine observations of key quality drivers with

expert opinion and historical data, offering a better opportunity to build a good predictor,

when compared against the potential associated to other statistical analysis techniques

Motorola Toulouse, in a process improvement effort, decided to deploy a new defect

prediction method in collaboration with Motorola Labs. Using BTA, an overall process

model was then composed from the individual activity models to match the project

structure and its process. During 2003, data was collected throughout the development

and predictions were generated periodically from early stages to improve defect

containment and product quality.

As a continuation of the initial project, the set of Bayesian Network (BN) models for

fault prediction were calibrated to the local context of this high-maturity software

development process. Research was then conducted to provide: (1) a method for the

improved calibration of prediction models built using BN, and (2) to extend the existing

set of models to cover new defects types and areas of software development, such as

system test.

This publication describes the results from the validation and refinement of the new

BN models that were constructed for the embedded software development process

deployed in Motorola Toulouse. The validation was conducted using data that was

collected from various participants of the software development and testing team. This

was done through a web-based questionnaire designed by looking at the input nodes of

UKTest 2005

the network models that were being built. The number of questions, that were finally

included in the questionnaire, covered all the model input factors. Once the answers

provided by the development team were collected, they were fed to the network models

allowing us to generate predictions for each of the networks outputs. The set of output

values constitutes a distribution which we label as the “predicted distribution” (PreDst).

This value is compared against the “actual distribution” (ActDst) recorded by the

Motorola Toulouse team during the characterization of their software development

process (SDP).

2. Deployment approach

Figure 1 Motorola Toulouse BN structure

Figure 1, shows a high level view of the BN model structure that corresponds to a

component life cycle of the process used by the Motorola Toulouse organization, details

of the procedure followed to build these models are described in [1]. There are 7 sub-

networks in the whole model. Each one is comprised of the factors and factor’s inter-

dependences that are used in one phase of the life cycle. A summary of the approach

followed to deploy the BN defect prediction models is described in the steps listed below:

• Identification of the different software components developed during the project,

and used in the product, or final deliverable.

• Selection of the activities associated to the component development phases, these

are based on each component’s life cycle which was here similar across the

components. It is possible to start from an existing generic library of activity

models, thus reducing the effort required to build and validate completely new

models each time. Additionally, we can reuse the BN model library developed for

similar developments processes.

• Revision of the selected activity models. This step usually entails adapting them

to the local context of use (for instance updating the definition of a variable or

scaling it). This step was very important as the Motorola Labs model library was

still at a research stage.

Requirements BBN

Design BBN Coding BBN

Feature Integration

Inspection / Unit test

Syst Test /Field Test (product)

CORE BBN

Feature / Component 1

• Composition, through our internal BN deployment tool, BTA, of the component

defect models. These are tailored to the project structure by connecting activity

models together in branches, each branch matching the sequence of development

activities of a component.

• Interconnection of all component branches into a product model, thus

representing the integrated component predictions at the product level.

In the case of the development of a big feature, an additional level of decomposition

into sub-components or sub-features may be necessary to give predictions with a more

appropriate granularity.

Thanks to the graphic representation of BN models, only limited modeling knowledge

was necessary to review, adapt, and compose the models. This expertise was provided

through Motorola Labs support during the initial transfer phase. As mentioned

previously, encouraging results had been obtained during this phase, highlighted by a

reduction of latent defects and the associated Cost of Quality improvement, prompting

the decision to continue with this practice and expand the scope of use of the models. To

this end, we then validated and improved the network models for the following phases:

Requirements, Design, and Coding.

The calibration enhancements (mean and variance) are achieved through the

validation/calibration of the nodes that comprise the existing models using historical data

that was collected in the 2003 projects and in new projects completed in 2004 as part of

the quality process improvement activities of the organization for which the network

models are intended. For this purpose a data collection tool, the “Web Questionnaire”

was deployed in Motorola Toulouse. At the same time, the first model set representing

component development is extended by including a “code inspection” model, and a

“system test effectiveness” model for system testing.

We wanted to ensure that the different software development groups working in

Motorola use the models to their best advantage. To this end, in addition to building BN

that represent the factors currently identified, we are keeping abreast with the findings of

research groups working in other organisations in the same area. This thread of study

provides us with information that allow us to build networks that include other factors,

for example new metrics, important in software improvement processes and currently

not recorded by our software groups., We think it is appropriate to start collecting data

with these new factors in order to build better predictors in the future. A detailed

description of the results obtained with this study is out of the scope of this paper.

Details of the new type of factors that are used to monitor the software requirements

phase are described in [12]. A brief description on how these factors can be integrated to

the information currently recorded by the software development groups we are

supporting can be found in section 3.2.

The following sections provide details of the network design process, the validation

method used, and the results obtained with the Motorola Toulouse case study.

UKTest 2005

3. Validation Procedure

The method used for validating the BN is an iterative and incremental one due to the

nature of the problem. The phases of the software development process are conducted in

a certain order which means that the models themselves must be computed and fine-tuned

sequentially. As the models were developed for predicting fault densities incurred in one

phase of the process, given the values of the factors judged to influence defects insertion

or removal in that phase.

The Motorola Toulouse organization calculates and records, as part of its monitoring

activities, the fault density that developers have detected in the requirements documents.

It also calculates the fault density associated to the designs, and to the software that is

produced during the coding phase. It is very clear that the number of faults detected at

any stage of the process depend on the quality of the software artefacts that are produced

in the previous phases. This is reflected in the network models through links that are used

to transmit the computed values for a particular upstream network to those of the

downstream network. We use the Netica tool [10] to build the networks. The nodes that

will be used as “links” between networks are specified in the same way in these

networks. We subsequently use the BTA tool to create these links and to be able to

transmit computations between networks. The BTA output is used to convert the model

output into a predicted range.

We entered and used the cases that were recorded by the Motorola Toulouse team for

the validation and fine-tuning. Nowadays many methods can be applied to validate and

improve probabilistic models using data describing the situation that wants to be

modelled. For any of them, one of the most important components is the quality of this

data. The validation method chosen depends on the degree of confidence associated to the

data, because there are methods that need more reliable inputs. Last but not least, in order

to produce a good predictor, it must not be too dependant on the data that was used to

compute the estimator, i.e. the solution must be a good “generaliser”, and this can only be

achieved if the solution is not fitted too closely to the available data. Deciding on the

level of “closeness” of fitness is also difficult, and very dependent on the situation/

problem being solved.

(OF/page) Des (OF/page)

Code Ph

(OF/KLOC)

ACT_FD_PR1 0.01 0.18 43.63

ACT_FD_PR2 0.07 0.38 7.98

ACT_FD_C2 0.08 0.25 9.69

MEAN_ALL 0.06 0.27 20.43

ACTUALS

Table 1 Fault Densities recorded by Motorola Toulouse

For the estimation of a set of inter-dependent probabilistic networks that are reliable

predictors of the fault density associated to a process described through a set of well-

defined factors, we have developed a procedure that makes use of two different

estimation methods, which complement one another reasonably well. A general outline

of the procedure runs as follows:

1. Arrange the network models according to the dependencies that exist between

2. For each model mi :

a. Make sure that the nodes in the model cover all the factors that affect the

associated development activity. Also check the inter-dependencies

between these factors. See Section 3.2 for details.

b. Transform the range of the network output (mi) so that it matches the

range of the output factor against which it is going to be compared,

allowing for variance calibration from a full history of project data

provided by Motorola Toulouse. In this case we are predicting fault

densities; therefore the output for each network (FDmi) needs to be

mapped to the range of the fault density computed at the relevant phase.

c. The values used for the calibration of the mean FD for our models can be

found in Table 1. Each column corresponds to the FD recorded by the

development team during a particular phase of the process (Requirements,

Design, Coding). The original data was recorded during the development

of three projects. Each row in the table corresponds to the FDs that were

calculated in the different phases of the process executed for a particular

project.

d. Run the calibrated network model mi with the BTA tool using as inputs

each of the cases recorded on the Web-questionnaire. Record the FDmi

computed by the model and also the relative error with respect to the

expected fault density.

e. Depending on the variability of the data, it might not be possible to

compute a good predictor just by calibrating current models. In these cases

there are two possibilities:

i. Review the activities being modelled to determine whether there

are missing factors or whether some of the existing ones have not

been correctly specified. If this is the case, it may be necessary to

modify the structure of the existing network, nodes and/or links,

and initiate another iteration in the validation procedure.

ii. If we have enough data to produce a reasonable probabilistic

network, an attempt can be made to build a new model through

statistical analysis. We developed a new approach based on

principal component analysis (PCA) to create intermediate nodes,

and linear regression to determine the relation between these

intermediate nodes and the output node, details of the procedure

can be found in Section 3.1. The estimation features used are

available in statistical packages such as Minitab [8]. The resulting

predictor will compute very good values for the data used to build

UKTest 2005

the regression model. In our case the set was small but the fit was

very good. Nevertheless, it is very likely that when using data that

was not part of the estimation procedure the predictions will be no-

where as good because predictors computed in this way usually

turn out to be poor “generalisers”.

3. Once a reasonable estimator mi has been computed, it can be linked to the

network mi+1 initiating a new iteration of the calibration and fine tuning

procedure.

3.1. Statistical Analysis

PCA is a data reduction technique used to identify a small set of variables that

account for a large proportion of the total variance in the original variables that describe a

population. Calculating the principal components (PC) of a population from a sample, it

is possible to determine which of the population’s descriptors (scores) are responsible for

its variance. These scores can be subsequently used to build the best estimator for

computing the predicted values which, in our case, correspond to the fault densities

recorded in an organisation. In our approach, we are using linear regression for this last

Details of the steps carried out are the following:

1. Compute the PCs of the input data that has been collected with the web-

questionnaire defined for the first stages of development.

2. The PCA analysis technique also provides information on how dependent the

variance of the population is on each of the components that have been calculated.

3. Select the number of principal components PCs that are responsible for at least

90% of the population’s variance. It is important to try and select as small a

number of PCs as possible because this will result in a simpler network.

4. Build the linear model that will allow us to estimate the desired fault density using

the scores calculated with the PCA method. This step is realized using linear

regression.

5. Structure a BN which has the following nodes:

a. One input node for each of the factors that are relevant for the activity

phase of the process under consideration.

b. One intermediate node for each of the scores that were computed with the

PCA. It might be necessary to use more than one intermediate node to

reduce the number of inputs to the node as this increases the complexity of

the BN.

c. An additional layer of intermediate nodes has to be built to represent the

linear regression model. These nodes will be linked to the final output

node that is used to compute the fault density.

In those cases in which the PCA produces results showing little variance in the input

data, it might be necessary to go back and discuss the matter with the team that provided

the information through the web-questionnaire because it is probable that it might be

necessary to collect further information in order to be able to compute a good fault

density-estimator.

The approach described previously provides structure and clear guidelines to compute

the probabilistic networks that give support for monitoring and improving the overall

software development process of an organization. One thing that should be noted is that it

might not work in all cases. If the data cannot be expressed with a linear function, it is

likely that the computations would produce a very complex solution that would fit the

data well but not “explain” it better than the models produced by experts. It might also be

the case that the input data used for the estimation is not as reliable as it must be to ensure

the good results that can be achieved when applying the estimation procedure described.

This highlights the importance of implementing sound methods to collect the data

generated during any development process.

For those cases where it is possible to compute a reasonable estimator using linear

regression and PCA techniques, we still need to determine how good the solution is in

relation to predicting values when new data is fed to the model. This constitutes one of

the aims of the project for 2005, i.e. the specification of a procedure that will allow us to

determine the generalisation capability of an estimator that has been computed using our

method.

In the next section we include details of the application of the validation/estimation

procedure that was described previously. The case study is the fault density prediction

networks developed for Motorola Toulouse.

3.2. Determining the factors/nodes for a BN

At the time of building a BN from expert opinion for a particular group, we conduct a

number of interviews with the development team to discuss the type of information they

record to measure the quality of their processes and new factors they identify as quality

drivers but not currently recorded. Once this matter has been clarified, it is necessary to

group the factors that are inter-dependent to reflect these dependencies in the network.

For example, in the case of the BN model for the requirements phase an element that will

affect the quality of the output, i.e. the requirements specification, is their structural

complexity.

UKTest 2005

Figure 2 Example Requirements BN model

The BN model shown in Figure 2 represents the factors and factors’ inter-

dependencies of the requirements process followed in Motorola Toulouse. The input and

output nodes of the network are described in Table 2. The intermediate nodes used in the

network integrate information that, in the experts’ experience, is brought together during

the decision process at any stage of the requirements phase. In some cases, the

intermediate nodes allow the network architects to explicitly express the dependency that

exists between some of the factors used to build the network.

Apart from deciding the inter-dependencies that exist between the input nodes, it was

also necessary to decide on the strength of the links associated to each of these inputs.

For example, the REQ_Accuracy node in the requirements BN has two inputs:

domain_knowledge, and time_pressure. Given that the distribution of the REQ_Accuracy

node is estimated using linear weighting, it is necessary to associate a weight value to

each of the inputs. The process of deciding on these weights involves a certain element of

trial and error; therefore care must be taken at the time of deciding on the best solution.

Node Name Definition

REQ_Domain_Knowledge REQ Domain Knowledge is a subjective assessment of the domain knowledge the project requirement team has.

GEN_REQ_F01_V08.dne Component's Generic Requirements Model label

REQ_Time_Pressure this node is a measure of the time pressure the project team is under during the requirement phase Five states are defined for this factor:

REQ_Accuracy Accuracy is a quality driven by the team's Domain Knowledge and the time pressure it is under.

REQ_Problem_Complexity REQ Problem Complexity is a subjective measure of the complexity of the application.

REQ_Uncomprehensiveness This node represents Comprehensiveness, Completeness and Consistency (plus Coverage in requirements documents.

REQ_Stability a subjective assessment of the evolution associated to the requirements at the end of the phase.

REQ_Process_knowledge a measure of how well the requirements process is followed Scale: five states, each of which defines a higher requirement process applicability.

REQ_Communication a measure of the level of team communication that exists

REQ_Challenge

this node represents the challenge the development team is confronted with as a result of things such as, missing requirements and requirements instability. Part of the holes in requirements could be filled by the team's design capability.

Table 2 Requirements network's input/output nodes' description

In our example three nodes, i.e. problem_complexity, time_pressure, and

domain_knowledge, are specified and associated to provide a measure of

“uncomprehensiveness” which allows us to represent in the BN the impact these elements

have on the overall complexity of the final requirements. Following the guidelines of the

IEEE standards on requirements specification [13] we have produced a BN which

provides a more fine-grained description of our understanding of the term “structural

complexity” by incorporating the following factors to the BN model: unambiguity,

modifiability, and verifiability. These factors provide a better description of the quality of

a requirement’s structure because they describe it in terms of their clarity, how easy it is

UKTest 2005

to change them, and how easy it is to test them. We believe this network should be a

better predictor. I It includes more information of the requirements process used by the

development team, providing more means to produce a better predictor. The extent to

which this is the case will be part of future validation research.

In addition to the specification of the dependencies that exist between the various

factors, it is also necessary to specify an output node that will compute a value that can be

compared to the fault density values recorded by the development group. This is achieved

by mapping the requirements challenge value that the BN built estimates to a value that

falls into the range of the fault densities that have been recorded.

In certain cases, it might be necessary to define additional intermediate nodes. This

occurs when a large number of input factors are inter-dependent, as is the case in the

network models that were built using the statistical analysis approach. When it is

necessary to use all the inputs to calculate the principal component in the BN model, the

computational costs of calculating that node becomes prohibitive because of exponential

growth of the node conditional probability table. To reduce node complexity the inputs

are distributed amongst various intermediate ones, which are brought together again by

adding an additional layer of intermediate nodes to the network.

4. Motorola-Toulouse Case Study

The following section describes the results obtained by using Bayesian networks to

build fault density predictors for Motorola Toulouse. The models were built using an

iterative procedure therefore various network models are generated throughout. The

following discussion includes some of the solutions produced which are restricted to the

Requirements, Design, and Coding phases.

The procedure for validating/fine tuning the Motorola Toulouse model complies with

the one outlined in Section 3. Each of the models mentioned corresponds to a version of

the networks generated during an iteration of the validation/design procedure:

o INITIAL_MODELS: these are the initial networks that were built in consultation

with Motorola Toulouse organisation. Each model is extended to produce a fault

density output FD. In the initial stage of the network development, a consultation

was carried out to decide the factors that were relevant at each stage of the

process. Once this matter was agreed, it was necessary to decide how the various

networks were going to be linked together. This step is important because it is

through this structure that information is fed between networks. The networks

were run to have an initial insight to the type of predictions it was possible to

compute. The columns labelled INIT_MODEL in the tables of sections 4.2, 4.3,

and 4.4 list the PreDst computed by this particular set of networks. There is also

an adjacent table that lists the relative errors that each network model is

committing when attempting to compute the ActDst. The relative errors

associated to the INIT_MODEL are the highest of the entire table showing that it

is clearly necessary to calibrate all of the network models.

o CALIB_MODELS: there are two further sub-sets of models in this group. Each

one includes a version of the network models in which the nodes were calibrated

in order to achieve better results for the various sets of inputs provided by

Motorola Toulouse. We have included two versions that differ from each other in

the weights associated to the links connecting the input nodes with the

intermediate nodes.

o REGRESSION_MODELS: this set of models was computed using linear

regression and PCA. It produces the best set of results as can be seen from the

contents of the relative errors tables listed in sections 4.2, 4.3, and 4.4. At present

we are not able to determine the generalisation capabilities of these networks

therefore it is not possible to discuss this matter further at this stage.

The following subsections discuss the results of the network models that were built for

each phase of the Motorola Toulouse process using the validation procedure.

4.1. Data sets

The data-points that were fed to each of the networks correspond to averages over the

data collected for each project used as input in this design process.

UKTest 2005

o PR1_AVR: is the average over the data collected from the individuals of the team

working on one of the projects they developed during 2003. For confidentiality

reasons the project has been labelled PR1. Each input that was fed to the networks

is the average over the values delivered by each individual member of the

development team.

o PR1_C2: this is a data set obtained from the team working on the feature labelled

C2 for the purposes of this report. This development was completed in 2004. The

feature C2 is part of the project PR1.

o PR2_C1_WAVG: the data points that were produced during the execution of

project PR2 could not be considered uniformly because they were obtained from

members of four different development teams, each one working on different

aspects of the features under development; therefore it was necessary to take into

account the differences between the contexts of work in each case. This was done

by associating a weight to each data point based on the size of the corresponding

component; a reasonable assumption is that the weight should be proportional to the

contribution of the component size to the overall feature. PR2_C1_WAVG is the

average over the weighted inputs.

o PR2_C1_AVR: corresponds to the average over all the data points collected from

members working on the PR2 project. The inputs were not weighted to compute

this average. We include it for comparison purposes with PR2_C1_WAVG, to

check the effect the weighting has, and determine whether the assumption of basing

it on code size is a correct one.

INIT_MODEL CALIBRATION 1 CALIBRATION 2 REGRESSION

PR1_AVR 0.13 0.05 0.05 0.01

PR1_C2 0.16 0.05 0.07 0.07

PR2_C1_WAVG 0.13 0.11 0.09 0.07

PR2_AVR 0.12 0.13 0.09 0.03

REQUIREMENTS FD

Table 3 Fault Density predicted by REQUIREMENTS model

PR1_AVR 979.94 295.14 329.26 -12.93

PR1_C2 100.62 -33.59 -15.28 -8.21

PR2_C1_WAVG 69.05 49.86 15.21 -1.69

PR2_AVR 67.13 51.82 19.56 -57.87

RELATIVE ERROR

Table 4 Relative error (%) for REQUIREMENTS model

4.2. Requirements phase networks

Table 3, and Table 4 contain the results that were computed with the network models

generated for the requirements phase conducted at Motorola Toulouse. Table 3 lists the

FDs that were predicted with each of the network models built when it is fed the averages

of the different projects that were used in the validation. Table 4 includes the relative

errors associated to each of these FD predictions when compared against the ActFD

values recorded for each project.

Looking at the relative errors in Table 4, it is clear that the best results are computed

with the REGRESSION network. The CALIBRATION networks compute reasonable

results but only for one subset of the actual FDs. This indicates that it will be difficult to

capture in the network the differences that exist between the projects whose data was

used for the calibration procedure.

Comparing the results obtained by using the weighted average of the

PR2_C1_WAVG data set against the results obtained when feeding the simple

PR2_C1_AVR average to the requirements network, it is clear that it is necessary to take

into consideration the differences that exist between the perceptions of the development

groups that are working on different components of the same feature.

Comparing the networks of the validation procedure against the initial requirements

network that was built by Toulouse, there are structural differences in addition to the

range and functions that were used to compute the output node. It was necessary to add

another input node which we have labelled REUSE. Its value is a percentage indicating

how much of the requirements in the feature’s specification document are being reused

from a previous release. The reusability factor should always be taken into consideration

because of the impact it has over the amount of work that needs to be carried out when

developing a software feature.

PR1_AVR 0.22 0.17 0.17 0.23

PR1_C2 0.25 0.17 0.18 0.21

PR2_C1_WAVG 0.23 0.22 0.21 0.35

PR2_AVR 0.24 0.25 0.24 0.37

DESIGN FD

Table 5 Fault Density predicted using the DESIGN models

PR1_AVR 22.87 -6.03 -5.11 26.92

PR1_C2 2.68 -4.79 -27.64 -12.79

PR2_C1_WAVG -40.01 20.63 -45.97 -8.18

PR2_AVR -36.49 39.49 -36.92 -2.03

RELATIVE ERROR

Table 6 Relative Error (%) for the DESIGN models

4.3. Design phase networks

Once we were satisfied that we had on our hands a working estimator for the FD of

the requirements phase, we proceeded with an analysis of the data that was collected by

Toulouse to improve the network model that was initially put together to reflect the

factors and factors’ inter-dependencies that occur at the design phase.

UKTest 2005

Structurally the only difference with the original network is that it was necessary to

add an output node to compute the output because the Design_Challenge that is

computed by the original network has nothing to do with the FD value that were recorded

by the Toulouse team as part of their quality processes. Table 5 includes the FDs

computed with the design models that were designed. The values in Table 6 correspond

to the relative error that exists between the predicted FD and the actual FD in each case.

For the Design phase the best models correspond to the CALIBRATION 1, and to the

REGRESSION model. Once again, it is possible to detect that with simple calibration the

most that can be achieved is to reach a solution that can only be close to a subset of the

actual FDs. The regression model at the design level is a better estimator. Once more it is

not possible to make any claims with respect to its generalisation potential.

4.4. Coding phase networks

The final stage of the validating/fine tuning phase involved the network that covered

the factors and factors’ inter-dependencies for the coding phase of the Motorola Toulouse

process. As in the previous cases, the validating operation was initiated once satisfactory

results had been achieved with the networks designed to cover the design phase.

PR1_AVR 13.34 11.98 25.90 28.73

PR1_C2 12.77 11.20 23.52 26.65

PR2_C1_WAVG 13.42 12.75 25.90 31.29

PR2_AVR 12.10 13.29 20.42 28.77

CODING FD

Table 7 Fault Density estimated with the CODING models

PR1_AVR -69.42 -72.54 -40.63 -34.15

PR1_C2 31.81 15.58 142.83 175.09

PR2_C1_WAVG 68.15 59.81 224.68 292.22

PR2_AVR 51.67 66.58 155.96 260.58

RELATIVE ERROR

Table 8 Relative error (%) of the CODING models

Of all the network models this was the most difficult to compute, partly due to the

huge differences that exist between the various Coding FDs that were recorded by the

Toulouse team, something that contrasts greatly with the little differences that exist

between the values of the input factors. The difference in the levels of diversity between

the inputs/outputs made it very difficult to produce good estimators either by calibrating

the existing nodes or using the linear regression/pca techniques. The little diversity that

was found in some of the input values made it extremely difficult to conduct a reasonable

For this particular case, it is possible that a factor has been missed out in the network

model that would explain this difference for the PR2 project. It might also be the case

that some measurement is erroneous due for instance to some of the descriptions

associated to each of the input factors was not clear enough.

The recommendation for this coding model is to continue capturing data from other

projects and adjust the relations from the new data. Another possibility is to use a “model

bagging” approach consisting of combining multiple model predictions. We could take a

coding model from the generic library and combine its results with the Motorola

Toulouse-specific coding model.

5. Conclusion and Next Steps

The analysis of the data and models provided by Motorola Toulouse have given us

the opportunity of producing a reasonable set of models and also a new method to

conduct the validation and fine-tuning of BN developed to support the improvement of

the software development process for a particular organization.

The method shows two alternative ways to produce an improved network. The first

one is based mainly in modifying the ranges of the existing nodes and/or adding inter-

dependencies between them, and/or varying the weight values associated to each of the

nodes that are used as inputs to the intermediate nodes of a BN model. The second

possibility is to use linear regression, PCA to build the intermediate/output nodes of the

BN network.

The second way of computing the best network generally produces better results

when the outputs are compared against those of the data that was used for building the

network. It does not necessarily follow that this will be the case when the network is run

using data of future Motorola Toulouse projects because the network might not be a good

generaliser. Further studies are required to prove this point.

We believe that the new methods developed for calibration/fine tuning of the network

should be the best way forward as this allows us to better exploit the potentials of the BN

model. In order to fully benefit from this approach we need to extend our current dataset

with more input data and actual faults numbers, and to refine the data collection method

to increase the precision of measurements.

To continue fine-tuning the method described, we have being increasing the number

of opportunities of applying them to other groups within Motorola in addition to

Motorola Toulouse. This will definitely help us to improve the current models, in

addition to learning from new data, by reviewing new generic descriptions for the nodes

currently used and adding group specific indicator variables in each case to qualify the

factors using Bayesian inference. This will make it possible to transmit clearly an

unambiguous meaning of each of the factors that are part of the BN models and to refine

the measurement with the added indicators.

UKTest 2005

As the process followed in an organization changes, and more effective ways to

capture the factors are devised, or the existing factors’ inter-dependencies change, it is

also necessary to adapt/change the BN models that are being used. Currently, we are

working on improving the models quality by conducting a review of the research area [1],

[5], [9], analysing the solutions reached by other research teams, and some of the new

modelling techniques that have been derived by them. Such an activity triggered the

changes to the Requirements network model discussed previously, which was extended

by integrating to the model the set of factors included in the ISO metrics for requirements

management [13]. The new model is included in Figure 3. The report mentioned in [12]

includes details on how the model was built. We intend to validate it in the project pilots

scheduled for 2005.

Figure 3 REQUIREMENTS generic network

We had a goal in 2004 to produce a clear procedure to validate/calibrate BN networks

to predict software reliability. To this end, we applied statistical techniques that provided

us with a measurement on the quality of the predictions made by our models, and

suggestions on how the models could be improved. The results achieved in Motorola

Toulouse is encouraging, nevertheless the number of unresolved questions indicate that

there is still a lot of research to be done and it is what we are currently working on.

6. REFERENCES1

[1] Boehm B., Basili V., "Software defect reduction, Top 10 list", IEEE Software

[2] Bourry F, Gras JJ, “Fault Prediction using Historical Data and Bayesian

Networks”, Proceedings of the Motorola S3 Symposium, 2004.

[3] Fenton N, Krause P, Neil M, “Software Measurement: Uncertainty and Causal

Modeling, IEEE Software, Jul/Aug 2002.

[4] Fenton N, Neil M, “A Critique of Software Defect Prediction Models”, IEEE

Transactions on Software Engineering, vol 25, No 5 Sept/Oct 1999.

[5] Fenton N, Pfleeger S, Software Metrics: A Rigorous & Practical Approach, ISBN

053495425-1, PWS, 1997.

[6] Gras JJ, “End-to-end defect modelling”, IEEE Software, Sept-Oct 2004.

[7] Gras JJ, McGaw D, “End-to-end defect prediction”, 15th International Symposium

on Software Reliability Engineering, 2004.

[8] Minitab, http://www.minitab.com.

[9] MODIST project, http://www.modist.org/

[10] Musa J, Iannino A, Okumoto K, “Software reliability engineering:

measurement, prediction, application”, ISBN 0-07-044093, McGraw-Hill, 1987.

[11] Netica, http://www.norsys.com/

[12] Pérez-Miñana E, “Extending the requirements factors of the fault prediction

BN models”, Motorola internal report, August 2004.

[13] IEEE Std 830-1998, “Recommended Practice for Software Requirements

Specifications”, 1998.

1 There are a number of additional publications that describe partial results achieved using the approach

described in the paper. They’re not listed in the references due to confidentiality issues.

UKTest 2005

uk software testing research iii - sheffield series of uk software testing research workshops...

Documents

polygraph testing uk

software engineering : software testing

learn software testing in coimbatore | learning software...

1 software testing see word file “software testing”

teaching software testing · teaching and self-study of...

novus - it recruitment agency uk | capita it resourcing •...

practical software testing software testing overview for...

software engineering software testing slide 1 software...

specialist group in software testing the tester · uk...

software testing workshop software testing workshop

devops enterprise summit devops con cisco follow us com...

“the future of software testing” iso 29119: the new...

software testing testing types testing strategy testing...

specialist group in software testing the tester · geoff is...

lecture 12 - software testing techniques & software testing...

software reviews & testing software reviews & testing an...

software testing interview questions 1934015245 software...

software testing basics course - qa-academy.lv ·...

software testing1 running head: software testing software...

software testing lab mannual - dept of technical …...