uk software testing research iii - sheffield series of uk software testing research workshops...
Post on 10-Mar-2018
218 Views
Preview:
TRANSCRIPT
UK Software Testing
Research III
5-6 September 2005
Department of Computer Science
University of Sheffield
UKTest 2005:
UK Software Testing Research IIIThe series of UK Software Testing Research workshops started with a workshop at the
University of York in September 1999. This first event involved only invited speakers andparticipants and almost all were from the UK. The EPSRC funded FORTEST network wasestablished in 2001 and discussions during FORTEST workshops led to a second event beingheld in September 2003. This second workshop was open to the international testing commu-nity and all papers were reviewed by members of the programme committee. UKTest 2005has followed this pattern.
The aim of UKTest is to bring together members of the UK Software Testing researchcommunity. It is intended to be an informal event and so presentations are relatively short(most are 20 minutes) and plenty of time has been scheduled for questions and discussion.
Organisation
• General Chair:Rob Hierons, Brunel University
• Programme Committee Chair and Local Organiser:Phil McMinn, University of Sheffield
Acknowledgements
We would like to thank our two invited speakers Paul Gibson and Alan Richardson foragreeing to present at UKTest 2005.
i
ii
Programme Committee
• Paul Baker, Motorola
• Tony Cowling, University of Sheffield
• Kirill Bogdanov, University of Sheffield
• John Clark, University of York
• John Derrick, University of Sheffield
• Isabel Evans, Testing Solutions Group Ltd
• Ian Gilchrist, IPL
• Keith Harrison, Praxis
• Mark Harman, Kings College, London
• Rob Hierons, Brunel University
• Mike Holcombe, University of Sheffield
• Paul Krause, University of Surrey
• Phil McMinn, University of Sheffield
• Mark Roper, University of Strathclyde
• Clive Stewart, IBM
• Joachim Wegener, DaimlerChrysler
• Martin Woodward, University of Liverpool
• Hong Zhu, Oxford Brookes University
iii
iv
Contents
1. Empirical Software Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Collecting and Categorising Faults in Object-Oriented CodeNeil Walkinshaw, Marc Roper, Murray Wood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Study of the Reciprocal Collateral Coverage of Two Testing MethodsDerek Yates, Nicos Malevris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
2. Formal Models and Approaches to Testing . . . . . . . . . . . . . . . . . . . . . 37
Towards Unit Testing for Communicating Stream X-machine SystemsJoaquin Aguado, Michael Mendler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39
Testing from object machines in practiceKirill Bogdanov, Mike Holcombe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A Formal Model for Test FramesTony Cowling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
The Need for New Statistical Software Testing ModelsJohn May, Maxim Ponomarev, Silke Kuball, Julio Gallardo . . . . . . . . . . . . . . . . . . . . . . .99
A Theory of Regression Testing for Behaviourally Compatible Object TypesAnthony Simons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Exploring test adequacy for database systemsDavid Willmor, Suzanne M Embury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .123
3. Search-Based Software Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Automatic Software Test Data Generation For String Data Using Heuristic Search withDomain Specific Search Operators
Mohammad Alshraideh, Leonardo Bottaci . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137Use of branch cost functions to diversify the search for test data
Leonardo Bottaci . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151Testability Transformation for Efficient Automated Test Data Search in thePresence of Nesting
Phil McMinn, David Binkley, Mark Harman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
v
4. Tools and Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Model-Driven Engineering Testing EnvironmentsPaul Baker, Paul Bristow, Clive Jervis, David King, Rob Thomson . . . . . . . . . . . . . . 185
Towards the Holy Grail of Software TestingIan Gilchrist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Improving Fault Prediction Using Bayesian Networks for theDevelopment of Embedded Software Applications
Elena Perez-Minana, Jean-Jacques Gras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
vi
1. Empirical Software Testing
1
2
Colle ting and Categorising Faults in Obje t-Oriented CodeNeil Walkinshaw, Mar Roper, Murray WoodDepartment of Computer and Information S ien esUniversity of Strath lyde, Glasgow G1 1XH, UKnw, mar , murray� is.strath.a .ukAbstra tA range of te hniques exist to identify and isolatefaults in obje t-oriented ode, but development andevaluation of these te hniques is hampered by the fa tthat there is very little information about the natureof obje t-oriented faults. This paper des ribes ourexperien es of using open sour e software proje ts tobuild up a pi ture of ommon faults in OO software.Open sour e proje ts are a ri h sour e of su h databut the identi� ation of faults from problem reportsis a non-trivial exer ise. Heeding existing warningsagainst employing tree-based lassi� ation s hemes,the attribute ategorization s heme of Weyuker andOstrand is adapted for the OO paradigm and em-ployed to des ribe the 71 identi�ed faults. Thesedes riptions are then used to reate an initial tree-based fault model. The resulting fault model is of ourse partial (we are limited by the types of bugswe have observed), but it provides a useful startingpoint for future re�nement.1 Introdu tionA sound knowledge of the types of faults1 that typ-i ally may reside in software produ ts is essentialfor the development of e�e tive validation and ver-i� ation te hniques. For example, e�e tive inspe -tion te hniques rely on the presen e of he klists thatguide the inspe tor towards potentially problemati areas of ode and these must be ontinually updatedto represent the urrently o uring problems; the ef-fe tiveness of a testing strategy may be evaluated bydetermining its power to un over a representative dis-tribution of defe ts; and the poten y of stati analysistools an be in reased by targetting them at reveal-ing ommonly o uring fault types. As Ostrand andWeyuker observed, the goals of olle ting and ate-gorising faults in lude �evaluating the e�e tiveness of1Throughout this paper the terms �fault�, �bug� and �de-fe t� are used synonymously. A fault is deemed to be the onsequen e of an error and a failure is a manifestation of afault.
urrent software development, validation, and main-tenan e te hniques, assessing the e�e tiveness of pro-posed new te hniques, gathering information that willguide future development methods and proje ts� [14℄.Although written twenty years ago, these goals arestill valid.Sin e then obje t-orientation (OO) has be omethe dominant programming paradigm. The types offaults that an o ur in pro edural ode (the sub-je t of Ostrand and Weyuker's paper) are di�erent tothose that an o ur in obje t-oriented ode. How-ever, there is very little information about the typeand distribution of faults that are likely to o urwithin OO proje ts, whi h makes it very di� ult todevelop and evaluate te hniques for OO software.This paper des ribes our experien es of trying tobuild knowledge of the types and distributions ofobje t-oriented faults that o ur in typi al softwareproje ts. We sampled 71 sour e ode faults fromthree open-sour e OO proje ts with the initial aimof using them to produ e an OO fault taxonomy. Asthis paper shows, there are several fa tors that makeit virtually impossible to produ e a taxonomy that isboth de�nitive and obje tive. We demonstrate thiswith a survey of existing pro edural and OO fault lassi� ation s hemes and highlight some of their am-biguities. We then use a modi�ed version of Ostrandand Weyuker's attribute ategorisation s heme in anattempt to lassify our own olle tion of faults. The�nal result is a tentative hierar hy of fault types.The most important ontribution of this paper isnot the fault lassi� ation s heme itself, but the ex-tension and employment of the attribute lassi� a-tion s heme whi h permits the faults to be used as abasis for multiple lassi� ation s hemes. If they aresimply re orded into a pre-de�ned hierar hy, there isno s ope for re lassi� ation.2 Related WorkThis se tion analyses some of the key ontributionsthat have been made in the study and lassi� ationof faults, both in pro edural and OO ontexts.13
UK Software Testing Research III
2.1 Pro edural Code FaultsA substantial amount of work has been arried out onthe analysis of errors, faults and failures that o urin pro edural ode (a omprehensive summary maybe found in Roper [17℄). This se tion is not exhaus-tive but fo uses on work that is parti ularly relevantto our aim of ategorising obje t-oriented faults (i.e. ategorisations in luding faults that ould also o - ur within obje t methods and studies that providemethodologi al details).Glass A sample of 100 `software problem reports'(SPRs) is taken from two real-life software proje tsand used as a basis for ategorisation [10℄. SPRsare studied in their `raw' undo tored form so thatas mu h original information as possible left by theprogrammer an be taken into a ount. Faults areallowed to `self- ategorise' - having reviewed a fault,it is either assigned to a ategory des ribing its ownnature or assigned to an existing ategory generatedby a previous fault. The bene�t of this approa h isthat it allows for new ategories to be established.The �nal set of fault ategries identi�ed by Glass anbe seen in table 1.Ostrand and Weyuker A sample of 156 faultsis taken from a ten thousand line spe ial-purpose ed-itor proje t [14℄. Faults are re orded on `SoftwareUser Reports' and `Change Report' forms are usedto des ribe hanges made to the produ t. They note( iting Thibodeau [18℄) that the popular approa h of lassifying an error by pla ing it in a tree of separatetypes is prone to ambiguous, overlapping and in om-plete ategories, too many ategories and onfusionof error auses, fault symptoms and a tual faults.Their solution to this is to des ribe the problemsymptoms, the a tual problem dis overed and the orre tion made instead of slotting faults into a spe- i� ategory. They use these des riptions - apturedin terms of attributes and values - to develop theirown lassi� ation s heme. Importantly, they do notassign a fault to a single ategory but �attempt toidentify the fault's hara teristi s in several distin tareas�. The fault attributes and the orrespondingvalues suggested in their paper are:• Major ategory: Data De�nition, Data Han-dling, De ision , De ision Pro essing, Do umen-tation, System, Not an Error• Type: Address, Control, Data, Loop, Bran h• Presen e: Omitted, Super�uous, In orre t• Use: Initialize, Set, Update
Defe t Type Pro ess Asso iationsFun tion DesignInterfa e Low Level Design (LLD)Che king LLD or CodeAssignment CodeTiming / Serialisation LLDBuild / Pa kage / Merge Library ToolsDo umentation Publi ationsAlgorithm LLDTable 2: ODC Defe t types [7℄Per eptual bugsSpe i� ation bugsAbstra tion bugsAlgorithmi bugsReuse bugsLogi al bugsSemanti bugsSynta ti bugsDomain Adheren e bugsTable 3: Pur hase and Winder's fault taxonomy [16℄Chillarege et al. Fault data an be used to pro-vide feedba k on the software development pro ess.Chillarege et al. propose the Orthogonal Defe t Clas-si� ation (ODC) pro ess, where signatures an be ex-tra ted from defe ts that point to triggers in the soft-ware pro ess [7℄. The goal of ODC is to identify theroot ause of defe ts of a parti ular type so that thedevelopment pro ess an be improved to address this ause. Chillarege et al. keep the number of possibledefe t types to a minimum in order to avoid onfu-sion. These an be asso iated with a parti ular stageof the pro ess and are shown in table 2.2.2 Obje t-Oriented Code FaultsHu�man Hayes Hu�man Hayes looks at the pos-sibility of applying fault-based testing to OO software[11℄. The paper onsolidates previous work on obje t-oriented software faults by Pur hase and Winder [16℄,Firesmith [9℄ and Mirsky et al. [13℄, although theempiri al basis for this, and the lassi� ations uponwhi h it is founded, are by no means lear. Everyfault lass is assigned a method that an be used forits dete tion. We have split Hu�man-Hayes' faulttaxonomy in order to group faults by author. Faultsproposed by Pur hase and Winder, Firesmith andMirsky et al. are shown in tables 3, 4 and 5 respe -tively.24
UKTest 2005
Omitted Logi Code is la king whi h should be presentFailure to reset data Reassignment of needed value to variable omittedRegression error Attempt to orre t one error auses anotherDo umentation in error Software and do umentation on�i tRequirements inadequate Spe i� ation of the problem insu� ient to de�ne desired solutionPat h in error Temporary ma hine ode hange ontains an errorCommentary in error Sour e ode omment is in orre tIF statement too simple Not all onditions ne essary for an IF statementReferen ed wrong data variable Self-explanatoryData alignment error Data a essed is not the same as data desired due to using wrong set of bitsTiming error auses data loss Shared data hanged by a pro ess at an unexpe ted timeFailure to initialise data Non-preset data is referen ed before a value is assignedTable 1: Fault ategories reported by Glass [10℄Errors asso iated with Obje tsAbstra tion violatedPersisten e ProblemsDo umentation out of syn with odeIn orre t state modelInvariants violatedFailures linked to instantiation and destru tionCon urren y problemsFailure to meet requirements of the obje tSyntax errorsFailures asso iated with messages, ex eptions, attributes or operationsErrors Asso iated with ClassesAbstra tion violatedDo umentation out of syn with odeIn orre t state modelInvariants violatedFailures linked to instantiation and destru tionFailures asso iated with inheritan eFailure to meet requirements of the lassSyntax errorsFailures asso iated with messages, ex eptions, attributes or operationsErrors Asso iated with S enariosFailure to meet requirements of the s enarioCorre t message passed to the wrong obje tIn orre t message passed to the right obje tCorre t ex eption raised to wrong obje tIn orre t ex eption raised to right obje tFailures linked to instantiation and destru tionCon urren y problemsInadequate performan e and missed deadlinesTable 4: Firesmith's fault taxonomy [9℄35
UK Software Testing Research III
Errors with En apsulationPubli interfa e to lass not via lass methodsImpli it lass to lass ommuni ationA ess module's data stru ture from outsideOveruse of friend/prote ted me hanismsErrors Asso iated w/ModularityMethod not usedPubli method not used by obje t usersInstan e not usedEx essively large number of methods in lassToo many instan e variables in lassEx essively long methodEx essively large moduleErrors with Hierar hyBran hing errorsDead-ends and y lesMultiple inheritan e errorsImproper pla ementErrors with Abstra tionClass ontains non-lo al methodIn omplete (non-exhaustive) spe ialisationTable 5: Mirsky et al.'s fault taxonomy [13℄In onsistent Type Use ( ontext swapping)State De�nition Anomaly (possible post- ondition violation)State De�nition In onsisten y (due to state-variable hiding)State De�ned In orre tly (possible post- ondition violation)Indire t In onsistent State De�nitionTable 6: Alexander et al.'s inheritan e / polymor-phism - related fault ategoriesAlexander et al. Con entrate on the faults that an be introdu ed by inheritan e and polymorphismand provide the `synta ti patterns' whi h ause them[4℄. The proposed fault ategories are listed in table6. The synta ti patterns also open up the possibil-ity of using stati -analysis tools to produ e empiri alresults, whi h would be very tedious to generate man-ually.Younessi Provides a higher-level overview ofparadigm features that ause problems when tryingto dete t faults [19℄. Although no on rete fault lassi� ations are proposed, a useful list of issuesthat need to be onsidered when dealing with obje t-oriented defe ts is provided. This is shown in table7.Dunsmore et al. Dunsmore et al. [8℄ produ e aset of defe t hara teristi s (in the vein of Ostrand
Abstra tion Redu es observabilityPartial/distributed implementationEn apsulation S ope es alationHierar hy integrationGeneri ity Type / behaviour variabilityInheritan e Substitutability problemMixing inheritan e stylesDeeply nested hierar hiesMultiple inheritan ePolymorphism In orre t binding in a homogeneous hierar hyServer-side hangeTable 7: Obje t-oriented features likely to ausefaults as listed by Younessiand Weyuker [14℄), where a single defe t an be as-signed several attributes. This was used to measurethe e�e tiveness of di�erent ode reading te hniquesat dete ting delo alised faults (faults that annot bedete ted by looking at a module of ode in isolationof the rest of the system). The set of hara teristi sis listed in table 8.Binder �The number of pla es to look is in�nitefor pra ti al purposes, so any rational testing strat-egy must be guided by a fault model.� [6℄. Binderdevotes a hapter of his testing book to the �bug haz-ards� of obje t-oriented ode, providing detailed ex-planations of why faults an o ur. He also providesa fault taxonomy whi h is provided in table 9 (weonly list the implementation faults, although he alsoprovides requirements, design and pro ess faults), al-though again the empiri al basis for this is un ertain.3 Colle ting FaultsThis se tion des ribes our experien es of olle ting,identifying and ategorising faults. The open sour eproje ts used are identi�ed, and the pro ess for de-riving a fault from its des ription is des ribed. Thepotential problems of trying to immediatly allo ate afault into a hierar hy are illustrated and a modi�edattribute lassi� ation s heme is proposed.3.1 Proje t Choi eOpen-sour e software foundries su h as Sour eforgeare a potent resour e for dis overing the nature of46
UKTest 2005
Use of library lass Requires understanding of lass librariesWrong obje t used Sending message to wrong obje tWrong method alled Sending in orre t messageIn orre t parameter in method all In orre t parameters in method allAlgorithm omputation Error in algorithmData �ow error In orre t / missing variable or in orre t valueSpe i� ation lash Clash with spe i� ationOmission Missing odeCommission In orre t or super�uous odeLo ality Area of ode required to be looked at to spot the defe t→Method→Class→SystemMethod size Size of method with defe t present→Small 0-4 lines of ode→Medium 5-10 lines of ode→Large 11+ lines of odeSequen e diagram lash defe t lashes with sequen e diagram given to use- ase inspe torsTable 8: Dunsmore et al.'s defe t hara teristi ssoftware faults. Their relatively re ent appearan eeliminates what used to be a signi� ant issue of ob-taining `real-life' software for a ademi analysis. Asfar as defe ts are on erned, tra king tools su h asBugzilla or the basi Sour eforge bug-tra king systemare valuable be ause they re ord omments about thenature of any fault / failure from both the developerand user.Three proje ts were hosen to demonstrate how to olle t faults using our fault s heme. The riteriafor the hoi es were that they must be open-sour e,obje t-oriented and must maintain a relatively de-tailed bug-reporting system. An example of a bugreport from one of the systems is shown in �gure 1.The three proje t hoi es were:
• Apa he Ant - A Java-based build tool [1℄ (17faults)• JHotDraw - A Java GUI framework for semanti drawing editors [3℄ (26 faults)• JFreeChart - A Java lass library for generating harts [2℄ (28 faults)3.2 Lo ating a FaultMost bug reports report a failure, but many do notexpli itly detail the a tual faults in the sour e ode.Usually there is not a lear one-to-one mapping be-tween failures and faults. In this paper we establishthis mapping by referring to a fault as the �x required
to eliminate a parti ular failure2.Whilst olle ting fault data, the �x to a failure of-ten had to be investigated by the author if the bugreport was not detailed enough. The simplest ap-proa h to determining a �x is to use a �le di�eren ingtool on subsequent CVS versions of the sour e ode.The E lipse IDE is parti ularly suitable for this taskbe ause it supports both CVS and �le di�eren ing[15℄. This is illustrated in �gure 2, where the two onne ted boxes in the middle highlight sour e odethat has hanged from one version to the next.3.3 The Dangers of Premature Classi-� ationInitially this study was arried out in order to vali-date a spe ulative fault hierar hy whi h was intendedto summarise ea h fault by its position in the hierar- hy. Hierar hies often seem the intuitive approa h to ategorising su h data and it is tempting to try anduse them. However, as has been noted earlier, theproblem with software fault hierar hies is that theyare prone to ambiguous, overlapping and in omplete ategories [18, 10, 14, 5℄. We illustrate this with anexample inspired by one of the bug reports: Figure3 shows a super�uous method all in an argument.If the type returned by getParent() (in the faulty ver-sion) is di�erent from omp (in the orre t version) and2It is important to note that there are usually several possi-ble �xes to a given fault. Fixes may also introdu e new faultsinto the system. Without a spe i� ation however it must beassumed that (at the time of orre tion) the �x is as lose aswe will get to the developers notion of a orre t system.57
UK Software Testing Research III
Method S opeGeneral Message sent to obje t without orresponding methodUnrea hable odeContra t violation (pre ondition/post ondition/invariant)Message sent to wrong server obje tMessage priority in orre tMessage not implemented in the serverFormal and a tual message parameters in onsistentSyntax errorAlgorithm Ine� ient, too slow, onsumes too mu h memoryIn orre t outputIn orre t a ura y, ex essive numeri al or rounding errorPersisten e in orre t - wrong obje t saved or not savedDoes not terminateEx eptions Ex eption missingEx eption in orre tEx eption not aughtIn orre t at hEx eption propagates out of s opeEx eption not raisedIn orre t state after ex eptionInstan e variable de�ne/use Missing obje t (referred to, but not de�ned)Unused obje t (de�ned but no referen e)Corruption / in onsistent usage by friend fun tionMissing initialization, in orre t onstru torIn orre t type oer ionServer ontra t violatedIn orre t or missing unit (e.g. grams vs. oun es)In orre t visibility s opingIn orre t serialisation, resulting in a orrupted stateInsu� ient pre ision / range on s alar typeClass S ope In orre t method under multiple inheritan e due to errorIn orre t onstru tor or destru torIn orre t parameter(s) used in generi lassAbstra t lass instantiatedSyntax errorsAsso iation not implementedCluster / Subsystem S ope In orre t priorityIn orre t serialisationMessage sent to destroyed obje tIn onsistent garbage olle tionIn orre t message / right obje tCorre t ex eption / wrong obje tWrong ex eption / right obje tIn orre t resour e allo ation / deallo ationCon urren y problemsInadequate performan eDeadlo kTable 9: Binder's fault taxonomy68
UKTest 2005
Figure 1: Example Bug Report from JFreeChart
Figure 2: Comparing subsequent �le versions in E lipse79
UK Software Testing Research III
popUp( omp.getParent(), newLo ation);instead ofpopUp( omp, newLo ation);Figure 3: Example of a faultthe alled method - popUp - is overloaded, then one ofthe overloading methods might be exe uted instead.How would this be ategorised using the obje t-oriented fault s hemes mentioned in se tion 2.2? Us-ing Pur hase and Winder's s heme [16℄ (shown intable 3), it would probably be most appropriate tolist this as an `algorithmi bug'. Using Firesmith'ss heme (shown in table 4), it would be lassi�ed asa `failure asso iated with messages, ex eptions, at-tributes or operations' and `in orre t message passedto right obje t'. Using Miller's s heme (shown in ta-ble 5) there is no appli able ategory. Using Dun-smore's s heme (shown in table 8) it ould be lassi-�ed as `wrong method alled', `in orre t parameter inmethod all' and ` ommission'. Using Binder's tax-onomy in �gure 9 the fault ould be ategorised under`message sent to wrong server obje t', `in orre t out-put' and `server ontra t violated'. It is lear that ategorisation is ertainly not straightforward usingany of these approa hes be ause we have the (by nomeans un ommon) situation where either there is nofault ategory tailored to this parti ular fault, or sev-eral may be appli able. This is a dangerous situa-tion sin e, through ina urate assignment of faults,it leads to the population of ina urate and unrep-resentative hierar hies, whi h has severe impli ationsfor their use in the development and evaluation ofveri� ation and validation te hniques.3.4 Re ording FaultsOstrand andWeyuker suggest that their attribute at-egorisation s heme is more �exible and a urate thaninserting faults into a hierar hy [14℄. It must be notedthough that attribute ategorisation s hemes are notwithout their problems: the ambiguities that arisewhen using fault hierar hies an still persist if indi-vidual attribute values are not orthogonal. As withthe hierar hies, su h ambiguities an result in in on-sistent data entries. The la k of restri tion on lassi�- ations an pose problems as well. Depending on thenumber of attributes and values, the number of possi-ble ombinations used to des ribe faults is potentiallyvast, making olle ted data di� ult to analyse.This restri tion an however also be seen as a de-gree of freedom. Allowing for the de�nition of faultsby their attributes and values without a prede�nedset of ategories makes this a very �exible te hnique
to use in obje t-oriented programming, where littlefault data has been olle ted. It allows for the poten-tial use of data lustering te hniques to dis over newfault ategories.For our data olle tion we expanded Ostrand andWeyuker's original attributes (see se tion 2.1) to in- lude some of the lassi� ations mentioned in se -tion 2.2 in order to adequately des ribe OO faults(remember that Weyuler and Ostrand's s heme wasdevised prior to the widespread adoption of OO pro-gramming). This resulted in the introdu tion of the�S ope� attribute and its asso iated values. New val-ues were generated in the vein of Glass's self ate-gorising approa h (see se tion 2.1): If a parti ularvalue didn't exist for a given fault attribute, it was reated. Also, some of the original values were un-used and have been dropped3 from the s heme, orhave been modi�ed to more a urately re�e t theOO paradigm (e.g. within the �Type� attribute, the�Data� value have been split into �Data Value� and�Data Type�. The �nal s heme is shown in table 10where bold items represent additions or modi� ationsto the original Ostrand and Weyuker s heme. Usingthe enhan ed ategorisation s heme introdu ed in ta-ble 10, the example fault des ribed earlier would be ategorised as follows:• Major Category: Call (be ause the fault is asuper�uous all)• Type: Message (the type of fault is an erroneousmessage being sent from one obje t to another)• Presen e: Super�uous (to �x the fault the allmust be removed)• Use: Argument (the all is made in the ontextof an argument within another all)• S ope: Method (the �x only a�e ts ode withina single method)This attribute ategorisation approa h has beenapplied to the sample of 71 faults harvested from thethree proje ts. Tentative groupings were establishedby re ording faults as a spreadsheet, where ea h rowis a fault, and ordering the data so that similar rowsare adja ent to ea h other (this notion of similarityis a subje tive one whi h gives slightly more weightto the major ategory attribute, but otherwise is asimple ount of ommon values). An extra t of thespreadsheet is shown in �gure 4. To ensure that thegroupings orre tly re�e t fault hara teristi s, re-ports for ea h of the faults in a group were ompared3The values were dropped mainly to prevent the s hemefrom be oming luttered. It is always possible that a new faultmay relate to one of these dropped values, but the �exibility ofthe s heme always allows for the reintrodu tion of su h values.810
UKTest 2005
Major Category Type Presen e Use S opeAlgorithmi Address In orre t Argument MethodAbstra t lass Predi ate Omission Boolean operation ClassCall Control Super�uous Cast SystemCon urren y fault Data value De�nitionDe laration Loop MethodEn apsulation Message Obje tEvent Data typeEx eptionInheritan e / PolymorphismTable 10: Enhan ed attribute ategorisation s heme
Figure 4: Illustration of Fault Des riptions9
11
UK Software Testing Research III
and a summary of the fault was produ ed. The (un-ordered) summaries are provided in table 11.3.5 Analysing FaultsCapturing faults in the manner we have illustratedpermits further analysis to be arried out. Althoughwe have argued against the use of hierar hies for thesake of re ording faults be ause of their ambiguities,that does not mean that they should not be usedoutright. As demonstrated by Kuhn (albeit on logi alspe i� ations)[12℄, they an be useful for reasoningabout the aspe ts of software development on ernedwith faults su h as test set generation for fault-basedtesting [11, 6℄.A tentative attempt was made to further lusterthe faults in table 11 and insert them into a hierar- hy, whi h is shown in �gure 5. Again this was pri-marily a manual pro ess - mainly due to the relativelysmall number of faults, oupled with the fa t that us-ing attributes (as opposed to numeri values) makeautomati lustering di� ult - but whi h resulted inan in reased number of ategories in this when om-pared with the initial spe ulative hierar hy. This istestimony to the fa t that the attribute ategorisa-tion s heme en ourages the dis overy of new faulttypes. The new hierar hy in ludes ategories su h as`Event fault' and `Super�uous all' that would havebeen di� ult to ategorise in the initial spe ulativehierar hy.The list of faults presented in this ase study isnaturally not a omplete list of faults that ould pos-sibly o ur in an obje t-oriented environment. It ishowever a useful starting point for anyone lookingfor a olle tion of representative faults whi h an beused to evaluate a fault isolation approa h su h assoftware inspe tion or testing.4 Con lusions and Future WorkThere is a la k of information on erning the na-ture and distribution of faults that are likely to o urwithin OO systems, and this is hampering the devel-opment and evaluation of veri� ation and validationte hniques aimed at this paradigm. Some lassi� a-tion s hemes exist but most of these have a question-able empiri al basis. Open sour e proje ts have typi- ally lists of problem reports whi h represent a valu-able sour e of data. Three open sour e proje ts were hosen and their problem reports mined to extra tinformation on the faults attributed to the problem(this is not always a straighforward task).The dangers of immediately lassifying faultswithin a tree-like stru ture are illustrated and insteadthe attribute ategorization me hanism introdu ed
by Weyuker and Ostrand is extended and adaptedfor the OO paradigm. This is employed to des ribe71 faults identi�ed in the open sour e proje ts andused to provide a tentative lassi� ation of ommonOO faults. Future work will on entrate on buildingup this database of faults and investigating the appli- ation of automati lustering me hanisms to de�nea robust fault lassi� ation s heme.
1012
UKTest 2005
Des ription Number Re ordedhigh level algorithm error ( orre tion required in several methods) 3not implementing interfa e when it should be 1in orre tly implements an interfa e 2missing fun tionality (requires additional data member) 1missing fun tionality (no lear �x) 1in onsistent method (does not allow property that other methods allow) 1high level algorithm error ( orre tion required in single method) 6missing fun tionality in method body 1predi ate ondition is in orre t 3predi ate is missing a ondition 9statements are in wrong sequen e 1data value is not orre t 3bool op an never return true given data value 1data value an ause failure without additional ontrol stru tures 2ine� ient loop algorithm, shouldn't iterate through entire sear h spa e 1not alling a method when it should 1 alls wrong method when de�ning a variable value 1in orre t method alled 4in orre t argument value 4wrong obje t used in method all 1in orre t boolean expression as argument 1omitted obje t in method all 1super�uous method all 2should not use method all in variable de�nition 1 on urren y fault auses failure (deadlo k / ra e ondition) 3Method is visible to entire pa kage instead of being prote ted 1sub lass de�nes new methods instead of overriding existing ones 1method name breaks naming onvention 1missing fun tionality 2super�uous de�nitions (already de lared in super-interfa e) 1wrong variable type de laration 2en apsulation error: data member an be manipulated via a essor 1does not �re events when it should 1fails to raise ex eption 3in orre tly implements super lass 2missing alls to super lass 1Table 11: Fault Summaries11
13
UK Software Testing Research III
• Algorithm⊲ Missing fun tionality⊲ Control Flow� Predi ate fault
· In orre t predi ate· Predi ate missing a ondition� In orre t exit lause for loop� Statements in wrong sequen e
⊲ Data� Data value is not orre t� Boolean operation an never return true for a given variable� Variable an ause failure without additional ontrol stru tures⊲ Call� Missing all� In orre t method alled� In orre t obje t used in all� super�uous method all� In orre t argument value
• Abstra t lass (interfa e)⊲ Not implementing an interfa e when it should⊲ In orre tly implementing an interfa e
• Con urren y fault (deadlo k / ra e ondition)• De laration fault
⊲ Method given wrong visibility⊲ Sub lass de�nes new methods instead of overriding existing ones⊲ Method breaks naming onvention⊲ Super�uous de laration (e.g. Interfa e de laring method that is already de lared in base-interfa e)⊲ Wrong variable type de laration
• En apsulation fault (e.g. Data member an be manipulated via a essor)• Event fault (e.g. Does not �re events when it should)• Ex eption fault (e.g. Fails to raise ex eption)• Inheritan e fault
⊲ In orre tly implements super lassFigure 5: Hierar hy of summaries12
14
UKTest 2005
Referen es[1℄ The apa he ant proje t.http://ant.apa he.org/.[2℄ Jfree hart.http://sour eforge.net/proje ts/jfree hart.[3℄ Jhotdraw.http://sour eforge.net/proje ts/jhotdraw.[4℄ R. Alexander, J. O�utt, and J. Bieman. Synta -ti fault patterns in obje t-oriented programs. InPro eedings of The Eighth IEEE InternationalConferen e on Engineering of Complex Com-puter Systems (ICECCS '02), pages 193�202,Greenbelt, Maryland, De ember 2002.[5℄ B. Beizer. Software Testing Te hniques. VanNostrand Reinhold, 1990.[6℄ R. Binder. Testing Obje t-Oriented Systems. Ad-dison Wesley, 1999.[7℄ R. Chillarege, I. Bhandari, J. Chaar, M. Halli-day, D. Moebus, B. Ray, and M. Wong. Orthogo-nal defe t lassi� ation - a on ept for in-pro essmeasurements. IEEE Transa tions on SoftwareEngineering, 18(11):943�956, November 1992.[8℄ A. Dunsmore, M. Roper, and M. Wood. Thedevelopment and evaluation of three diversete hniques for obje t-oriented ode inspe tions.IEEE Transa tions on Software Engineering,29(8):677�686, August 2003.[9℄ D. Firesmith. Testing obje t-oriented software.In Pro eedings of TOOLS, Mar h 1993.[10℄ R. Glass. Persistent software errors. IEEETransa tions on Software Engineering,7(2):162�168, Mar h 1981.[11℄ J. Hu�man Hayes. Testing of obje t-orientedprogramming systems (oops): A fault-based ap-proa h. In Pro eedings of the International Sym-posium on Obje t-Oriented Methodologies andSystems (ISOOMS), number 858 in Springer-Verlag Le ture Notes on Computer S ien e se-ries, pages 205�220, Palermo, Italy, September1994.[12℄ D. Kuhn. Fault lasses and error dete tion apa-bility of spe i� ation-based testing. ACM Trans-a tions on Software Engineering and Methodol-ogy, 8(4):411�424, O tober 1999.[13℄ S. Mirsky L. Miller, J. Hu�man Hayes. Task7: Guidelines for the veri� ation and validation
of arti� ial intelligen e software systems. Te h-ni al report, United States Nu lear RegulatoryCommission and the Ele tri Power Resear h In-stitute, 1993.[14℄ T. Ostrand and E. Weyuker. Colle ting and at-egorizing software error data in an industrial en-vironment. Journal of Systems and Software,4(4):289�300, November 1984.[15℄ OTI. E lipse platform overview. 2003.[16℄ J. Pur hase and R. Winder. Debugging toolsfor obje t-oriented programming. Journal ofObje t-Oriented Programming, 4(3):10�27, June1991.[17℄ M. Roper. Software Testing. M Graw Hill, 1994.[18℄ R. Thibodeau. The state-of-the-art in softwareerror data olle tion and analaysis - �nal re-port. Te hni al report, General Resear h Corp.,Huntsville, AL, 1978.[19℄ H. Younessi. Obje t-Oriented Defe t Manage-ment of Software. Prenti e Hall PTR, 2002.
1315
UK Software Testing Research III
16
UKTest 2005
A Study of the Reciprocal Collateral Coverage Provided by Two Testing
Methods
D. F. Yates
(© 2005 D.F.YATES)
Information Systems and Databases Laboratory
Department of Informatics
Athens University of Economics and Business
Athens, Greece
and
N. Malevris
Department of Informatics,
Athens University of Economics and Business
Athens, Greece
Abstract
Branch testing, for example, seeks to cover all branches in a piece of code, but when it
is performed, as well as branches, other program elements, such as statements or p-
uses, will necessarily be covered. The contemporaneous coverage of these other
elements is referred to as “collateral coverage”. An understanding of the extent of the
collateral coverage of the various program elements that is afforded by the available
testing methods, can be used to determine an optimal deployment of the methods in
pursuance of the aim of covering one or more of those elements. In this paper, as a
step towards such an understanding, the collateral coverage that branch testing yields
in respect of JJ-paths, and that JJ-testing yields in respect of branches, is investigated.
The results of the investigation do, in fact, facilitate the definition of a policy for
deploying the two methods when branch coverage, JJ-path coverage, and/or both of
these is sought. However perhaps surprisingly, they also indicate that the widespread
preference for branch over JJ-path testing, may not be justified.
Keywords: branch testing, JJ-path testing, collateral coverage, test coverage.
1.0 Introduction
The extent of branch coverage that is achieved by a practical application of branch
testing, may be expressed using the Ter2 metric, Brown [2], as follows:
codetheinbranchesofnumbertotal
executedbeenhavethatbranchesofnumber2 =Ter
Woodward et al. [19], introduced a family of structural testing methods based on
the concept of a JJ-path or LCSAJ, where a JJ-path is a sequence of consecutively
17
UK Software Testing Research III
numbered basic blocks: p, (p+1), …, q, followed by a jump to a basic block numbered
r, where r ≠ (q + 1). The first member of this family seeks to cover all JJ-paths in a
unit of code, and the success of the method can be measured using the Ter3 metric,
also introduced by Woodward et al.; the metric being defined analagously to Ter2.
The influence of work on branch testing, of which [1], [7], [9], [16], and [23]
represents but a small sample, has been significant in it becoming viewed as a
minimum requirement for the testing of a piece of software. Branch testing does have
its limitations, [4] and [5], nevertheless, it is popular, as is reflected in the fact that it
is a testing requirement in many published software standards, see [3] and [14] for
example. JJ-path testing has substantially fewer adherents, and few testing standards
require that it be performed; [14] is an exception, although it is mentioned in [15].
In Malevris and Yates [11], the concept of “collateral coverage” was introduced.
It may be defined as follows.
Definition 1
Let M be a testing method that specifically aims at covering some element (feature), E
say, of program code. If as a result of applying M to a unit of code in persuance of
covering element E, some coverage of another element, H, is also realised, then this
additional coverage is defined to be the collateral coverage of H that is achieved by
M.
Upon its application, every testing method will achieve collateral coverage, which
may be extensive, of a number of different structural elements of a piece of code. It
would be unwise and inefficient not to take some advantage of this. What is more, if
the relative ability of the various testing methods in achieving collateral coverage are
known, further advantage may clearly be gained by optimally sequencing the
application of those methods. The work presented in this paper is intended to be one
step towards assessing such relative abilities. The step taken involves an investigation
of branch and JJ-path testing in this respect. Specifically, the collateral coverage of
branches by JJ-path testing, and of JJ-paths by branch testing, is compared with
respectively the branch coverage derived by branch testing, and the JJ-path coverage
resulting from JJ-path testing. Thus, in effect, it is an investigation of the reciprocal
collateral coverage of branch and JJ-path testing. Further, since it too may prove to
be relevant in determining an optimal sequence for the application of the two
methods, the “aggregate coverage”, (branch coverage +JJ-path coverage) achieved by
the methods is also compared.
The paper is structured as follows. After introducing necessary definitions and
concepts, and defining the path generation method used in the experiments in section
2 of the paper, the sample and experimental regime are detailed in section 3. The
results of the experiments themselves, which entailed the generation and attempted
execution of well over 6,000 program paths, are presented in section 4. An overview
and discussion of the important finding of the experimental work in section 5 then
concludes the report.
2.0 Definitions and Path Generation Method
Consider, C, a code unit (a program, procedure, subroutine, or other delimited and
distinguished section of program code) with n basic blocks. If C is to undergo branch
18
UKTest 2005
testing, an appropriate model of C's structure upon which path selection can be based
is “the control flow graph” of C, see Paige [13].
Definition 2
The control flow graph Gc= (V, �
) of C is a connected directed graph in which, there
is a unique vertex vj ∈V for each basic block j of C, and arc vivj ∈�
iff there is a
transfer of control f1ow (branch) in C from block � to block j.
As a code unit may have one or more entry points, and one or more exit points, Gc
may possess one or more sources (vertices with no in-coming arc), and one or more
sinks (vertices with no out-going arc) respectively. It will be assumed here that Gc
contains a unique source denoted S and a unique sink denoted F. Should this not be
the case in practice, the existence of multiple entry and/or exit points in C can be
addressed by introducing into Gc, a super-source and /or super-sink, as is appropriate,
together with associated arcs, see [20] for example.
Under this assumption, it can be seen that there is a one-to-one correspondence
between the program paths (paths from the entry to the exit point) of C and the paths
from S to F in Gc (S-to-F p � ths). Consequently, executing one of C's progam paths is
equivalent to 'executing', or covering, the corresponding S-to-F path in Gc and
therefore, its constituent branches.
Unfortunately, coverage of a path in Gc is not necessarily equivalent to executing
the corresponding program path in C since the program path may be 'infeasible', that
is, no data set exists that will force its execution. Consequently, a set of S-to-F paths
selected to give a specific value of Ter2 may not realize that value when an attempt is
made to test the corresponding program paths in C. The existence and influence of
infeasible paths will necessarily play a major role in what follows.
If, rather than branch testing, C is to undergo JJ-path testing, the control flow
graph of C is not the most convenient model on which to base test path generation. In
this situation C `s JJ-graph, GJ , is more appropriate, see [18] and [21].
Definition 3
The JJ graph GJ = (V , S ) of C is a connected directed graph in which, there is a
unique vertex vj ∈ V for each JJ-path j of C, and arc vivj ∈ S iff there is a transfer of
control f1ow (jump) in C from JJ-path � to JJ-path j.
It is not unusual for GJ to have one or more sources and/or sinks, see Woodward [18].
However, the same expedient as for the control flow graph, namely, that of
introducing appropriately a super-source and /or super-sink, together with the
necessary associated arcs, may be adopted. Therefore, it will be assumed here that GJ
contains a unique source, S , and a unique sink, F . Given the definition of GJ it is not
difficult to understand that, with this assumption, the execution of a program path, and
therefore, its constituent JJ-paths, is equivalent to covering the vertices and arcs of the
corresponding S -to- F path.
2.1 Improvements on the Teri
By definition, full branch, and full JJ-path coverage is achieved only when the
19
UK Software Testing Research III
corresponding Ter metric achieves the value unity. However, given that every basic
block involves a predicate, and that, in general, any program path will involve several
basic blocks, there exists the possibility of there being no test data that will
simultaneously satisfy all predicates associated with a specific branch, or JJ-path. In
the presence of such infeasible branches and infeasible JJ-paths, less than accurate
information is, therefore, provided by the Ter metrics. However, consider *
iTer , i = 2,
3, the relative iTer metrics, which are defined as:
infeasiblebe toprovennot of elementsofnumber
coveredof elementsofnumber*
C
C Teri =
where the term ‘elements’ refers to branches, or JJ-paths, as is appropriate. (It is noted
that similar ‘relative’ metrics can be defined for any testing method.)
Clearly, in order to evaluate any *
iTer , it is necessary to know the number of
corresponding infeasible elements in the software under test. Testing, however, is
essentially a sequential process. Thus, there may come a time when a certain element
is known to be infeasible, and at that point, an error may potentially have been
highlighted, but thereafter, even if no error exists, *
iTer will provide a more accurate
indicator of the coverage actually achieved than iTer . It is for this reason that the *
iTer metrics, rather than the iTer , will be used in reporting the experimental results
that have been derived.
2.2 The Path Selection Method
The aim of this paper is to compare certain characteristics of the performance of two
testing methods. As such a comparison will (necessarily) be afforded only as a result
of generating sets of test paths. If an unbiased comparison is to be made, it is
imperative, therefore, that the method adopted for the generation of such path sets also
be unbiased. This section only outlines the method that was adopted in deriving the
experimental results reported herein, as a detailed specification has been published
elsewhere in Yates and Malevris [21], and [22].
In an attempt to derive a heuristic for selecting feasible test paths a priori, Yates
and Hennell [20] advanced, and argued the proposition that: a program path that
involves q≥0 predicates is more likely to be feasible than one involving p > q. The
formal statistical investigation of this that was undertaken in Malevris et al. [12]
concluded, with great statistical significance, that the feasibility of a path decays
exponentially with the increasing number of predicates it involves. As a result, Yates
and Malevris [22] proposed a path selection method, extending that of Yates and
Hennell. Although initially introduced to support branch testing, the method was
founded only upon a consideration of the number of predicates on a program path.
Thus, the method does not seek to optimisation in respect of any one testing criterion,
and thus, may be used validly, and without bias, when attempting to fulfill any testing
criterion. If, for purposes of this paper, a code element is taken to refer to either a
branch, or a JJ-path, the method may be summarised as follows.
20
UKTest 2005
1. Generate set of program paths, , whose constituent paths each involve a
minimum number of predicates, and which, in the absence of infeasible
paths, would cover the elements of code unit C.
2. Derive the value of iTer resulting from executing C with test data
corresponding to those paths in that are feasible.
While condition repeat step (3)
3. Select an uncovered element, E of C, and successively generate K
Eπ , K = 1,
2, …, until for some K = λ , λπ E is found to be feasible, and then
recalculate the value of iTer .
Here, K
Eπ denotes that path through element E that involves the Kth
smallest number of
predicates.
In order to generate and the K
Eπ for branch and JJ-path testing, use can be made
of the fact that corresponding paths exist in respectively the control flow graph and
the JJ-graph. These paths can be found by generating the kth
shortest paths (in terms
of the number of predicates involved) through the arcs of the control flow graph for
branch testing, and the vertices of the JJ-graph for JJ-path testing. Here k =1 when
is to be constructed, and k = K otherwise, and standard graph theoretic algorithms are
available in each case, see Gondran and Minoux [6], for example.
All that is now required to define the instantiation of the above path generation
method that was used in the experiments, are details of condition, and definition of the
criterion used to determine the order in which uncovered elements are selected in step
(3). Details of the choices made are now given.
Although, it would be ideal to select condition to be iTer < 1, for reasons of
practicality, it was found to be necessary to curtail the testing of certain units. This
was achieved by taking condition to be While iTer < 1 and K < 300. Thus, only a
maximum of 300 paths, in addition to those contained in the initial path set, initially
could be generated by any of the methods
When there is more than one uncovered element to be treated in step 3 of the
method, in which order is it most appropriate to attempt to cover them? This question
was answered by making use of the substantiated thesis that: a program path that
involves q≥0 predicates is more likely to be feasible than one involving p > q.
Specifically, at the beginning of each iteration of step 3, the lengths, again in terms of
the number of predicates involved, of the shortest untried path through the extant
uncovered elements, are compared. The ensuing attempt to increase coverage then
involves that element which corresponds to the shortest of these paths. Should this
process result in a tie, one from amongst the tying candidates is then selected at
random.
3.0 The Test Sample and Experimental Regime
The results that are reported in section 4 were derived by testing a sample of code
units written in Fortran. Although Fortran is still employed in many organisations, its
21
UK Software Testing Research III
usage, relative to that of other available languages, has diminished most significantly
over the last twenty-five years. There then arises the question of whether the results
and corresponding conclusions derived herein have any general significance or
relevance. This question is addressed by the following.
Suppose that infeasible paths do not exist. In such circumstances, a valid path set
derived to fulfil a testing criterion would, when executed, automatically achieve its
aim. Unfortunately, infeasible paths do exist! Therefore, it is only because of their
existence that full coverage in respect of a testing criterion may not be met in practice,
and it is also clear that this must be true irrespective of the language in which the
software under test has been encoded. Correspondingly, it is relevant to ask whether
infeasible paths are more likely to occur in code written in one programming language
than in another. In the substantially sized study performed by Malevris et al. [12] of
the feasibility, or otherwise, of program paths, statistical analysis of the results
showed that with a certainty of 99.95%, the potential infeasibility of a program path is
characterised by the number of predicates that it involves. Given such a definitive
result, one of two possibilities must obtain. Either, the encoding language plays no, or
at most, a highly insignificant and almost imperceptible role in influencing the
feasibility of a program path, or, the language itself is instrumental in deciding the
number of predicates involved in a path.
Consider the second of these possibilities, and also consider an arbitrary
generalised algorithm (not a program) for processing certain data sets. In general,
subsets of the data will be processed differently by the algorithm depending on the
specific characteristics of the subsets, and differentiation in processing will be
achieved via the use of tests (predicates). Without loss of generality, assume that it
requires x > 0, say, predicates to differentiate the processing of one of the subsets, A
say, from the others. Further, suppose that a programmer is asked to encode the
algorithm in one imperative language and then transliterate to produce another
encoding in a second imperative language. Under the relatively mild assumptions that
in the two languages, the decimal accuracy with which a real value is represented, and
with which arithmetic is performed on such values, is the same, then in the two
encodings y > x tests (y = x if the programmer codes efficiently) will be needed to
distinguish the processing of subset A from that of other data subsets. Now this will
be true irrespective of the two imperative languages used in the encodings, and will
also obtain for all valid data subsets that the code is designed to process. Therefore,
the two programs produced by the programmer will use the same number of tests to
distinguish the processing of each specific pair of data subsets, and the same total
number of tests to distinguish the processing of any one data subset. Thus, under the
above assumptions, the idea that the encoding language influences coverage should be
eschewed.
Correspondingly, the authors contend that the language in which software is
encoded is immaterial, or at worst insignificant, where results derived from
experiments such as those reported below are concerned, and that such results may
justifiably be viewed as being both relevant and generally applicable. Work aimed
substantiating this contention experimentally, forms part of the authors’ on-going
research.
3.1 The Experimental Regime
The experimental results reported in the succeeding section, were derived as a result
of testing a set of 35 Fortran subroutines chosen in a pseudo-random manner from the
22
UKTest 2005
NAG library. The subroutines have thus been employed extensively in industry, in
research establishments, and for various academic purposes. In all, the investigation
of well over 6000 program paths was entailed. A profile of the sample of code units,
detailing the number of branches, and JJ-paths that each contains, is presented as table
1.
Each of the 35 units was first subjected to branch testing. The path generation
method adopted was that described in section 2. The values of 2Ter , and 3Ter ,
achieved for each routine at the end of step 1 of the path generation method, herein
referred to as the initial coverage in respect of the units, were recorded. The
increased values of the metrics resulting from the coverage of each additional branch
in step 3 of the method were also recorded; the last such pair of values to be recorded
being referred to as the final coverage in respect of the units. An analogous regime
was then adopted for JJ-path testing.
Once the testing of all of the routines had been completed, the values of *
2Ter
and *
3Ter , corresponding respectively to each of the recorded values of 2Ter and 3Ter
were derived.
In order to generate the control flow and JJ graphs upon which the path
generation methods rely, the TESTBED tool, [17], was used. The test paths
themselves were generated using the ESPM tool, [10], which embodies the path
selection method of section 2, and corresponding sets of test data (for the feasible
paths) were derived using the VOLCANO symbolic execution system of Koutsikas
and Malevris [8].
Profile of the sample
Unit No. of
branches
No. of
JJ-paths
Unit No. of
branches
No. of
JJ-paths
1 9 8 19 21 17
2 12 12 20 12 12
3 4 5 21 12 13
4 9 10 22 6 7
5 33 36 23 16 18
6 16 14 24 36 32
7 21 19 25 15 16
8 9 9 26 10 9
9 18 16 27 21 21
10 11 12 28 20 19
11 27 25 29 18 16
12 38 35 30 50 46
13 25 23 31 15 18
14 22 19 32 45 41
15 31 26 33 19 18
16 30 24 34 17 14
17 3 3 35 25 20
18 20 19 Total 696 657
Table 1.
23
UK Software Testing Research III
4.0 Results and Discussion
The results derived as a result of the experiments are presented in three subsections.
Sections 4.1 and 4.2 focus on the relative effectiveness of the testing methods in
respect of branch and JJ-path coverage respectively, whereas section 4.3 addresses
their relative effectiveness in terms of aggregate coverage.
4.1 Branch Coverage
The initial branch coverage of each of the 35 code units as achieved by the testing
methods is reported, in terms of *
2Ter , in table 2. The table’s size is substantial, and
consequently, is placed in the appendix. However, an inspection of the table reveals
certain salient points, and these are summarised in table 3.
Initial coverage of branches
Method
Total no.
of paths
generated
Mean no. of
paths per unit
generated
No. of units
for which *
2Ter = 1
No. of units
for which *
2Ter = 0
Range of *
2Ter for
other units
Mean value of *
2Ter for all
35 units
Branch testing 218 6.229 13 2 [0.125, 0.933] 0.695
JJ-path testing 324 9.257 14 2 [0.125, 0.933] 0.717
Final coverage of branches
Method
No. of
paths
generated
Mean no. of
paths per unit
generated
No. of units
for which *
2Ter = 1
No. of units
for which *
2Ter = 0
Range of *
2Ter for
other units
Mean value of *
2Ter for all
35 units
Branch testing 2455 70.143 29 0 [0.5, 0.789] 0.947
JJ-path testing 4096 117.029 29 0 [0.278, 0.818] 0.935
Table 3.
The values of the final branch coverage achieved by the testing methods in respect of
the test sample are also given in table 2, and in summary, in table 3. Attention is
drawn to the fact that for two of the units, no initial branch coverage was achieved. In
both cases the reason for this was the existence of nested loops in the code, the
innermost of which needed to be executed more times than was catered for by the
initial set of test paths.
Focussing upon the last column of table 3, the values suggest that branch testing
gives a worse return, in the mean, than JJ-path testing as measured by the initial
coverage, but that the situation is reversed when final coverage is considered.
However, these levels of coverage should also be assessed in the light of the number
of test paths generated in achieving them (column 2 of the table).
In order to facilitate further investigation, the mean values of the quantities of
interest were derived as follows.
Denote by *
,2 kjTer the value of *
2Ter < 1 achieved in respect of code unit j after a
total of k paths have been generated for it in the application of one of the testing
methods. Now, for some value of k, Qj say, full branch coverage of unit j will be
achieved, and thus, define:
24
UKTest 2005
<
=.otherwise1
)( j
*
kj2,*
2
QkifTerjk,Ter
Further, define:
<
=otherwise
),(j
j
Q
QkifkjkP
from which it can bee seen that, for code unit j, the testing method achieves a value of
),(*
2
*
2 jkTerTer = after ),( jkP paths have been generated in respect of testing it.
Hence, using these, it can be seen that after a mean of )(kPM paths per code unit have
been generated, where:
),(35
1)(
35
1
jkPkPj
M ∑=
=
))((*
2 kPT M , the mean value for *
2Ter achieved by testing the 35 code units, is:
),(35
1))((
35
1
*
2
*
2 jkTerkPTj
M ∑=
=
Note
For simplicity and conciseness in what follows, ))((*
2 kPT M and )(kPM will be written
as )(*
2 MPT and MP respectively; their dependence upon k being tacitly understood.
A trend line relating the values of )(*
2 MPT and MP that were obtained from the
experiments, was generated for each testing method. These values are given in the
appendix as table 4. In deriving the trend lines, it was essential that the range MP ∈
[9.257, 70.143] be chosen. This choice facilitates an unbiased comparison of the
methods, because the results for them were derived in different ranges: MP ∈ [6.229,
70.143], and MP ∈[9.257, 117.028], for branch and JJ-path testing respectively (see
columns 2 and 4 of table 4).
Of the various functions that were fitted to these data values, one of the form
)(*
2 MPT = 1
321 )/( −++ MP
M eAPAA , where A1 , A2 and A3 are constants, proved to be
the best for both methods. The resulting trend lines are defined by: )(2 MPT = 1)55693.466/8653818.20044056.1( −−++ MP
M eP for branch testing and )(2 MPT = 1)84016.534/176849.3014394.1( −−++ MP
M eP for JJ-path testing. As evidenced by
the corresponding values of R2, 0.996 and 0.997 (3 dec. places.)
respectively, the
regression lines represent an extremely good fit, and therefore may be relied upon.
The plot of )(*
2 MPT against MP together with the corresponding regression line for
branch testing is given in figure 1 for purposes of illustration.
A calculation using the trend lines straightforwardly reveals that branch testing
provides a better mean branch coverage than does JJ-path testing in the entire range
25
UK Software Testing Research III
MP ∈[9.257, 70.143]. Specifically, the mean branch coverage that branch testing
yields, is 3.691% greater than that obtained using JJ-path testing at MP =9.257, but this
percentage diminishes in the range of interest: at MP = 21.165 it is 2% greater, for
example, and at MP =70.143, it is only 1.38% greater. Consequently, it may be said
that, in most of the experimental range, the mean branch coverage achieved by branch
testing is less than 2% more than the mean collateral branch coverage achieved by JJ-
path testing.
Figure 1. A plot of )(*
2 MPT against MP for branch testing (diamonds), and the
corresponding regression line (dots).
It is also noted that, because of the form of the regression lines, exactly the same
comments can be made about the performance of the two testing methods when
branch coverage per unit test path, rather than branch coverage itself, is considered.
4.2 JJ-path Coverage
Both the initial, and the final JJ-path coverage of each of the 35 code units that was
achieved by branch and JJ-path testing is reported, in terms of *
3Ter , in table 5. This
table too has been placed in the appendix because of its size. However, a synopsis of
the important values contained therein, is presented in table 6.
The entries in the last column of table 6 do indeed indicate that, as measured
by *
3Ter , the initial and final return from JJ-pair testing exceeds that of branch testing.
As in the case of branch coverage, each of these values belies the effort needed to
achieve it, that is, taking the number of test paths that were generated (column 2 of the
table) into account.
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80P M
T*
2 (P M)
26
UKTest 2005
Initial coverage of JJ-paths
Method
Total no. of
paths
generated
Mean no. of
paths per unit
generated
No. of units
for which *
3Ter = 1
No. of units
for which *
3Ter = 0
Range of *
3Ter for
other units
Mean value of *
3Ter for all
35 units
Branch testing 218 6.229 3 2 [0.125, 0.882] 0.592
JJ-path testing 324 9.257 12 2 [0.125, 0.952] 0.665
Final coverage of JJ-paths
Method
Total no. of
paths
generated
Mean no. of
paths per unit
generated
No. of units
for which *
3Ter = 1
No. of units
for which *
3Ter = 0
Range of *
3Ter for
other units
Mean value of *
3Ter for all
35 units
Branch testing 2455 70.143 7 0 [0.343, 0.909] 0.823
JJ-path testing 4096 117.029 25 0 [0.171, 0.882] 0.892
Table 6.
The level of coverage of JJ-paths achieved by the two methods was investigated in a
similar manner to that of branch coverage. Thus, by analogously defining
))((*
3 kPT M , it can be understood that, ))((*
3 kPT M , the mean value for *
3Ter , is
achieved after testing the 35 code units when )(kPM paths per unit have been
generated. Again for simplicity and conciseness, ))((*
3 kPT M and )(kPM will be written
respectively as )(*
3 MPT and MP in what follows.
In order to relate the experimental values of )(*
3 MPT and MP , these being given in
the appendix as table 7, a trend line in the range MP ∈[9.257, 70.143] was derived for
each testing method. As before, a function of the form 1
321 )/( −++ MP
M eAPAA
provided the best fit. )(*
3 MPT = 1)19472.159/3515703.31559459.1( −−++ MP
M eP and
)(*
3 MPT = 1)92324.993/4141592.3060645.1( −−++ MP
M eP , the regression lines derived
for branch testing and JJ-path testing respectively, both have an associated R2
value of
0.997, and this clearly indicates that each line fits the corresponding data very well
indeed.
Using the equations of the trend lines, it is found that JJ-path testing yields better
JJ-path coverage than branch testing in the entire range MP ∈[9.257, 70.143]. In
relative terms, JJ-path testing is between 0.579% more effective than branch testing at
MP = 9.257, this rises swiftly to 5% at MP = 10.854, and to 7.843% at MP = 97.914. It
may therefore be concluded that the extent of the mean collateral cover of JJ-paths
provided by branch testing, is not commensurate with the level of coverage achieved
by JJ-path testing. Again because of the form of the regression lines, these comments
obtain identically when the relative performance of the methods in respect of
coverage per unit test path is considered.
4.3 Aggregate Coverage
By summing the relevant columns in tables 2 and 5, the values of the initial and final
aggregate coverage of each of the 35 units may be deduced straightforwardly.
Using the definitions of )(*
2 MPT and )(*
3 MPT , the aggregate mean coverage that
is achieved by each of the testing methods is given by:
27
UK Software Testing Research III
)(*
MPT = ( )(*
2 MPT + )(*
3 MPT )
The values of MP and )(*
MPT derived experimentally, may be deduced by summing
the appropriate entries in corresponding entries in tables 4 and 7. Using these, a trend
line relating )(*
MPT and MP for each method was derived, as before The lines:
)(*
3 MPT = 1)52041.167/5449193.153743396.0( −−++ MP
M eP for branch testing, and
)(*
3 MPT = 1)87559.374/6462876.151848991.0( −−++ MP
M eP for JJ-path testing each
represents a good model of the aggregate coverage attained by corresponding method
since the associated R2 values are both 0.997. Again by performing simple
calculations on the trend lines, it was found that branch testing yields an aggregate
coverage that is superior to that of JJ-path testing in the rather small range MP ∈
[9.257, 10.06], but after MP = 10.06, the situation is reversed. The performance of the
methods in relative terms shows that in the range of interest, JJ-path testing is between
(–1.64)% and 3.127% more effective than branch testing; it being 2% or more
effective for MP > 16.179. Therefore, it may be said that, in the mean, for all values of
MP > 10 (approximately), JJ-path testing provides a superior level of aggregate
coverage than does branch testing. Identical comments can be made, and the same
conclusion drawn in respect of aggregate coverage per test path.
5.0 Conclusions
In this paper, an experimental investigation has been undertaken of the extent of the
collateral coverage of: branches provided by JJ-path testing, and JJ-paths provided by
branch testing. Specifically, the mean collateral coverage provided by each testing
method has been compared with the mean “natural” coverage provided by the other.
The levels of the mean aggregate coverage (branch coverage + JJ-path coverage)
achieved by the methods have likewise been compared. The values of *
2Ter , *
3Ter ,
and ( *
2Ter + *
3Ter ) that were used to effect the comparisons, were derived as a result of
applying branch testing and JJ-path testing to a sample of 35 units of code and
recording the relevant statistics.
The results of the comparisons can be summarised as follows.
(1) Branch coverage
The results show that if fewer than 71 test paths per unit (including infeasible paths)
need to be generated, the mean branch coverage achieved by branch testing always
exceeds the mean collateral branch coverage provided by JJ-path testing. The
difference in mean branch cover is 3.691% at MP = 9.257, which reduces to 2% at MP =
21.165, and thence to 1.38% at MP = 70.143. It is noted that the experimental results
apply only to MP , the number of test paths generated, in the range [9.257, 70.143].
However, if the trend lines that have been derived are valid for MP > 70.143, the value
of the mean collateral coverage will continue to approach that of the mean natural
coverage (the percentage difference between them will continue to decrease) with
increasing values of MP .
28
UKTest 2005
(2) JJ-path coverage
The natural coverage of JJ-paths by JJ-path testing, in the mean, always exceeds the
collateral cover delivered by branch testing. The extent of the excess increases
rapidly from 0.579% at MP = 9.257 to 5% at MP =10.854, and thereafter to 7.843%
at MP =70.143. Further, if the trend lines are valid beyond this value of MP , it will
continue to increase until a maximum value is reached.
(3) Aggregate coverage
The mean aggregate coverage that branch testing yields is at most 1.64% greater than
that achieved by JJ-path testing only if no more than approximately 10 test paths need
to be generated in order to give an aggregate coverage of 2. If, on the other hand,
between 10 and 70 paths are needed, the situation is reversed, and the relatively better
mean yield of JJ-path testing increases to more than 2% at MP = 16.179, after which it
increases steadily to 3.127% at MP =70.143. Further increases up to a maximum
value are indicated if the trend lines are valid beyond this value of MP .
Given these, how then should the two methods be sequenced to maximal advantage in
a single testing strategy? As the testing of branches in a program is likely, in general,
to require the generation of more than 10 paths, then in virtue of (1) and (3) above,
only the extent of branch coverage will be greater if branch testing is applied first.
The aggregate coverage provided will be less, see (3), than that attained using the
alternative sequence, as to a greater extent will be the coverage of JJ-paths, see (2).
The results therefore suggest that the order of application should be JJ-path testing
first, and then branch testing provided that some branches remain uncovered.
However, taking into account the extent of the percentage differences between the
performance of the two methods in the above three cases, the use of only JJ-path
testing in a testing strategy, rather than in combination with branch testing, does not
by any means appear to be an unrealistic option. Moreover, it appears to the authors
to be a somewhat better option. This view is supported by the fact that, because of
both the form of the trend lines, and their quality of fit, all of the above comments
concerning relative performance also obtain when coverage per test path is
considered. For this can certainly be viewed as a measure of the effort required in
order to achieve a given level of coverage. Thus, the authors contend that, if it is
necessary to adopt only one of the testing methods in order to encompass branch and
JJ-path coverage and/or their aggregate, the choice should be JJ-path testing. Further,
they also contend that serious consideration should also be given to using only JJ-path
testing even when the application of both methods is possible.
How valid and general are these conclusions? The fundamental factors of influence
here are: the method used to generate the test paths, and the sample used in the
experimentation, in the context of its being representative. As far as the first of these
is concerned, the features of the path generation method used in the experiments are,
as was made explicit in section 2.2, such as not to introduce bias into the results
derived using any testing method. Thus, although the values of *
2Ter , *
3Ter , and
( *
2Ter + *
3Ter ) achieved by any other unbiased path generation method used in
connection with branch and JJ-path testing, may not accord with those recorded
herein, the relative levels of coverage that they yield should remain inviolate. With
regard to the second factor, two features of the sample are germane: the language in
29
UK Software Testing Research III
which units in the sample are encoded, and the extent to which the sample is
representative in respect of the number of branches and JJ-paths that were involved.
It was argued in section 3.0 that the issue of encoding language should have, at most,
an insubstantial influence on the outcome of the experiments. In terms of the sample
used being truly representative, it must first be noted that no method exists for
proving/disproving that any given sample is representative. However, the code units
in the sample possess between 3 and 50 branches, with a mean of 19.89, and between
3 and 46 JJ-paths; 18.77 being the mean value. Whether these values conform to the
sample’s being representative, is a moot point, however both of the ranges are quite
substantial. Also, it must be said that: the code units constituting the sample were
selected in a pseudo-random manner; they have been used extensively in a number of
areas of endeavour; the size of the sample (35) is relatively large when compared with
other studies of test coverage that are reported in the literature. The authors suggest,
therefore, that tentative acceptance of the sample’s being representative is not
unreasonable, and thus, some credence should be afforded to the results and
conclusions deriving from the work.
8.0 References
[1] Bertolino A and Marre M. How many paths are needed for branch testing? Journal
of Systems and Software, 1996; 35(2): 95-106.
[2] Brown JR. Practical application of software tools. TRW report TRW-55-72-05,
TRW Systems, One Space Park, Redondo Beach, California, 1972
[3] European Space Agency (ESA) Software Engineering Standards. ESA PSS-05-0
Issue 2, European Space Agency (ESA), 8-10, rue Mario-Nikis, 75738 PARIS
CEDEX, France, 1991.
[4] Frankl PG, Hamlet RG, Littlewood B and Strigini L. Evaluating testing methods
by delivered reliability. IEEE Transactions on Software Engineering; 1998; 24(8)
586-601.
[5] Frankl PG and Weyuker EJ. Provable Improvements on Branch Testing, IEEE
Transactions on Software Engineering; 1993; 19(10) 962-975.
[6] Gondran M and Minoux M. Graphs and Algorithms, Wiley-Interscience, John
Wiley and Sons, Chichester, UK, 1984.
[7] Harder M, Mellen J, and Ernst MD. Improving test suites via operational
abstraction. Proceedings of the International Conference on Software
Engineering, Portland, Oregon; 2003; 60-71.
[8] Koutsikas C and Malevris N. A Unified Symbolic Execution System, Proceedings
of the ACS/IEEE International Conference on Computer Systems and
Applications, Beirut, 2001, pp. 466-469
[9] Malaiya YK, Li MN, Bieman JM, and Karcich R. Software reliability growth with
test coverage. IEEE Transactions on Reliability; 2002; 51(4): 420-426.
30
UKTest 2005
[10] Malevris N. An Assessment of the Number of Paths Needed for Control Flow
Testing. 3rd International Conference on Reliability, Quality and Safety of
Software-Intensive Systems (ENCRESS ’97), Athens, Greece, pp. 32-43.
[11] Malevris N and Yates DF. The Collateral Coverage of Data Flow Criteria When
Branch Testing, Information and Software Technology, to appear.
[12] Malevris N, Yates DF and Veevers A. A predictive metric for the likely
feasibility of program paths. Information and Software Technology; 1990;
32(2), 115-119.
[13] Paige MR. On partitioning program graphs, IEEE Transactions on Software
Engineering; 1977; SE-3(6) 386-393.
[14] Requirements For Safety Related Software in Defence Equipment, Part 1.
Defence Standard 00-55(PART 1)/Issue 2, Ministry of Defence, U.K., 1997.
[15] Standard for Software Component Testing. British Computer Society Special
Interest Group in Software Testing (BCS SIGIST), U.K., 2001.
[16] Sze SKS and Lyu MR. ATACOBOL - A COBOL Test coverage analysis tool
and its applications. Proceedings of the 11th International Symposium on
Software Reliability Engineering (ISSRE'00), San Jose, California; 2000; 327-
335.
[17] Testbed, LDRA Software Technology, Portside, Monks Ferry, Wirral, U.K.
[18] Woodward MR. An investigation into program paths and their representation.
Techniques et Science Informatiques; 1984; 3, 273-286
[19] Woodward MR, Hedley D and Hennell MA. Experience with path analysis and
testing of programs. IEEE Transactions on Software Engineering; 1980; SE-6:
278-286.
[20] Yates DF and Hennell MA. An approach to branch testing. Proc. of 11th
Workshop on Graph Theoretic Techniques in Computer Science, Wurtzburg,
West Germany; 1985; pp. 421-433.
[21] Yates DF and Malevris N. The effort required by LCSAJ testing: an assessment
via a new path generation strategy. Software Quality Journal; 1995; 4(3): 227-
243.
[22] Yates DF and Malevris N. Reducing the effect of infeasible paths in branch
testing. Proc. 3rd
Symposium on Software Testing, Analysis and Verification
(TAV3), Key West, Florida, U.S.A.;1989; 48-56.
[23] Zhu H, Hall P and May HR. Software unit test coverage and adequacy. ACM
Computing Surveys; 1997; 29(4):336–427.
31
UK Software Testing Research III
Appendix
Branch Coverage
Initial Coverage Final Coverage
Branch
Testing
JJ-path
Testing
Branch
Testing
JJ-path
Testing
Unit No. of
paths
used
*
2Ter
No. of
paths
used
*
2Ter
No. of
paths
used
*
2Ter
No. of
paths
used
*
2Ter
1 3 1 4 1 3 1 4 1
2 4 0.857 6 0.857 6 1 12 1
3 2 1 3 1 2 1 3 1
4 3 0.75 5 1 5 1 5 1
5 9 0.353 19 0.353 309 0.588 319 0.353
6 5 0.875 6 0.875 6 1 7 1
7 7 0.385 10 0.385 23 1 36 1
8 3 1 4 1 3 1 4 1
9 6 1 8 1 6 1 8 1
10 4 0.714 7 0.714 178 1 123 1
11 8 0.5 12 0.5 308 0.5 312 0.5
12 10 0.167 16 0.278 212 1 316 0.278
13 7 1 10 1 7 1 10 1
14 6 0.9 8 0.9 8 1 10 1
15 9 0.667 11 0.667 42 1 14 1
16 9 0.933 12 0.933 12 1 52 1
17 2 1 2 1 2 1 2 1
18 8 0.5 10 0.5 21 1 310 1
19 7 0.583 9 0.583 13 1 309 1
20 5 0.625 7 0.625 13 1 203 1
21 4 1 7 1 4 1 10 1
22 3 1 4 1 3 1 4 1
23 5 0.667 10 0.667 94 1 310 1
24 10 0.789 15 0.789 310 0.789 315 0.789
25 5 1 8 1 5 1 8 1
26 4 1 5 1 4 1 15 1
27 7 0.231 12 0.231 307 0.538 312 1
28 7 0 11 0 53 1 311 1
29 6 1 8 1 6 1 8 1
30 15 0.429 25 0.857 103 1 63 1
31 5 0 10 0 26 1 310 1
32 13 0.273 21 0.273 313 0.727 321 0.818
33 5 0.125 7 0.125 36 1 38 1
34 5 1 5 1 5 1 5 1
35 7 1 7 1 7 1 7 1
Table 2
32
UKTest 2005
The values of MP and )(*
2 MPT used in deriving the trend
lines in respect of branch coverage
Branch Testing JJ-path Testing
MP )(*
2 MPT MP )(*
2 MPT
10.14286 0.774 9.257143 0.717
11.85714 0.776 9.6 0.721
12.28571 0.791 10.25714 0.728
12.71429 0.793 10.85714 0.758
13.11429 0.806 11.97143 0.781
13.51429 0.824 12.51429 0.811
14.25714 0.832 14.6 0.811
14.45714 0.833 16.54286 0.812
15.37143 0.861 17.02857 0.814
18.82857 0.872 17.51429 0.828
19.42857 0.881 18.48571 0.83
21.2 0.886 19.45714 0.859
22.62857 0.892 21.88571 0.859
23.2 0.905 22.37143 0.863
26.51429 0.908 24.68571 0.88
31.91429 0.909 25.54286 0.893
33.97143 0.91 25.97143 0.893
34.25714 0.919 27.25714 0.893
48.37143 0.923 27.68571 0.893
52.88571 0.942 28.51429 0.893
53.22857 0.947 38.91429 0.909
70.14286 0.947 48.2 0.918
- - 56.74286 0.922
Table 4.
33
UK Software Testing Research III
JJ-path Coverage
Initial Coverage Final Coverage
Branch
Testing
JJ-path
Testing
Branch
Testing
JJ-path
Testing
Unit No. of
paths
used
*
3Ter
No. of
paths
used
*
3Ter
No. of
paths
used
*
3Ter
No. of
paths
used
*
3Ter
1 3 0.875 4 1 3 0.875 4 1
2 4 0.667 6 0.75 6 0.833 12 1
3 2 0.8 3 1 2 0.8 3 1
4 3 0.5 5 1 5 0.875 5 1
5 9 0.171 19 0.171 309 0.343 319 0.171
6 5 0.786 6 0.857 6 1 7 1
7 7 0.278 10 0.333 23 0.833 36 1
8 3 0.875 4 1 3 0.875 4 1
9 6 0.875 8 1 6 0.875 8 1
10 4 0.455 7 0.455 178 1 123 1
11 8 0.4 12 0.4 308 0.4 312 0.4
12 10 0.156 16 0.25 212 0.844 316 0.25
13 7 0.85 10 1 7 0.85 10 1
14 6 0.882 8 0.882 8 1 10 1
15 9 0.458 11 0.5 42 0.833 14 1
16 9 0.857 12 0.952 12 0.905 52 1
17 2 1 2 1 2 1 2 1
18 8 0.471 10 0.471 21 0.882 310 0.882
19 7 0.533 9 0.533 13 0.867 309 0.867
20 5 0.545 7 0.545 13 0.909 203 1
21 4 0.75 7 0.875 4 0.75 10 1
22 3 0.857 4 1 3 0.857 4 1
23 5 0.5 10 0.5 94 0.857 310 0.857
24 10 0.7 15 0.733 310 0.7 315 0.733
25 5 0.8 8 1 5 0.8 8 1
26 4 0.875 5 0.875 4 0.875 15 1
27 7 0.167 12 0.167 307 0.5 312 1
28 7 0 11 0 53 0.765 311 0.765
29 6 0.875 8 1 6 0.875 8 1
30 15 0.415 25 0.707 103 0.902 63 1
31 5 0 10 0 26 0.692 310 0.692
32 13 0.206 21 0.206 313 0.441 321 0.618
33 5 0.125 7 0.125 36 1 38 1
34 5 1 5 1 5 1 5 1
35 7 1 7 1 7 1 7 1
Table 5.
34
UKTest 2005
The values of MP and )(*
3 MPT used in deriving the trend
lines in respect of JJ-path coverage
Branch Testing JJ-path Testing
MP )(*
3 MPT MP )(*
3 MPT
10.14286 0.675 9.257143 0.666
11.85714 0.678 9.6 0.67
12.28571 0.69 10.25714 0.678
12.71429 0.693 10.85714 0.716
13.11429 0.7 11.97143 0.741
13.51429 0.717 12.51429 0.769
14.25714 0.721 14.6 0.772
14.45714 0.722 16.54286 0.774
15.37143 0.742 17.02857 0.779
18.82857 0.753 17.51429 0.793
19.42857 0.763 18.48571 0.796
21.2 0.766 19.45714 0.816
22.62857 0.771 21.88571 0.818
23.2 0.781 22.37143 0.822
26.51429 0.784 24.68571 0.839
31.91429 0.787 25.54286 0.849
33.97143 0.787 25.97143 0.85
34.25714 0.797 27.25714 0.851
48.37143 0.805 27.68571 0.852
52.88571 0.822 28.51429 0.853
53.22857 0.825 38.91429 0.865
70.14286 0.825 48.2 0.875
- - 56.74286 0.883
Table 7.
35
36
2. Formal Models and Approaches
to Testing
37
38
Towards Unit Testing for
Communicating Stream X-machine Systems⋆
Joaquın Aguado and Michael Mendler
Faculty of Information Systems and Applied Computer Sciences,University of Bamberg, Germany,
{joaquin.aguado,michael.mendler}@wiai.uni-bamberg.de
Abstract. This papers studies the conformance testing problem for amodel of distributed systems, Communicating Stream X-machine Sys-
tems (CSXMS). An approach for the unit testing of CSXMS is sug-gested. It is based on generating testing sequences from the distributedspecification, and exercising them on each of the components indepen-dently of the context through a distributed test system, which can bethe same original system, but in which the components have some ex-tra functionality to act as testers. The main technical result is twofold.First, for testing a complete system this approach has at least the samefault detection ability as the previous (product machine) approach butit is also more efficient and it permits to separate unit and integrationtesting. Second, the design for test conditions are relaxed compared toprevious CSXMS techniques regarding the attainability of the memory,which means less effort to prepare systems for testing.
1 Introduction
Conformance testing concentrates on the production of a set of test cases, whichare deemed sufficient to demonstrate that the behaviour of the implementationconforms to the specification. Conformance, in general, cannot be expected toimply full functional correctness for two reasons. First, to validate the correct-ness fully, exhaustive testing is required which is impossible to achieve if thereis an infinite range of operating conditions and user interactions of which only arather small and finite part can be covered by a finite test suite. Second, if the im-plementation under test is a concrete physical system (a set of programs runningon a set of hardware processors) implementing a complex software architecture(compiled and optimized for a given operating system and middleware layer) itwill be next to impossible to determine an exact model of the implementationrelative to which one could guarantee that tests are exhaustive.
If conformance testing does not obtain full specification correctness, whatdoes it achieve? How can we measure the quality of a conformance test andgive rational assertions about the extent to which it achieves its goal? Any such
⋆ This work has been partially supported by the European Commission within theTYPES network IST 510996.
39
UK Software Testing Research III
method, so it seems, will have to involve an abstraction both of the specifica-tion (to reduce the number and complexity of the specified behaviour) and ofthe implementation. The Stream X-machine Model (SXM) [1] is geared towardsmaking such abstractions by separating finite control structure from (usuallyinfinite) data paths. Each of these two can be tested separately. For instance, itmay be assumed, that relative to a given set of control points the implementationcan be represented by an SXM with a bounded number of control states and thatthe data transformations that are effected through the interaction points can beverified separately. This is known as the fundamental testing hypothesis. Relativeto such an abstract model of the implementation under test (IUT) it is possibleto give guarantees about full test coverage. To be more precise, according to [2] afault model is a triplet (sι, conforms,Miut), where sι is a finite-state specification(e.g. I/O FSM, LTS, I/O Automaton, X-Machine), conforms is the conformancerelation (e.g. FSM equivalence, reduction or quasi-equivalence, trace inclusionor trace equivalence) and Miut is the fault domain (i.e. the set of possible im-plementations). The fault domain usually reflects test assumptions (e.g. all I/OFSM with a given number of states) which are established by a testing hypoth-esis. In this manner, a test suite is complete with respect to (sι, conforms,Miut)when for all ι ∈ Miut, ι passes the test suite if and only if ι conforms sι. If so,the test suite provides fault coverage guarantee for the given fault model.
sι
ι
Miut
Ms
Mm
ι conforms sι
Hypothesis
Testing
mι ≃ sι
mι
Fault model (sι, conforms, Miut)
Fig. 1. Conformance Testing
In conformance testing [3, 4] we are given three sets Miut, Ms and Mm asin Fig. 1, respectively, of implementations under test, specifications and modelsof implementation. Each IUT ι ∈ Miut is to be tested with respect to a givenspecification sι ∈ Ms. The testing hypothesis is the assumption that each ι ∈Miut can be captured adequately by some associated formal model mι ∈ Mm.Conformance is the result that we hope to get from running some set of testson the implementation ι. If successful, the tests establish a relationship between
40
UKTest 2005
Miut and Ms, viz. “ι conforms sι”. This is not a formal relationship but anoperational process. If we want to judge the quality of the tests we refer to themodels of implementation. For we may be able to show that the tests are powerfulenough so that any model in Mm that passes all tests actually implements Ms insome formal sense. Depending on the purpose of testing such an implementationrelationship, written mι ≃ sι, can mean one of many things, e.g. that mι istrace equivalent to sι, that mι simulates sι, and so on. The point is that it is aprecise mathematical relation that gives rational meaning to the statement thatan implementation passes a set of test. In this way the testing hypothesis linksthe informal notion of ι conforms to sι with a formal one, mι ≃ sι.
Testing is about deciding the implementation relation ≃ which — if it issignificant — refers to an infinite number of possible input sequences in termsof a finite but judiciously chosen test set χ. For finite state systems either theclassical W-method [5] or the Wp-method [6] can be applied to obtain a test set χfrom the specification automaton. The test suite generation problem from SXMhas been solved [7, 8] by considering the associated (control) automaton A(Λs)of the SXM specification Λs (see Def. 1 below) treating the operations Φ on thedata path as abstract input symbols (Φ is called the type of Λs). This dependsonly on the control states and is independent of the different possible states of thememory (i.e. memory values). Thus, it produces a much smaller test set than thatproduced for a method that considers both control-states and memory-states.The test set is a set of sequences χ ⊆ Φ∗ from the type Φ of Λs which tests traceequivalence with respect to A(Λs). Let ≃T denote language equivalence relativeto the words in T ⊆ Φ∗, i.e., A1 ≃T A2 iff ∀ω ∈ T. ω ∈ L(A1) ⇔ ω ∈ L(A2),where L(A) denotes the language of A. Full trace equivalence, A1 ≃ A2, isA1 ≃Φ∗ A2. As part of the testing hypothesis one assumes that the controlautomaton A(Λm) of an implementation under test Λm has a bounded numberk of control states (essentially the number of states, depending on χ). Then ifA(Λm) passes all tests in χ (which have been generated to fit k) it must be traceequivalent to A(Λs). Formally, A(Λm) ≃χ A(Λs) implies A(Λm) ≃ A(Λs). Now,if all data path operations Φ are implemented correctly (by testing hypothesis)this further implies that Λm and Λs have the same input-output functions fΛm
and fΛm, respectively (see Def. 2 below), i.e., ∀s ∈ Σ∗. fΛm(s) = fΛs(s).
Let us overload notation and use ≃ to denote this functional equivalencefor SXMs. Under the testing hypothesis, thus, a finite test set χ suffices todecide functional equivalence ≃ between two SXMs. But how do we apply χto the control automaton, A(Λm), in practice? What we will have available fortesting is the IUT corresponding to the full SXM Λm, not the abstracted controlautomaton. Thus, in order to exercise these sequences on Λm it is necessary toconvert them into sequences from the input alphabet Σ of Λm. Here is where thefundamental test function [9, 7, 8] comes into play. A fundamental test functiont : Φ∗ → Σ∗ has the property that for every ω ∈ χ, t(ω) ∈ dom(fΛm) iffω ∈ L(A(Λm)). In other words, to find out if a test sequence ω ∈ χ is acceptedby the control automaton A(Λm) it suffices to run the translation t(ω) ∈ Σ∗ on
41
UK Software Testing Research III
the “real” system Λm. Obviously, then Λm ≃ Λs iff ∀ω ∈ χ. t(ω) ∈ dom(fΛm) ⇔ω ∈ L(A(Λs)).
How do we construct the test function? The canonical way to do this isto simulate the abstract test sequence ω = φ1φ2 · · ·φk ∈ χ incrementally onΛm, at each step providing a suitable input symbol σi ∈ Σ∗ for controlling theexecution of φi ∈ Φ on Λm, so that we are able to detect whether or not φi canbe successfully executed (trigger). The abstraction made for obtaining the testset χ (by interpreting the relation in Φ as abstract symbols) removes informationregarding to the input-output behaviour and the machine memory. In order forthe fundamental test function t to be able to fill in this information, the standardtechnique proceeds as follows:
1. Identify a memory invariant N ⊆ MEM that can be maintained for a re-stricted set of input symbols S ⊆ Σ. Formally, the type Φ of Λ is calledclosed with respect to (MEM, S) if the initial memory has m0 ∈ N and forall φ ∈ Φ executed from an arbitrary memory m ∈ N under an input symbolσ ∈ S will produce a memory value m′ ∈ N .
2. Ensure that it is always possible to apply an input σk+1 ∈ S that can triggera given operation φk+1 ∈ Φ in all memory states within the invariant setN that can possibly be assumed by the implementation Λm after havingexecuted the functions φ1φ2 · · ·φk in this order. This condition, known asinput completeness, ensures that (i) every path ω ∈ L(A(Λs)) which is alsoin L(A(Λm)) can actually be followed in Λm, and (ii) if at the end of thispath a function φk+1 is not found to be executable in Λm (note that thisφk+1 may or may not be executable in Λs) then ω cannot be extended byφk+1 in the control automaton L(A(Λm)) either.
3. Ensure that it is always possible to determine which operation φ ∈ Φ hasbeen executed by observing the output γ ∈ Γ produced by Λm. For non-deterministic systems it is crucial that this output also determine the nextmemory value. This condition, known as output-distinguishability guaranteesthat the output sequence produced by the machine identifies uniquely thepath followed.
Completeness and output-distinguishability are known as design for test con-ditions. Output-distinguishability is only mentioned briefly in Section 4.3 re-garding deterministic machines. For information about different types of non-deterministic SXM the reader is referred to [10–12].
The fault model of SXM-based testing admits a more general interpretation interms of an abstract machine AINV which we might call an automaton invariant.The idea is that AINV encapsulates an abstraction of the combined memoryand control paths which both specification and implementation are assumedto follow as part of the testing hypothesis. In this manner, the fault domain(given by the testing hypothesis) refers to the set of those implementationsrefined from AINV . The notion of refinement here is trace inclusion. To be moreprecise, for all SXM Λ with type Φ, for which the testing hypothesis holds,
42
UKTest 2005
we have L(A(Λ)) ⊆ L(AINV ). The implication of this is that any test set χis a subset of L(AINV ). This is justified because any ω /∈ L(AINV ) cannot beexecuted neither in the specification nor in the implementation. Note that thisfiltering through L(AINV ) still involves both positive and negative testing: Forall ω ∈ χ ∩ L(A(Λ)) it is tested that the implementation admits these traces,while for ω ∈ χ ∩ (L(AINV ) \ L(A(Λ))) we verify that ω cannot be executed.
For the stand-alone SXM testing technique, AINV is trivial as illustrated inFig. 2. In this AINV there is a single state representing all pairs Q × N and aself-loop on this state for every operation φ ∈ Φ subject to the input constraintS. The intended meaning is that for each (q, m) ∈ Q×N and every φ ∈ Φ thereis an input σ ∈ S such that (m,σ) triggers φ and the resulting state is again inQ × N .
Q × N Φ, S
Fig. 2. The automaton invariant AINV of the SXM testing approach.
One contribution of this paper consists of defining AINV for communicatingstream X-machine system (CSXMS) (see Fig. 3 below) subject to the structuralconstraints and the assumptions on the operation of the functions of Φ made inthe testing hypothesis.
Next, let us look at the conformance testing problem from a communicatingstream X-machine system (CSXMS) specification W . In [9, 13] it is assumed bytesting hypothesis that the IUT ι is modelled by a single monolithic SXM Λmrather than a communicating system of individual SXMs. In order to obtaintests, the specification W is “transformed” into a functionally equivalent globalSXM Λ(W ) [13] so that Λ(W ) ≃ W . The testing is done from this resulting Λ(W )using a stand-alone testing method. If the test cases are sufficient to demonstratethat Λm ≃ Λ(W ) and since ≃ is transitive, then it is possible to deduce thatΛm ≃ W and the problem reduces to that of finding a test set for Λm fromΛ(W ). In this approach, thus, the original structure of the CSXMS specificationis lost and not linked with any component structure of the IUT.
Using the procedures described in [7], it can be ensured that the type Φi
of every component Λi is both input-complete and output-distinguishable. Thefollowing results from [9] provide the basis for test generation from Λ(W ), un-der moderate assumptions on the structure of the CSXMS W (testing variant,simpleness):
– If the associated automaton of each component Λi of W is deterministic thenthe associated flat automaton Λ(W ) is also deterministic.
– If the type Φi of each Λi is a set of partial functions then the type Φ of Λ(W )also consists of partial functions.
43
UK Software Testing Research III
– If all Φi are input-complete then Φ is input-complete.– If all Φi are output-distinguishable, then Φ is, too.
The advantage of this global product machine approach is that it is verygeneral and can be applied in situations where the IUT cannot be decomposedinto parts and only accessed as a whole. However, there are severe disadvantages,too. Because of state-space explosion the construction of Λ(W ) and of the testsequences from it can be very expensive. Also, since the IUT is tested as a wholeit can be very difficult to obtain a reliable guess on the maximal sequentialdepth (number of states) of Λm when the IUT is a complex system. Fortunately,in many practical situations this monolithic approach is not necessary. Oftenthe distribution of components in W has a direct match in the implementation,either because the implementation has been derived from W in a component-oriented manner or because the specification W has been tailored to reflectexisting implementation components for the purposes of testing. If the softwaresystem is implemented from the CSXMS specification it seems natural to assumethat this structure is preserved in the implementation. Programming languages,operating systems and tools provide a standard mechanism to achieve this evenin stand-alone computers.
Indeed, we believe that the main benefits of using CSXMS as a specificationformalism lie in the possible separation of unit and integration testing. In unittesting the smallest piece of a system (e.g. the smallest compilable element or acommunicating component) is tested in isolation. On the other hand, integrationtesting is defined as the activity in which components are combined and testedto evaluate the interaction between them [14]. The original approach [9, 13] de-scribed above does not support this separation of concerns as it performs at thesame time and in an indivisible manner both the unit testing and the integrationtesting of a system specified as a CSXMS. This paper attempts to make firststeps towards developing a methodology for unit and integration testing basedon CSXMS.
2 Unit Testing for CSXMS
A CSXMS specification W is distributed in the sense that it is presented in termsof clearly separated and independently executing components, which interactwith each other. In order to define a fault model that exploits this for unittesting one possibility is to restrict the fault domain Miut to those IUTs whichfollow the structure (system architecture) of the specification W . Concretely,this means that for every implemented component under test (ICUTi) in theIUT, there is one and only one component Λi in the CSXMS specification Wthat prescribes its behaviour.
Therefore, at least in principle, it is possible to take the component’s specifi-cation Λi and generate a test set out of it, and then exercise the test sequencesin the component’s implementation ICUTi to observe its behaviour. However,neither this Λi nor the ICUTi are isolated entities. In the CSXMS model they
44
UKTest 2005
communicate with other components through input and output ports. We maynow analyse the behaviour of the ICUTi and compare it to that prescribed by Λi
under the assumption that the rest of the system is fixed and operates correctlyaccording to the CSXMS specification, or test independently of the context. Thislatter alternative is the one explored in this paper.
Let Λim be the model of ICUT i and Λi its specification. The implementationrelationship is articulated as follows. We put Λim ≃ Λi iff W [Λi] ≃ W [Λim],where W [·] is an arbitrary CSXMS in which Λi and Λim can operate as a com-ponent. For W [Λi] ≃ W [Λim] we use standard CSXMS equivalence. We showhow the test generation problem to decide the strong contextual Λim ≃ Λi canbe solved using the SXM testing approach on the specified component Λi withan extended communication interface. Note that the assumption that the com-ponents can be manipulated does not imply that from outside of the system(environment) it is possible to access directly the communication interfaces (i.e.ports), but just the environment interface (i.e. streams) of the components. Inorder to do this, the approach that is suggested here is based on testing archi-tectures. The intention of a distributed/remote architecture [4] is to provide aframework for unit testing of the ICUTs that later will be integrated into asystem. A testing architecture is a description of the environment in which theIUT is executed by carrying out test cases in order to observe whether or notit has a certain behaviour. Abstract testing architectures are described in termsof observable outputs and controllable inputs with respect to the IUT [15]. Inthis way, a testing architecture refers to the test devices (also called testers),the connections between them and those between test devices and the IUT [16].Even more specifically, a testing architecture consists of the IUT, its points ofcontrol and observation (PCOs) and a test system. The PCOs are points closestto the IUT at which input events and observation of output events take placeduring testing [15, 17].
Our test architecture consists in connecting independent testers to all inputand output channels of the CSXMS component under test and to provide themwith individual test instructions, called local test sequences (LOC) that all to-gether constitute a remote test case (RTC). In this way the test is performedin a concurrent and local fashion and also provides some amount of integrationtesting as it can involve the actual communication topology in which ICUT issupposed to operate. Two CSXMS W ′ and W are architecturally isomorphic,denoted by W ′ ∼=id W , if and only if there is a one-to-one correspondence be-tween the systems’ components and the structure of communication (i.e. theconnections among components) is the same in both.
Now, given that all the components have been tested and conform, it is re-quired to test the architectural isomorphism Wm ∼=id W (provided that thisis not given), where the CSXMS Wm is the model of implementation of IUT.Then, because W is given and distributed, it is possible to use a simple dis-tributed algorithm that can be implemented as part of the functionality of thecomponents, and executed when required (at least once before the system is putinto operation). This integration testing algorithm operates as follows. It receives
45
UK Software Testing Research III
from the input stream of each component a codification of the identifier of thecomponent and the identifier of all its neighbours in W . Then it sends its identi-fier to all its neighbours and receives from them their corresponding identifiers,which are placed, one by one, in the output stream. If at the end, the outputstream contains exactly the set of neighbours identifiers that are indicated in Wthen the component is in the right place and correctly connected. If for all thecomponents in the IUT this is the case, then obviously Wm ∼=id W .
3 Basic Definitions
In this section we recall the standard definitions of deterministic (communicat-ing) stream X-machines.
Definition 1. A SXM is a tuple Λ = (Σ,Γ, Q,M,Φ, F, q0,m0), where:
– Σ and Γ are finite input and output alphabets, respectively,– Q is a (finite) set of control states,– M is a possibly infinite set called the internal memory of Λ,– Φ, called the type of Λ, is a set of non-empty (partial) processing functions
of the form φ : M × Σ ⇀ Γ × M ,– F is the (partial) next-state function, F : Q × Φ ⇀ Q,– q0 ∈ Q and m0 ∈ M are respectively the initial control and memory states.
For determinism we require that if F (q, φ1) 6= ⊥ and F (q, φ2) 6= ⊥ are bothdefined for distinct functions φ1 6= φ2 then dom(φ1) ∩ dom(φ2) = ∅.
A configuration cf of a SXM is a tuple cf = (m, q, s, g), where m ∈ M , q ∈ Q,s ∈ Σ∗, g ∈ Γ ∗. The SXM starts its execution from an initial configuration(m0, q0, s, ǫ), where the input stream is set to s and ǫ indicates that the outputstream is empty. A change of configuration (m, q, σ · s′, g) ⊢ (m′, q′, s′, g · γ) ispossible if there exists φ ∈ Φ with q′ ∈ F (q, φ) and φ(m,σ) = (γ,m′). Thetransitive and reflexive closure of ⊢ is denoted by ⊢∗. Configurations such as(m, q, ǫ, g) with empty input stream are final representing successful termination.
Definition 2. The (partial) function computed by a SXM Λ, fΛ : Σ∗ ⇀ Γ ∗, isdefined by fΛ =df {(s, g) | ∃q ∈ Q,m ∈ M. (m0, q0, s, ǫ) ⊢∗ (m, q, ǫ, g)}.
The notion of CSXMS combines the advantages of the SXM, i.e., to separatethe concerns of modelling control system and functional data paths, with thepossibility of sending and receiving messages among several of these abstractmachines. The communication between components (referred also as CSXM) isimplemented via local input and output ports within each component and aglobal communication matrix C that links every component with every other.Each entry C[i, j] in the matrix is a directed one-place buffer for messages intransit from Λi to Λj . The buffer, which may be empty or full, synchronises thecomponents in the sense that any writing into C[i, j] by Λi is blocked until anyprevious value has been read by Λj so that the buffer is empty, while reading
46
UKTest 2005
from C[i, j] must wait until the buffer is filled by Λi. Otherwise, the executionof operations in different component SXMs is concurrent and asynchronous, i.e.any number of components whose operations are enabled can move together inone step of a global change of configuration. Let C[i, j] ← x be the communcationmatrix C where cell (i, j) has been updated by value x, leaving all other cellsunchanged.
Definition 3. A CSXMS with n components is a family W = (Λi)i≤n of n SXMsystem components of the form Λi = (Σi, Γi, Qi,DATAi, Φi, Fi, q
0i , data0) over
the abstract data path DATAi = INi ×Mi ×OUTi ×CM subject to the followingstructural constraints and interpretations:
– INi and OUTi are two sets1 called the input port and the output port re-spectively of the component, with the property that INi,OUTi ⊆ Mi ⊎ {λ}and λ ∈ INi ∩OUTi. The special symbol λ is used to indicate an empty port.We will use the abbreviation MEM i = INi × Mi ×OUTi called the memoryof Λi.
– CM =df Π(i,j)∈n×nOUTi is the set of communication matrices C ∈ CM oforder n × n modelling a message buffer of size 1 associating with every pair1 ≤ i, j ≤ n a message C[i, j] ∈ OUTi en-route from Λi to Λj. We assumethat OUTi ⊆ INj for every pair i, j of components, so that every outputmessage of component i can be accepted as input by every component j.
– data0i = (λ,m0
i , λ, C0) is the initial state of the data path, where m0i the
initial internal memory and C0 the initial communication matrix, which weassume to be empty. I.e., C0[i, j] = λ for all i, j ≤ n.
Further, the type Φi of any component can be partitioned into processing opera-tions Φp
i and communicating operations Φci , i.e. Φi = Φp
i ⊎ Φci , such that
– no processing operation depends on or modifies the communication matrix,i.e., all φp
i ∈ Φpi are of type MEM i × Σi ⇀ Γi × MEM i. As a function
on the full type DATAi × Σi ⇀ Γi × DATAi we have φpi (in,m, out, C, σ) =
(γ, in′,m′, out′, C) where φpi (in,m, out, σ) = (γ, in′,m′, out′) under the re-
stricted type.– the communicating operations Φc
i = Φsi ⊎ Φr
i split into output-moves Φsi or
input-moves Φri , which are partial functions of the following shapes (assum-
ing i 6= j):• The output moves are sendi→j : OUTi ×CM ×Σi ⇀ Γi ×OUTi ×CM
such that sendi→j(out, C, σ) is defined only if out 6= λ = C[i, j]. Thefunction value, if defined, is (γ, λ, C[i, j] ← out) = sendi→j(out, C, σ).
• The input moves are receivej→i : INi ×CM ×Σi ⇀ Γi × INi ×CM suchthat receivej→i(in, C, σ) is defined only if C[j, i] 6= λ = in. The result, ifdefined, is (γ,C[j, i], C[j, i] ← λ) = receivej→i(in, C, σ).
Both sendi→j and receivej→i operations are naturally extended to the fulltype DATAi × Σi ⇀ Γi × DATAi as follows: We put
sendi→j (in,m, out, C, σ) = (γ, in,m, out′, C ′)
1 We assume that INi 6= {λ} and OUTi 6= {λ}.
47
UK Software Testing Research III
iff (γ, out′, C ′) = sendi→j(out, C, σ) for send and
receivej→i (in,m, out, C, σ) = (γ, in′,m, out, C ′)
iff (γ, in′, C ′) = receivej→i(in, C, σ) for the receive operation.
Finally, the set of control states is partitioned Qi = Qpi ⊎ Qc
i into a set of pro-cessing states Qp
i and communicating states Qci such that
– all the transitions emerging from Qpi are labelled by processing operations
and all transitions emerging from Qci labelled by communicating operations,
so the next-state function Fi : Qi×Φi → Qi of component Λi has the domaindom(Fi) ⊆ (Qp
i × Φpi ) ∪ (Qc
i × Φci )
– Every processing function φ ∈ Φpi clears the input port and sets the output
port to a defined value, i.e., if φ(x,m, y, C, σ) = (γ, x′,m′, y′, C) then x′ = λand y′ 6= λ
– the initial state is a processing state, q0i ∈ Qp
i
– a communicating operation always moves the machine into a processing state,i.e., if (q, φ) ∈ dom(Fi) and q ∈ Qc
i then Fi(q, φ) ∈ Qpi .
We will generally use processing operations φpi , input moves sendi→j and
output moves receivej→i with their more specific types. Also, note that commu-nication operations just like processing operations are synchronous in the sensethat they consume exactly one symbol from the input stream and produce onesymbol on the output stream. Previous work on CSXMS [9] permits commu-nication operations to run silently, but adds extra symbols for controlling andobserving them as part of the so-called testing variant. Def. 3 avoids this extrainstrumentation step. Also, note that Def. 3 presumes that each component ofa CSXMS is simple a condition imposed in [9] for input controllability.
Let W = (Λi)i≤n be a CSXMS. As defined above a configuration for SXMcomponent Λi is a tuple
cf i = (qi, ini,mi, outi, C, si, gi) ∈ Qi × MEM i × CM × Σ∗i × Γ ∗
i ,
where ini ∈ INi and outi ∈ OUTi are the current port values, mi ∈ Mi andqi ∈ Qi the current internal memory and control state, respectively, C the cur-rent state of the configuration matrix, and finally si ∈ Σ∗
i , gi ∈ Γ ∗i the cur-
rent contents of input and output streams. We now define the configurations ofthe CSXMS W by glueing together those local configurations using the com-munication matrix as a shared object. Thus, a configuration of W is a tuplecf = (cf 1, . . . , cf n, C) where each cf i is a configuration of component Λi and Cthe current global communication matrix such that C = π5(cf i) (πi refers to theith projection function) for all i ≤ n. Each such SXM Λi performs local changesof configuration cf i ⊢ cf ′i by executing a processing or communicating operationin the appropriate way. Since all Λi run under mutual exclusion regarding thematrix C, their execution can be serialised. Thus, a global change of configu-ration (cf 1, . . . , cf n, C) ⊢ (cf ′1, . . . , cf
′n, C ′) consist of an index i ≤ n such that
48
UKTest 2005
cf i ⊢ cf ′i and cf j = cf ′j for all other j 6= i. Again, we let ⊢∗ be the reflexive andtransitive closure of ⊢. A configuration cf = (cf 1, . . . , cf n, C) is initial if all cf i
are initial configurations of Λi and C = C0 is the initial (empty) matrix, and itis final if all cf i are final for Λi.
Definition 4. The stream relation computed by a CSXMS W = (Λi)i≤n is therelation fW : Σ∗
1 × · · · × Σ∗n ↔ Γ ∗
1 × · · · × Γ ∗n with (s1, . . . , sn)fW (g1, . . . , gn)
iff there exist initial and final configurations cf 0 = (cf 01, . . . , cf
0n, C0) and cf t =
(cf t1, . . . , cf
tn, Ct), respectively, such that si = π6(cf
0i ) and gi = π7(cf
ti) for all
i ≤ n and cf 0 ⊢∗ cf t.
As shown in [9] the relation fW computed by any CSXMS (under the as-sumptions made) actually is a partial function.
4 Design for Test Conditions
It is not hard to see that the design for test conditions, as they have been definedin [9], are strong enough not only for allowing the original testing approach,but also for supporting the unit testing described above. Moreover, as it hasbeen shown in [18], it is possible to perform unit testing from the model underconsideration (CSXMS) with the same accuracy as the SXM testing approachprovides for stand-alone SXM. Here we generalise the results of [18] to exploitthe input and output ports as extra points of control and observation. In thissection we discuss our design for test conditions
– Memory Closedness– Input Completeness– Output Distinguishability
considering that for driving our tests we have available as PCOs not only theinput and output streams but also the input and output ports and the commu-nication matrix. This gives rise to a new technique of obtaining relaxed designfor test conditions, generalising the notion of attainable memory state [9].
4.1 Memory Closedness
The first part of the design for test conditions is the notion of closedness ofthe type Φi, which identifies a memory invariant that is maintained by everyoperation for a suitably restricted set of input conditions.
Definition 5. Let Λi be a CSXMS component. Let Ni ⊆ Mi be a subset ofinternal memory states and Si = 〈Sφ
i | φ ∈ Φpi 〉 with Sφ
i ⊆ Σi and Ti = 〈Tφi |
φ ∈ Φpi 〉 with Tφ
i ⊆ INi two families of stream input symbols and port inputvalues indexed by the processing operations of Λi. Then Φp
i is (memory) closedwith respect to (Ni, Si, Ti) if
– m0i ∈ Ni
49
UK Software Testing Research III
– for all φ ∈ Φpi , m ∈ Ni, σ ∈ Sφ
i , τ ∈ Tφi ∪ {λ}, γ ∈ Γi, y ∈ OUTi, it is the
case that if φ(τ, m, y, σ) = (γ, λ,m′, y′) then m′ ∈ Ni.
Thus Φpi is closed with respect to (Ni, Si, Ti) if the initial memory state is
in Ni and every processing operation φ ∈ Φpi keeps the machine in region Ni
provided the stream input is in Sφi and the value of the input port is empty
or in Tφi . The set Ni is used to approximate the memory region in which the
tester maintains control of all processing operations at the cost that the inputsthrough which he drives the ICUT i are constrained by (Si, Ti). Thus, from thepoint of view of testing we would strive to make Ni small and both (Si, Ti) large.Note that the constraint Si can be accounted for directly at every computationstep by adjusting the stream input symbols, while the constraint Ti can onlyindirectly be satisfied via the communication matrix and the execution of receiveoperations. Observe that whenever the input port is empty we trigger throughthe stream input only. If it is not empty the value must have been entered by someprevious receive operation and at that point the tester that feeds the associatedcommunication cell can make sure this value is in Tφ
i . Using these existing PCOgeneralises previous SXM techniques, which only go through the stream input,and also relaxes the constraints on the programmer to install extraneous inputcontrol for testing.
Def. 5 considers processing operations only. We also need to care about com-munication operations. These never change the internal memory, so they preservethe invariant Ni, trivially. This, however, will not be enough to guarantee thatwe can always trigger a communication operation, which also depends on thecomunication buffers. In fact we need to maintain the right “fill” status of theports and the communication matrix. To this end we enlarge the invariant Ni
to an invariant INV i ⊆ Qi × MEM i on the state of component Λi such that
INV i = (Qci × MEM sr
i ) ∪ (Qpi × MEM sr
i ) ∪ (Qpi × MEM sr
i ) ∪ (Qpi × MEM sr
i ),
where MEM sri , MEM sr
i , MEM sri are the states of the memory in which the ports
are set so that, respectively, all send and all receive operations (coded sr), allsend but no receive operation (sr), no send but all receive operations (sr) areenabled, provided the communication matrix is set appropriately. Formally,
MEM sri = {λ} × Ni × (OUTi \ {λ})
MEM sri = (INi \ {λ}) × Ni × (OUTi \ {λ})
MEM sri = {λ} × Ni × {λ}.
Note that all three sets are disjoint. The states MEM sri = (INi \{λ})×Ni ×{λ}
are not attainable in Λi, and the state changes that are possible are shown inthe automaton invariant AINV i
of Fig. 3 whose set of control states correspondsprecisely to INV i. The automaton AINV i
represents an abstraction of the controlstructure and memory of the CSXM machine model.
Proposition 1. Let Λi be a CSXMS component which is memory closed withrespect to (Ni, Si, Ti). Then the initial state (q0
i , λ,m0i , λ) is in INV i and every
50
UKTest 2005
Qpi × MEMrs
i
INi 6= λ, OUTi 6= λ
Qci × MEMrs
i Qpi × MEMrs
i
INi = λ, OUTi 6= λ INi = λ, OUTi 6= λ
Qpi × MEMrs
i
INi = λ, OUTi = λ
Φsi , Si, Ti
Φpi , Si, Ti
Φpi , Si, Ti
Φpi , Si, Ti
Φpi , Si, Ti
Φpi , Si, Ti
Φpi , Si, Ti
Φri , Si, Ti
Fig. 3. The automaton invariant AINV ifor (memory-closed) CSXMS components.
operation of Λi preserves INV i under the input constraint (Si, Ti), i.e., for all
φ ∈ Φi and all states (q, x,m, y) ∈ INV i such that x ∈ Tφi ∪ {λ}, and inputs
σ ∈ Sφi : if Fi(q, φ) = q′ and φ(x,m, y, C, σ) = (γ, x′,m′, y′, C ′) then we have
(q′, x′,m′, y′) ∈ INV i for the next state, too. More specifically, processing andcommunicating operations only change the state according to AINV i
.
Proof. We only show that INV i is a state invariant. Checking the state transi-tions of Fig. 3 is trivial. That (q0
i , λ,m0i , λ) ∈ INV i follows from q0
i ∈ Ni∩Qpi and
the definition of INV i immediately. Next observe that due to memory closednessrelative to (Ni, Si, Ti) we know that under the input assumptions σ ∈ Sφ
i and
x ∈ Tφi ∪ {λ} the next memory value m′ is in Ni, for all φ. Regarding the other
components of the total state we must distinguish the two types of operationsΦp
i and Φci and use the structural constraints on Λi made in Def. 3 as follows: By
assumption, every processing operation φ ∈ Φpi clears the input port, x′ = λ, and
sets the output port to a defined value y′ 6= λ, so the next state (q′, x′,m′, y′)obviously preserves INV i regardless whether q′ ∈ Qc
i or q′ ∈ Qpi . Now, let φ ∈ Φc
i
be a communicating operation. Since (q, φ) ∈ dom(Fi) our structural assump-tions imply that q ∈ Qc
i is a communicating state and the next state q′ must bea processing control state, i.e., q′ ∈ Qp
i . The former implies x = λ and y 6= λby the invariant assumption. Thus, we must necessarily have x′ = y′ = λ whenφ ∈ Φs
i or x′ 6= λ and y′ 6= λ for φ ∈ Φri . Under no circumstances can we
have x′ 6= λ and y′ = λ. In conjunction with m′ ∈ Ni this suffices to conclude(q′, x′,m′, y′) ∈ INV i as desired. ⊓⊔
At this point it is worth recalling that by the testing hypothesis each ICUT i
is modelled by a CSXM Λim. Since every (memory-closed) CSXM is a refinementof AINV i
, both the specification Λi and Λim operate according to AINV i. The
implication of this is that AINV igives an upper approximation on the possible
51
UK Software Testing Research III
test sequences and associated memory values. This is a generalisation of thenotion of attainable memory [9] including also the constraints imposed on thestructure of the CSXMS W in [9, 13] (i.e. testing variant, simpleness).
The tester can use the information contained in AINV ito optimise tests in
the sense that any test sequence that is not in the language L(AINV i) does not
need to be exercised since it will neither be in L(Λi) nor in L(Λim). To make thismore precise assume that the tester has driven ICUT i through a test sequenceω = φ1φ2 · · ·φk which in the specification Λi resulted in a state (q, x,m, y).The tester does not know the control state of ICUT i but he knows that thememory state must be (x,m, y)2. Now suppose the tester needs to verify a sendoperation φk+1 = sendi→j in ICUT i’s current control state. If in the currentmemory the out-port y is empty, y = λ, the send is certainly not possible inICUT i. On the other hand, according to AINV i
, q is a processing state and thesend is not possible in the specification Λi, either. Hence, only for non-emptyout-port does the tester have to enable any sendi→j in ICUT i by choosing asuitable stream input symbol and emptying the communication cell. The sameapplies to a receive operation φk+1 = receivej→i. Where this receive is blockedbecause of a full in-port x 6= λ the corresponding state q reached by Λi must bea processing state. This means that the receive operation is not possible in thespecification either. Thus, the tester needs to switch free receivej→i only wherex = λ. But this is easy for we only need to find a triggering stream input andfill the communication cell.
The situation is different for a processing operation φk+1 ∈ Φpi . Suppose the
ICUTi has reached a memory state with x = λ. The invariant automaton AINV i
guarantees that such an operation can only ever be enabled (in both implemen-tation and specification) in the initial state or if the operation φk immediatelypreceding3 it is a processing operation or a send operation. But then x = λ inall computations that take the ICUT i through the sequence φ1φ2 · · ·φk. Thus,any triggering information we get from ICUT i with this particular value x = λin our particular test situation is fully representative, positively and negatively,about all computations of the same abstract operator sequence. Hence, wheneverwe find x = λ during the test, we may safely try and trigger a given operationthrough the stream input alone. On the other hand, if x 6= λ then from AINV i
we can infer that the preceding φk must be a receive operation, in which casewe can use this receive to supply φk+1 with a suitable value on the in-port fortriggering it.
Now, if the underlying test generation method that operates on the associ-ated control automaton of an SXM (e.g. W-method) is restricted to the pathsof AINV i
then it is possible to guarantee full fault coverage in the fault modelof interest (this applies for both positive and negative triggering of functions).
2 This is because by testing hypothesis all functions operating on memory are imple-mented correctly.
3 The Attainφ sets [9] only capture information about the future of the test sequenceis derived from the memory state, viz. that φ can be a successor in the path, whilehere we need information about the past.
52
UKTest 2005
This can be easily done, for example, by filtering out the sequences in χ thatare not in the language of AINV i
. The invariant automaton, which is part of thetesting hypothesis here, can itself be subjected to testing. In this way, testing isfactorised into two parts: A test in which ICUT i is verified to comply with theabstraction AINV i
and another in which we verify that ICUT i implements thespecification Λi using AINV i
as a filter. This suggests a more general methodol-ogy of X-machine testing through abstractions which we leave to future work.
4.2 Input Completeness
Now we can introduce the concept of input completeness which is to ensurethat any operation sequence ω ∈ Φ∗
i (from the test set χ) can be translatedinto a (non-interactive) sequence of stream and port inputs t(ω) to trigger theoperations in ω in the specified order starting in the initial memory state m0
i .To do this we exploit AINV i
, which essentially amounts to a generalisation ofthe notion of φ-attainable memory states introduced in [9]. The set Mattainφ isthe set of all memory states in which the specification expects the operation φto be enabled under some sequence of inputs from the start state. Our invariantyields an upper bound on this set as follows:
Mattainφ ⊆ {(x,m, y) | ∃q. (q, x, m, y, φ) ∈ dom(FINV i)},
where FINV iis the partial state transition function of AINV i
. Note that theautomaton invariant AINV i
is more general than the Mattainφ sets [9] sincefrom the former we also we get information about the past while the latter onlycaptures information about the immediate successor in a test sequence.
The following definition implements the relaxed design for test condition [9]that comes with AINV .
Definition 6. Let Λi be a CSXMS component such that Φpi is memory closed
with respect to (Ni, Si, Ti). Then, the type Φi is (input) complete with respectto (Ni, Si, Ti), if the following condition holds: For all φ ∈ Φi and (x,m, y) ∈Mattainφ:
– if x = λ then there exist σ ∈ Sφi , C ∈ CM such that (λ,m, y, C, σ) ∈ dom(φ);
– if x 6= λ then there exist σ ∈ Sφi , τ ∈ Tφ
i , C ∈ CM such that (τ, m, y, C, σ) ∈dom(φ).
Def. 6 essentially says that as long as we stay within INV i all operations inΦi can be triggered by suitable stream and port inputs. Note that if the in-portis non-empty (x 6= λ) we trigger both by the stream input σ ∈ Sφ
i and in-port
value τ ∈ Tφi . This is possible since by our invariant there must be a receive
operation immediately before through which this value can be supplied. It isworthwhile to analyse some of the formal consequences of input completeness.
First, depending on what operation φ we are looking at, the condition
(τ, m, y, C, σ) ∈ dom(φ) (1)
53
UK Software Testing Research III
on the existence of σ, τ, C in Def. 6 has different operational meaning. For in-stance, no processing operation depends on the communication matrix, so thatany C will do. If φ is a sendi→j operation then the cells C[i′, j] for i′ 6= i areirrelevant and for i′ = i condition (1) uniquely determines these cells C[i, j] tobe empty. A similar argument applies to receive operations, for which (1) merelysays that input cells C[j, i] must be non-empty. Hence, the existence statementin Def. 6 concerning the communication matrix C is trivial. It does not imposeany constraint on the CSXM component, but only on the test environment tofill the input cells C[j, i] (with any value) and clear the output cells C[i, j] atthe right times.
Proposition 2. Let Λi be a CSXMS component such that Φpi is memory closed
with respect to (Ni, Si, Ti). Then, the type Φi is input complete with respect to(Ni, Si, Ti) if there exist choice functions as follows:
– For every φ ∈ Φpi there is a function choice
φi : Ni ×OUTi → Sφ
i × Sφi × Tφ
i
such that
(λ,m, y, π1choiceφi (m, y)) ∈ dom(φ) and
(π3choiceφi (m, y),m, y, π2choice
φi (m, y)) ∈ dom(φ)
– For all φ ∈ Φci a function choice
φi : Ni → Σi such that
• if φ = sendi→j and C[i, j] = λ 6= y then (y, C, choiceφi (m)) ∈ dom(φ)
• if φ = receivej→i and C[j, i] 6= λ then (λ, C, choiceφi (m)) ∈ dom(φ),
where m ∈ Ni, y ∈ OUTi, C ∈ CM.
Proof. Consider the case φ ∈ Φpi . We argue that input completeness implies the
existence of a choice function as specified in the statement of Prop. 2. SpecialisingDef. 6 to such φ ∈ Φp
i gives
∀(q, x, m, y, φ) ∈ dom(FINV i).
x 6= λ ⇒ ∃σ ∈ Sφi , τ ∈ Tφ
i , C ∈ CM. (τ, m, y, C, σ) ∈ dom(φ) and (2)
x = λ ⇒ ∃σ ∈ Sφi , C ∈ CM. (λ,m, y, C, σ) ∈ dom(φ). (3)
What does this imply? Since no processing operation depends on the commu-nication matrix we know that if we can find a tuple (τ, m, y, C, σ) ∈ dom(φ) thenin fact any C will do. Thus, the existence of C in both cases presents no restric-tions. Secondly, whether or not a particular (τ, m, y, C, σ) or (λ,m, y, C, σ) is indom(φ) does not depend on the choice of q, x. Hence, if we have constructed onesolution for some given (q, x,m, y) we can use the same σ, τ, C for all (q′, x′,m, y)with the same m and y. Further, observe that for every processing operationφ ∈ Φr
i and (m, y) ∈ Ni × OUTi there are (q, x) such that (q, x, m, y, φ) ∈dom(FINV i
). Thus, the implication of (2) is that for all φ ∈ Φpi there exist a
function choiceφi : Ni×OUTi → Sφ
i ×Tφi such that (removing C ∈ CM from the
condition, which is trivial) (π2choiceφ1 (m, y), m, y, π1choice
φi (m, y)) ∈ dom(φ),
54
UKTest 2005
for all (m, y) ∈ Ni × OUTi. Similarly, the import of (3) is a choice function
choiceφi : Ni × OUTi → Sφ
i such that (λ,m, y, choiceφi (m, y)) ∈ dom(φ) for
all (m, y) ∈ Ni × OUTi. In the statement of Prop. 2 we have packed up bothfunctions into one in the obvious way.
Further note that whenever we consider a communicating operation φ ∈ Φci
to be triggered in a control state q because of the automaton invariant AINV i
then we may conclude that q ∈ Qci , x = λ and y 6= λ. In other words, the ports
are ready and set so that triggering of φ can only depend on the stream inputsymbol σ and the communication cells being prepared accordingly. From thisthe existence of the choice functions as required follows easily.
We omit the other direction of the proof that the existence of choice functionsimplies input completeness. ⊓⊔
4.3 Output Distinguishability
At testing time we must be able to verify from the output produced that ICUT i
executes the right functions as specified by the test sequence. That these func-tions implement the correct transformations on the data path is tested indepen-dently and assumed by way of the testing hypothesis. The final element of thedesign for test conditions thus is output distinguishability which guarantees thatall functions in Φi can be identified uniquely from their output.
Definition 7. Let Λi be a CSXMS component such that Φpi be memory closed
with respect to (Ni, Si, Ti). The type Φi is output-distinguishable with respect to(Ni, Si, Ti), if for all φ2, φ2 ∈ Φi, m,m1,m2 ∈ Ni, y ∈ OUTi, τ ∈ INi, γ ∈ Γi ifit is the case that
– (σ, τ) ∈ Sφ1
i × (Tφ1
i ∪ {λ}) and (τ, m, y, σ) ∈ dom(φ2), or
– (σ, τ) ∈ Sφ2
i × (Tφ2
i ∪ {λ}) and (τ, m, y, σ) ∈ dom(φ1)
and both φ1 (τ, m, y, σ) = (γ, τ, m1, y) and φ2 (τ,m, y, σ) = (γ, τ, m2, y), thenφ1 = φ2,m1 = m2.
In nondeterministic systems, output distinguishability is also needed to con-trol the test sequence, since the actual memory state can only be determined attesting time. In the deterministic case we only need it to determine success orfailure at testing time.
4.4 Testing Hypotheses
To conclude this section let us establish the fault domain of the fault modelunder consideration by means of the testing hypothesis as follows.
– The IUT (derived from CSXMS specification W ) is modelled by a CSXMSWm such that Wm ∼=id W .
– Every Λim from Wm and Λi from W are refinements of the automatoninvariant AINVi
of Fig. 3.
55
UK Software Testing Research III
– Every Λim from Wm and Λi from W have the same set Φi which is memoryclosed, input-complete and output-distinguishable.
– Let Qi and Qim be respectively the set of control-states in Λi and Λim.There is a known integer k such that card(Qim) − card(Qi) ≤ k.
– Λim and Λi are deterministic and its corresponding associated automatonis minimal.
– Λi is reachable.
5 Fundamental Test Function
Now let us see how the design for test conditions ensure the translation fromω = φ1φ2 · · ·φk ∈ χ into a concrete sequence of external inputs to drive apurported implementation Λim of Λis along transitions φi in the total statespace. It is assumed by way of the testing hypotheses that both machines haveexactly the same type Φi of operations over the same total data path. Our goalis to test if ω ∈ χ is in the language of the associated automaton L(Λim). Asusual, we assume that A(Λis) is a minimal (deterministic) control automaton.
In our test architecture the test object Λim is embedded in a system of testerstesterj (j 6= i) connected to the communication matrix C. They are designed tobe ready to read out any value that happens to be put into C[i, j] by Λim andalso to keep cell C[j, i] filled with a defined value of our choice at all times. Let usdenote this CSXMS by WT
i . In this way we may assume without loss of generalitythat the communication matrix is always in the form required for enabling sendand receive operations. The input alphabet of testerj is Σtester
j = A ∪ B ∪ {−},where A = {(?, z) | z ∈ OUTi} and B = {(!, w) | w ∈ INi}. Intuitively, thesymbol (?, z) should be interpreted by testerj as the instruction “receive z fromICUTi” and the symbol (!, w) should be interpreted as “send w to ICUTi”. Thesymbol − is used by the tester to complete the operations. The output alphabetfor testerj is Γ tester
j = A ∪ B ∪ {−, 1, 2}.
Tester
Consider testerj in Fig. 4, for every 1 ≤ i ≤ n, there are three sequences that cantake place in the tester. Firstly, is! · sendj→i occurs when a symbol of the form(!, w) is read from the input stream. The operation is! places the same symbol(!, w) in the output stream and updates the output port with y = w, afterwardsthe sendj→i is performed, which effectively sends this w to the ICUTi, and themachine returns to the initial control-state. The other two possible sequencesare is? · receivei→j · right and is? · receivei→j ·wrong. Both of them start whena symbol of the form (?, z) is read from the input stream. This operation is?writes the same symbol (?, z) in the output stream, stores the value z in thememory (i.e. m = z) and empties the input port. Then receivei→j takes placeby receiving a value, say z′, from the ICUTi (i.e. x = z′). Finally, there are twopossibilities, either the processing operation right or wrong is executed whenthe symbol − is taken from the input stream. The former occurs when the valuereceived from ICUTi is the same as the value stored in memory (i.e. m = x)
56
UKTest 2005
is?
right wrong
is!(x, m, y, (!, w)) = ((!, w), λ, m, w)
is?(x, m, y, (?, z)) = ((?, z), λ, z, y)
right(x, m, y,−) = (−, λ, m, y), if x = m
wrong(x, m, y,−) = (′err′, λ, m, y), if x 6= m
receivei→j
sendj→i
is!
sendj→i(λ, m, y, C,−) = (1, λ, m, λ, C[j, i] ← y)
receivei→j(λ, m, y, C,−) = (2, C[i, j], m, y, C[i, j] ← λ)
Fig. 4. The canonical tester in position j (testerj)
and the machine returns to the original initial control-state. The latter occurswhen m 6= x and in this case the machine goes to a terminal state which is notthe initial one.
For a given component Λi, let Σpi = Σi × INi, Σr
i = Σi × {!j | 1 ≤ j ≤ n} andΣs
i = Σi × {?j | 1 ≤ j ≤ n} × OUTi.
Definition 8. A function tx,m,y : Φ∗i → (Σp
i ∪ Σri ∪ Σs
i )∗ can be defined byrecursion as follows:
tx,m,y(λ) = λ, where λ is the empty sequence.
For φ ∈ Φpi and φ∗ ∈ Φ∗
i ,
if choiceφi(m, y) = (σ1, σ2, x1) then
tλ,m,y(φ · φ∗) = (σ1, x1) · tx′,m′,y′(φ∗), where φ(λ,m, y, σ1) = (γ, x′,m′, y′) andtx,m,y(φ · φ∗) = (σ2, x1) · tx′,m′,y′(φ∗), where φ(x1,m, y, σ2) = (γ, x′,m′, y′)
For φ = sendi→j ∈ Φsi and φ∗ ∈ Φ∗
i ,
tx,m,y(φ · φ∗) = (choiceφi(m), ?j , y) · tx,m,λ(φ∗).
For φ = receivej→i ∈ Φsi and φ∗ ∈ Φ∗
i :
tx,m,y(φ · φ∗) = (choiceφi(m), !j) · tx,m,y(φ∗).
Henceforth, tm,y with m = m0i and y = λ is called fundamental test function of
Λi and it is simply denoted by t.
57
UK Software Testing Research III
A remote test case (RTC) is a tuple (s1, . . . , sn), where si ∈ Σ∗i is a local test
sequence (LOC) for the ICUTi and for all 1 ≤ j ≤ n with j 6= i the sequencesj ∈ Σtester∗
j is a LOC for testerj .
Definition 9. A function p : (Σpi ∪ Σr
i ∪ Σsi )∗ → Σ∗
i for computing the LOCfor ICUTi can be defined as follows:
p(λ) = λ.
p((σ, x) · ψ∗) = σ · p(ψ∗).
p((σ, ?j , y) · ψ∗) = σ · p(ψ∗).
p((σ, !j) · ψ∗) = σ · p(ψ∗).
Definition 10. A function gj : (Σpi ∪ Σr
i ∪ Σsi )∗ → Σtester∗
j for computing theLOC for testerj can be defined as follows:
gj(λ) = λ.
gj((σ, x) · ψ∗) = gj(ψ∗).
gj((σ, ?k, y) · ψ∗) =
{
(?, y) · − · − · gj(ψ∗) if k = j
gj(ψ∗) otherwise.
gj((σ, !k) · ψ∗) =
(!, x) · − · gj(ψ′∗) if (k = j) ∧ (ψ∗ = (σ, x) · ψ′∗)
(!, τ) · − · gj(ψ∗) else if k = j
gj(ψ∗) otherwise,
where τ 6= λ is an arbitrary value in INi.
In this form, for every abstract test sequence φ∗ ∈ χ, the sequence si = p(t(φ∗))is the LOC for the ICUTi and for every testerj , the sequence sj = gj(t(φ
∗))corresponds to its input LOC. Finally, s = (s1, . . . , sn) is the RTC for the testingsystem.
Let us denote by Wm ≃′ W the verdict provided by the original method, andlet Λim ≃ Λi be the verdict given by our unit testing approach. It can be shownthat (∀ 1 ≤ i ≤ n,Λim ≃ Λi) ⇒ Wm ≃′ W .
6 Analysis
The approach just described has two main advantages with respect to the originalapproach. The first one is related to complexity and the second has to do withthe possibility of employing different parameters of reliability in different partsof the systems. Let us discuss these aspects in more detail.
58
UKTest 2005
6.1 Complexity
The complexity of a testing process can be split into two parts, namely the com-plexity of generating the test suite and the complexity of executing this test suite.Let us denote the former by Og and the latter by Oe. In order to consider theapproach suggested in this paper and the original approach by themselves, let usabstract the complexity of the underlying method employed over the associatedautomaton (W-method or Wp-method). In this form, we can say that for the un-derlying method Og = (f(card(Q), k)) and Oe = (f ′(card(Φ), k)), where f andf ′ are arbitrary functions and k is the parameter of reliability used to estimatethe difference between control-states in the implementation and the specifica-tion. In the original approach, from the form in which the SXM is obtained [9],it is not hard to see that card(Q) = O(card(Q′)n), where Q′ corresponds to theset of control-states of the component with the greatest number of control-statesand n is the number of system components. Similarly, card(Φ) = O(card(Φ′)n),where Φ′ is the type of the component with highest card(Φ′). The complexities ofthe original approach are Og = (f(card(Q′)n, k)) and Oe = (f ′(card(Φ′)n, k)).For the approach discussed here, the SXM testing method is applied n times(one for each component) independently. Thus, Og = (n · f(card(Q′), k′)) andOe = (n · f ′(card(Φ′), k′)), where Q′ and Φ′ are as before and k′ is the largestk used for testing a particular component. If on the top of this we consider thetesting of the architectural isomorphism it is easy to see that the complexityof the distributed algorithm is polynomial in terms of communication C (i.e.number of messages) and time T (i.e. communicating operations executed). Tobe more precise, C = T = O(n2) where n is the number of system components.
Thus, the approach of this paper has a complexity that is much better than thatof the product machine approach. Moreover, observe that the pre-processing costinvolved in constructing the product machine has not been taken into accountfor this analysis. In other words, even assuming that the product SXM can beobtained with a reasonable amount of resources (i.e. in polynomial time) theapproach of this paper is superior. Note that this holds independently of the un-derlying method (e.g. W-method), which in turn implies that any improvementin the underlying method will not contribute to the improvement of the productapproach. Finally, because in the unit testing every component is tested inde-pendently, this allows a high degree of parallelism in the process. In this way,if the testing of all the components is executed simultaneously this will resultin Og = (f(card(Q′), k′)) and Oe = (f ′(card(Φ′), k′)). To put it simply, in thesame order of time that it takes to test a SXM with card(Q′) control-states andcard(Φ′) functions, a complete CSXMS with n components of card(Q′) control-states and card(Φ′) functions each can be tested.
6.2 The value of parameter k
The possible different values on the parameter k (i.e. estimate difference ofcontrol-states) do not imply any difference regarding to the detection ability ofthe approaches. In fact, for any method compared to itself, the use of distinct
59
UK Software Testing Research III
values of k may produce a different verdict. This does not suggest that for ob-taining this k, it is necessary to compute the global SXM in the componentsapproach. Although from this SXM the precise number of system’s states in thespecification can be obtained, the number of states in the implementation is stillan estimate, and hence k is an estimated value. Moreover, if the approach sug-gested here is employed then the only reason for constructing the SXM wouldbe to obtain an accurate value of card(Q) (the number of control-states in thewhole system). However, the construction of this machine demands a consider-able amount of resources and, therefore, a more sensible approach will be ratherto expend such resources for testing with greater values of k.
In practice, the value of k represents a coefficient of reliability between the spec-ification and the IUT. If the value of k is increased the test set will includelonger searching sequences. Thus, more faults are found if they are there, other-wise more resources (time) are used to obtain the same verdict. The designers,programmers, testers, etc. must decide how confident they are regarding to theimplementation or how critical it is, and from that set the value k.
There is, however, an issue that deserves special consideration, arising from thefact that in the components approach the components are tested separately.Since the number of states in every specified component v is known, differentvalues kv can be employed for testing different parts of the system (differentimplemented components). In this form, instead of having a flat parameter ofreliability, the testing can be adjusted in such a way that the process effort isemphasised or relaxed depending on the level of confidence required in each partof the whole system.
7 Related Work
Research has been abundant on test generation based on FSM (e.g. [2, 19]). Aleast common model employed in testing is the input/output transition system(IOTS), which usually models transition systems with concurrent input/outputbehaviour. In other words, pairs of inputs/outputs are not considered to beatomic actions (i.e. the next input can arrive before the previous output isproduced). This concurrent input/output behaviour is similar to that of theCSXMS components regarding to communicating operations. A framework fortesting IOTS through queues is introduced in [20]. In this framework, the testerand the system under test are two input-enabled message passing communicat-ing systems. Thus, both inputs and outputs are buffered using queues betweenthem, which are assumed to be bounded. In this form the system does notblock inputs from the tester and this does not prevent the system from pro-ducing outputs. In the CSXMS model, this is not necessarily the case since thecommunicating matrix (which can be considered a set of queues of size one) to-gether with the communication mechanisms defined may prevent a componentto execute a communicating operation due to the fact that the complemen-tary operation has not been executed by the corresponding component. In the
60
UKTest 2005
IOTS testing, the tester is split into two processes, an input-test process andan output-test process. The former applies inputs to the system while the lat-ter receives outputs from the system until no more outputs are detected. In aqueued-quiescence tester there is just one input-test process and one output-test process. A queued-suspension tester consists of several of such input-testand output-test processes. The implementation relations that can be tested arebased on traces (e.g. queued-quiescence trace-equivalence, queued-suspensiontrace-equivalence) and the queued-suspension tester has more fault detectionability that the queued-quiescence tester. Although this approach is architec-turally similar to the one suggested in this paper in the sense of having a num-ber of testers for testing a component, they differ basically in their aim. TestingIOTS as in [20] checks for quiescence in the system while testing componentsof CSXMS verifies functional-equivalence. Moreover, in the queued-suspensiontechnique the testers are employed to detect intermediate quiescent states inthe system, so an input sequence is decomposed into several testers in orderto achieve this. On the other hand, in the approach presented in this paper aninput testing sequence derived from a component is decomposed in such a waythat the testers emulate the expected behaviour of the other system componentswith respect to the ICUT .
In practice, when a distributed system is specified, some form of communicationis required amongst the several units or components of the system. Moreover,usually non-trivial distributed specifications (including protocols) require vari-ables and operations based on variable values. Hence, the specification is nor-mally divided into its control and data parts. This has implied the development ofsome extensions for the finite state machine (FSM) model to be powerful enoughto represent in a concise way these systems. For instance, for the communicationaspects the communicating finite state machine (CFSM) can be employed, forthe data part the extended finite state machine (EFSM) are applicable. Fur-thermore, the communicating extended finite state machine (CEFSM), as in theCSXMS case, includes both aspects communication and data.
In the EFSM model approaches like the W-method can be used to analyse thecontrol flow of the specification, but this leaves the data part (of the distributedsystem) untested. The data part can be tested using a data flow approach. Dataflow testing attempts to check the effects of data objects in software systems.In general, data flow testing is based on a data flow digraph with the nodesrepresenting the functional units of the system and the edges representing theflow of data objects. A number of methods have been proposed in the literaturefor testing EFSM using control flow and/or data flow techniques (e.g. [21–24]). However, these are applicable only to stand-alone EFSM. There is anotherissue that requires attention and this is that given an EFSM, if each variablehas a finite domain of values, then there are a finite number of configurations(combination of state and variable value) and then an equivalent FSM withconfigurations as states can be obtained. In this form, testing EFSM reduces totesting FSM.
61
UK Software Testing Research III
On the other hand, in a CFSM each component (unit) is a FSM that is ableto communicate. The simplest approach for testing a system of CFSM consists ofcomposing all the units into one machine which is a FSM that models the com-plete system behaviour and then generate the test cases from it (using knownmethods). The trouble is that this approach leads to an explosion in the num-ber of states. To cope with this, methods for reducing the reachability analysishave been suggested for CFSM (e.g. [25, 26]). The idea is that of constructinga smaller graph representation of partial behaviour of the system that allows usto study communication properties. Alternatively, there are also heuristic pro-cedures for test generation for CFSM such as random walk, where the selectionof the next input is done randomly and on-the-fly. However, this may trap thetest into a portion of the system behaviour [27] and to cope with this the guidedrandom walk technique has been suggested in [28], where instead of choosing thenext input randomly, it is favored to visit transitions with higher priority. Con-versely, the notion of semi-independence for CFSM has been suggested in [29]as a means of avoiding concurrent reachability analysis altogether, by meansof identifying and testing separately the communicating transitions (CT) thatcan affect the behaviour of other machines. In this manner, it is possible (un-der certain conditions, namely semi-independence) to test a non-communicatingtransition (NCT) t by placing the machine into the source state (using onlyNCT’s) of t and then execute t followed by a unique input/output sequence(UIO) that allows us to identify the target state of t. For testing a CT c, it isnecessary to test the final state of all the machines involved. But, it is possibleto set up the other machines for the execution of c (unsign only NCT’s). Then, ifthere is no feedback (i.e. a feedback transition leads to a sequence of transitionsincluding at least one more in the same machine) the output can be verifiedand the final state of all the machines affected by c can be checked using onlyUIO. The test effort is thus reduced by finding a minimal cost set of sequencesthat covers all CT’s. If there are few feedback transitions the approach is that offinding the shortest sequence of transitions that test it. Otherwise, a large num-ber of feedback transitions suggests a high degree of dependency between thecomponents and it is necessary to apply other approaches (e.g. [30]). Since in aCSXMS a communicating operation always moves the control into a processingstate, there are no transitions with feedback which thus avoids this problem.
For the CEFSM, a small amount of work has been reported related to testing.Most existing methods deal only with CFSM where the data part is not consid-ered. As in the case of CFSM or EFSM, constructing a product machine fromthe system of CEFSM runs into the state explosion problem. There is, however,an alternative and this is that of testing in context CEFSM [31]. The frameworkfor testing CFSM in context is due to Petrenko et al. [32, 30]. The intuition is asfollows. When testing a communicating system, it could be the case that someof its components have already been tested or are not critical to the system.Hence, the distributed implementation under test can be viewed as composedof two parts. The component that requires testing is known as the embeddedcomponent, and the other components are assumed to be fault-free. Testing a
62
UKTest 2005
component embedded into a communicating system is known as embedded test-ing or testing in context. The goal is to test whether an implementation of acomponent conforms to its specification in the context of the other componentsand it is assumed that the tester does not have direct access to the componentunder test (ICUT ). Several approaches have been presented in the literature fortesting in context and they are based on fault-models [32], on reducing the prob-lem to testing of components in isolation [30], on test suite minimisation [28, 31,33], on fault coverage [34] and on test the system with uncontrollable (or semi-controllable) interfaces [35]. Intuitively, the idea of these approaches is that atest case can be removed from a test suite if the test case concerns only the con-text or if there is another test case that detects the same fault of the embeddedcomponent. However most of these approaches resort to reachability graphs tomodel the joint behaviour of all the system components, and again this exposesthem to the state explosion problem. In [36], a solution called Hit-or-Jump hasbeen suggested. This technique is a unification of both the exhaustive search andrandom walks.
The technique for testing CEFSM in context is similar to the one describedabove [31]. A partial product machine is obtained and then the EFSM Test Gen-erator algorithm [21] (an algorithm that generates test cases for EFSM whichcover both control and data flow) is employed. In this guided procedure, eachCEFSM is tested in isolation first, and then the global system is tested as fol-lows: (1) a partial product CEFSM is obtained and (2) the test cases for it aregenerated. These test cases are generated from the test cases obtained when theCEFSM was tested in isolation (as a guide) to compute its partial product.
The approach suggested in this paper for CSXMS does not correspond to thenotion of testing in context discussed above. This is because our objective hasnot been that of testing conformance of a component in the context of theother system components, but instead to test conformance of a component in allpossible contexts. In this sense our unit testing is more conservative than boththe product approach and testing in context. It provides strong guaranties ofcorrectness of components. Observe that the assumption that the ICUT can bemanipulated does not imply that from the outside of the system (environment)it is possible to access directly the communication interfaces (i.e. the ports),but just the environment interface (i.e. streams) of the components. Thus, inorder to access these communication interfaces we use the other components astesters.
8 Conclusions
The essence of the conformance testing paradigm is given by two factors: (i)the testing hypothesis and (ii) the assumption of a correct specification. Theformer could be summarised as follows: an abstraction from the concrete non-formal implementation can be made such that the particularities of interest (e.g.behaviour) are formally specified in a model of implementation. Nevertheless, it
63
UK Software Testing Research III
is known that the generation of the test suite is strongly influenced by the modelof specification employed. In this research, the CSXMS formalism has been used.
This paper takes first steps towards a unit testing method for CSXMS. Thisis based on generating testing sequences from the components of the specifica-tion. These sequences are exercised into each of the implemented componentsindependently through a distributed test system, which can be the same systemunder test, but in which the components include some extra functionality to actas testers.
The justification for this approach for the model under consideration is thatthe product machine approach is, in general, prohibitively expensive. The usualpractice is that of developing techniques for reducing the size of this model.This implies that both a transformational process (for transforming a distributedspecification into a stand-alone specification) and a condensational process (find-ing an equivalent reduced specification) are required. The question that naturallyarises is: why is the specification distributed in the first place? In other words, isthe implemented system also distributed? If so, intuition suggests that, if thereis a correspondence between the components of the specification and those ofthe implementation then the whole process is overloaded. This claim is sup-ported by the observation that at the beginning a stand-alone form is obtainedby transforming and reducing an original distributed specification. Then thetest cases, generated from it, are applied ”backwards” into a distributed modelof the implementation. Why exercising all this transformational-condensationalmachinery for using a stand-alone testing technique when both specification andimplementation are distributed and the components of both correspond? Thispaper suggests that the complexity can be improved if the testing process iskept distributed, and it investigates to what extent this method requires distri-bution. This view considers that under some well-defined assumptions (testingconditions, testing hypothesis) some distributed testing tasks can be achieved ina local manner with no degradation in their accuracy regarding a global versionof the same.
9 Acknowledgments
We would like to thank one of our anonymous reviewers for astute questions andconstructive suggestions for improving the paper.
References
1. Laycock, G.: The Theory and Practice of Specification-Based Software Testing.PhD thesis, University of Sheffiled, U.K. (1993)
2. von Bochmann, G., Petrenko, A.: Protocol Testing: Review of Methods and Rel-evance for Software Testing. In: Proceedings of the International Symposium onSoftware Testing and Analysis. (1994) 109–124
3. Heerink, A.: Ins and Outs in Refusal Testing. PhD thesis, University of Twente,The Netherlands (1998)
64
UKTest 2005
4. International Organization for Standarization: Information Technology – OpenSystems Interconnection – Conformance testing methodology and framework.Parts 1–2, 4–7 IS 9646, ISO/IEC (1994)
5. Chow, T.: Testing Software Design Modeled by Finite-State Machines. IEEETransactions on Software Engineering 4 (1978) 178–187
6. Fujiwara, S., von Bochmann, G., Khendek, F., Amalou, M., Ghedamsi, A.: Test Se-lection Based on Finite State Models. IEEE Transactions on Software Engineering17 (1991) 591–603
7. Holcombe, M., Ipate, F.: Correct Systems: Building a Business Process Solution.Springer Verlag, Berlin (1998)
8. Ipate, F., Holcombe, M.: An integration Testing method which is proved to findall Faults. International Journal of Computer Mathematics 63 (1997) 159–178
9. Ipate, F., Holcombe, M.: Testing Conditions for Communicating Stream X-machineSystems. Formal Aspects of Computing 13 (2002) 431–446
10. Ipate, F., Holcombe, M.: Generating Test Sets from Non-Deterministic StreamX-Machines. Formal Aspects of Computing 12 (2000)
11. Hierons, R.M., Harman, M.: Testing conformance to a quasi-non-deterministicstream x-machine. Formal Aspects of Computing 12 (2000) 423–442
12. Hierons, R.M., Harman, M.: Testing conformance of a deterministic implementa-tion against a non-deterministic stream x-machine. Theoretical Computer Science4 (2004) 191–233
13. Balanescu, T., Cowling, A., Georgescu, H., Gheorghe, M., Holcombe, M., Vertan,C.: Communicating Stream X-Machines Systems are no more than X-Machines. 5
(1999) 494–50714. Institute of Electrical and Electronics Engineers: IEEE Standard Computer Dic-
tionary: A Compilation of IEEE Standard Computer Glossaries. Technical report,IEEE (1990)
15. Sarikaya, B.: Principles of protocol engineering and conformance testing. EllisHorwood series in Computer Communications and Networking (1993)
16. Walter, T., Grabowski, J.: Framework for the Specification of Test Cases for RealTime Distributed Systems. Information and Software Technology 41 (1999) 781–789
17. Cacciari, L., Rafiq, O.: Controllability and observability in distributed testing.Information and Software Technology 41 (1999) 767–180
18. Aguado, J.: Conformance Testing of Distributed Systems: an X-machine basedApproach. PhD thesis, University of Sheffiled, U.K. (2004)
19. Petrenko, A.: Fault Model-Driven Test Derivation from Finite State Models: An-notated Bibliography. In: Modeling and Verification of Parallel Processes, 4thSummer School, MOVEP. (2000) 196–205
20. Petrenko, A., Yevtushenko, N.: Queued Testing of Transition Systems with Inputsand Outputs. In: FATES. (2002)
21. Bourhfir, C., Dssouli, R., Aboulhamid, E., Rico, N.: Automatic executable testcase generation for extended finite state machine protocols protocols. In: IWTCS.(1997) 75–90
22. Chanson, S., Zhu, J.: A Unified Approach to Protocol Test Sequence Generation.In: INFOCOM. (1993) 106–114
23. Huang, C., Y.Lin, M.J.: Executable Data Flow and Control Flow Protocol TestSequence Generation for EFSM-Specified Protocol. In: IWPTS. (1995)
24. Ural, H., Yang, B.: A Test Sequence Selection Method for Protocol Testing. IEEETransactions on Communications 39 (1991) 514–523
65
UK Software Testing Research III
25. Rubin, J., West, C.: An Improved Protocol Validation Technique. ComputerNetworks 6 (1982) 65–73
26. Gouda, M., Yu, Y.: Protocol Validation by Maximal Progress State Exploration.IEEE Transactions on Communications 32 (1984) 94–97
27. Aleliunas, R., Karp, R., Lipton, R., Lovasz, L., Rackoff, C.: Random walks, Uni-versal traversal sequences, and the complexity of maze problems. In: Proc of the20th Annual Symposium on Foundations of Computer Science. (1979) 218–223
28. Lee, D., Sabnani, K., Kristol, D., S. Paul, S.: Conformance Testing of ProtocolsSpecified as Communicating Finite State Machines - a Guided Random Walk BasedApproach. IEEE Transactions on Communications 44 (1996) 631–640
29. Hierons, R.: Testing from Semi-independent Communicating Finite State Machineswith a Slow Environment. IEE Proceedings on Software Engineering 144 (1997)291–295
30. Petrenko, A., Yevtushenko, N., von Bochmann, G., Dssouli, R.: Testing in context:framework and test derivation. a special issue on Protocol Engineering of ComputerCommunications 19 (1996) 1236–1249
31. Bourhfir, C., Dssouli, R., Aboulhamid, E., Rico, N.: A Guided Incremental TestCase Generation Procedure for Conformance Testing for CEFSM Specified Proto-cols. In: IWTCS. (1998) 275–290
32. Petrenko, A., Yevtushenko, N., von Bochmann, G.: Fault Models for Testing inContext. In: FORTE. (1996) 163–178
33. Yevtushenko, N., Cavalli, A., Lima, L.: Test suite minimization for testing incontext. In: IWTCS. (1998)
34. Zhu, J., Vuong, S., Chanson, S.: Evaluation of Test Coverage for Embedded SystemTesting. In: IWTCS. (1998) 111–126
35. Fecko, M., Uyar, M., Sethi, A., Amer, P.: Issues in conformance testing: multiplesemicontrollable interfaces. In: FORTE. (1998) 111–126
36. Cavalli, A., Lee, D., Rinderknecht, C., Zaidi, F.: Hit-or-Jump: An algorithm forembedded testing with applications to IN services. In: FORTE. (1999) 41–56
66
UKTest 2005
Testing from object machines in practice
Kirill Bogdanov and Mike Holcombe
Department of Computer Science, The University of Sheffield,
Regent Court, 211 Portobello St., Sheffield S1 4DP, UK
email: K.Bogdanov@dcs.shef.ac.uk
July 28, 2005
Abstract
Rigorous state-based testing methods for objects are capable of producing high-quality
test case sequences, but derivation of test data for them can be hard to automate; moreover,
test sequences using call-backs can be tedious to hand-code in JUnit. This paper describes
an approach for automatic construction of JUnit test sequences from templates provided by
a tester, which can be guided by one of the rigorous state-based test methods. Templates
are written in a style similar to traditional JUnit tests, from which actual test sequences are
automatically produced when JUnit requests a test suite from a tester object.
The main benefit of the described work is automation of rigorous testing of objects
which communicate with a number of collaborator objects.
1 Introduction
In many applications, objects are responsible for operations which go through a number of
states. Consider an object which has to perform a sequence of actions to accomplish a request
from a user, where each of the actions has to be delegated to collaborator objects. Each of the
actions may fail, making it necessary for the object to recover, depending on the action which
failed. For this reason, such an object has to have a state dedicated to each of these actions with
transitions corresponding to a success or a failure of actions. In addition, a user may change
his/her mind before a request has been completed. In such a situation, the object has to tell a
user that it cannot accept a new request until the previous one has either been completed or it
failed; a more sophisticated system could cancel the previous request and start a new one. An
action to cancel a request may also depend on the current state. Finally, an implementation
of the said object may use different forms of notification of the success or failure of actions,
namely (1) by values returned to the object from methods of its collaborator objects, (2) via
exceptions and (3) through methods of the object being called back by collaborators. Testing
objects of the described kind requires one not only to generate tests covering a variety of
possible paths through a transition diagram but also to utilise the three kinds of communication
mentioned.
The complexity of the control behaviour can make it difficult to perform traditional (such as
category-partition) testing of the state-transition behaviour of methods of an object effectively.
Consequently, it seems reasonable to attempt known rigorous methods for test generation from
a state-transition structure. The main benefits of these methods are that (1) states can be entered
by taking sequences of transitions rather than attempting to set state-related variables, which
may be hidden; (2) behaviour of each state is tested with both expected and unexpected inputs;
67
UK Software Testing Research III
(3) states entered by an object during testing can be identified by attempting sequences of
inputs, hence it is often not necessary to make the internal variables available to a tester; (4)
faults targeted by state-based testing methods and the conditions under which such faults are
found are clearly stated. The downside of these test methods is that they generate numerous test
case sequences, which may be hard to run automatically. In addition, test generation requires a
state-based model of an object under test; unfortunately, traditional state-based testing methods
for objects do not make it easy to express call-backs as first-class elements of such models.
This paper describes how testing can be performed, avoiding these two problems; in addition,
it is relatively easy to make changes to a test suite to reflect changes in an object under test.
Section 2 describes how objects with call-backs can be modelled and Sect. 3 introduces one of
the testing methods which can be used for testing objects; Sect. 4 describes how the common
parts of test sequences can be highlighted by a developer. The construction of stubs to perform
testing is described in Sect. 5, followed in Sect. 6 by conclusions and a comparison to related
work.
2 A model for objects: object machines
Typical models of objects used for state-based testing [15, 14, 25, 24, 11] describe the be-
haviour of objects in terms of an extended finite-state machine with transitions taken in re-
sponse to method calls received by objects. This model does not permit one to express call-
backs where an object delegates to a collaborator object, which in turn calls the considered
object back. The approach taken in this paper considers all instances when data flows into an
object under test as an input of it and all cases when it flows out of it — as an output. The
theory of this is described in detail in [1], while this paper focuses on the practical side of test-
ing. These practical results stem from a case study conducted by the authors testing 15K lines
of Java code with 25K lines of test code from which 3.3K of JUnit tests were automatically
generated. Although nowhere big in size, the case study will be referred to as the big case
study to differentiate it from a tiny one used in this paper to illustrate the proposed method for
testing of objects. The Java example used as an illustration in this paper is slightly different
from the code used in the big case study and can be obtained from URL
http://www.dcs.shef.ac.uk/∼kirill/simple om test.zip. This example was chosen in preference
to a part of the big case study for two reasons: (1) the example is simple and covers the be-
haviour based on inter-object communication, exceptions and call-backs; (2) any part of the
big case study illustrating the advantages of the modelling and testing method to be described
will need extensive explanation covering the purpose of such a part and details of its operation.
The object used for testing in the example has a single method start(int) which takes an
integer as a parameter and uses a collaborator object to perform computations. The transition
diagram for the example is depicted in Fig. 1. The object under test (OUT) of the example
will further be referred to as SOUT (Simple Object Under Test). A label of every transition
on the diagram contains an input/output pair; potentially complex computations associated
with labels are not shown on the diagram. An input part of a transition label reflects any
information received by an object, be it a method call, an exception or simply a value returned
to an object from a method it called. In a similar way, an output can be a method call made by
the considered object, an exception thrown by it or a value it returns to a caller of its method.
Tables at the top of the diagram depict inputs and outputs of the SOUT and its collaborator.
Method calls are start(int), init(int) and compute(); RET(true), RET(false) and RET() are return
values and MyException is the exception which may be thrown. The described treatment of
inter-object communications is important from two points of view: (1) it forces a developer
to consider every possible call-back or an exception in every state of an object and (2) any
68
UKTest 2005
start(int)/RET(true)
MyExceptionstart(int)/
MyException
E
{init, (impossible)}
{MyException, RET(true)}
{RET(true), (impossible)}
Response to each of the sequences:
compute()
RET(false)
MyException
RET(true)
init(int)
Outputs
Object Under Test (OUT)
{MyException, (impossible)}
E:
C:
R:
I:
Outputs
MyException
Collaborator
RET(true)
RET(false)
State identification: {start(int), RET()}
compute()
init(int)
Inputs
I C
MyException/RET(false)
MyException/
MyException
RET(true)
Rstart(int)/init(int)
RET(false)/
RET(false)
RET()/RET(true)/
compute()
RET(false)
RET(true)
start(int)
Inputs
MyException
RET() RET()
start(int)/
public boolean start(int arg) throws MyException {if (processingStage == stateE) {
// State E
processingStage = stateINIT;return true;
}if (processingStage == stateINNER)
throw new MyException("Processing already active");
processingStage = stateINNER;
try{// About to enter state I
if (!collaborator.init(arg))
{ processingStage = stateINIT;return false;}}catch(MyException ex){
processingStage = stateINIT;throw ex;
}try{
// About to enter state C
collaborator.compute();
}catch(MyException ex){
processingStage = stateINIT;return false;
}processingStage = stateE;
return true;// everything is ok.
}
Figure 1: A simple object machine and the implementation of the start(int) method
69
UK Software Testing Research III
call/return/exception can be used in testing of any state. Given that the considered model of
objects uses an explicit transition diagram with some computations associated with labels, it
seems reasonable to use Extended Finite-State Machines to give a formal meaning to such
models. X-machines [6] (a kind of Extended Finite-State Machines) are used for this reason;
this paper does not go into detail as to how objects are modelled, but it provides both the idea
of how this can be done and how to use the X-machine testing method for testing of objects.
X-machines describing objects are further called object machines.
In the example considered, there is a single transition corresponding to a call of the start(int)
method from every state. This does not have to be the case: it is possible to introduce different
transitions corresponding to a call of start(int) with different values of the integer argument.
Even more, in principle, one can introduce labels which accept both a method call and a return
as an input and either make a call to a collaborator or return to a caller, depending on the re-
sult of computation they performed. Although handling different values of an integer method
argument with different labels does not make testing difficult, testing object machines with la-
bels accepting/generating both calls and returns or even accepting/generating returns of data of
different types makes testing substantially more complex. For this reason, this case is not con-
sidered in this paper; for details refer to [1]. The big case study did not contain the complexity
mentioned in this paragraph other than splitting labels based on values of arguments.
The part of Fig. 1 below the diagram depicts a possible implementation of the SOUT’s
only method. The code style was chosen to ensure that everything fits on a page. When a
method call is received, the SOUT has to determine its current state and respond accordingly.
Although the processingStage variable holds the current stage, the difference between
I and C states is not captured by it, since the object knows when it made the init(arg)
call. This highlights the correspondence between states in an object machine and variables in
a program code: a state in a machine has to be related to (1) instance variables of an object,
(2) local variables of the object’s methods and (3) the currently executing location in the code.
Consider an object waiting for a collaborator to respond, such as the SOUT in state I. In this
state the executing method of the considered object could have some local variables. If this
object receives a callback call (such as from a collaborator object), a corresponding method
of the considered object will be invoked and new local variables created. The behaviour may
hence be affected by values of both original and new local variables. For this reason, the above
characterisation of a state of an object machine has to read (1) protected and private instance
variables of an object, (2) a stack of local variables of object’s methods and (3) a stack of
locations in the code corresponding to object’s methods. The mentioning of protected and
private variables and an omission of static ones is related to how accessible variables are to
other objects. When talking about objects, it is important to define a model which gives a
predictable outcome to an input received by a model. If variables associated with states can
be modified directly by other objects, in a faulty implementation an OUT can change a state
unpredictably; for a similar reason, state-related variables should not be directly accessible to
other objects. Any variables which do not satisfy the stated condition cannot be a part of a
state, but they can be implicitly considered inputs and outputs. This also includes instance or
local variables which are passed to other objects by reference, since nothing stops those objects
from storing these references and using them at any time in the future.
While an OUT is waiting for a response from a collaborator object after making a call to
it, nothing stops some other object calling back the OUT. The OUT may respond by making a
further call and receive a call-back again. In principle, there is no upper limit on the number
of nested callbacks received by an object. If every callback enters a separate state, this leads
to an infinite number of states in a model. Additionally, if a label has a method return as an
input, it can only be taken if an OUT has previously made a call; this implies that not every
path in a model may be executable. In this paper, it is assumed that objects do not call their
70
UKTest 2005
collaborators from within callbacks (for instance, the SOUT is not allowed to make a call to the
collaborator from I or C states). This assumption ensures that the number of states in object
machines is finite, makes it easy to check implementability of a model manually and makes
test application simpler. All but one objects used in the big case study satisfy this property and
the one which does not is nevertheless testable using the described framework.
With the constraints described above, an X-machine corresponding to the SOUT can be
defined as follows. A set of inputs is Σ = ({start} × N) ∪ ({RET } × {true, false}) ∪{RET, MyException}. Outputs can be defined in a similar way, Γ = ({init} × N) ∪({RET } × {true, false}) ∪ {compute, RET, MyException}. In X-machine terms, ev-
ery transition label from a set of labels Φ is called a function and shares a common data store
(called memory) with all other functions of the same machine. The presence of such a store is
important, since a developer can choose what to interpret as a state on a transition diagram and
what to put in this memory. Without a store, all data will have to be included on a transition
diagram, leading to a known problem of state explosion. On the other extreme, if a developer
includes too little information in a transition diagram, state-based testing will lose its effec-
tiveness. For this reason, it is necessary to include only the relevant control behaviour in a
transition diagram; how to determine what is relevant is outside the scope of this paper. A
combination of a state on a transition diagram and a value of memory is a global state of an
X-machine, in that a response of a machine to an input depends solely on these two. Given
that objects are modelled using X-machines, only variables not accessible to other objects can
be used to define their global state.
When supplied with an input σ ∈ Σ, an X-machine decides which function φ to take, such
that (1) such a function is defined for the current memory and the input, and (2) there is a
transition from the current state labelled with that function. If there is a function satisfying
these conditions, the X-machine takes the corresponding transition and executes the function;
this yields a state change, an output produced by the function and potentially a change to the
memory. Formally, X-machine functions are defined as partial functions to take an input and
memory and produce an output and a new value of memory, φ : Σ × M → Γ × M , where M
denotes the type of memory. Traditionally, functions are given names which are included on a
transition diagram; this has also been done for the big case study, but the example included in
this paper appeared easier to understand if input/output pairs are included instead. These pairs
uniquely identify the corresponding functions and make their purpose clear. To complete the
definition of an X-machine corresponding to the SOUT, it is necessary to define a set of states
Q = {R, I, C, E}, the initial state q0 = R and the transition diagram F : Φ × Q → Q, all the
three depicted in Fig. 1. With m0 denoting the initial memory value, the SOUT X-machine is a
tuple (Σ, Γ, Φ, M, m0, Q, q0, F ). Note that since the SOUT does not actually contain memory
variables, it can be defined without M and m0; this is not a common situation in practice.
3 X-machine testing of object machines
This section describes how the X-machine (extended finite-state machine) testing method [6]
can be applied to object machines and Sect. 4 shows how test sequences can be represented in
JUnit with common parts of them separated.
The main advantage of using the X-machine testing method is that it verifies by testing that
subject to certain conditions an implementation is behaviourally-equivalent to an X-machine
specification. Compared to many other testing methods, all these assumptions can be defined
formally and hence can be verified (or taken on trust). The foundation of the X-machine testing
method is finite-state machine testing methods [18, 20, 26, 2] which test a transition diagram.
They have been adapted [6] for testing X-machines by focusing on testing of the state-transition
71
UK Software Testing Research III
diagram. For each state, the aim is to ensure (1) that all transitions with appropriate labels are
implemented from that state and lead to the expected states and (2) that transitions with all
other labels featuring in an object machine are not implemented from the considered state. For
instance, one would like to check that there is a transition RET(false)/RET(false) from the I
state, corresponding to the SOUT receiving a response false and returning false to the caller of
its start(int) method; moreover, there should be no transition RET(true)/RET(true) from that
state. Attempting such a function by returning true from the collaborator causes the collabo-
rator’s compute method to be called (instead of true being returned to a caller of the start(int)
method), making it clear to a tester that there is no transition with the RET(true)/RET(true)
label from the I state. Since an object may accept a variety of different inputs in any state, it is
not feasible to attempt all possible inputs from each state; for this reason, testing of a transition
diagram is limited to checking that from every state, only transitions with specified labels exist
and these transitions lead to the expected states. For instance, a transition from state I with a
label RET(false)/RET(false) should exist and enter R.
State verification can be done by observing values of object variables, however this con-
tradicts the spirit of object machines, since variables in a program code which contribute to
a global state of a corresponding object machine are not supposed to be externally acces-
sible. State-based testing methods such as the W method [26, 2] identify states by check-
ing the response of an object under test to various sequences of inputs; if each state can be
associated with a unique response, states can be identified. Compared to state variable ob-
servation, this is a more difficult approach to state verification but a more general one and
has been the main state-verification approach in the big case study. For this reason, it is
also used for testing of the simple object described in this paper. State-identification strat-
egy of the W method (when applied to object machines) is to find a set of sequences of labels
(called the W set), such that for every pair of states of an OUT there is a sequence in this
set which exists from one of these states and not from the other one. For example, state
C is the only one with the RET()/RET(true) transition from it; states C and I are the only
two with the start(int)/MyException transition; finally, the existence of a transition with the
start(int)/init(int) label uniquely determines the R state. The above three labels (comprising
the W set) make it possible to tell if the SOUT is in one of the three of its four states; if it is not
in any of the three, it must be in the E one. After running a test sequence during testing, one
would first attempt an input of start(int), which may lead a potentially faulty implementation
to an arbitrary state. For this reason, it is necessary to restart a test and run the same sequence
again, following it with RET(), if such an input can be applied. This requires an implementa-
tion of an OUT to possess a reliable reset. The response of the SOUT to these two inputs is
shown in the top-right corner of Fig. 1 and it is possible to observe that states can indeed be
uniquely identified this way; ‘(impossible)’ means that an input mentioned cannot be applied
in that state. In general, the W set is not uniquely defined; moreover, state identification is not
always possible, because non-deterministic behaviour and an inability to attempt certain inputs
may lead to some states being indistinguishable from other states [13]. In the context of object
machines, the situation is sufficiently constrained so that this is not a problem, except when an
object machine contains states with identical behaviour. In this case, all but one states with the
same behaviour are redundant, hence it seems reasonable to assume that a developer will build
object models without redundant states.
There are two good alternatives to the W method for finite-state machines, Wp [5] and HSI
methods [13]. Both can be applied to testing of X-machines and their usage will reduce the size
of a test set compared to the W method, while still detecting all faults in an implementation.
The method mostly used for the big case study was the HSI one, since it does not require two
separate stages of testing; refer to [13] or to [19] for the description of how the HSI method
works. This paper focuses on the W method for simplicity.
72
UKTest 2005
Given a set of labels on transitions Φ, let a set of sequences of labels C be such that
for every state of an OUT, there is a sequence in the set C which labels a path in an object
machine from the initial state to the considered one; for the initial state, such a sequence
has to be an empty sequence (denoted 1). To simplify the presentation, only sequences of
inputs to enter states are provided, hence Cinputs = {1, start(int), start(int) RET (true),start(int) RET (true) RET ()}. Such a set can be built if any state is reachable by a sequence
of labels from the initial state; unreachable states are redundant. For this reason, it is assumed
that every state can be reached in the considered object machine; this condition together with
the one about the absence of equivalent states is called minimality. According to [6], the
simplest set of test cases capable of finding all faults is C ∗ (W ∪ Φ ∗ W ). Set multiplication
of two sets of sequences means pairwise concatenation of sequences in those sets. The set
of test cases is a formal representation of what has been mentioned above: the C ∗ W part
corresponds to entering every state in an implementation and verifying that the correct state
was entered and C ∗Φ ∗W means that a tester should attempt every function from every state
and verify the entered state.
The extension of the W method to X-machines relies on the following main assumptions:
A tester knows which functions are present in an implementation. For object machines, there
is a clear relationship between the structure of their transition diagrams and the code,
hence checking that no unexpected functions are present in an implementation should
be relatively easy.
All functions of an implementation are correctly implemented. This condition is essentially
a requirement that there is a correspondence between functions of an object machine
and those in an implementation and the functions which correspond to each other are
behaviourally-equivalent. This typically requires testing of such functions; with the
two specific conditions (below) satisfied, such a testing can be performed together with
testing of a transition diagram [8] rather than requiring a tester to test every function
separately.
Every function can be attempted from every global state. This condition (called
input-completeness) aims to solve a problem where the value of memory in an imple-
mentation does not permit a tester to attempt a function of his/her choice. For instance,
an object implementing a stack may only resize itself when full, hence the precondi-
tion of a resize function may require a memory variable size to be of a specific value.
Following [6], the easiest way to solve this problem is to designate specific values of
arguments of methods or return values as test values, so that given a function φ, it is
possible to identify an input σφ such that regardless of the memory value, the precondi-
tion of φ will be satisfied if the input σφ is supplied to the OUT.
Consider attempting RET(false)/RET(false) after a test sequence which is expected to get
an implementation of the SOUT into state C. The only input to attempt RET(false)/RET(false)
is to return false to the SOUT from the collaborator, however in state C no value can be
returned since method compute() has no return value. This clearly means that there can-
not be a transition with the said label from the C state, but a tester cannot be complacent
about this: a faulty implementation may have entered a different state from which a
transition with such a function exists. For this reason, it is necessary both (a) to attempt
functions where the corresponding inputs can be applied and (b) to verify that inputs
for the remaining functions cannot be attempted. For example, if part (b) is not done,
verification of state I reduces to checking that calling start(int) causes an exception to
be thrown and hence makes it impossible to distinguish between states I and C.
73
UK Software Testing Research III
It is assumed in this paper that both an object machine and its implementation are
deterministic, hence given an input, at most one function can be taken in response; the
definition of a deterministic X-machine [6] also requires that preconditions of functions
on transitions from the same state are disjoint.
It is possible to identify which function fired in response to an input from an output. For
the SOUT, the same input start(int) can be used to attempt any of the four functions tak-
ing start(int) as an input. During test execution, a tester needs to know which of them is
taken by an implementation in response to start(int); this is done by observing the output
from functions taken. The described condition is called output-distinguishability.
During execution, an object does not have an option of ignoring an input: it always has to
do something, i.e. execute a function. For this reason, it is possible to assume that for every
input which can be applied, an object machine has to execute a function. The fact that it is
additionally possible to verify that an input cannot be supplied means that object machines are
completely defined, i.e. it is possible to assume that for any input in any state there is a defined
response from an object machine and an implementation.
The paper [1] gives a sketch of a proof that subject to (1) the interface of an implementa-
tion of an OUT containing the expected methods with correct argument and return types, (2)
conditions underlined in this section and those outlined in Sect. 2 being satisfied, and (3) the
number of states in an implementation of the OUT being at most the number of states in the
object machine of the OUT, the X-machine testing method is capable of finding all faults. A
test set can also be generated for higher bounds on the number of states in an implementation
of an OUT, refer to [6] for details.
4 Test data generation: h-sequences
In order to run test sequences constructed from C ∗ (W ∪ Φ ∗ W ), one has to identify the
actual test data, i.e. determine sequences of (a) inputs to apply to an implementation to drive it
through all the sequences of labels from a test case set, and (b) outputs which will be produced
by an object machine in response to these inputs. This task requires consideration of precon-
ditions of functions, is frequently difficult to automate and requires a detailed mathematical
description of the behaviour of labels. The authors believe that for a number of objects, such
a test data has to be derived manually. This has an advantage that labels do not have to be
defined formally and although in this case an object machine cannot be used for a detailed
analysis of the behaviour of an object, the machine is still useful to guide implementation and
serve as a basis for test generation. In the big case study, object machines have been essential
for both of these. The main problem of test generation is the quantity of test cases: for the
SOUT this amounts to |C| ∗ (|W | + |Φ| ∗ |W |) =4 ∗ (3 + 9 ∗ 3) = 120 sequences. This can
make rigorous testing such as X-machine testing prohibitively expensive, especially if every
test input is determined manually and upon any change to the model the whole work has to be
done again. The effect of this problem can be substantially reduced by the extraction of numer-
ous common parts of test sequences. This way, (1) one can identify test data corresponding to
common elements manually and re-use it for all the sequences it is used in; (2) it is possible to
describe tests in a compact form, so that when an object machine changes, the said form can
be modified and all test sequences be regenerated from it automatically.
Identification of common parts of test sequences can be done manually, but the task is made
relatively simple by the structure of a set of test cases and specific properties of objects under
test. First of all, every sequence in C is common to all the sequences in W ∪Φ∗W . In general,
test data for elements of W depends on the sequence of functions executed before elements of
74
UKTest 2005
W ; both for the big case study and the simple example considered in this paper, the choice of
test data for state identification tends to depend only on the state to identify. For this reason,
test inputs and outputs corresponding to a W set applied in a particular state are common to
all test sequences requiring identification of that state. Rather frequently, a response of an
object under test to elements of Φ is also determined by a state from which these functions are
attempted.
The compact form of encoding test sequences is called hierarchical test sequences (abbre-
viated h-sequences). The hierarchy is used to separate common parts of test sequences. In this
paper, h-sequences are encoded as nested arrays of objects because Java provides convenient
methods to initialise arrays inline. For the SOUT, a sketch of a possible (incomplete) test for
state I is shown below.
public Object [] testStateI() { return new Object [] {attempt start,verify init called,
new Object [] {new Object[] { return true, verify compute called,
verifyEnteredC() },new Object[] { throw MyException,
verify start thrown MyException, verifyEnteredR() },
}};}
The attempt start,verify init called sequence ensures that the object under test
enters state I; this part consists of test data corresponding to a sequence from C. Attempts
to fire labels from state I are given by inputs return true, throw MyException. Each
of them has to be attempted after the sequence attempt start,verify init called
is taken. For this reason, a nested array of sequences is included in the test sequence above.
Each element of the nested array has to be concatenated with elements preceding such a nested
array; if there are elements following the nested array, they are appended to the resulting test
sequences. Sequences in a nested array, such as those starting from return true, are h-
sequences, so that it is possible to include nested arrays in them too. For an arbitrary degree
of nesting of arrays, starting from one, odd levels correspond to elements which have to be
taken in a sequence and even levels correspond to sequences, each of which has to be con-
catenated with a sequence at a lower level. If there are two nested arrays, such as if a test se-
quence contains new Object [] {attempt start,verify init called}, new
Object [] {return true, verify compute called}, each element of the first
of them has to be concatenated with every element of the second one.
The testStateI() method calls verifyEnteredC() and verifyEnteredR()
methods to generate h-sequences verifying states C and R, respectively. This makes it possible
to use the same h-sequences for verification of the same state in multiple tests. Deep nesting
of h-sequences is useful if in order to identify a state, a number of sequences of labels have
to be attempted. Usage of the verifyEntered()-kind methods was essential for testing in
the big case study.
The actual test inputs and outputs in h-sequences are represented by objects. This way,
attempt start has to be an instance of some object. In order to run tests, h-sequences are
converted to ordinary sequences by the test framework and each of the resulting sequences is
encapsulated in a JUnit test; the framework also partially automates running test sequences.
Stubs representing collaborators have to be written manually by a tester, which was easy for
the big case study. The description of stubs can be found in Sect. 5.
With h-sequences, it is easy to propagate changes made to an OUT, to the corresponding
test h-sequences. For new functions being added, new sequences attempting them have to be
75
UK Software Testing Research III
added to tests for every state. An extra state in an OUT can be tested by introducing a new
method returning an h-sequence to test this state and writing a new state identification method.
This was performed a number of times in the progress of the work on the big case study.
Calls of methods in an OUT are described by objects extending the special testElem
class introduced in the object machine testing framework; instances of derivative objects are
typically instances of Java anonymous inner classes, implementing the run method which is
called by the test framework during test execution. Expected calls to a stub of the collaborator
object are represented by instances of the CollaboratorStub class which are initialised
with enough information to check that data passed to the stub is correct (namely that the argu-
ment passed to the init(int) method is correct). For exceptions raised by stubs and callbacks
from stubs to the SOUT, instances of the MyEx class and the testElem class, respectively,
can be used in h-sequences. Instances of ReturnValue are used to denote values to be re-
turned from stubs; those of NotApplicable are necessary to check that a particular input
cannot be applied.
In the big case study values to be returned from stubbed methods were often included in
the instances of stub test elements. In the simple example used in this paper they are included
as separate elements of h-sequences because this makes it easier to compose test sequences
using nested h-sequences. For instance, different return values have to be attempted after the
attempt start,verify init called sequence; if return values are included in an
instance of CollaboratorStub (i.e. coupled with the verify init called element),
it is not possible to have verify init called shared between all subsequent sequences.
This was not an issue for the big case study where state verification sequences never started
with a return value.
For the SOUT, testing of the R state can be accomplished using the h-sequence returned
by the method below, which includes instances of the appropriate objects.
public Object [] testRState() {return new Object [] {
verifyRstate(),// check that we are in the R state
new testElem("start") {// "start" is this input’s name
public void run() throws MyException
{ out.start(67);}},new CollaboratorStub("init",67),
verifyIstate(),
new classEnd()
};}
A call to the SOUT is performed with out.start(67); the following element in the
sequence checks that 67 has been passed to the init(int) method of the collaborator. The h-
sequence returned by verifyIState() checks the entered state; the sequence testing R
ends with the new classEnd() element, which ends the test and unrolls a call stack. Un-
rolling is often necessary because test sequences end within a call to a collaborator with an
OUT waiting for a response; an unhandled exception is used to perform the unrolling.
As mentioned earlier, state verification is accomplished using sequences returned by meth-
ods such as verifyRstate(). In addition to sequences mentioned in Sect. 3, an empty
sequence is included in h-sequences used for state verification. Its presence is necessary to
verify states and continue testing. For example, state verification of the SOUT involves sup-
plying two test inputs to it, each of which may lead to an arbitrary state in a faulty implemen-
76
UKTest 2005
tation; a tester would like to attempt one of them, restart a test sequence, attempt another one,
restart the sequence again and then attempt some functions from the already-verified state. The
need to perform ‘verify and continue’ in testing of the big case study was a motivation for the
development of h-sequences. A method to verify the I state is provided below.
public Object [] verifyIstate() { return new Object [] {null,
new Object [] {// calling "start" should cause an exception.
new testElem("start I(-582)") {public void run() {
try { out.start(-582); }catch(MyException ex) {
return; // everything is ok if we got here.
}fail("Exception was not thrown");
}},new classEnd()
},new Object [] {
new NotApplicable(new ReturnValue(null)),
new classEnd()
}};}
The first sequence is empty (represented with null) and the following two correspond to
the two inputs of the W set and the expected response from the SOUT. In the second sequence,
a call of start(int) is attempted and a test passes if an exception is thrown by the SOUT in re-
sponse; the third sequence verifies that it is not possible to return from a stub without providing
a return value (such a verification is necessary to distinguish state I from state C).
It is possible to integrate testing of X-machine functions into testing of a transition diagram
[8]. This is based on first testing a number of functions on transitions from the initial state
using, for instance, the category-partition testing method [17]. Subsequently, a different state
may be entered using one of the tested functions and functions on transitions from that state
can be tested in a similar way. This process can be repeated until all functions are tested. Such
a testing process can be directly integrated into h-sequences where in addition to attempts to
verify transitions from each state, different inputs are supplied to functions on those transitions
in order to test their behaviour.
If multiple instances of the same collaborator class can be used by an OUT, a reference to
an instance which is expected to be called by the OUT has to be stored in each test element
reflecting an expected call to a test stub.
Testing of self-delegation can follow either of the two approaches: (a) to stub an OUT and
thus capture calls directed from the OUT to itself, or (b) assume that self-delegated calls are
not visible to a tester and limit stubbing to objects other than the OUT. The choice between
these two methods depends on the degree of abstraction used when an object machine of the
OUT is built: a higher-level model will ignore self-delegated calls while the lower-level one
will take them into account.
JUnit [10] tests are often generated by constructing a tester class derived from a JUnit-
supplied class and including test methods in it. Test methods are expected by JUnit to follow
specific syntactic conventions such as a name of a test method has to begin with test. When
77
UK Software Testing Research III
a suite() method of such a tester class is called, JUnit collects all test methods defined in
the class and packages each of them in an instance of a TestCase class; a collection of these
instances is returned from the suite() method. The framework for object testing using h-
sequences is following a similar pattern, where all methods with names starting with test are
called in order to obtain h-sequences; these sequences are expanded into ordinary sequences
by ‘flattening’ the hierarchy as explained in the introduction to h-sequences. Every sequence
obtained by the expansion of h-sequences is packaged in an instance of a TestCase-derived
class and the collection of the resulting objects is returned from the suite() method.
5 Usage of stubs to run test sequences
A traditional approach to testing of an object operating in a context of other objects is to stub
the context; these stubs are known as mocks. A number of tools [9, 4] exist to generate such
mocks. These tools aim at mocking the behaviour of a single object, rather than at creating
mocks to check that multiple collaborators are called in a particular order. The approach
described in this paper expects mocks to interpret test sequences constructed from h-sequences
and communicate with an object under test as prescribed by these sequences. For this reason,
(1) the order in which collaborators of an OUT are called is verified and (2) a single mock
can be built for each mocked interface (or an object) and used to test a variety of objects
communicating via that interface (or with that object). Mock construction can be accomplished
manually (as has been done for the big case study) or automated using [9, 4] or any other mock
creation tools.
The code for the stub of the init(int) method is a simplification of the code used in the
simple example considered in this paper, obtained primarily by deletion of error-checking
code. A test sequence constructed from an h-sequence is stored in the array called testData
and the current test datum is assumed to be at the position position of that array.
public boolean init(int arg) throws MyException {Assert.assertEquals("init",
((CollaboratorStub)testData[position]).argName);
Assert.assertEquals(
((CollaboratorStub)testData[position]).expectedArg,arg);
position++;// move to the next element
while (testData[position] instanceof NotApplicable)
{// verify that the input cannot be returned.
...
}runSequence();tryMyException();
return ((Boolean)getReturnValue()).booleanValue();
}
The first assertEquals line checks that the test sequence expects the init(int) method
to be called and the second assertEquals statement verifies that the argument passed to
init(int) is the expected one. If a test sequence includes checks that particular inputs cannot be
applied (described by instances of the NotApplicable class), the stubbed method verifies
this in the while loop. The runSequence() method is used by the stub to perform call-
backs on the SOUT, if callbacks are included in a test sequence. Finally, either an exception
is thrown or a value returned to the SOUT; the former is accomplished using a helper method
tryMyException() and the latter is carried out by getReturnValue().
78
UKTest 2005
The runSequence() method runs tests by calling methods of an object under test, as
provided by a developer in objects extendingtestElem. A simplification of runSequence()
is included below.
protected synchronized void runSequence() {while(position < testData.length &&
!(testData[position] instanceof TestEx ) &&
!(testData[position] instanceof ReturnValue) {int currentPosition = position; position++;
if (testData[currentPosition] instanceof testElem)
((testElem) testData[currentPosition]).run();
else if (testData[currentPosition] instanceof classEnd)
throw new UnrollCallStackException();
}}
The while loop iterates through the test sequence and calls the run() method of every
instance of testElem. There are two cases when runSequence() is called, to run a
top-level test sequence and to perform call-backs, as mentioned above. The former is a test
sequence where each call to an OUT is made when such an OUT has returned from a previous
call (states R and E of the SOUT). An instance of classEnd or the end of a test sequence (the
first line of the condition of the while loop) are used to terminate a top-level test sequence;
the last two lines of the while loop condition aim to terminate a call-back sequence.
6 Conclusion
This paper described how a rigorous state-based testing method can be used in conjunction
with JUnit for testing of objects using call-backs and exceptions for communication. The
approach presented also handles the situation where the test method generates numerous test
sequences but test data generation is difficult to automate, by separating common parts of test
sequences. This also makes tests easy to adapt to changes in an object machine.
The idea of test reuse is not new — it has been previously used for testing of inheritance
where one could maintain a hierarchy of test classes in parallel to the structure of classes
under development; another example of test re-use is IFTC [7], where tests are developed
for each interface and an object implementing them can be tested by running tests for all the
interfaces it implements. This paper described a testing method complementary to these two
approaches. Reference [22] uses an algebraic approach to test generation. The framework
described in this paper can be used for the resulting tests but the authors of this paper believe
that it could be easier to find commonalities between test sequences produced from state-based
testing methods. For a detailed comparison to papers [16, 23, 3], refer to [1].
Reference [12] describes how to determine an order on object testing in order to create as
few mocks as possible. Paper [21] describes how a behaviour of objects can be recorded and
used as mocks for objects which are difficult to use in a test environment. The work on object
state testing primarily targets unit testing and stems from the view that an object under test
has to be controllable and observable. Using real objects as collaborators, rather than mocks
made to follow test sequences, makes controllability and observability more difficult. In the
approach described in this paper, in order to facilitate the construction of stubs, (1) a stub is
built per interface (or an object) and can be used without changes to run any tests and (2)
various helper methods were introduced to help writing stubs. In contrast, problems addressed
by [12, 21] seem rather important for integration testing. The authors believe that the object
79
UK Software Testing Research III
machine testing method can be useful for this type of testing too, but it will have to be applied
from higher-level object machines.
It is worth pointing out that the described approach to object modelling and testing can be
applied to software components; in the context of the big case study, 7.6K lines of assembly
were tested using 8.6K lines of test code (180 tests). This translates to 7.6 lines per test for
Java and 48 lines per test for assembly; the substantial difference between these numbers is
primarily due to the lack of automation of test generation for assembly.
The main limitation of the described work is the limit on the number of nested call-backs
in the model. Although not demonstrated here, the framework seems capable of supporting an
arbitrary (bounded) number of nested callbacks; verifying this capability of the framework as
well as considering unbounded nested callbacks in the model is a subject of future work.
Acknowledgement
This research was in part sponsored by EPSRC grant GR/M56777 ‘MOTIVE’. The authors
would like to thank Tony Simons, Barry Norton and Mike Stannett for valuable discussions.
References
[1] K. Bogdanov, M. Holcombe, and A. Simons. A state-based model for generating com-
plete test sets for objects using interception of communication between them. To be
submitted to ACM Transactions on Software Engineering and Methodology, 2005.
[2] T. Chow. Testing software design modeled by finite-state machines. IEEE Transactions
on Software Engineering, SE-4(3):178–187, 1978.
[3] J. Davies and C. Crichton. Concurrency and refinement in the unified modeling language.
Electronic Notes in Theoretical Computer Science, 70(3), 2002.
http://www.elsevier.nl/locate/entcs/volume70.html.
[4] Easymock web site.
http://www.easymock.org, December 2004.
[5] S. Fujiwara, G. von Bochmann, F. Khendek, M. Amalou, and A. Ghedamsi. Test selection
based on finite state models. IEEE Transactions on Software Engineering, 17(6):591 –
603, June 1991.
[6] M. Holcombe and F. Ipate. Correct Systems: building a business process solution.
Springer-Verlag Berlin and Heidelberg GmbH & Co. KG, September 1998.
[7] Interface/hierarchial test case (IFTC) web site.
http://groboutils.sourceforge.net/testing-junit/using iftc.html
January 2005.
[8] F. Ipate. Complete deterministic stream X-machine testing. Formal Aspects of Comput-
ing, 16(4):374–386, 2004.
[9] Jmock web site.
http://jmock.codehaus.org, December 2004.
[10] Junit web site.
http://www.junit.org, December 2004.
80
UKTest 2005
[11] D. Kung, Y. Lu, N. Venugopalan, P. Hsia, Y. Toyoshima, C. Chen, and J. Gao. Object state
testing and fault analysis for reliable software systems. In IEEE 7th Int’l symposium on
Software Reliability Engineering, pages 133–142. IEEE Computer Society Press, 1996.
[12] Y. Labiche, P. Thevenod-Fosse, H. Waeselynck, and M.-H. Durand. Testing levels for
object-oriented software. In Proceedings of the 22nd International Conference on Soft-
ware Engineering, pages 136–145. ACM Press, June 2000.
[13] G. Luo, A. Petrenko, and G. von Bochmann. Selecting test sequences for partially spec-
ified nondeterministic finite state machines. In IFIP Seventh International Workshop on
Protocol Test Systems, Japan, pages 95–110, 1994.
[14] J. McGregor. Constructing functional test cases using incrementally derived state ma-
chines. In Eleventh International Conference on Testing Computer Software, 1994.
[15] J. D. McGregor. Functional testing of classes. In Proc. 7th International Quality Week,
San Francisco, CA, May 1994. Software Research Institute.
[16] OMG. OMG Unified Modeling Language specification, version 1.5.
http://www.omg.org/technology/documents/formal/uml.htm,
March 2003.
[17] T. J. Ostrand and M. J. Balcer. The category-partition method for specifying and gener-
ating functional tests. Communications of the ACM, 31(6):676–686, June 1988.
[18] A. Petrenko. Fault model-driven test derivation from finite state models: Annotated bib-
liography. In Modeling and Verification of Parallel Processes (MOVEP’2000), Nantes,
France, volume 2067 of Lecture Notes in Computer Science, pages 36–43. Springer Ver-
lag, 19-23 June 2000.
[19] A. Petrenko, N. Yevtushenko, and G. v. Bochmann. Testing deterministic implementa-
tions from nondeterministic FSM specifications. In Proc. of 9th International Workshop
on Testing of Communicating Systems (IWTCS’96), pages 125–140, 1996.
[20] T. Ramalingam, A. Das, and K. Thulasiraman. On testing and diagnosis of communi-
cation protocols based on the FSM model. Computer communications, 18(5):329–337,
May 1995.
[21] D. Saff and M. Ernst. Mock object creation for test factoring. In Workshop on Program
Analysis for Software Tools and Engineering PASTE 2004. ACM, June 2004.
[22] D. Stotts, M. Lindsey, and A. Antley. An informal formal method for systematic JUnit
test case generation. Lecture Notes in Computer Science, 2418:131–143, 2002.
[23] J. Tenzer and P. Stevens. Modelling recursive calls with UML state diagrams. In
FASE’03, volume 2621, pages 135–149. Lecture Notes in Computer Science, 2003.
[24] C. D. Turner. State Based Testing - A New Method for the Testing of Object-Oriented
Programs. PhD thesis, University of Durham, UK, 1995.
[25] C. D. Turner and D. J. Robson. The testing of object-oriented programs. Technical Report
TR-13/92, Computer Science Division, University of Durham, 1993.
[26] M. Vasilevskii. Failure diagnosis of automata. Cybernetics, Plenum Publ. Corporation,
NY, 4:653–665, 1973.
81
UK Software Testing Research III
82
UKTest 2005
A Formal Model for Test Frames
A. J. Cowling
Department of Computer Science,
University of Sheffield,
Regent Court, 211 Portobello Street,
Sheffield, S1 4DP, United Kingdom
Email: A.Cowling @ dcs.shef.ac.uk
Telephone: +44 114 222 1823 Fax: +44 114 278 0972
Abstract
Motivated by errors that students have been observed to make while learning to use the category-
partition test method, this paper describes a new model that has been developed for the concept of test
frames, by defining them in terms of the characteristic conditions identifying sets of test cases. This
model is illustrated by reference to the stream X-machine model of computation, but is also applicable
to other computational models. It is shown that the model can be applied to both functional and
structural test methods, and that it leads to a view of them as processes for generating sets of test
frames, which are complemented by processes that are based on the model for generating test cases
from these test frames. The paper discusses the need for test frames to identify representative sets of
test cases, and introduces the concept of structural continuity of specifications and implementations as
a way of capturing the requirements for test frames to be representative. The application of this
concept to primitive operations is discussed, and is shown to require a hierarchical approach that
reflects the structure of the process of integration testing.
Key Words and Phrases
Category-Partition Testing, Functional Testing, Structural Testing, Test Cases, Test Methods, Structural Continuity,
Integration Testing.
1. Introduction
The term “test frame” has many different meanings, including being a synonym for “test harness” or “test driver”, and also
being used to refer to test data or specific test cases for various kinds of applications that use concepts that they call
“frames”, such as link layer network protocols, digital image processors (eg for use with cameras, scanners, etc), and
window-based graphical user interfaces. While all of these meanings are important in their own specific contexts, this
paper addresses a more general context, in which the meaning of the term is derived instead from the work of Ostrand &
Balcer [1] on the category-partition method for functional testing. Here, they define the term to mean that “a test frame
consists of a set of choices from the specification, with each category contributing either zero or one choice”.
The motivation for discussing this concept stems from experience with teaching the category-partition method to a number
of successive cohorts of students. This experience has indicated that there may be a structural weakness in the method as it
is currently defined, and that the role of test frames as it stems from this definition may be the key to this weakness. The
purpose of the paper is therefore to try to address this weakness by developing a new model for test frames that is more
formal than the one given by Ostrand & Balcer. In particular, part of this weakness of the current model that has been
identified is that its formulation is very dependent on the concepts of categories and partitions, but these are currently only
defined formally in the context of the test specification for a specific system, and so the current models only support formal
reasoning about the application of these concepts to specific systems. Hence, one of the aims for this new model is that it
should provide more support for reasoning about the generic properties of test frames than the current models do.
To achieve these aims, the structure of the rest of the paper is as follows. Section 2 explains how this weakness in the
category-partition method has been identified from experience of teaching it, by describing the assignment that the students
have been required to work through in order to demonstrate that they can apply the method successfully, and identifying the
problems that some of them have found in carrying out this assignment. Section 3 then presents the model for test frames
that has been developed to try to overcome this weakness, and section 4 shows how this model applies to the role of test
frames in functional testing, while section 5 extends this to structural test methods. An issue that is raised by the need to
relate these two kinds of test methods is that of ensuring that a test case generated from a test frame will possess the
property of being representative of the test set that corresponds to such a frame, and section 6 discusses this issue from the
perspective of functional test methods, by introducing a concept that will be called structural continuity. Section 7 shows
83
UK Software Testing Research III
how this concept can be extended to structural test methods, and section 8 discusses the implications of this concept for the
process of integration testing. Finally, section 9 summarises the conclusions from this work and discusses possible future
extensions of it.
2. Experience of the Category-Partition Method
Like any topic in software engineering, software testing is a practical subject, and so a practical approach must be taken to
teaching it [2], which means that students of it need to spend a significant amount of their study time in actually carrying
out the testing of pieces of software. In this particular case, which is a course taught to final year undergraduates and
masters students, the pieces of software to be tested had been produced by second-year undergraduates as part of a course
in data structures and algorithms that the author had taught previously [3]. The requirements for these pieces of software
were given by the following scenario, which will be used as the example for the rest of the paper.
“A number of towns have express bus services between them, which simply run from one town to another: for
this purpose it is assumed that there are no through services which stop at intermediate towns, so to plan a
journey from one town to another may involve several changes of bus at the intermediate towns. The basic
requirement is therefore to produce a program which will perform two main functions, as follows.
The first main function will be to read in details of the towns that are served by buses, the pairs of towns
between which buses run, and the distances between each of these pairs. This could in practice be extended to
also allow a set of data which had been read in to be written out to a file, so that subsequently data could be
read either from the keyboard or from a file: such an extension might help simplify the eventual testing of the
system.
The second function will be to read in the names of two of the towns, determine whether there are any routes by
which someone could travel from one to the other by bus, and if there are to print out the route that involves the
minimum number of intermediate changes (or routes if there are several of them), together with the total
distance for each such route.
Since the main emphasis in this assignment is on the construction of the necessary data structures, the user
interface for the system should be made as simple as possible (where ‘simple’ is intended to mean ‘simple to
program’, rather than ‘simple to use’). To facilitate this, all town names will be represented throughout by two-
letter abbreviations (eg you might want to use SH for Sheffield, LE for Leeds, YO for York, DN for Doncaster,
BA for Barnsley, etc.). All distances will be expressed as whole numbers. If it helps, you can also assume that
most towns will only have bus services to up to four other towns, although there may be a few towns that have
services to more.”
Students taking this previous course had produced some five working systems that they were prepared to make available for
use in the software testing assignment, although (not surprisingly) there were significant variations between these systems
in the ways that their authors had interpreted the requirements. Thus, the software testing assignment required the students
firstly to select one of the candidate systems, and undertake some exploratory testing in order to determine how its authors
had interpreted the requirements. Then, the students (who were encouraged to work in groups of two or three) were
required to apply the category-partition method to each of the two main functions that were identified in the scenario, in
order to produce a test specification for each function and (using a locally-written tool) generate the corresponding sets of
test frames. Each individual student was then required to select ten test frames for the route finding function, and actually
carry out the testing for test cases produced from these frames. Finally, each group was required to produce and submit a
report on the work that it had done, and evaluate both their work and the methods that they had used.
This structure for the assignment has been in use for a number of years, and apart from minor refinements it has needed
little change, as in general students have been able to cope with it well and produce reasonable test specifications, even if
some of them have been less thorough than others. On the other hand, one mistake has been observed to occur fairly
persistently, even if not particularly frequently. This mistake arises in trying to create categories for the routes that are
output, where some students correctly identify the total distance for each route as one of the output parameters for this
function, but then (following the pattern that they have used for at least some of the inputs) identify a category concerned
with the validity of this distance. This they interpret as meaning whether or not the distance that is output is actually the
sum of the distances for the individual segments of the route, which leads them to suggest that there should be two
partitions for this category: one for the case where the distance is correct, and the other for the case where it is not.
The reason for classing this as a mistake is simply that, unless the exploratory testing has already identified cases where the
system to be tested does actually calculate the total distance wrongly, it will be impossible for the students to actually create
test cases corresponding to any test frame that involves the partition “total distance is incorrect”. In terms of the concept of
domain testability, as described by Freedman [4], the nature of this mistake is that test specifications have been created that
84
UKTest 2005
do not possess the property of domain testability, because the domain that is being defined by this particular category is not
controllable. As such it contrasts with the informal specification (as given above), where there is nothing that would imply
a lack of controllability (or, for that matter, a lack of observability either), so that this lack of controllability in the test
specification actually constitutes an inconsistency between the test specification and the system specification from which it
has been derived.
This inconsistency thus constitutes a fault in the test specification, but the most interesting aspect of it is the nature of the
mistake that had led to this fault occurring. On the first few occasions when this mistake was observed it was assumed that
it was simply due to the students not having understood the method properly, but as the mistake has continued to recur, so it
has begun to raise the question as to whether the problem is actually a structural one with the category-partition method (or,
at least, with the way in which it is normally described in the literature), rather than simply being a consequence of the
students not having understood it fully.
Reflecting on this issue has led to a number of hypotheses being proposed for the precise nature of the problem with the
method. The common feature of all of these is that there are some aspects of the method which are somehow being masked
by the way in which the method is normally presented, with the consequence that some of its underlying theoretical aspects
are not being properly appreciated by at least some of the students, including those who have made this particular mistake.
The most fundamental candidate for such an unappreciated aspect of the method appears to be the causality that is implicit
in the input-output relation for a system, meaning that these students at least (and maybe others as well) are not properly
appreciating the significance of the fact that the inputs cause the outputs.
In particular, it appears that some of the students are not properly appreciating that pre-conditions for inputs (such as
“being valid”), which may or may not hold depending on what data is supplied, have a different significance from the post-
conditions for outputs, where the responsibility for ensuring that they hold rests entirely with the system itself, and not with
the external environment in which it is run and from which the inputs are supplied. This difference means that, while it is
reasonable to expect a system to check whether its inputs satisfy the preconditions, and deal sensibly with cases where they
are violated, for the post-conditions the only responsibility on the system is to maintain them, and (except in the rare cases
where there could be good reasons for the system not being able to maintain these post-conditions, which are outside the
scope of this paper) it is not reasonable to expect the system to check whether they might have been violated, or to take
action if such violations have actually occurred.
A consequence of this, which also needs to be appreciated, is how this issue of what responsibility the system has for
checking conditions affects the kinds of test cases that should or should not be constructed for a system. If violations of a
condition can occur in the environment, so that the system is responsible for checking for them, then test cases are needed
to ensure that these checks are actually being carried out correctly, and that the system is taking appropriate actions both
when the checks are passed and when they are failed. If the system is not responsible for checking for particular
conditions, however, as is the case when it can be expected to be constructed so as to maintain them itself, then it is
pointless trying to test whether such checking is being carried out: any such checking would be outside the specified scope
of the system, and so any attempt to create test cases to determine whether such unrequired checks might fail would (as in
this particular example) simply lead to tests for which it would be impossible to construct appropriate data.
In the scenario for this particular example, there is an alternative view that might be taken of this aspect of ensuring post-
conditions, namely that the property of the total distance for a route being the sum of the distances for the individual
segments of it is an invariant that needs to be maintained. As with a post-condition, though, this means that the only
responsibility on the system is to maintain the invariant, and (in general) that it does not have any responsibility for
checking whether it has been maintained. Consequently, this alternative view still leads to the same conclusion, that since
the specification of the system does not require it to check whether this condition might be maintained, it will be impossible
to use the specification in trying to create the test data that might be needed for any test case which assumed that such a
check would be conducted and would fail.
At a more theoretical level, another aspect of the method that the students who make this mistake may not be appreciating
properly is the relationship between the partitions, the test frames that are constructed from them (by the tool) and the test
data that must then be selected to create the test cases corresponding to each of the test frames. This relationship means
that test frames can only be regarded as valid if test data can be constructed from them, and similarly that partitions can
only be regarded as valid if the test frames constructed from them are valid. Consequently, the test frames that they are
creating by specifying partitions such as “total distance is incorrect” are actually invalid ones, because there is no sensible
corresponding test data that can be selected for them, rather than simply being less meaningful than those frames for which
data can be selected.
Part of the reason for not appreciating this aspect of the method may well be that the usual descriptions of the method rather
gloss over the steps in the process where the frames are generated from the test specification (particularly since there is a
tool available for doing this step), and where the test data is selected from the frames. Of course, the original developers of
85
UK Software Testing Research III
the method subsequently extended both it and its supporting toolset [5], to cover the specification of actual test data for test
frames as well as the generation of the frames, but we did not have this toolset available. If we had, then maybe its
emphasis on the need to define appropriate actual test data for test frames might have help to avoid the problem. Given that
we did not have this toolset available, though, then part of the reason why some students are not appreciating these aspects
properly may also be that (as indicated in the introduction) the models underlying them are actually not very clear. In
particular, this is because these models are described in terms that are specific to the individual systems for which
categories and partitions are being identified, rather than in terms that are independent of the specific details of individual
systems.
This, therefore, leads to the goals for the work described in the rest of this paper, since a possible way of addressing this
problem would be to find an alternative model for the concepts of partitions and test frames, particularly if this could be
formulated so as to allow some formal reasoning about their properties at a generic level, rather than at the level of the test
specification for an individual system. A second goal is that, if such a model could be constructed, then it should also
allow a more precise description of the processes by which test frames are constructed from partitions, and by which test
data is selected to correspond to test frames, and so it should help to make these aspects of the method clearer. The third
goal is then related to this, in that it would be desirable for such a model and its associated description of the process to
also put more emphasis on the role of the input-output relation for the system, so as to help clarify this aspect too.
3. Functional Test Frames
Given these goals, the model that has been developed for test frames is derived from the observation that a test frame
specifies a set of possible test cases, and does so in terms of ranges of test data that meet criteria derived from the different
partitions being combined within that test frame. Hence, a test frame can be defined to be the characteristic condition that
specifies such a set of test cases, so that any test case satisfying this characteristic condition can be said to satisfy that test
frame. Here, the condition may involve ranges of values for input parameters to the system, outputs from the system
(which for this purpose will also be referred to generally as parameters), or possibly both. The original description of the
category-partition method also identified a third kind of parameter, known as an environment condition, but more recently
Ostrand has indicated [6] that this can best be understood in terms of data that is input to the system and then an updated
version of it output, and so such parameters can simply be treated as appearing as both inputs and outputs.
To make this definition more precise requires some formal model of the computation that is performed by a general system,
and for this purpose the model that will be used here for illustration is that of the stream X-machine, although the model of
test frames is not inherently dependent on the X-machine model. Indeed, it could equally well be built on top of other
models of computation, but the reason for choosing the X-machine model is that it has already been shown to be a very
successful one for obtaining useful results concerning the power of software testing methods [7, 8].
A stream X-machine consists of a finite-state machine that models the overall control structure of a computation, such as
the input of successive data items or the output of successive results, and this machine is known as the associated
automaton of the X-machine. To produce the X-machine model it is augmented with a memory that contains the working
data of the computation, and the steps in the computation are represented by functions that can be applied to this memory,
so as to read it and update it. Each function also reads the next input to the machine and produces the next output. Then, at
each step in the operation of the machine the choice of which function to apply is determined by the current control state of
the machine, and which of the functions have their preconditions satisfied by the current input and memory values.
There are a number of variants of the formulation of the X-machine model, which are reviewed in [9]: the one that will be
adopted here is that an X-machine Λ can be represented as a tuple Λ = (Σ, Γ, Q, M, Φ, F, q0, T, m0), where:
Σ is the input alphabet of the machine, which defines the type(s) of values that can be input to it;
Γ is the output alphabet of the machine, which defines the type(s) of values that can be output by it;
Q is the set of control states of the machine (ie the states of its associated automaton);
M is the set of possible values of the memory of the machine;
Φ is the set of processing functions that the machine can apply to the memory, inputs and outputs;
F is the next state function, which defines the next control state for each control state and processing function;
q0 is the initial control state of the machine;
T is the set of terminal control states of the machine, which must be a subset of Q; and
m0 is the initial memory value of the machine.
Here, it should be noted that the memory of the machine will usually be structured as the Cartesian product of a number of
components, corresponding to the different variables used in the computation, but for this purpose it is not necessary to
consider the details of this structure further. All that needs to be noted is that the machine starts in the control state q0 and
the memory state m0, and with an input stream s* in Σ*, and an empty output stream.
86
UKTest 2005
Then, at each step in the operation of the machine a processing function φ is selected from Φ, such that φ (m, σ) is defined
where σ is the head of the stream s* and where F is defined for the pair (q, φ). Once this function φ has been chosen, the
element σ is removed from the stream s* and the function φ is applied to yield a pair (m’, γ) = φ (m, σ). This pair is used to
update the memory value to m’, while γ is appended to the output stream g* in Γ*, and also the control state is updated to
the new value q’ = F(q, φ). These processing steps continue until both the input stream is empty and the machine enters a
control state that is in T, when the machine terminates, having computed the output stream g* from the input stream s*.
Figure 1. The associated automaton of the X-machine for typical functional systems.
For a system such as the one specified in the scenario, which essentially just provides a number of functions that can be
invoked in any order, the next state function F will typically have the form illustrated by the state transition diagram in
figure 1. For such systems it will often be the case that T will be a singleton set, whose sole element will be the control
state that corresponds to reaching the end of the program. For testing a system of this kind, though, where the focus is
simply on verifying the behaviour of the individual functions, it is often convenient to regard other states (such as the state
Running in figure 1) as also being terminal, since the tester may well either not be interested in any further computation
steps, or may wish to go on to perform another test without having to terminate and restart the system.
This convention, of regarding as terminal any control state that corresponds to the end of a particular test case, also applies
to those systems where a significant part of the testing is concerned with verifying that specified sequences of control states
and transitions between them are implemented correctly. This kind of state-based testing, which includes conformance
testing [10, 11], is a more complex process than the function-based testing that is being discussed here, although it
obviously includes function-based testing as an important part, since the validity of the results obtained from it obviously
depend on the individual processing functions Φ of the X-machine model being implemented correctly. This paper,
though, is just concerned with the function-based part of the testing activity, and so possible extensions to state-based
testing will be regarded as beyond its scope.
Given this model, then in principle the characteristic condition of a test frame f could involve any of the elements of the
tuple, but for functional test frames we shall just be concerned with three possibilities:
f involves only elements s* of Σ*, in which case it will be termed an input-only frame;
f involves only elements g* of Γ*, in which case it will be termed an output-only frame; or
f involves both elements s* of Σ* and elements g* of Γ*, in which case it will be termed an input-output frame.
Structural test frames will be considered in section 5, but these possibilities mean that the most basic form of functional test
frame will simply specify values or ranges of values for some elements of s*, g* or both. It should be noted that such basic
test frames will usually correspond to one partition of a single category, but this is not inconsistent with the original
informal definition of a test frame, since although this could be taken as implying that a frame would usually consist of
combinations of partitions from a number of categories, it did not actually require this, and so a choice of a single partition
from a single category would in fact be permitted by that definition.
In practice, of course, one wishes to construct more elaborate test frames, which will combine partitions from a number of
categories. Syntactically these can be constructed from the basic test frames using the usual operators of boolean algebra,
so that if f1 and f2 are any test frames, then f1 ∧ f2, f1 ∨ f2, and ¬f1 are also all test frames. In particular, if f1 is a test
frame that represents some partition of one category c1, and f2 is a test frame that represents some partition of a different
category c2, then f1 ∧ f2 is the test frame that represents the combination of these two partitions from the categories c1 and
c2.
Of course, there is a possibility that the characteristic conditions of two frames f1 and f2 may conflict, so that there will
actually be no data that could satisfy the condition f1 ∧ f2, and any frame for which there is no data satisfying the
Quit
Function 1 .................... Function n
Running
87
UK Software Testing Research III
characteristic condition is said to be infeasible. The simplest kinds of conflict that could produce such infeasible frames are
where two categories c1 and c2 both apply to the same parameter (whether input or output), and define properties of that
parameter that are related by a constraint that prohibits the combination of the two partitions f1 and f2. More general forms
of conflict can arise whenever the specification of the system contains a constraint between two categories c1 and c2 that
apply to different parameters, such as the dependency of an output on an input: if this constraint means that the
combination of f1 and f2 is prohibited, then the frame f1 ∧ f2 will be infeasible.
This can be illustrated easily from the example, since two of the categories needed in the test specification of the route
finding function relate to the validity of the start and end town names respectively, and for each of these categories two of
the partitions would be that the named town either is or is not in the map of the bus network. Other partitions might be that
the input characters do not even constitute a possible town name, because they are not letters, or these could be separated
into categories for syntactic validity, as opposed to the semantic validity that derives from being contained within the map.
Other categories in this test specification will relate to properties of the route that is to be output, such as the number of
alternative routes and the number of intermediate towns, and for the latter the obvious partitions will be none, one and more
than one. Hence, we might have two basic frames:
f1 = “start town not in map”, and
f2 = “number of intermediate towns = one”,
but it should be clear that the frame f1 ∧ f2 will be infeasible, because if the start town is not in the map then the function
should find no routes at all, and so there certainly can not in this case be a route found with an intermediate town.
In principle this way of building up frames can obviously be extended until one has combined partitions from all the
categories, to produce what have sometimes been called complete test frames. In doing this, though, it is important to
avoid the construction of infeasible test frames, since (as noted in section 2) any infeasible frame is essentially invalid,
because it represents conditions that can not actually be tested. Hence, any process for constructing test frames must check
whether they are feasible, and take action to avoid producing them if they are not. This is why test specifications need to
represent the constraints between the different partitions and categories, in order to specify that if certain partitions have
been chosen, then either certain partitions from other categories must not be combined with them, or possibly even the
other categories must be ignored completely, in the sense that no partition from them may be included in the combination.
One effect of these constraints is to impose a hierarchical structure on the set of categories, and indeed Grochtman and
Grimm’s classification tree method [12] is based on explicitly identifying such hierarchical structures. A classification tree
then contains a leaf node for each individual test case, and paths down the tree correspond to combinations of the partitions
represented by the intermediate nodes through which they pass. For instance, a part of the classification tree for this
example would be as shown in figure 2, where sets of edges are labelled with the relevant categories, and nodes are labelled
with the partitions drawn from these categories. From this it can be seen that, in order to avoid constructing infeasible test
frames, what is required here instead of a complete test frame is what might be called a sufficiently complete test frame,
meaning one that in terms of the classification tree model incorporates partitions from all the categories that are represented
by a particular path from the root of the tree to one of its leaf nodes.
Figure 2. Part of the classification tree for the route finding function.
Unfortunately, from the perspective of the goals for the model being developed here, describing the concept of sufficient
completeness in this way has the weakness that any classification tree structure will be specific to a particular test
semantic validity of towns
start town
not in map
end town
not in map
both towns
in map
nonemore than oneone
number of intermediate towns
88
UKTest 2005
specification. Of course, there are generic aspects to such tree structures, as illustrated by their use in the method of test
templates developed by Stocks and Carrington [13, 14], but even this method can only take the generic aspects so far.
Beyond that point the construction of further levels of the tree structure has to reflect the specific properties of the
specification of the system, which in their examples they assume to have been defined in Z [15]. Also, the choice relation
framework that was developed by Chen et al for the category-partition method [16] defines some generic properties of the
relationships between choices from different partitions (namely whether one is fully embedded in another, partially
embedded or not embedded), but again the application of these relationships to any given system specification is wholly
specific to that system.
Thus, while in principle the model being developed here needs the notion of a test frame being sufficiently complete, it is
not appropriate to define this formally in terms of specific constructions such as classification trees, test templates, or the
sets of constraints in a test specification. For the moment, therefore, it will be assumed that it is going to be possible to
define this idea formally, so that testing methods can be described in terms of constructing sets of sufficiently complete test
frames, which will enable attention to be turned to the way in which these frames are used in the process. Then, the
problem of actually defining sufficient completeness will be returned to in section 6, once the application of these ideas to
structural testing has also been considered.
4. The Process of Functional Testing
The application of this model for test frames to the process of functional testing depends on two key notions. One of these
is that all significant functional test methods operate by generating a set of input-output test frames, although the validity of
this notion can only be demonstrated informally. The previous section has effectively demonstrated that this notion applies
to the category-partition method and others based on it, since the model of test frames that has been created is consistent
both with Ostrand & Balcer’s and with the classification tree method. As functional test methods are usually described (for
instance by Roper [17]), many of them are essentially contained either within these two, or the method of cause-effect
graphing, to the extent that we will claim that any others are not significant. Then, it should be obvious that this notion
applies to the cause-effect graphing method too, since causes and effects are defined in terms of logical properties that
apply to the inputs and outputs respectively, and so map directly into the characteristic conditions of test frames that
correspond to them. Since the decision tables used in this method then simply produce test frames by constructing
disjunctions of the causes and effects, these too map directly into the structures of test frames as they are being defined in
this model, and so this method also generates a set of input-output test frames as defined here.
The other key notion can be defined formally, and this is the notion that any specification of a system will need (amongst
other things) to define the relationship between the inputs that can be supplied to that system and the corresponding outputs
that will be produced. In the case of the stream X-machine model this relationship is defined formally as the relation
consisting of all possible pairs of the form (s*, g*) where the machine computes g* from s*. If the machine is deterministic
then this relation will in fact be a function with type Σ* → Γ*, so that if this function is denoted Spec then the operation of
the machine can be described as g* = Spec (s*).
Thus, given these two key notions, when a functional test method has produced a set of input-output test frames, the
process of generating the corresponding set of test cases can be defined in terms of four steps, as follows.
1. Each input-output frame F in this set is considered separately, and it is decomposed into a pair of frames: an
input-only frame Fi and an output-only frame Fo, such that F = Fi ∧ Fo.
2. From the output-only frame Fo an equivalent input-only frame Fe is constructed by forming the relational
image of Fo under the inverse of the function Spec. Here (as in the Z notation), the postfix symbol ~ is used
to denote the operation of constructing the relation that is the inverse of a function or relation, and the special
brackets · ‚ are used to denote the operation of forming the relational image, so that this construction of Fe can
be written as Fe = Spec~ ·Fo‚.
3. A new input-only data frame Fd is constructed, as Fd = Fi ∧ Fe, and then if Fd is feasible any set of input data
d that satisfies Fd can be selected as the input for the test case.
4. Finally, the expected output from the test case must be determined, and this can be constructed as Spec (d).
In principle, therefore, this process defines formally how the set of test frames produced for any system by any functional
test method is converted into the equivalent set of test cases, which is a step that is usually not defined precisely in the
descriptions of the test methods themselves. As described in the previous section, though, an important issue in this
definition of the process is what should happen at the third step if the constructed data frame Fd is infeasible, since while
this obviously means that a test case can not be generated for this frame, the fact that the frame is therefore invalid also
means that some error has occurred earlier in the operation of the test method. Thus, the generation of such an infeasible
frame could indicate simply that the person who constructed the test specification for the system had not understood
correctly some aspect of the original system specification, and that therefore the test specification needed to be revised to
try to eliminate the infeasible frame. Alternatively, though, the infeasibility may indicate that some consequences of the
original system specification have not been defined clearly, so that while there might be an inconsistency between it and the
89
UK Software Testing Research III
test specification, this would not necessarily indicate an error on the part of the person who had subsequently constructed
the test specification, but might just be a consequence of the weakness in the system specification.
In terms of this process, there are actually three possible situations that could give rise to an infeasible data frame Fd:
firstly Fi could be infeasible, secondly Fe could be infeasible, and thirdly Fi and Fe could both be feasible on their own,
but they could conflict. The simplest of these situations is the one where Fi is infeasible, since this clearly indicates that the
test specification did not correctly reflect the actual constraints on what inputs can be supplied to the system, so that the test
method produced a combination of partitions that could not actually occur in practice, and that should therefore have been
eliminated during the operation of the method. In this situation, therefore, the tester needs to go back and modify the test
specification, and then rerun the test method.
The same is also true of the situation where Fi and Fe are both feasible, but conflict. This indicates that the test
specification has produced a legal combination of input partitions, and a legal combination of output partitions, but that the
combination of the two is not legal. Hence, the test specification has not properly modelled the way in which this particular
combination of input partitions should be reflected in the properties of the outputs from the system, and so has allowed this
combination of input partitions to be combined in a test frame with a combination of output partitions that actually could
not result from that input. Again, therefore, the remedy is that the tester needs to go back and modify the test specification,
and then rerun the test method.
The remaining situation is the one where Fe is infeasible, which means that there are no inputs that could cause the system
to produce outputs satisfying the conditions of Fo. While this could indicate a simple error in the test specification, this is
not the only possible explanation, as it is very common for the specification of a system to be written in such a way that,
within what is notionally defined as the range of its possible outputs, there are some values that may actually never occur in
practice, so that the system may therefore not be completely controllable over the whole of its notional output domain.
Furthermore, the more complex a system is, the more likely it is that such “holes” in the output domain may occur, but they
may well not be defined directly in the system specification, and their existence may not be easy to analyse. If such “holes”
do exist, though, then any combination of output partitions Fo that is contained entirely within one of them will result in a
frame Fe that is infeasible. In such a situation, though, it may be necessary to work backwards through the system
specification from the infeasible frame in order to identify the “hole”, before the tester can then revise the test specification
so as to exclude the generation of such a combination of output partitions.
In the example the operation of this extended process is that the partition “total distance is incorrect” produces an output
frame Fo that violates the specification Spec, and so the constructed input frame Fe = Spec~ ·Fo‚ is infeasible. In analysing
the cause of this infeasibility (which is not a step that is required by the category-partition method as normally described) it
will then become very obvious that the partition that gives rise to Fo does lie outside the range of possible outputs of the
system as specified, which highlights that this partition must be illegal, and so can not just be ignored, because any test
specification for this system which contains such a partition must be erroneous. Hence, this form of the process makes it
much clearer than was previously the case that a test specification for this system that contains such a partition needs to be
revised, so as to remove the erroneous partition, and consequently also remove the category that would give rise to it.
The other issue that arises from this treatment of infeasible frames is that in principle the inverse relation Fe = Spec~ ·Fo‚
may not necessarily be computable, so that in theory it may not be possible to determine whether or not a frame Fe is
feasible. In practical testing, though, this is not a significant issue, since the important step is to be able to find at least one
set of values that satisfy Fe. Typically, rather than actually trying to compute Spec~, a tester will approach this step by
using a method that can be described as successive approximations, in which each approximation consists applying Spec to
some likely set of input values, to determine whether they produce a result that satisfies Fo. If such a set of input values
can be found, then Fe must be feasible; but if the tester fails to find such a set of values, then in practice they will have to
treat Fe as being infeasible, and so the question of whether or not it is theoretically infeasible will become irrelevant.
5. Structural Testing
The common feature of all structural test methods is that each test case can be characterised in terms of the set of structural
elements that its execution covers, where different test methods are distinguished primarily by the different definitions that
they use of what constitutes a structural element. This feature is then extended to characterizing a test set in terms of the
union of the sets of structural elements that the individual test cases cover, and typically the goal of such test methods is to
build up a test set for a system that covers some specified fraction (often 100%) of the structural elements that make up the
code of that system.
To extend the definition of test frames to structural testing, therefore, all that is required is to reverse the notion of a test
case covering a set of structural elements, so as to define the criterion for a structural test frame as being of the form that a
given set of structural elements must be covered. Then, any test case that covers this set of structural elements satisfies this
test frame. In practice, though, there is also an implicit expectation that any such test frame should be sufficiently tightly
90
UKTest 2005
defined (ie should require sufficiently many elements to be covered) that all test cases satisfying it should cover exactly the
same set of structural elements.
Given such a definition of structural test frames, then again we can assert that structural test methods essentially operate by
producing a set of test frames, typically by an incremental process of analysing which structural elements are not covered
by the test cases satisfying these frames, and then adding further frames to cover these elements. The details of this
incremental process do not need to be discussed here, beyond noting that in order to do the required analysis of candidate
test frames a definition is needed of the relationship between the inputs that can be supplied to a system and the
corresponding structural elements that will be covered.
In applying these concepts to the stream X-machine model, the starting point is that structural elements can be defined in
terms of any of the components Q (the control states), M (the memory states), Φ (the processing functions) or F (the next-
state function), or of combinations of these components. The most comprehensive form of structural element is then the
execution path of the machine, which is usually denoted as an object of a type called Path. This type is defined as a
sequence of alternate states (ie pairs of control and memory states) and invocations of processing functions, where each
processing function is applied to the previous memory state and input, and takes the machine to the next control and
memory state (via the next state function), while producing the relevant output. For any deterministic stream X-machine
and input sequence s*, the corresponding path p is computed by a function that can be denoted Exec, of type Σ* → Path, so
that we can write p = Exec (s*). Coverage of elements corresponding to any of the more basic components Q, M, Φ or F
(or combinations of them) can then be obtained from the path p by projection.
Given this model, then the process of generating a set of test cases from the set of test frames produced by a structural test
method can be defined in terms of the following three steps, which are similar to those of the specification-based method.
1. Each frame F in this set is considered separately, and from it is constructed an equivalent input-only frame Fe,
which is defined as Fe = Exec~ · F ‚.
2. If Fe is feasible then any set of input data d that satisfies Fe can be selected as the input for the test case.
3. Finally, the expected output from the test case must be determined, and as before this can be constructed as
Spec (d).
In this process the issue of feasibility arises because in any piece of code there will be constraints between the different
structural elements that mean that certain combinations of them can not be covered by the same test case. As an obvious
example, given any piece of code of the form
if condition
then block1
else block2
fithat is not nested inside any loop construction, then it will be immediately apparent that any single test case can only result
in the execution of either block1 or block2, but not both. Consequently, any test frame that required coverage of structural
elements derived from both block1 and block2 would be infeasible. Typically, though, structural test methods do not
employ any form of test specification to capture such constraints, but rather they would rely on the iterative step of splitting
any such infeasible test frame into two frames, one relating to the structural elements derived from block1 and the other
relating to those derived from block2. The precise details of how such splitting of infeasible frames might be
accommodated within the test method will depend on the method itself, and in this paper it is not practical to try to discuss
this separately for each of the various structural test methods that have been proposed. The key point, though, is the
general one that this process of splitting frames is an iterative one, so that if one of the frames resulting from it is still
infeasible, then it can be split further, until eventually a set of frames is produced in which every individual frame is
feasible.
6. Representative Test Cases
Having shown that this model for test frames can be applied to both functional and structural testing, the next stage in
developing the model is to return to the issue of the completeness of test frames, and in particular to the aspect of this of
specifying how many partitions need to be included in a test frame. Underlying this issue is the observation that was made
originally by Bernot et al [18], that essentially any software testing is concerned with trying to make a generalisation, from
statements of the form “for some test case the system works correctly” to “for all test cases satisfying certain conditions the
system works correctly”. In terms of any system, and some test frame tf for it, this generalisation can be expressed as going
from
∃ a test case tc satisfying tf � system is correct for tc
to
∀ test case tc satisfying tf � system is correct for tc
which makes it immediately obvious that some additional justification is required to support the validity of the
generalisation.
91
UK Software Testing Research III
In principle this additional justification will come from the property of test cases that, under some suitable criterion, it is
possible for one of them represent some set of other similar tests, where the degree of similarity is such that it is indeed
reasonable to claim that if the system passes one of the tests in this set then it should pass all of them. Here the criterion
corresponds to what Bernot et al call the uniformity hypothesis, and it relates to the kind of test method being employed, so
that in a structural test method it will mean that the execution paths for each test case in the representative set are
sufficiently similar that each case will cover the same set of structural elements, whereas in a functional test method it will
mean that in some sense the functional behaviour of the system is the same for each test case.
In terms of test frames, the implication of this property is that a test frame can be treated as sufficiently complete if each
test case that satisfies it is representative of the whole set of test cases that satisfy it, so that we can define that the test frame
is a representative frame under the criterion being used. For structural test frames this property of them being
representative can, as indicated above, be expressed directly in terms of the similarity of the execution paths of the test
cases that satisfy the frame, but for functional test frames the equivalent notion that is required is that of similarity of
specified behaviour, and this is more difficult to define formally.
In terms of the category-partition method, another way of looking at the requirement for test frames to be representative is
that it is equivalent to requiring that each category and partition must be “small enough” that each legal combination of all
(or enough) of them must be representative. This in turn then requires that each of these combinations of the partitions
must capture just a single kind of behaviour, so that within any representative test frame there must not be any alternative
patterns of behaviour that might occur instead. On the other hand, for reasons of efficiency, we do not wish to achieve this
by making the categories or partitions smaller than they need to be, so that there should not be any cases where two
different test frames actually correspond to the same behaviour of the system.
In order to capture this property in a formal model, there needs to be a way of representing this requirement for a set of test
cases to be representative, namely that the behaviour of the system must be uniform for each case in the set. To model this
requirement, a concept is introduced that will be called structural continuity. Informally, the basis of this concept is that
system specifications or implementations, or components of them such as conditions, expressions or statements, can be
regarded as structurally continuous over any part of their domain in which their behaviour is in some sense uniform, but
that any boundary within this domain where the behaviour changes (ie from one alternative to another) represents a
structural discontinuity. In the case of specifications this can be formalised by observing that any deterministic
specification can be expressed in a clausal form, in which the set of alternative clauses can be written as
clause1 else clause2 else ... clausen
and where each clausei has the form
if PartitionCombinationi then Expressioni
and where the various conditions that are denoted PartitionCombinationi are all mutually exclusive.
The simplest case of such a clausal form specification is one that only requires a single clause, and where the expression in
the clause does not involve any alternatives either. In this case the expression is implicitly structurally continuous, and so
such a specification is defined to be structurally continuous. By contrast, if the expression does involve alternatives, in a
form which means that in order to cover all the possible behaviours allowed by the partition combination it should itself be
expressed in clausal form using more than one clause, then the specification is said to be structurally discontinuous, and
similarly any specification that requires more than clause is in principle structurally discontinuous too. In practice, though,
there is an important intermediate situation, which is a specification that requires a number of clauses, but where the
expressions in each of the clauses are all structurally continuous. Such a specification is defined to be piecewise
structurally continuous.
The significance of defining structural continuity in this way is that, in such a clausal form specification for a system, then
given the way in which test frames have been defined, each partition combination will be a test frame for that system.
Consequently, if a specification is piecewise structurally continuous, then (because the corresponding expression in each
clause involves no alternative behaviours), each of its partition combinations must therefore constitute a test frame for the
system that will be representative under the criterion for functional testing. This therefore provides the formal equivalent
for the informal notion of sufficient completeness that was introduced in section 3: a test frame is sufficiently complete if
the corresponding clause in the specification (meaning, the one that has this test frame as its partition combination) has an
expression that is structurally continuous. Hence, test frames that are sufficiently complete according to this definition will
also be representative under the criterion for functional testing. In both of these cases, though, the restriction to the
criterion of functional testing is important, as it can not be guaranteed that such test frames will also be representative under
criteria for structural testing. This is because there is no guarantee that the designers or the implementers of the system will
have made the implementations of each clause in the specification completely uniform for all cases of the behaviour
represented by that clause, even though this might be a reasonable thing for them to aim at doing, and indeed it is
sometimes referred to as the “reasonable implementation” principle when justifying the significance of the representative
properties of functional test sets.
92
UKTest 2005
Another useful property of such piecewise structurally continuous specifications is that, using the equivalence of the
partition combinations and the test frames, one can rewrite each clause in the general form
TestFramei ∧ (output = Expressioni)
Then, since the assumption has been made that the partition combinations (ie the test frames) are mutually exclusive, it will
be apparent that any such specification is actually in disjunctive normal form, as is commonly derived during the operation
of methods for generating test cases from formal specifications, such as those described originally by Dick & Faivre [19]
and then developed by the many successors to them (see, for example, [20] for a review of these).
Related to this is the property that any specification which is not in piecewise structurally continuous form can be rewritten
into this form. This follows from the fact that, in a clause of the form
if TestFramei then Expressioni
if Expressioni is not structurally continuous then it must involve alternative behaviours, and so must actually be in a form
such as if conditionj then Expressioni1 else Expressioni2. Hence, this single clause can be rewritten as the pair of clauses
[if TestFramei ∧ conditionj then Expressioni1] else
[if TestFramei ∧ ! conditionj then Expressioni2]
This rewriting step is very similar to the unfolding process for conditional axioms that is described by Bernot et al, and
ideally repeated applications of it should eventually reach a form in which all the combinations of test frame and condition
are mutually exclusive, and all expressions are structurally continuous. If such a form can be achieved, then the whole
specification will be in piecewise structurally continuous form. In principle, though, such a form may not be achievable:
indeed, it may not even be possible to compute the number of rewriting steps that might be required for it. In this case,
some arbitrary upper bound has to be assumed for the number of rewriting steps that will be applied, which correspond to
what Bernot et al call the level of the regularity hypothesis.
This process can then be developed to make these definitions of structural continuity and piecewise structural continuity
completely rigorous, by looking more closely at the way in which expressions are built up. The simplest case is that an
expression uses variables but no operators, and in this case it must be structurally continuous. Any expression in a
specification that is more complex than this must use some operators, but any operator Opj can be assumed to have a
specification in clausal form, and by the construction above it can without loss of generality be required that this
specification be in piecewise structurally continuous form. Under these conditions it will then follow that Expressioni will
be structurally continuous if
∀ Opj used in Expressioni � (∃ clause k in OpSpecj � PartitionCombinationi ⇒ PartitionCombinationk)
Furthermore, if this condition is satisfied, then since for any Opj the PartitionCombinationk are mutually exclusive, the
clause k satisfying the condition must be unique, because the variables occurring in PartitionCombinationk will be a subset
of those occurring in PartitionCombinationi, and the specified ranges of these must correspond to a unique clause in the
specification of Expressioni.
An issue that is still left open by this construction is that of precisely how an operator is specified, since there may well be
cases where it is convenient to regard operators as structurally continuous in practice, even though in principle they are
defined in such a way that they are only piecewise structurally continuous. This issue will be returned to in section 8, but
before doing so it is appropriate to examine how the concept of structural continuity applies to implementations.
7. Structural Continuity of Implementations
The application of the concept of structural continuity to implementations in conventional programming languages is built
up in the usual fashion, starting with the most basic forms of statement and then going on to the various kinds of structured
statements. Basic statements, such as assignments, will inherently be structurally continuous if the expressions that they use
are. Thus, if the expressions just use individual variables, the statements will be structurally continuous, or if the
expressions involve any form of operator application or function call and all of the operators or functions are structurally
continuous then the expression will also be structurally continuous. Conversely, if any of the operators or functions in it are
structurally discontinuous then the whole expression will be structurally discontinuous, but if some or all of them are
piecewise structurally continuous and the rest (if any) are structurally continuous then (by the construction given in the
previous section) the whole expression can be regarded as equivalent to one that is in piecewise structurally continuous
form. Similarly, if the expression in a statement is piecewise structurally continuous, then for the purposes of analysing its
behaviour when executed the statement can also be regarded as being piecewise structurally continuous.
When such statements are composed sequentially into a block, then if all the individual statements are structurally
continuous the block will be structurally continuous too. If one of the statements is only piecewise structurally continuous
then similarly the block will be piecewise structurally continuous, and this generalises to the case where more than one of
the statements in the block is piecewise structurally continuous. Here, though, there will in principle need to be one clause
in the equivalent piecewise structurally equivalent form for each possible combination of the clauses in the forms for the
93
UK Software Testing Research III
individual statements, although in practice there may be constraints between the conditions of the clauses that mean that
some of the combinations can be ignored.
For any kind of structured statement that expresses a choice, such as
if condition then block fi ,
if condition then block1 else block2 fi, or more general forms such as
switch expression case block1 case block1 ... case blockn end ,
the basic definition of the concept of structural continuity means that it should be obvious that the statement will inherently
be structurally discontinuous, unless the blocks of code nested within it are all structurally continuous, in which case the
whole statement is piecewise structurally continuous. Alternatively, if each block nested within the statement is either
structurally continuous or piecewise structurally continuous, then by the construction given in the previous section the
whole statement can be regarded as being equivalent to one that is in a piecewise structurally continuous form. Thus, for
the purposes of analysing its execution behaviour, such a statement can be treated as being piecewise structurally
continuous.
The structured statements that express repetitions conventionally divide into two groups, depending on whether or not the
number of repetitions is fixed at the start of execution of the statement. Statements that provide for a fixed number of
repetitions typically have a form such as
for range of iterations do block endand if the possibility of there being zero repetitions is ignored, then essentially they can be treated as the sequential
composition of the appropriate number of occurrences of block. Hence, in principle the number of clauses in the
equivalent piecewise structurally continuous form would be the number of clauses in the form for block, raised to the
power of the number of repetitions, which means that in practice this form will suffer from a combinatorial explosion in the
number of clauses.
A similar problem applies to the statements that provide for indefinite numbers of repetitions, such as
while condition do block end or
repeat block until condition endsince these are inherently structurally discontinuous, but again in principle if the blocks of code nested within them are
piecewise structurally continuous then it ought to be possible to treat the resultant statements as piecewise structurally
continuous. In practice, though, this would involve expanding the loop up into a series of alternatives, one for each
possible number of iterations, and since each of these alternatives would suffer from the kind of combinatorial explosion
that affects fixed numbers of repetitions, the resultant piecewise structurally continuous equivalent would suffer from an
even more severe combinatorial explosion, that would make it very intractable for any serious analysis.
The final form of statement that needs to be analysed is the invocation of a procedure or equivalent kind of routine, for
which the structural continuity will be the same as for the code of the procedure or routine being invoked. In particular,
this means that any correctly formed recursive procedure or function must inherently be either structurally discontinuous or
piecewise structurally continuous, since the requirement for correct formation means that any such procedure must contain
some form of choice statement in order to select either the base or the recursive case.
8. Integration Testing
Given the problems of combinatorial explosion that can arise with the number of clauses in a piecewise structurally
continuous block of code, as described above, an issue that is obviously significant is that of just how many clauses are
required to define the behaviour of operations, and in particular whether it is possible to control this number, perhaps by
treating some of the more primitive operations as being structurally continuous in practice, even though in principle they
are only piecewise structurally continuous. The point here is that typically a primitive operation will be specified in terms
of a set of axioms, which will often have been derived from the structure of primitive type definitions that are inherently
recursive, with the consequence that the set of axioms will give rise to a piecewise structurally continuous specification.
As an illustration, at the most primitive level natural numbers are defined in terms of a recursive data type that uses two
constructor functions, typically denoted zero (a constant) and succ (a unary function). Consequently, any primitive
operation over the natural numbers will be specified in terms of axioms that define the value that it returns for each
combination of these two possible constructions. For instance, in the case of addition the relevant axioms will be:
zero + zero = zero,
zero + succ (n) = succ (n),
succ (n) + zero = succ (n), and
succ (n1) + succ (n2) = succ (succ (n1 + n2) ).
Hence, this set of axioms produces a specification that is at best piecewise structurally continuous, and that appears to
require four clauses, although actually the need to unwind the recursions means that in principle a separate clause is
required for every possible combination of natural number values.
94
UKTest 2005
Similarly, if one considers the implementation of natural numbers in terms of the usual binary representations for unsigned
integers, each number needs to be mapped into a string of bits, and then the usual half-adder and full-adder operations need
to be defined in terms of pairs or triples of bits, and extended to strings of appropriate maximum lengths. Thus, the
implementation of the basic half-adder operation will be piecewise structurally continuous, with clauses for the four pairs
(0, 0), (0, 1), (1, 0) and (1, 1), and similarly the implementation of the full-adder operation will require clauses for the eight
possible triples. Then, the extension of these to appropriate length bit strings will involve embedding these in a loop that
operates over the maximum number of bits in the string, and so in principle it will again give rise to a clause for every
possible combination of natural number values.
In practice, though, when testing a piece of software that used such an addition operation, one would not want to focus on
testing every possible set of inputs to the addition, since one would normally expect that this operation had already been
thoroughly tested, probably using both functional and structural methods, and so could be relied upon to operate correctly.
Thus, one would want instead to focus on issues such as whether the expression using the addition operation had been
written correctly: for instance, to identify faults such as writing a * b or a – b, instead of the expression a + b that was
actually intended. Furthermore, in order to find test frames that will be particularly appropriate for identifying such faults,
if they have occurred, the ones that arise most naturally from either the specification or the implementation of the addition
operator may be of little value. For instance, test cases with either a = 0 or b = 0 may not help much in distinguishing the
correct expression from the incorrect ones.
What is needed, therefore, is some way of treating such primitive operations as though they are structurally continuous, in
order to avoid the combinatorial problems that would otherwise result from having to recognise that actually they are only
piecewise structurally continuous. If this is to be done, though, the decision about which operations are to be treated as
being structurally continuous in this way has to be made on the basis of how thoroughly the operations have actually been
tested, or even verified, rather than on any inherent property of their specifications or implementations. This is not a new
situation, and indeed even for the stream X-machine testing method the claim that it will find any possible faults in the
system under test is predicated on the assumption that the individual processing functions have already been thoroughly
tested, so that the only remaining faults can be those arising from the way in which these functions have been integrated
into the complete X-machine system.
What this leads to, therefore, is a new interpretation of the usual view of the process of integration testing as being one of
assembling components into a hierarchy of sub-systems that are regarded as having been tested. In this interpretation, each
node in the hierarchy now consists of a component or sub-system that, because of the way in which it has been tested or
verified, is effectively declared to be structurally continuous, even though its underlying specification or implementation is
only piecewise structurally continuous. Thus, the leaf nodes in this hierarchy will be whatever operations are deemed to be
“primitive” for the purpose of the system being tested, which might well be those built in to whatever programming
language is being used for the development. On the other hand the hierarchy does not have to stop here, and for ultimate
correctness one might want to go down further, through the various layers of the system software (compiler, loader,
operating system, etc), and perhaps even to the microcode or the hardware itself.
In the other direction, as components or sub-systems are tested to the point where it is considered safe to declare them to be
structurally continuous, then they too can be added to the hierarchy at a layer above the components on which they depend.
Thus, in the X-machine test method, the layer above the primitive operations would have the processing functions declared
to be structurally continuous and added to it, as the testing of each of them is completed. Once all of these functions have
been assembled into this hierarchy, then at the next layer up the X-machine that uses the functions can be tested according
to the method, and eventually added to the hierarchy. Then, if these machines in their turn are used as functions in a
higher-level X-machine, which is one way in which the X-machine approach can be used to specify complex system, this
higher-level X-machine can in its turn be tested using the X-machine method, and added to the hierarchy, and so on.
In theory there are then two possible extensions to this approach. One is that it could be extended to the case where the
underlying specification or implementation of a component is structurally discontinuous rather than being piecewise
structurally continuous, although the significance of the construction presented in section 6 is that in practice this extension
should never be necessary. The other possible extension would be to allow a component or sub-system to be declared to be
piecewise structurally continuous, but with fewer clauses than its underlying specification or implementation. Here,
though, the examples given above suggest that the underlying specification or implementation may not be of much help in
identifying which clauses would need to be retained in the new piecewise structurally continuous form, but this may not
always be the case.
For instance, suppose that a component had been produced to implement a stack structure. Intrinsically the behaviour of
this might well be piecewise structurally continuous, with a clause being required for each possible value of the number of
items in the stack. Once it had been thoroughly tested, then in principle one might want to deem it to be structurally
continuous before trying to integrate it with other components. In practice, though, there will be an obvious boundary
95
UK Software Testing Research III
where its behaviour will not be completely uniform, in that some operations (such as top and pop) will behave differently
when the stack is empty from when it contains some data. Hence, this would probably have to be recognised by treating the
component as piecewise structurally continuous, with either just two clauses (for empty and non-empty) or three (for
empty, full and in between), which is likely to be many fewer clauses than would be required by its intrinsic behaviour.
What this indicates is that, for such an extension to be useful, a method needs to be developed for identifying the clauses
that would be relevant to the use of such components or sub-systems. Consideration of such a method is beyond the scope
of this paper, although it can be observed that any such method would have to involve analysing the use of the component
or sub-system. For instance, in the case of this stack example, any use of an operation such as top or pop will have to
recognise that it may deliver a legitimate value, or if the stack is empty it may have to take some other action to indicate
this, such as raising an exception. Thus, one basis for such a method might be to start from the assumption that any
integrated component can in principle be treated as structurally continuous, but then in practice superimpose on top of this
whatever test frames (and their associated clauses) might arise from these different possible patterns of use of the various
operations.
9. Summary and Conclusions
The main conclusion to be drawn from this work is that it is possible to define test frames in terms of the characteristic
conditions that identify sets of test cases, and indeed there are significant advantages in doing so. Unlike the original
definition of test frames that was given by Ostrand & Balcer, such a model is independent of the structure of categories or
partitions that are needed in the test specification for any particular system. Partly because of this independence, this model
then has the advantage that it enables a unified view to be taken of test methods, since they can all – both functional and
structural methods – be regarded as generating a set of input-only test frames. For each of these kinds of method the model
then provides a basis for defining a formal structured process for the activity of generating the required test cases (inputs
and expected outputs) from these test frames. Furthermore, by its treatment of infeasible test frames this process identifies
more explicitly than was previously the case those situations where errors might have arisen in the operation of the test
method itself.
The second key feature of this model is that it provides a good basis for specifying formally an important property of a test
frame, namely that the test set that it generates is a representative one. This property depends on being able to specify
formally the requirements for a test set to be representative, and for this purpose the concept of structural continuity has
been introduced, and it has been shown to provide a good basis for doing this. As with the model of test frames, this
concept too is applicable to both functional and structural testing, and for both kinds of testing representative sets of test
frames can be derived directly from any construction that is in a piecewise structurally continuous form. Furthermore, it
has been shown that for a specification that is structurally discontinuous it is always possible to produce an equivalent
piecewise structurally continuous form, or at least one that can be treated as piecewise structurally continuous under some
assumed bound on the number of rewritings that are required, so that for functional testing this concept has the highly
desirable property that it can always lead to a set of functional test cases that can be taken as representative.
The third key feature of this model is that this concept of structural continuity provides a theoretical basis for the practical
approach to integration testing, in which units of code are assembled into progressively larger sub-systems until the whole
system being developed has been integrated, thus leading to a hierarchical structure for the system as a whole. Informally it
is obvious that each level of this hierarchy consists of sub-systems that have been tested sufficiently thoroughly that they
can be regarded as having been properly integrated in some sense, although usually the meaning of “properly integrated”
here is fairly imprecise. The importance of the concept of structural continuity is that it enables this to be made much more
precise, since effectively it treats “properly integrated” as meaning that at least some aspects of the behaviour of this sub-
system can now be deemed to be structurally continuous, even though intrinsically it is actually only piecewise structurally
continuous.
These still leave two main issues open as further work. One of these issues is that of how this approach to the hierarchical
structure of the integration testing process can be extended, so that instead of treating all sub-systems that have been
integrated as structurally continuous, some can where necessary be deemed to be piecewise structurally continuous, but
with fewer clauses than would be required to model their intrinsic behaviour. As described in the previous section, this
requires some method for analysing how such sub-systems or components are used, so that the different possible cases that
need to be treated as separate clauses can be clearly distinguished. The development of such a method, and its
incorporation into the process of integration testing, is a problem that still needs to be addressed.
The other main open issue is that of how this model of test frames, and the associated concept of structural continuity,
should be extended from the function-based testing regime that has been discussed here, to state-based testing. Clearly
such an extension will not affect the basic concepts of verifying the conformance of states in the implementation to state in
the specification, and the conformance of sequences of state transitions. What still needs to be investigated, though, is
whether it might be necessary to test some particular sequences of state transitions more than once, in order to cover
96
UKTest 2005
different combinations of categories and partitions in the specifications of the processing functions that are invoked to
perform the state transitions. If it is necessary, then one might expect that this would be because of some kind of structural
discontinuity in one of the processing functions, but the possible nature of any structural discontinuity that might require
such combinations of test cases is not clear, and requires further analysis.
Acknowledgements
The material presented in this paper has benefited greatly from discussion with colleagues within the FORTEST network,
which was funded by EPSRC, and in particular, the contributions of Stuart Reid and Robert Hierons to these discussions
are gratefully acknowledged. The presentation of the material has also benefited greatly from the comments of the referees
on the original version, in which they identified a number of additional issues that ought to be discussed, and suggested
relevant references for these.
References
1 Ostrand TJ, Balcer MJ. The Category-Partition Method for Specifying and Generating Functional Tests.
Communications of the ACM, June 1988; 31(6): 676-686.
2 Cowling AJ. What Should Graduating Software Engineers Be Able To Do? Proceedings of 16th Conference on
Software Engineering Education and Training, Madrid, Spain, March 2003. IEEE Computer Society Press: Los
Alamitos, CA, 2003; 88-98.
3 Cowling AJ. Teaching Data Structures and Algorithms in a Software Engineering Degree: Some Experience with
Java. Proceedings of 14th Conference on Software Engineering Education and Training, Charlotte, North
Carolina, USA, March 2001. IEEE Computer Society Press: Los Alamitos, CA, 2001; 247-257.
4 Freedman RS. Testability of Software Components. IEEE Transactions on Software Engineering, 1991; 17(6):
553-564.
5 Balcer MJ, Hasling WM, Ostrand TJ. Automatic Generation of Test Scripts from Formal Test Specifications.
Proceedings of 3rd
Symposium on Software Testing, Analysis and Verification, Key West, FL, December 1989.
ACM Press, New York, NY, 1989; 210-218.
6 Ostrand TJ. Generating Formal Specifications from Test Information. Proceedings of 2nd
Workshop on Formal
Approaches to the Testing of Software (FATES), Brno, Czech Republic, August 2002 (at
<http://www.brunel.ac.uk/~csstrmh/concur2002/fates.html>); 11-18.
7 Holcombe M. What are X-machines: Editorial to Special Issue. Formal Aspects of Computers, 2000; 12: 418-
422.
8 Holcombe M, Ipate F. Correct Systems: Building a Business Process Solution. Springer Verlag (Series on
Applied Computing): Berlin & London, 1998.
9 Aguado J, Cowling AJ. Foundations of the X-machine Theory for Testing, Department of Computer Science
Research Report CS-02-06, University of Sheffield, 2002, at
<http://www.dcs.shef.ac.uk/research/resmems/papers/CS0206.pdf>.
10 Petrenko A, Yevtushenko N, von Bochmann G, Dssouli R. Testing in context: framework and test derivation.
Computer Communications, 1996; 19: 1236-1249.
11 Hierons RM, Harman M. Testing conformance of a deterministic implementation against a non-deterministic
stream X-machine. Theoretical Computer Science, 2004; 323: 191-233.
12 Grochtmann M, Grimm K. Classification Trees for Partition Testing. Software Testing, Verification and
Reliability, 1993; 3(2): 63-82.
13 Stocks PA, Carrington, DA. Test Templates: A Specification-based Testing Framework. Proceedings of 15th
International Conference on Software Engineering, Baltimore, MD, May 1993. IEEE Computer Society Press:
Los Alamitos, CA, 1993; 405-414.
14 Stocks PA, Carrington, DA. A Framework for Specification-based Testing. IEEE Transactions on Software
Engineering, 1996; 22(11): 777-793.
15 Spivey JM. The Z Notation: A Reference Manual. Prentice Hall: New York & London, 1989.
16 Chen TY, Poon PL, Tse TH. A Choice Relation Framework for Supporting Category-Partition Test Case
Generation. IEEE Transactions on Software Engineering, 2003; 29(7): 577-593.
17 Roper M. Software Testing. McGraw-Hill (International Software Quality Assurance Series): London, 1994.
18 Bernot G, Gaudel MC, Marre B. Software testing based on formal specifications: a theory and a tool. Software
Engineering Journal, 1991; 6: 387-405.
19 Dick J, Faivre A. Automating the Generation and Sequencing of Test Cases from Model-Based Specifications.
Proceedings of FME’93: Industrial-Strength Formal Methods, Odense, Denmark (Lecture Notes in Computer
Science, vol. 670). Springer: Berlin, 1993; 268-284.
20 Offutt J, Liu S, Abdurazik A, Ammann P. Generating Test Data from State-Based Specifications. Software
Testing, Verification and Reliability, 2003; 13: 25-53.
97
UK Software Testing Research III
98
UKTest 2005
The need for new statistical software testing models
John May, Maxim Ponomarev, Silke Kuball, Julio Gallardo
Safety Systems Research Centre, University of Bristol
ABSTRACT
There is growing interest in Statistical Software Testing
(SST) as a software assurance technique. Whilst the
approach has major attractions, we show that there is a
need for new statistical models to infer failure
probabilities from SST. We construct a simple but
realistic case in which traditional models do not work.
KEY WORDS
Software testing, Software assurance, Failure probability,
Statistical test models, Statistical estimation
1 Statistical software testing
Interest in Statistical Software Testing (SST) is growing
because it provides a software assurance technique that is
both sound and practical. In common with formal proof
methods, it is one of the few techniques that offers an
objective, quantitative measure of software quality (SST
provides a failure probability estimate). In addition, SST
side-steps the famous statement “Program testing can be
used to show the presence of bugs, but never to show theirabsence!” [1]. The statement is true, but assumes that a
total absence of failure is the only acceptable goal. This
assumption is not made in other engineering disciplines,
where the goal is risk reduction and complete absence of
system failures is seen as unrealistic. In this context
Dijkstra’s statement loses its impact.
Statistical software testing (SST) is a dynamic testing
technique, designed in a very specific way that makes it
possible to assess the system’s probability of failure on
demand or failure per hour from the test results. No other
testing technique allows us to do this. The aim of SST is
to have and execute a test-set that facilitates the deduction
of a dependability figure (e.g. a failure probability) for the
software under test. Such a figure may be used as
evidence in a safety-case or as stand-alone assurance for
the software under test. Simply stated, the core conditions
for statistical testing are that statistical test-cases have to
be a) generated through a probabilistic simulation of the
application environment for which the dependability
statement needs to be derived, and b) they have to be
statistically independent. For more details on the
technique and its application see for example [2], [3], [4],
[5].
2 The need for new statistical models of
software testing
The statistics underpinning SST invariably relies on the
Binomial model of failure [2]. This paper constructs an
example where the simple Binomial model does not
apply.
There has been research into other forms of SST model,
in particular where models are required to relate program
fail probability to the fail probabilities of program
components [8], [9], [10], [11], [12], [13], [14], [15], [16].
It is clear that the simple Binomial model does not solve
this problem, and neither does its partition variant. SST
models for component-based software remain an open
problem, and we do not address them in this paper. Our
example is one where, at first sight, the Binomial model
appears relevant, and is the only accepted model in the
SST literature.
3 The example
The example was inspired by software found inside smart
sensors. These are being used in safety critical systems,
and present a problem to safety analysts because the
presence of software makes it hard to ascertain the
reliability of such devices.
The example focuses on the simple problem of computing
a rolling average of 8 sensor readings. In each cycle, a
new reading is taken and used as the 8th reading in a
computation of an average over the most recent 8
readings, which is then output. Only 8 readings are kept in
memory, older readings are discarded. A sequence of 8
readings is regarded as a test, although tests could be
defined as sequences of any length. Thus 8 averages are
output on each test. There are no gaps between tests; the
8th reading of test n is followed directly by the 1st reading
of test n+1. Each test is chosen by a random search
mechanism analogous to placing all tests in a bag
according to an operational distribution and blind picking
a test from the bag (and replacing after each pick), as
described in detail by Miller at al [2].
An important feature of this program is that the first 7
outputs in any test depend on inputs from the previous
test. Thus neighbouring tests are dependent. This creates a
1
99
UK Software Testing Research III
situation where it is possible that occurrence of an ‘initial’
failure on a test is random (initial tests occur according to
the Binomial model), but the test immediately following
such an initial failure always fails. This occurs where, for
example, the average computation processes a specific
single reading (e.g. the reading ‘0’) incorrectly.
Therefore, in a case where this occurs, it is clear that the
Binomial model does not apply since in the failure
process it describes the failure probability on each test
must be equal, irrespective of previous test history.
3.1 A new statistical model of failure probability
estimation
Developing the likelihood function for 0 failures under
the assumption that a failure induces at least one
follow-up failure.
We shall consider the following failure pattern. Consider
N software tests, where N is large, within which j failures
occur, where j is between 1 and N.
Whenever an initial failure occurs on a test-run Ti, then
this will lead to another failure at test-run Ti+1 due to
shared data. We call this type of failure in Ti+1 ‘follow-
up failure’. The distinction between the two types of
failure is important. Ti+1 can also produce a follow-up
failure in Ti+2 due to new data contained in Ti+1 i.e.
Ti+1 would contain a follow-up failure and an new initial
failure. If Ti+1 did not contain an initial failure, and Ti+2
did not contain an initial failure, the failure sequence
would stop i.e. Ti+2 would not fail.
Thus, a total of j failures can occur as a result of a
sequence of failure events, each of which is of the form:
an initial failure event on test-run Ti, which induces one
additional certain (follow-up) failure event on the next
test run Ti+1 (follow-up failure), due to shared data
between the two tests. If initial failures are rare, then the
failure pattern is likely to be failures occurring in pairs
separated by long sequences of failure free test-runs.
All we observe from the outside is the total number of
failures. However, if we are aware of the different
patterns that might lead to observing j failures, we can use
this information when constructing the probability of
observing j failures.
We aim at developing an expression for this probability.
Furthermore, the sum over such probabilities for j=1,…,N
will yield the probability Pr(at least 1 failure in N tests)
given the underlying failure process. The complement of
this probability would be the likelihood function of
observing 0 failures in N tests given the underlying failure
process. This formula depends on parameters capturing
the failure probability of the program on a single test, and
it would then be possible to go on to build estimators
these parameters and hence the failure probability of the
program given a number of tests (although we do not do
this).
It seems plausible that the likelihood function under the
assumption of having follow-up failures would differ
from the traditional likelihood function for 0 failures in N
tests when we assume that all failures occur
independently from each other (see for example [2], [3],
[4]).
Ultimately, we are interested in comparing these two
likelihood functions to identify whether the use of a
traditional Binomial model in the case of having more
complex failure patterns would underestimate the
software failure probability.
For a test-run Ti, we call the probability of failure on Ti
caused by the newly observed data in Ti: . (This is less
than the probability of failure for Ti since there is also the
possibility that Ti fails as a follow-up failure caused by
data read in Ti-1). The probability of follow-up failure (in
Ti+1) based on the same data already present in Ti and
carried over to Ti+1 is 1.
j failures could could happen in many different ways and
the question is how this would influence the likelihood
function for the event:
(0,N) – 0 failures in N tests.
We call this likelihood function Pnew(0,N| ). It depends
on . We calculate it via its complement P( 1 failures,
N| ). This is the sum over all probabilities
P(j failures, N| ), where j takes on any value in 1,…,N:
1
( 1 , | ) ( , | )N
new new
j
P failure N P j failures N(1)
It is a counting problem now for probabilities of having
different numbers of failures inside the line (sequence) of
N tests.
The first term in eq.(1) is the probability of a single
failure inside the N tests. This single failure is possible in
the very end of our test line. So it can occur in the only
one (last) place and its probability is along with the
probability of independent ‘success’ (1 ) in the rest N-
1 places of the test line. Therefore, we have the following
likelihood for exactly one failure:
new 1P (1 failure, N| ) = (1 )N (2)
The second term in eq.(1) is the probability of having two
failures inside the N tests. This can happen in two ways.
The first one is raising the exact pair of the initial and the
follow-up failures. This one can happen with the
2
100
UKTest 2005
probability in some place. And there are
(N-1) free places for the initial failure with the follow-up
failure following immediately. This gives us the first term
in the sum in the eq.(3). The second term in this eq.(3)
gives probability of the more exotic case when the initial
failures happen in pair overlapping the follow-up failure.
It can happen in the very end of the N test line only with
the probability
2(1 )N
2 and probability of (N-2) independent
successes . This cut overlap can happen in
the one place only – in the end boundary of the N test
line. So we obtain the probability of having 2 failures for
our test line as follows:
2(1 )N
new 2
2 2
P (2 fails, N| ) = N-1 (1 )
(1 )
N
N
(3)
By analogy considering probabilities for more failures in
our N test line, we obtain more terms for the eq.(1): P(j
failures, N| ), for j=3,4:
new 2 3
3 3
P (3 fails, N| ) = 2 N-2 (1 )
(1 )
N
N
(4)
new 2 4
3 4
4 4
-1 -3P (4 fails, | ) = (1 )
2
2 -3 (1 )
(1 )
N
N
N
N NN
N
1
(5)
For j larger than 4, and these probabilities are
decreasing significantly (corresponding to the higher
orders of small parameter
( )N
2( )N ). These specific
conditions allows us to approximate P( 1 failures, N| ),
via a truncated (partial) sum in eq.(1) after j=3.
Furthermore if we exclude terms higher than the second
order of magnitude in , (N ) , we have the following
approximation:
new 1 2
2 2 2 3
2 4
P ( 1 failure, N| ) (1 ) -1 (1 )
(1 ) 2 - 2 (1 )
-1 -3(1 )
2
N N
N N
N
N
N
N N
(6)
For a large number of tests ( ) and small expected1N
(and ) we can estimate eq.(6) asymptotically:1N
2
new 2P ( 1 failure, N| ) 22
NN
N N e
(7)
As a next step, we want to compare the result above with
the result one would obtain if we applied the traditional
Binomial model to the test results, as follows:
SUM
1
P ( 1,M | ) =
1 2 ...( 1)(1 )
!
Mj M
j
M M M M j
j
j
(8)
This expresses the probability of observing at least one
random failure in M tests with a Binomial failure pattern.
Hereby failure probability on demand is denoted as .
Using the Poisson approximation for eq.(8), which is
effective for large M and small (so that 1M ), we
can again argue that terms above a certain limit L can be
neglected in the sum in eq.(8). This yields the following
approximation of eq.(8):
SUM -M
1
(M )P ( 1, M | ) e
!
jL
j j
(9)
Excluding terms higher than the second order of
magnitude in ( )M , we have truncated sum in eq.(9):
2
SUMP ( 1 failure, M| )2
MM
M e (10)
We can examine the differences between eqs.(7) and (10)
by considering the case ~
2 and M=N (the same
number of tests). When the ‘initial’ fail rate ~
is small
and N is large, the overall fail rate of the ‘new’ process is
approximately twice~
. In this case, we can compare the
probability of seeing 0 failures in N tests for a Binomial
failure process and a ‘new’ failure process with the same
overall failure rate. We merely note that the expressions
differ, so that there is scope for error if a program obeys
the ‘new’ failure process and the Binomial model is used
to model it. For example, if the traditional Binomial
model gives a lower value for the probability of zero
failures in N tests, it can only explain this in terms of a
lower overall failure rate estimate. The result, given the
observation of 0 failures during testing with N tests,
would be to underestimate the program failure rate.
It should be noted that here we made estimations in the
specific case of large test number ( ), small
expected
,M N 1
, , and . Calculations for
more general cases according to our model (when the
1, 1N M
3
101
UK Software Testing Research III
truncation of the sum in eq.(1) is not possible) could give
different deviations.
4 Discussion
SST provides test results with a useful meaning, and so
appears to be an important future software assurance
technique. However, this paper has shown that there is a
need to develop more sophisticated statistical models of
SST than are currently available. This will require input
from computer science, i.e. they will involve software
analyses, as in the example, where the basic requirements
dictate a certain use of program memory and consequent
test dependence. It will not be possible to derive these
models using standard statistical techniques alone; there is
a need for collaborative efforts between statisticians and
computer scientists.
The example we have constructed is not a pathological
case. It is simple and realistic, and the implications are
therefore potentially important. It demonstrates that new
statistical models of SST are needed in scenarios where
traditional analysis only offers the Binomial model and its
partition variants.
Application of the Binomial model to this example has
the potential to be incorrect in a dangerous sense. In a
safety-critical context, an underestimate of the program
failure probability (i.e. over-confidence in the software
reliability) might sanction the deployment of a program
whose reliability does not meet acceptable targets.
REFERENCES
[1] Dijkstra E. Notes on Structured Programming, in
Dahl O, Dijkstra E, Hoare C (Eds.) Structured
Programming, (Academic Press 1972).
[2] Miller W.M. , Morell L.J., Noonan R.E., Park S.K.,
Nicol D.M., Murrill B.W. and Voas J.M. Estimating the
probability of failure when testing reveals no failures,
IEEE Trans. on Software Engineering v18 n1 1992.
[3] Thayer R., Lipow M., and Nelson E. Software
Reliability (North Holland 1978).
[4] Ehrenberger W. Probabilistic techniques for software
verification in safety applications of computerised process
control in nuclear power plants, IAEA-TECDOC-581, Feb
1991.
[5] May J.H.R., Hughes G and Lunn A.D. Reliability
Estimation from Appropriate Testing of Plant Protection
Software, Software Engineering Journal, Nov. 1995.
[6] May J.H.R and Lunn A.D. New Statistics for
Demand-Based Software Testing, Information ProcessingLetters 53, 1995.
[7] Kuball, S., Hughes, G., May, J.H.R., Gallardo, J.,
John, A.: The effectiveness of Statistical Testing whenqpplied to logic systems, Safety Science, Vol. 42, pp. 369-
383, Elsevier, 2004.
[8] J.May, S.Kuball, G.Hughes, Test Statistics for System
Design Failure, International Journal on Reliability,
Quality and Safety Engineering (IJRQSE),Vol. 6, No.3,
pp 249--264 (1999).
[9] Kuball S., May J.H.R. & Hughes G. Building a
System Failure Rate Estimator by Identifying Component
Failure Rates, Procs. of the 10th Int. Symposium onSoftware Reliability Engineering (ISSRE’99), Boca
Raton, Florida, Nov 1-4, 1999) pp32-41 (IEEE Computer
Society 1999).
[10] Gokhale S.S. & Trivedi K.S. Structure-based
software reliability prediction, Procs. Advanced
Computing (ADCOMP 97), Chennai, India 1997.
[11] Smidts C. & Sova D. An architectural model for
software reliability quantification: sources of data,
Reliability Engineering & System Safety 64 (2) pp. 279-
290, 1999.
[12] Hamlet D., Mason D., Woit D. Theory of software
reliability based on components, Proceedings of the 23rd
International Conference on Software Engineering (ICSE
2001), pp12-19 May 2001, Toronto, Ontario, Canada.
IEEE Computer Society 2001, ISBN 0-7695-1050-7.
[13] Littlewood B. A reliability model for systems with
markov structure, Applied Statistics v24 n2 pp. 172-177,
1975.
[14] Shooman M. Structural models for software
reliability prediction, 2nd Int. Conf. Software Engineeringpp268-280, 1976.
[15] Krishnamurthy S. & Mathur A.P. On the estimation
of reliability of a software system using reliabilities of its
components, 8th International Symposium on Software
Reliability Engineering, pp. 146-155, Albuquerque, New
Mexico, 1997.
[16] Goseva-Popstojanova K, & Trivedi K.S.
Architecture-based approach to reliability assessment of
software systems, Performance Evaluation v45 n2-3,
2001.
[17] Musa J.D. Operational profiles in software reliability
engineering, IEEE Software 10(2) 1993.
[18] Littlewood B. and Wright D. Some conservative
stopping rules for the operational testing of safety-critical
software, IEEE Trans. on Fault Tolerant Computing
Symposium, pp 444-451, Pasadena, 1995.
[19] Butler R.W. & Finelli G.B. The infeasibility of
quantifying the reliability of life-critical real-time
software, IEEE Trans. on Software Engineering v19 n1
1993.
[20] Littlewood B. & Strigini L. Validation of Ultra-High
Dependability for Software-based Systems,
Communications of the ACM, 36 (11), pp.69-80, 1993.
4
102
UKTest 2005
A Theory of Regression Testing for Behaviourally
Compatible Object Types
Anthony J H Simons
Department of Computer Science, University of Sheffield,
Regent Court, 211 Portobello Street, Sheffield S1 4DP, United Kingdom A.Simons@dcs.shef.ac.uk
http://www.dcs.shef.ac.uk/~ajhs/
Abstract. This paper presents a behavioural theory of object compatibility,
based on the refinement of object states. The theory predicts that only certain
models of state refinement yield compatible types, dictating the legitimate de-
sign styles to be adopted in object statecharts. The theory also predicts that
standard practices in regression testing are inadequate. Functionally complete
test-sets that are applied as regression tests to subtype objects are usually ex-
pected to cover the state-space of the original type, even if they do not cover
transitions and states introduced in the subtype. However, such regression test-
ing is proven to cover strictly less than this in the new context and so provides
much weaker guarantees than was previously expected. Instead, a retesting
model based on automatic test regeneration is required to guarantee equivalent
levels of correctness.
Keywords: Object-oriented, behavioural subtyping, state refinement, state-
based testing, regression testing, test generation, testing adequacy.
1 Introduction
Practical object-oriented unit testing is influenced considerably by the non-intrusive
testing philosophy of McGregor et al [1, 2]. In this approach, every object under test
(OUT) has a corresponding test-harness object (THO), which encapsulates all the test-
sets separately. This separation of concerns is the main motivation for McGregor’s
parallel design and test architecture in which an isomorphic inheritance graph of test
harness classes shadows the graph of production classes [2]. This embodies the be-
guiling intuition that, since a child class is an extension of its parent, so the test-sets
for the child are extensions of the test-sets for the parent. The presumed advantage is
that test-sets can be inherited from the parent THO and applied, as a suite, to the child
OUT, in a kind of regression test. The purpose of such retesting is to ensure that the
child class still delivers all the functionality of the parent. The child THO will supply
additional test-sets to exercise methods introduced in the child OUT [1, 2].
More recently, the JUnit tool has fostered a similar strategy for re-testing classes
that are subject to continuous modification and extension [3, 4]. JUnit allows pro-
grammers to develop test scripts, which are converted into suites of methods behind
103
UK Software Testing Research III
the scenes. These are executed on demand, to test objects and to re-test modified or
extended versions of those objects. One of the key benefits of JUnit is that it makes
the re-testing of refined objects semi-automatic, so it is widely used in the XP com-
munity, in which the recycling of old test-sets has become a major part of the quality
assurance strategy. XP makes a strong claim that a programmer may incrementally
modify code in iterative cycles, so long as each modification passes all the original
unit tests: “Unit tests enable refactoring as well. After each small change the unit tests
can verify that a change in structure did not introduce a change in functionality” [5].
There are two sides to this claim. Firstly, if the modified code fails any tests, it is
clear that faults have been introduced, so there is some benefit in reusing old tests as
diagnostics. Secondly, there is the implicit assumption that modified code which
passes all the tests is still as secure as the original code. Tests are implicitly being
used as guarantees of a certain level of correctness.
In this paper, we prove that the second assumption is unsound and unsafe. In sec-
tion 2, a state-based theory of object refinement is presented, which encompasses
object extension with subtyping, the concrete satisfaction of abstract interfaces and
refactoring of implementations with unchanged behaviour. The theory predicts that
only certain models of state refinement yield compatible types, dictating the legitimate
design styles to be adopted in object statecharts. In sections 3 and 4, the theory also
predicts that standard practices in regression testing are inadequate. Functionally
complete test-sets that are applied as regression tests to subtype objects are usually
expected to cover the state-space of the original object, even if they do not cover tran-
sitions and states introduced in the refinement. However, such regression testing is
proven to cover strictly less of the original object’s state space in the new context and
so provides much weaker guarantees than expected. After passing the recycled tests,
objects may yet contain introduced faults, which are undetected.
In place of the unsafe kinds of regression testing, we propose a new approach,
which is based on automatically generating the tests from the refined object’s state
machine. The theory predicts that simply adding new test suites to the existing encap-
sulated suites does not achieve coverage. It is necessary to generate new test-sets
from scratch, in which methods are interleaved in different orders than before, to ob-
tain the same level of guarantee.
2 A Theory of Compatible Object Refinement
In classical automata theory, the notion of machine compatibility is judged by compar-
ing sets of traces, sequences of labels taken from transition paths computed through
the machines in question. Two machines are deemed equivalent if their trace-sets are
equivalent. A machine is behaviourally compatible with another if its trace-set in-
cludes all the traces of the other, that is, for every trace in the protocol of the reference
machine, such a trace also exists in the protocol of the modified machine. Object-
oriented design methods [6, 7] include object statecharts, which are influenced by
Harel’s statecharts [8] and SDL [9]. These notations are more complex than simple
finite state automata. Equivalence and compatibility between statecharts are judged
104
UKTest 2005
by considering syntactic relations between the transformed state spaces, from which
the trace behaviour follows.
2.1 McGregor’s Statechart Refinements
S1
S3
S2
M0
S1
S3
S2
M1
S1
S3
M2
S1
S3
S2
M3
S2.1
S2.2
S4
S5
a
b
a
b
d
e
a
bc
a
b
d
e
S2
Fig. 1. McGregor’s structural statechart refinements. The basic state machine M0 is refined
by the compatible machines M1, M2, M3. M1 refines M0 by adding an extra transition. M2
refines M0 by introducing substates. M3 refines M0 by introducing concurrent states.
McGregor et al. proposed one of the early theories of object statechart refinement [10,
11]. In McGregor’s model, object states derive from the object’s stored variable val-
ues, as seen through observer methods. The machines have Mealy-semantics, with
quiescent states and actions on the transitions, representing the invoking of methods.
Figure 1 illustrates a contemporary reworking of McGregor’s three main structural
refinements, to allow comparison with trace models. These kinds of refinement were
deemed compatible because they observed the rules:
• all states in the base object are preserved in the refined object;
• all introduced states are wholly contained in existing states;
• all transitions in the base object are preserved in the refined object.
These structural refinements may be compared with trace models. The traces of
M0 are the set {<>, <a>, <a, b>}, where <a, b> is a sequence of method invocations
in the protocol of M0. M1 adds an extra method c to the interface of M0. This is a
derived method, analogous to function composition [12], that computes a more direct
route to the destination state S3. The traces of M1 are {<>, <a>, <a, b>, <c>} so it
is clear that this includes the traces of M0.
M2 adds two extra methods d and e, which examine state S2 at a finer granularity.
S2 is completely partitioned into substates S2.1 and S2.2. Since states are abstractions
over variable products [10], this is equivalent to dependence on disjoint subsets of
105
UK Software Testing Research III
variable values. The usual statechart semantics of M2 is that entry to S2 implies entry
to the default initial substate S2.1; and the exit transition b from S2 preempts other
substate events. The statechart may therefore be flattened to a simple state machine,
with transition a leading directly from state S1 to S2.1 and an exit transition b from
both substates S1.1 and S1.2 to state S3. The traces of M2 are infinite (due to the
infinite alternation of d, e), but include {<>, <a>, <a, b>, <a, d>, <a, d, b>, <a, d,
e>, <a, d, e, b>, …} and so include all the traces of M0.
M3 introduces concurrent states S4, S5 and extra methods d and e which depend on
the new states. This represents the definition of new variables in the object subtype,
together with new methods whose behaviour is orthogonal to existing behaviour. The
usual statechart semantics is that both machines execute concurrently. Formally, this
is equivalent to a flat state machine containing the product of the states of the two
concurrent machines, which we denote as: {S1/4, S2/4, S3/4, S1/5, S2/5, S3/5}. The
traces of M3 are infinite, but include {<>, <a>, <d>, <a, d>, <d, a>, <a, b>, <d,
e>, <a, d, e>, <d, e, a>, <a, d, b>,…} and so include all the traces of M0.
2.2 Cook and Daniels’ Statechart Refinements
S1
S3
S2
M0
S1
S3
S2
M4
S1
S3
M5
S2.1
S2.2
a
b
a
b
a
bc
S2
S4
b
d
S1
S3
M6
S2.1
S2.2
a
b
S2d
Fig. 2. Cook and Daniels’ additional statechart refinements. As before, M0 is the reference
machine. M4 refines M0 by adding a transition to a new state. M5 refines M0 by transition
splitting. M6 refines M0 by retargeting a transition.
In their Syntropy method [13], Cook and Daniels permit further extensions to state-
charts. Their full set of refinements includes (p207-8): adding new transitions, add-
ing new states, partitioning a state into substates, splitting transitions either at source
or destination substates, retargeting transitions onto destination substates and
composition with concurrent machines. Figure 2 illustrates the three main kinds of
transformational refinement not already covered above.
106
UKTest 2005
These refinements may also be compared with trace models. M4 refines M0 by
adding a new method c leading to a new state S4. This new state represents the addi-
tion of object variables, but unlike the case M3, the associated behaviour is not or-
thogonal, but tightly coupled to state S1. We sometimes refer to S4 as a new external
state, to distinguish this from a new substate, of the kind in M2. The traces of M4 are
the set {<>, <a>, <a, b>, <c>} and so include the traces of M0. However, this re-
finement breaks the second of McGregor’s rules about new states being introduced as
wholly contained substates.
M5 refines M0 by splitting the exit transition b, which no longer proceeds from the
S2 state boundary, but from the individual substates S2.1 and S2.2. This represents
the redefinition of the method b in the refinement, to depend disjointly on the intro-
duced substates. The overall response is equivalent to the original b. The traces of
M5 are {<>, <a>, <a, b>, <a, d>, <a, d, b>} and so include the traces of M0. By
the usual semantics of object statecharts, an exit transition from a superstate boundary
is equivalent to exit transitions from every substate. It is therefore inevitable that state
partitioning will split exit transitions.
Cook and Daniels [13] also allow the symmetrical case, splitting entry transitions to
target different destination substates. Mutually exclusive and exhaustive guards are
introduced to distinguish which of the substates should be reached by each partial
transition. However, fairness in partitioning incoming transitions to all substates is
later shown to be irrelevant in the retargeting rule. M6 refines M0 by retargeting the
transition a onto an arbitrary substate of S2. We choose to target S2.2 simply to illus-
trate how this is different from the default initial substate S2.1, even though the model
now cannot enter S2.1. The traces of M6 are {<>, <a>, <a, b>} and so are exactly
the traces of M0.
According to the classical theory of trace inclusion, all of the refinements M1-M6
may be substituted in place of M0 and will exhibit identical trace behaviour in re-
sponse to M0’s events. However, we argue below that this is an insufficient guarantee
of behavioural compatibility in object-oriented programming, where objects are ali-
ased by handles of multiple types. For this, a stronger theory is required.
2.3 Behaviourally Compatible Statechart Refinement
The fundamental philosophical problem to decide in the theory is how to treat the
introduction of new variables in subtype objects. Do these variables correspond to
missing pieces of the object’s earlier state, and so their concatenation in the subtype
gives rise to brand-new external states (like M4 above)? Do these variables already
exist in virtuo at the abstract level, in which case their concrete exposure in the sub-
type creates new substates (like M2 above)? Are these variables orthogonal and so
give rise to concurrent states in the subtype, equivalent to state products (like M3
above)? These different views may be in conflict.
The M3 refinement can be shown to be more general than M2. By flattening M2, a
statechart is obtained in which all a-transitions target the default initial substate, S2.1.
The product machine obtained by flattening the M3 refinement is more sophisticated,
since the a-transitions go from S1/4 and S1/5 to S2/4 and S2/5 respectively. M3 is
107
UK Software Testing Research III
more sensitive to orthogonal behaviour than M2. It is reasonable to assume that we
must expect subtype objects to exhibit orthogonal behaviour at least some of the time,
so the M3 refinement is chosen over M2.
Both M3 and M2 assume that introduced state variables are exposed as substates of
existing states. This contrasts with M4, which assumes that entirely new states may be
introduced. In M4, the c-transition takes an object entirely out of the S1 state, whereas
in M3, the d-transition still leaves the object in its S1 state (going from S1/4 to S1/5).
This means that in all contexts and under all firings of d- and e-transitions, the M3
object can be abstracted to a M0 object, whereas this cannot be done for a M4 object.
Abstracting away from M4 in state S4 leaves an object in no recognizable M0 state,
and furthermore the object will deadlock in this state for any attempt to fire a-
transitions. In terms of the π-calculus process algebra [14], M3 strongly simulates
M0, whereas M4 only weakly simulates M0. This is discussed in section 5.4 below.
S1
S2
L0
a
b
L2
a
d
S2cS1/3
S1/4dS1
c b
a
b
S3
S4
L1
c
d
S3
S4
e
S2/3
S2/4
e
e
Fig. 3. The model of behaviourally-compatible refinement. L2 is the refined statechart result-
ing from the concurrent composition of L0 and L1, without respect to order. The states of L0
and L1 become intersecting regions in the refinement, which contains the product of states.
Since in general we must expect to support refinements like M3, in which full state
products are computed, the notion of hierarchical superstates encapsulating substates,
in the style of M2, becomes moot. It is more sensible to think of the old states as
being completely partitioned into new states. Figure 3 illustrates this in a more com-
pelling way. Here, L2 is the refinement resulting from the concurrent composition of
L0 and L1. However, it is irrelevant whether L0 is the basis and L1 is the supplement,
or vice-versa. Whereas in figure 1 we were tempted to view composition as ordered,
here we cannot. Accordingly, we cannot say that any particular superstate hierarchy is
108
UKTest 2005
more valid. So, we dispense with superstates and think instead of regions, intersecting
areas enclosing states that share some common transition behaviour. In figure 3, re-
gions are shown as dashed outlines. Four intersecting regions can be identified in L2
that correspond to the pairs of simple states in L0 and L1.
The process of refining a state machine then becomes a matter of turning states into
regions, whose enclosed states completely partition the original unrefined state. After
this, the main obligation is to ensure that all the transition behaviour of the base object
is preserved in the refined object. Partitioning a state will always split outgoing transi-
tions, for example, the a-transition from S1 is turned into a pair of partial a-transitions
from S1/3 and S1/4. Because we are assuming orthogonal behaviour, these also target
separate partitions of S2, the states S2/3 and S2/4. However, what if the behaviours of
c, d are not entirely independent of a, b? In this case, incoming transitions might be
retargeted onto different states.
Let a region correspond to a state that is being refined. Retargeting has no adverse
effect on the validity of the refinement, so long as the transition retargets a state within
the same region. Suppose the a-transitions were retargeted onto different states within
S2. No matter which destination states within region S2 we retarget, we should still be
able to abstract away to S2. In all cases, the partial a-transitions would be merged in a
single transition from S1 to S2. Retargeting may select an arbitrary state, or combina-
tion of states within the destination region. Supposing now that the c-transition from
S1/3 were retargeted outside the S1 region, to S2/4, within the different region S2.
The c message now interacts unfavourably with the alternating behaviour of a, b. This
means that a sequence <c, a> will deadlock from S1/3. While this modification is not
compatible with L0, it is compatible with L1. Retargeting must therefore be consid-
ered with respect to the compatibility relation desired between specific machines.
From these considerations, we obtain the statechart refinement rules for behav-
ioural compatibility. With respect to the statechart for a given object type, the state-
chart for a compatible object may introduce additional states, corresponding to the
exposure of extra variable products, and additional transitions, corresponding to the
introduction of new methods, so long as:
• Rule 1: new states are always introduced as complete partitions of existing
states, which become enclosing regions;
• Rule 2: new transitions for additional methods do not cross region bounda-
ries, but only connect states within regions;
• Rule 3: refined transitions crossing a region boundary completely partition
the old entry/exit transitions of the original unrefined state;
• Rule 4: refined transitions within a region completely partition the old self-
transitions of the original unrefined state.
Rule 1 is the fundamental rule, which preserves the hierarchy of state abstractions.
It confirms McGregor’s second rule of statechart refinement [11]. It disallows the
introduction of new external states, so rules out Cook and Daniels’ refinement by
extension (such as M4) [13]. Rule 2 defines limits on state retargeting for new meth-
ods, with respect to the chosen compabitility relationship. In section 5.4 we show how
these two rules relate to strong simulation. Rule 3 captures all of Cook and Daniels’
rules about transition splitting and retargeting within a superstate (a region, in our
approach). The important generalisation is the complete partitioning of transitions,
109
UK Software Testing Research III
which ensures that the set of new transitions behaves exactly like the old single transi-
tion. Rule 4 is a similar rule to ensure that self-transitions are preserved explicitly in
the refinement. These two rules essentially describe the faithful replication of transi-
tions for states that have been partitioned. They ensure that the refined machine is a
non-minimal equivalent to the original machine.
Together, the four rules enforce a strict behavioural consistency between the re-
fined and original state machines, analagous to strong simulation (see 5.4). This is
stronger than some other trace-based models of consistency, which only look at model
executions in the absence of a theory of state and state generalisation. The invoca-
tional consistency of Ebert and Engels [15] requires the subtype to contain all the
traces of the supertype. This is equivalent to Cook, Daniels and McGregor’s position,
described above [11, 13]. Ebert and Engels’ observational consistency is weaker still,
since it merely requires all the supertype’s traces to be derivable by censoring the
subtype’s traces to remove methods that were introduced in the subtype [15].
3 The Generation of Complete Unit Test-Sets
In state-based testing approaches [16, 17, 11, 18], it is possible to develop a notion of
complete test coverage, based on the exhaustive exploration of the object’s states and
transitions. However, the nature of the guarantee obtained after testing varies from
approach to approach. The following is an adaptation of the X-Machine testing
method [18, 19], which offers stronger guarantees than other methods, in that its test-
ing assumptions are clear and it tests negatively for the absence of all undesired be-
haviour as well as positively for the presence of all desired behaviour.
3.1 State-Based Specification
Em pty
s ize() = 0
Normal
s ize() > 0
push(e)
push(e)
pop() [s ize() > 1]pop() [s ize() = 1]pop()
new
Fig. 4. Abstract state machine for a Stack interface. The two states (Empty, Normal) are de-
fined on a partition of the range of the size access method. No self-transitions for access meth-
ods are notated, by convention, but all other transitions must be shown
We assume that the object under test (OUT) exists in a series of states, which are
chosen by the designer to reflect modes in which its methods react differently to the
same message stimuli (formally, the notion of state derives from state-contingent re-
110
UKTest 2005
sponse and has nothing to do with whether the object has quiesecent periods). The
OUT is assumed to have a unique transition to its initial state and may or may not
have a final state, a mode in which it is no longer useable, for example, an error state
(representing a corrupted representation – see figure 4), or a terminated state (repre-
senting the end of the object’s life history).
The states of an object derive ultimately from the product of its attribute variables,
but can be characterised more abstractly as the product of the ranges of its access
methods. Formally, we assume that states are a complete partition of this product.
For completeness, a finite state model must define a transition for each method in
every state. However, suitable conventions may be adopted to simplify the drawing of
the state transition diagram, in particular, to establish the meaning of missing transi-
tions. Figure 4 shows a simplified state machine for an abstract Stack interface, in
which the omitted transitions for all the access methods size, empty and top may be
inferred implicitly as self-transitions in every state.
It must be possible to determine the desired behaviour of the object, in every state,
and for each method. If more than one transition with the same method label exits
from a given state, the machine is nondeterministic. Qualifying the indistinguishable
transitions with mutually exclusive, exhaustive guards will restore determinism (in
figure 4, ambiguous pop transitions exiting the Normal state are guarded). Certain
design-for-test conditions may apply, to ensure that the OUT can be driven determin-
istically through all of its states and transitions [18]. For example, in order to know
when the final pop transition from Normal to Empty is reached, the accessor size is
required as one of Stack’s methods.
3.2 State-Based Test Generation
The basic idea, when testing from a state-based specification, is to drive the OUT into
all of its states and then attempt every possible transition (both expected and un-
wanted) from each state, checking afterwards which destination states were reached.
The OUT should exhibit indistinguishable behaviour from the specification, to pass
the tests. It is assumed that the specification is a minimal state machine (with no du-
plicate, or redundant states), but the tested implementation may be non-minimal, with
more than the expected states. These notions are formalised below.
The alphabet is the set of methods m ∈ M that can be called on the interface of the
OUT (including all inherited methods). The OUT responds to all m ∈ M, and to no
other methods (which are ruled out by the syntactic checking phase of the compiler).
This puts a useful upper bound on the scope of negative testing.
The OUT has a number of control states s ∈ S, which partition its observable
memory states. A control state is defined as an equivalence class on the product of the
ranges of the OUT’s access methods. If a subset A ⊆ M of access methods exists,
then each observable state of the OUT is a tuple of length |A|. Formally, tuples fall
into equivalence classes under exhaustive, disjoint predicates p : Tuple → Boolean,
where each predicate p corresponds to a unique state s ∈ S. In practice, these predi-
cates are implemented as external functions p : Object → Boolean invoked by the test
111
UK Software Testing Research III
harness upon the OUT : Object, which detect whether the OUT is in the given state
using some combination of its public access methods.
Sequences of methods, denoted <m1, m2, …>, m ∈ M, may be constructed. Lan-
guages M0, M1, M2, … are sets of sequences of specific lengths; that is, M0 is the set
of zero-length sequences: {<>} and M1 is the set of all unit-length sequences: {<m1>,
<m2>, …}, etc. The infinite language M* is the union M0∪ M1
∪ M2∪ … contain-
ing all arbitrary-length sequences. A predicate language P = {<p1>, <p2>, …} is a set
of predicate calls, testing exhaustively for each state s ∈ S.
In common with other state-based testing approaches, the state cover is determined
as the set C ⊆ M* consisting of the shortest sequences that will drive the OUT into all
of its states. C is chosen by inspection, or by automatic exploration of the model. An
initial test-set T0 aims to reach and then verify every state. Verification is accom-
plished by concatenating every sequence in the state cover C with every predicate in
the predicate language P, denoted: C ⊗ P, where ⊗ is the concatenated product which
appends every sequence in P to every sequence in C.
T0 = C ⊗ P (1)
A more sophisticated test-set T1 aims to reach every state and also exercise every
single method in every state. This is constructed from the transition cover, a set of
sequences K1 = C ∪ C ⊗ M1, which includes the state cover C and the concatenated
product term C ⊗ M1, denoting the attempted firing of every single transition from
every state. The states reached by the transition cover are validated again using all
singleton predicate sequences <p> ∈ P.
T1 = (C ∪ C ⊗ M1) ⊗ P (2)
An even more sophisticated test-set T2 aims to reach every state, fire every single
transition and also fire every possible pair of transitions from each state. This is con-
structed from the augmented set of sequences K2 = C ∪ C ⊗ M1∪ C ⊗ M2 and the
reached states are again verified using the predicate. The product term C ⊗ M2 de-
notes the attempted firing of all pairs of transitions from every state.
T2 = (C ∪ C ⊗ M1∪ C ⊗ M2) ⊗ P (3)
In a similar fashion, further test-sets are constructed from the state cover C and
low-order languages Mk⊆ M*. The reached states are always verified using <p> ∈ P,
for which exactly one should return true, and all the others false. The desired Boolean
outcome is determined from the model. Each test-set subsumes the smaller test-sets of
lesser sophistication in the series. In general, the series can be factorised and ex-
pressed for test-sets of arbitrary sophistication as:
Tk = C ⊗ (M0∪ M1
∪ M2 ... Mk) ⊗ P (4)
For the Stack shown in figure 2, the alphabet M = {push, pop, top, empty, size}.
Note that new is not technically in the method-interface of Stack. It represents the
default initial transition, executed when an object is first constructed, which in the
formula is represented by the empty method sequence <>. The smallest state cover C
= {<>, <push>}, since the “final state” is really an exception raised by pop from the
112
UKTest 2005
Empty state. Other sequences are calculated as above. Test-sets generated from this
model may be used to test any Stack implementation that has identical states and tran-
sitions, for example, a LinkedStack, which uses a linked list to store its elements.
3.3 Test Completeness and Guarantees
The test-sets produced by this algorithm have important completeness properties. For
each value of k, specific guarantees are obtained about the implementation, once test-
ing is over. The set T0 guarantees that the implementation has at least all the states in
the specification. The set T1 guarantees this, and that a minimal implementation pro-
vides exactly the desired state-transition behaviour. The remaining test-sets Tk pro-
vide the same guarantees for non-minimal implementations, under weakening assump-
tions about the level of duplication in the states and transitions.
A redundant implementation is one where a programmer has inadvertently intro-
duced extra “ghost” states, which may or may not be faithful copies of states desired
in the specification. Test sequences may lead into these “ghost” states, if they exist,
and the OUT may then behave in subtle unexpected ways, exhibiting extra, or missing
transitions, or reaching unexpected destination states. Each test-set Tk provides com-
plete confidence for systems in which chains of duplicated states do not exceed length
k-1. For small values of k, such as k=3, it is possible to have a very high level of
confidence in the correct state-transition behaviour of even quite perversely-structured
implementations.
Both positive and negative testing are achieved, for example, it is confirmed that
access methods do not inadvertently modify object states. Testing avoids any uni-
formity assumption [20], since no conformity to type need be assumed in order for the
OUT to be tested. Likewise, testing avoids any regularity assumption that cycles in
the specification necessarily correspond to implementation cycles. When the OUT
“behaves correctly” with respect to the specification, this means that it has all the
same states and transitions, or, if it has extra, redundant states and transitions, then
these are semantically identical duplicates of the intended states in the specification.
Testing demonstrates full conformity up to the level of abstraction described by the
control states.
The state-based testing approach described here is an adaptation of the X-Machine
approach for complete functional testing [18, 19], replacing input/output pairs with
method invocations. The need for “witness values” in the output is eliminated by the
guaranteed binding of messages to the intended methods in the compiler. The test
generation algorithm adapts Chow’s W-method for testing finite state automata [16].
In Chow’s method, states are not directly inspectable. Instead, reached states are
verified by attempting to drive the implementation through further diagnostic se-
quences chosen from a characterisation set W ⊆ M*, each state uniquely identified by
a particular combination of diagnostic outcomes. Here, we know that the OUT’s state
is inspectable, since it must be characterised by some partition of the ranges of its
access methods.
113
UK Software Testing Research III
4. Object Refinement and Test Coverage
The notion of behaviourally-compatible refinement introduced in section 2 applies
equally to the realisation of interfaces (in the UML sense that a concrete class imple-
ments an abstract interface [7]) and also to the specialisation of object subtypes. In
both cases, the notion of refinement is explained in terms of deriving a more elaborate
state transition diagram by subdividing states and adding transitions to a basic dia-
gram. In this paper, we also consider that the need to re-implement an object, in the
sense of XP’s refactoring [5, 21], constitutes a refinement in the same sense. This is
because modification typically replaces simple solutions with more complex ones, in
response to new requirements. At the unit-testing level, individual OUTs tend to
become more complex. (It is also possible, when refactoring an entire subsystem
[21], for certain objects to become simplified, at the expense of introducing new ob-
jects, or shifting the complexity onto other objects, or by deleting unnecessary code –
we do not consider this here).
4.1 Test Coverage of a Modified or Refactored Object
Figure 5 illustrates a refined object statechart for a DynamicStack, an array-based
implementation of a Stack. We may either consider this to be a concrete realisation of
the Stack interface of figure 4, or else a change in implementation policy, a refactoring
of an old linked Stack. Firstly, we wish to confirm that the DynamicStack specifica-
tion conforms to the abstract Stack specification in figure 4.
Em pty
s ize() = 0
Loaded
s ize() < n
push(e)
push(e)[s ize() < n-1 ]
pop() [s ize() > 1]
pop() [s ize() = 1]
pop()
newFull
s ize() = n
push(e)[s ize() = n-1]
pop()
push(e) /res ize()
Fig. 5. Concrete machine for a DynamicStack, which realizes the Stack interface. The two
states (Loaded, Full) partition the old Normal state in fig. 4, resulting in the replication of its
transitions. The behaviour of push in the Full state must be tested
The main difference between the DynamicStack and the earlier Stack machine is
that the old Normal state, now only shown as a dashed region, has been partitioned
into the states {Loaded, Full}, in order to model the dynamic resizing of the Dynam-
icStack (push will behave differently in the Full state, triggering a memory realloca-
tion). This is a complete partition (no other substate of Normal exists), so rule 1 is
satisfied. No new methods are introduced, so rule 2 is not applicable. The Normal
114
UKTest 2005
state’s old entry and exit transitions now cross over the region boundary, reaching the
exposed Loaded state. The new pair of push, pop transitions exactly replaces the old
pair (without splitting), so rule 3 is satisfied. The Normal state’s old self-transitions
are now replicated inside the region, as a consequence of splitting the state. The for-
mer push transition is first split in two (one replication for each new state) and then
the transition from Loaded is split again, with exclusive guards on size. Similarly, the
former pop transition is replicated for each new state and its former guard: size() > 1 is
preserved in both states; however, the guard need not be notated in the Full state, as
there is no other conflicting pop transition. So, rule 4 is also satisfied. The refined
DynamicStack implementation (in figure 5) is therefore compatible with the original
Stack interface’s behaviour (in figure 4).
Next, we consider the issue of test coverage. Increasing the state-space has impor-
tant implications for test guarantees. Consider the sufficiency of the T2 test-set, gen-
erated from the abstract Stack specification in figure 4. This robustly guarantees the
correct behaviour of a simple LinkedStack implementation with S = {Empty, Normal},
even in the presence of “ghost” states. T2 will include one sequence <push, push,
push, isNormal>, which robustly exercises <push, push> from the Normal state and
will even detect a “ghost” copy of the Normal state. A strong guarantee of correctness
after testing may therefore be given for a LinkedStack implementation.
In classical regression testing, saved test-sets are reapplied to modified or extended
objects in the expectation that passing all the saved tests will guarantee the same level
of correctness. If the Stack’s T2 test-set were reused to test a DynamicStack con-
structed with n ≥ 3, so having all the states {Empty, Loaded, Full} and all the transi-
tions shown in figure 5, the resizing push transition would never be reached, since this
requires a sequence of four push methods. To the tester, it would appear that the
DynamicStack had passed all the saved T2 tests, even if a fault existed in the resizing
push transition. This fault would be undetected by the saved test-set.
4.2 Test Coverage of a Subclassed or Extended Object
In more complex examples of subclassing, the refinement introduces new behaviour,
which partitions all existing states. Figure 6 illustrates the development of an abstract
class hierarchy leading to concepts like the loan items in a library. The upper state
machine describes the abstract behaviour of a Loanable entity, which oscillates be-
tween its Available and OnLoan states. The lower state machine describes a LoanItem
entity that extends the Loanable entity. This is a product machine with four states,
resulting from the concurrent composition of the Loanable machine with a supplemen-
tary Reservable machine (not illustrated), which, we may infer, oscillates between
Unreserved and Reserved states. The resulting four states are named {OnShelf, PutA-
side, NormalLoan, Recalled}. The behaviours of loaning and reserving are dependent
on each other in interesting ways.
First, we check the refinement for compatibility. The four states completely parti-
tion the two states of Loanable, so rule 1 is satisfied. The new methods {reserve,
cancel} introduced in LoanItem stay within the prescribed region boundaries, so rule
2 is satisfied. Looking now at the splitting of transitions required by rule 3, while
115
UK Software Testing Research III
return has been split by the partitioning of OnLoan into two states {NormalLoan,
Recalled}, the borrow transition is more interesting. One partial transition from On-
Shelf allows the loan to go ahead. The other partial transition from the PutAside state
is guarded, and only succeeds if the LoanItem is borrowed by the same person who
reserved it previously. While such behaviour is reasonable, it makes LoanItem in-
compatible with Loanable. The refinement of the borrow transition breaks rule 3,
since the partials are not a complete partition of the original. From Loanable’s per-
spective, borrow always succeeds from the Available state, whereas it sometimes fails
for a LoanItem. This illustrates the practical effect of breaking refinement rules.
However, compabitility may be restored by adding a borrow transition from the
Available state to itself, in the Loanable abstract class, indicating the anticipated null
operation. The abstract state machine is then nondeterministic, since the choice of the
successful or failing borrow transition cannot yet be decided.
new Available
loaned() = false
OnLoan
loaned() = true
borrow(b)
return()
OnShelf
reserved() = false
PutAside
reserved() = true
newreserve(a)
cancel()
NormalLoan
reserved() = false
Recalled
reserved() = true
reserve(a)
cancel()
borrow(b)
borrow(b)[a ≠b]
borrow(b)[a = b]
return()
return()
Fig. 6. The upper state machine captures the behaviour of a Loanable entity, with the methods
{borrow, return}. The lower LoanItem machine extends this with reservations, combining the
behaviour of {borrow, return, reserve, cancel}. The refined machine is not yet wholly com-
patible with the base machine, but this can be addressed
Next, we consider the issue of test coverage. Assuming that a T2 test-set is gener-
ated from the Loanable specification in figure 6, this will robustly confirm that bor-
row and return succeed and fail correctly (for a Loanable instance), even in the pres-
ence of “ghost” versions of the OnLoan and Available states. However, when the
same tests are reapplied to the extended LoanItem, they will only cover half of the
partitioned states. The saved T2 test-set includes the sequences: {<isAvailable>,
<borrow, isOnLoan>, <return, exception>, <borrow, return, isAvailable>, …} and no
116
UKTest 2005
sequence will contain reserve or cancel, which are first introduced in the subclass’s
protocol. The test-set will therefore oscillate between the states {OnShelf, Normal-
Loan} and will not reach the states {PutAside, Recalled}. Because of this, only half
of the borrow and return transitions will be exercised in the refinement, compared to
all of them in the original. Partitioning states always results in splitting transitions.
Consider now that every pair of methods like {borrow, return} and {reserve, cancel}
introduces further partitions in every existing state. The proportion of the original
transitions still covered falls off as a geometrically decreasing fraction in each succes-
sive refinement. Contrary to popular expectations that recycled regression tests con-
firm the base object’s behaviour in the refined object, regression tests actually cover
significantly less of the base object’s state space in each successive refinement.
5. Conclusions: Regression versus Regeneration
The weakness in conventional regression testing comes from recycling saved test-sets
as a whole, rather than reconstructing test-sequences from scratch. This culture goes
back to the parallel design and test architecture [1, 2] (see section 1), in which test
suites are saved as methods of the THO and are inherited as a whole. The prospect of
reusing whole test suites is so beguiling, that it is hard to refuse, especially after the
effort invested in developing the tests in the first place. Likewise, in JUnit [3, 4], test
scripts are saved and recycled as a whole, in the expectation that this provides a guar-
antee against the effects of entropy in the modified code.
5.1 Overestimation of Regression Test Coverage
Programmers do not expect regression tests to exercise the new features introduced in
the refinement. For this, they develop additional tests, sometimes exercising the new
features in combination with old features. However, they do expect the regression
tests to exercise all of the original features completely. This corresponds to an impov-
erished view of refinement, as illustrated by the model M4 (see section 2.2 above).
The state space of a valid refinement is actually much greater, more like the model L2
(see section 2.3 above).
Unfortunately, recycled test-sets always exercise significantly less of the refined
object than the original. As the state-space of the modified or extended object in-
creases, the guarantee offered by retesting is progressively weakened. This under-
mines the validity of popular regression testing approaches, such as parallel design-
and-test, test set inheritance and reuse of saved test scripts in JUnit. To achieve the
same level of coverage, it is vital to test all the interleavings of new methods with the
inherited methods, so exploring the state-transition diagram completely. This simply
cannot be done reliably by human intuition and manual test-script creation.
117
UK Software Testing Research III
5.2 Completeness of Regenerated Test Sets
In the proposed approach, the test-sets for refined object types, such as the Dynamic-
Stack or the LoanItem introduced in section 4, should be regenerated entirely from
scratch, using the algorithm from section 3. With even very simple object state ma-
chine specifications, this process can be automated, generating test-sets to the desired
T1, T2, T3… confidence levels.
The regenerated tests are not regression tests in the normal sense, but all-new tests
in which the state-space of the refined OUT is fully explored. Regenerating the test-
set works equally well, whether or not the OUT is a behaviourally compatible refine-
ment of some original object, since the test-set is derived directly from the refined
specification, not the original one. For this reason, the proposed re-testing approach is
robust under all kinds of software evolution, whether this is by subclassing, by refac-
toring or by simple textual editing of the OUT, and works independently of behav-
ioural compatibility. However, regenerated tests do satisfy the expectations of regres-
sion testing, in that they test up to the same confidence-levels as the original tests.
In common with all test-sets generated from object state machines, regenerated
tests provide specific guarantees for specific amounts of testing. Because the test-sets
are generated systematically, the tester may choose whether to test using T1, T2, T3…
etc. up to the desired level of k in the algorithm. The significance of this is that the
same levels of guarantee may be provided for both the original and retested objects,
something that is not possible with conventional regression testing using recycled test-
sets, for which the guarantees are progressively weakened in each new context.
5.3 Testing to a Repeatable Level of Quality
This paper turns a number of regression-testing concepts on their head. Conventional
regression testing assumes that a refined object is compatible with its unrefined pre-
cursor, if it passes the same tests [2, 3, 5, 21]. This was shown to be false, in section 4
above. Compatibility cannot be assured directly through re-testing, but it can be
proved indirectly by verification in a formal model. Figure 7 shows the different phi-
losophies.
Compatibility is redefined as a verifiable refinement relationship between two ob-
ject specifications. Each OUT may only be proven to conform to its own specifica-
tion, by a specific test-set generated from that specification (the B-test and R-test sets
in figure 7). The refined OUT is then only provably compatible with the basic speci-
fication by virtue of the transitive composition of the R-test conforms and refines
relationships.
The strength of the guarantee obtained in conventional regression testing is badly
overestimated. Recycled test-sets exercise significantly less of the refined object than
the original, such that re-tested objects may be considerably less secure, for the same
testing effort. By comparison, in the test regeneration approach, it is possible to pro-
vide specific guarantees for levels of confidence in the OUT. After the OUT has been
refined, the same levels of confidence may be retained after re-testing using fully
118
UKTest 2005
regenerated test-sets. This notion of guaranteed, repeatable quality is a new and
important concept in object-oriented testing.
OSpecBasic
OUTBasic
OUTRefined
B-test conforms
B-test conforms
OSpecBasic
OSpecRefined
OUTBasic
OUTRefinedR-test
conforms
B-test conforms
transitively conforms
Regression
Regeneration
Fig. 7. The new philosophy for testing. The Refined OUT does not conform to the Basic
OSpec because it B-test conforms to that specification, but rather because it R-test conforms to
the Refined OSpec, which is a provably correct refinement of the Basic OSpec.
5.4 Links with Simulation in Process Calculi
As demonstrated in section 2.2, Cook and Daniels’ [13] examples of statechart re-
finement are all equivalent to the classical refinement of automata, which judges com-
patibility by trace inclusion [15]. This works so long as the subtype object aliased
through the supertype handle is only manipulated through the protocol of that super-
type. In more realistic execution contexts, objects may be aliased simultaneously by
handles of many types. This is in fact quite common in object-oriented design, where
generic algorithms are factored into parts introduced at different levels in the inheri-
tance hierarchy (see the Template Method design pattern [24, p325]). In this context,
an object may be manipulated by more than one protocol, and messages from the
different protocols may be interleaved, which may cause deadlocks [22, 23].
We showed in section 2.2 above how an M4 object could be manipulated through
the protocol of M0, until it receives <c> through another M4 protocol, at which point
the M0 protocol deadlocks. M4 is not strongly compatible with M0, although it
clearly includes the traces of M0. We therefore draw an analogy with Milner’s π-
calculus [14], which contrasts trace inclusion with the stronger simulation relation-
ship. From the viewpoint of the M0 protocol, unseen events that affect the aliased
119
UK Software Testing Research III
object through the simultaneous M4 protocol are “invisible actions”, rather like τ-
actions in π-calculus. Weak simulation is where one process behaves like another up
to null assumptions about invisible τ-actions (ie that they do not affect behaviour).
The contrasting strong simulation is where one process behaves like another in all
contexts, irrespective of the τ-actions’ unseen behaviour. Our behavioural compatibil-
ity is like strong simulation, because the protocol of the supertype is preserved, no
matter what invisible actions may be interleaved by the protocols of subtype handles.
This is achieved by making sure, in rules 1 and 2, that invisible actions cannot force a
refined object into a state that is unrecognised by its supertype’s protocol. The rules
are therefore normative, since simulation follows from this.
5.5 Acknowledgement
This research was undertaken as part of the MOTIVE project, supported by UK
EPSRC GR/M56777.
References
1. McGregor, J. D. and Korson, T.: Integrating Object-Oriented Testing and Development
Processes. Communications of the ACM, Vol. 37, No. 9 (1994) 59-77
2. McGregor, J. D. and Kare, A.: Parallel Architecture for Component Testing of Object-
oriented Software. Proc. 9th Annual Software Quality Week, Software Research, Inc. San
Francisco, May (1996)
3. Beck, K. Gamma E. et al.: The JUnit Project. Website http://www.junit.org/ (2003)
4. Stotts, D., Lindsey, M. and Antley, A.: An Informal Method for Systematic JUnit Test
Case generation. Lecture Notes in Computer Science, Vol. 2418. Springer Verlag, Berlin
Heidelberg New York (2002) 131-143
5. Wells, D.: Unit Tests: Lessons Learned, in: The Rules and Practices of Extreme Program-
ming. Hypertext article http://www.extremeprogramming.org/rules/unittests2.html (1999)
6. Rumbaugh, J., Blaha, M., Premerlani, W., Eddy, F. and Lorensen, W.: Object-Oriented
Modeling and Design, Prentice Hall, Englewood Cliffs, NJ, 1991
7. Object Management Group, UML Resource Page. Website http://www.omg.org/uml/
(2004)
8. Harel, D. and Naamad, A: The STATEMATE Semantics of Statecharts. ACM Trans. Softw.
Eng. and Meth., Vol. 5, No 4 (1996), 293-333
9. Bjorkander, M: Real-Time Systems in UML (and SDL), Embedded Systems Engineering,
October/November, 2000, http://www.telelogic.com/download/paper/realtimerev2.pdf
10. McGregor, J. D. and Dyer, D. M.: A Note on Inheritance and State Machines. Software
Engineering Notes, Vol. 18, No. 4 (1993) 61-69
11. McGregor, J. D.: Constructing Functional Test Cases Using Incrementally-Derived State
Machines. Proc. 11th International Conference on Testing Computer Software. USPDI,
Washington (1994)
12. Liskov, B., and Wing, J. M.: A New Definition of the Subtype Relation, Proc. ECOOP
’93, LNCS 707, Springer Verlag, 1993, 118-141
13. Cook, S. and Daniels, J.: Designing Object-Oriented Systems: Object-Oriented Modelling
with Syntropy. Prentice Hall, London (1994)
120
UKTest 2005
14. Milner, R.: Communicating and Mobile Systems: the π-Calculus, Cambridge University
Press, 1999.
15. Ebert, J. and Engels, G.: Dynamic Models and Behavioural Views. International Sympo-
sium on Object-oriented Methods and Systems. Lecture Notes in Computer Science, Vol.
858. Springer Verlag, Berlin Heidelberg New York (1994)
16. Chow, T.: Testing Software Design Modeled by Finite State Machines. IEEE Transactions
on Software Engineering, Vol. 4 No. 3 (1978) 178-187
17. Binder, R. V.: Testing Object-Oriented Systems: a Status Report. 3rd edn. Hypertext article
http://www.rbsc.com/pages/oostat.html (2001)
18. Holcombe, W. M. L. and Ipate, F.: Correct Systems: Building a Business Process Solution.
Applied Computing Series. Springer Verlag, Berlin Heidelberg New York (1998)
19. Ipate, F. and Holcombe, W. M. L.: An Integration Testing Method that is Proved to Find
All Faults. International Journal of Computational Mathematics, Vol. 63 (1997) 159-178
20. Bernot, B., Gaudel, M.-C. and Marre, B.: Software Testing Based on Formal Specifica-
tions: a Theory and a Tool. Software Engineering Journal, Vol. 6, No. 6 (1991) 387-405
21. Beck, K.: Extreme Programming Explained: Embrace Change. Addison-Wesley, New York
(2000)
22. Simons, A. J. H., Stannett, M. P., Bogdanov, K. E. and Holcombe, W. M. L.: Plug and Play
Safely: Behavioural Rules for Compatibility. Proc. 6th IASTED International Conference
on Software Engineering and Applications. SEA-2002, Cambridge (2002) 263-268
23. Simons, A. J. H.: Letter to the Editor, Journal of Object Technology. Received December
5, 2003. Hypertext article http://www.jot.fm/general/letters/comment_simons_html (2003)
24. Gamma, E., Helm, R., Johnson, R. and Vlissides, J.: Design Patterns: Elements of Reus-
able Object-Oriented Software, Addison Wesley (1995)
121
UK Software Testing Research III
122
UKTest 2005
Exploring test adequacy for database systems
David Willmor and Suzanne M Embury
Informatics Process Group, School of Computer Science,
University of Manchester, Oxford Road, Manchester, M13 9PL, United Kingdom
d.willmor|s.m.embury@cs.man.ac.uk
Abstract
Database systems are an important asset for many businesses. As such, it is important to test database systems thor-
oughly, as any faults that remain hidden may significantly impact critical business processes. However, these systems
bring additional complexities that make them amongst the most complex and difficult kind of system to test. While soft-
ware testing in general is a well-developed area, techniques specifically aimed at testing database systems are still in
their infancy. In this paper, we present a family of test adequacy criteria for database systems that can be used to deter-
mine the “quality” of a test suite. These criteria consider various aspects of database systems including the source code
structure (in terms of patterns of database operations), the existence of define–use pairs between database operations
and the interactions between different applications of the database system. The criteria we present differ from existing
adequacy criteria as we focus on a general definition of a database test case that is based on intensional constraints. This
overcomes the problems associated with adequacy being constrained to a single static database state. We also consider
transactional operators that alter the behaviour of a database system and influence adequacy.
Keywords: software testing; database systems; test adequacy criteria
1. Introduction
Database systems are an important asset for many organisations, since they contain vital business data (both historical
and current) and support critical business processes. Because of this the testing of database systems is an important
concern. Database systems often present a strong integration of computation, data and communication aspects. Not only
do they incorporate a secondary persistent database state that is used for the storage and querying of data, but they are
also often spread across a number of logical tiers. The type of system typically considered by research into software
testing exists solely within a single program state, which is volatile in nature, and so their testing reflects this. Techniques
are often focussed on: the definition and use of variables; the passing of control; the coverage of statements and paths;
and outputs from the system. Each of these techniques is based on the computational aspects of the software system —
how an output comes to exist. However, when testing database systems, data and communication aspects must also be
explicitly considered during testing. Such considerations result in new testing challenges.
A database system consists of two forms of state: program and database. These states are not simple variants of
each other; they are fundamentally different in several ways. Where program state is organised as a collection of named
locations where individual data values can be stored, database state is organised relative to the concepts provided by a
data model (such as the relational model). Large segments of a database state can be accessed or modified through the
execution of a single program statement in contrast to a typical program statement which will typically define at most
one variable. Moreover, changes to the state are controlled by a transaction mechanism and may be undone by some later
statement. Understanding how the state may be changed is further complicated due to the huge space of possible database
states. Finally, database state is persistent, so that changes made during one execution may affect the behaviour of later
executions. Whilst one test case may execute correctly, its effect on the database state may affect other test cases, possibly
causing an error. This can result in errors that are hard to isolate and correct.
The communication aspects of a database system handle the movement of data between program and database state.
Program and database states communicate using database operations often in the form of SQL, a powerful set–oriented
language with explicit semantics. SQL is used for both the definition of the database schema and for querying the data
123
UK Software Testing Research III
captured by it. Often, some form of database access layer (such as JDBC for Java) is used to allow SQL queries to be
embedded directly into the program code. The location of embedded queries (where communication occurs between the
program and the database) are likely locations of faults. Therefore, these locations often are a focus of the software testing
process for database systems.
Often, in a business setting, multiple programs will interact with a single database. For example, a sales application
will modify the stock levels of items to reflect purchases, a warehouse application will also modify stock levels to reflect
deliveries and a management system may analyse stock levels whilst investigating trends. Due to the persistent nature
of database state, the effect that one system has on the database will affect the behaviour of other completely separate
systems. Whilst this effect may often be intentional it may lead to unanticipated behaviours and possible faults.
In this paper, we present a generalised view of what a database system and test case is that can be applied to the many
different types of database systems that exist (Section 2). This view is based on the concept of intensional constraints that
allow real world (and dynamic) database states to be used for testing. In Section 3, we present a family of test adequacy
criteria based on structural and data–oriented aspects of a database system. These criteria allow us to address what
sufficient testing is. These criteria also differ from those presented in existing work as they are based on the reasoning of a
dynamic database state. Thus, our adequacy criteria are not constrained to one specific instance of database state. Finally,
in Section 4 we present concluding remarks and discuss possible avenues for future work.
2. A view of database testing
Considering the widespread use of database systems there has been relatively little research into their testing. The
work that has been produced differs by a number of factors, not least in the terminology that is used. In order to provide
consistency in this paper we use the following terminology:
Application : a software program designed to fulfil some specific requirement. For example, we might have separate
application programs to handle the entry of a new customer into the database, and to cancel dormant accounts once
a time-limit has passed.
Database : a collection of interrelated data, structured according to a schema, that serves one or more applications.
Database application : an application that accesses one or more databases. A database application will operate on both
program and database state.
Database system : a logical collection of databases and associated (database) applications.
Testing is more difficult (or, at least, different) when dealing with database applications. The full behaviour of a database
application program is described in terms of the manipulation of two very different kinds of state: the program state and
the database state. It is not enough to search for faults in program state, we must also generate tests that seek for faults that
manifest themselves in the database state and in the interaction between the two forms of state. A further complication for
testing is that the effects of changes to the database state may persist beyond the execution of the program that makes them,
and may thus affect the behaviour of other programs [18]. Thus, it is not possible to test database programs in isolation,
as is done traditionally in testing research. For example, a fault may be inserted into the database by one program but then
propagate to the output of a completely different program. Hence, we must create sequences of tests that search for faults
in the interactions between programs. This issue has not yet been considered by the testing research community. This has
been shown to be particularly important for regression testing where the change to the functionality of one program may
adversely effect other programs via the database state [18].
The literature on testing database systems varies in a number of ways. A fundamental difference in the literature is in
the understanding as to exactly what a database system is. Each definition is constrained to a particular situation. There is
no definition general enough to be applied to the different scenarios in which database systems may be used. The simplest
view is when a single application interacts with a single database [3, 4, 8, 7]. This has been moderately extended to handle
the situation in which multiple databases exist [13]. Whilst the situation in which multiple applications interact with a
database has been considered in a constrained form [12, 18] there does not exist a generalised definition that is applicable
to both this situation and the previous ones. Therefore, the following is a general definition of a database system that is
applicable to all existing work on database testing:
124
UKTest 2005
Definition 1. A database system consists of:
• a collection of database applications P1, P2, . . . , Pn,
• a collection of databases D1, D2, . . . , Dm,
• a schema Σ describing the databases.
Conceptually we can view each individual database as a single logical database D that matches the data model Σ. Multiple
databases are often used as from an implementation perspective they are easier to understand, manage and optimise. Also,
database systems are often not constructed from scratch they often must use existing databases. We do not constrain Σ to
a particular data model, for example relational [5, 6], object–relational [6, 17], object–oriented [1, 6, 14] etc..., however
for the remainder of this paper for readability we assume that it is relational.
As with the definition of a database system there is no agreed view as to what a database test is, but an informal
consensus is beginning to emerge. The following is a definition of database test cases and suites that can form the
foundation for the proposals for test adequacy criteria (described in the next section) and for future work. A test case
usually involves stimulating the system using some form of input, action or event. The output from the system is then
compared against a specification describing what is expected and any faulty behaviour identified. In terms of database
systems, the concept of a test case becomes more complicated. Not only must we consider program inputs and outputs we
must also consider the input and output database states. A database test case must therefore describe what these database
states are. For initial database states, existing proposals either adopt an extensional approach [13] or do not consider
database state on a per test basis instead specifying a fixed initial database state for all tests [3, 4, 7, 8]. For output states,
existing approaches adopt either an extensional approach [13] or intensional approach [3, 4, 7, 8].
A robust approach for testing database systems should specify both initial and output database states intensionally. This
allows test cases to be executed on a variety of different states (often real world or changing states) allowing for more
realistic testing. Before justifying this we present our definition of a database test case and then discuss the advantages of
an intensional approach:
Definition 2. A test case t is a quintuple 〈i,∆ic, P, o,∆o
c〉 where:
• P the program on which the test case is executed,
• i is the application input,
• ∆ic are the intensional constraints the initial database state must satisfy,
• o is the application output, and
• ∆oc are the intensional constraints the output database state must satisfy.
In this definition P , i and o represent the same concepts as the traditional notion of a test case. The database aspects of
the test case are described by constraints ∆ic and ∆oc. We have chosen to specify the input and output database states
using intensional constraints as they allow us to address a number of limitations with extensional states. In terms of input
states, extensional states are: difficult to store, especially where either database states or test suites are large; difficult to
maintain as each state must often be modified to reflect changes to the test case, application or data model; and difficult
to ensure they reflect the real–world and changes to the database state that may occur over time. In terms of the output
state, extensional states are: expensive to determine if two large states are identical; difficult to maintain as the output state
must be modified to reflect changes to the input state and the functionality of the system; and time consuming to manually
create states that reflect complex behaviour that a test case may exhibit on the initial state. Our intensional technique
specifies constraints that a test case must satisfy to determine (a) applicability (if the input state is valid for the test case)
and (b) success (if the output state is correct). Consider the following very simple example in which a new customer is
added to the database:
Test Case 1: add a new customer with <name>, <email> and <postcode>
• ∆ic : initial state constraint
1. no customer C in CUSTOMER has C.NAME=<name>, C.EMAIL=<email>
and C.POSTCODE=<postcode>
• ∆oc : output state constraint
125
UK Software Testing Research III
1. at least one customer C in CUSTOMER has C.NAME= <name>,
C.EMAIL=<email> and C.POSTCODE=<postcode>
This test case is relatively simple and imposes a single input constraint that specifies that no customer should exist in
the database that matches the customer to be added. The output constraint specifies that after executing the test case the
database should contain exactly one customer matching the customer to be added. We specify exactly one customer in the
output constraint as it allows us to cover faults where no customer was added and where multiple customers were added.
The use of intensional constraints against a real–world database raised the question of how we can deal with situations in
which the initial constraint does not hold. This is important as whilst using a real–world database state provides us with
realistic data, we cannot create opportunities for exposing faults that might arise in the future, but which are not present
in existing data.
A test case aims to test a particular use of a system. However, database systems exhibit significantly more complex
functionality. For example, a sequence of related tasks may be carried out by a user interspersed with tasks of other users.
Tasks may also be spread across a number of individual programs. These cannot be captured by the execution of a single
test case since our definition of a test case assumes a single program execution. Consider the situation in which a test case
t1 adds an item to a shopping cart and t2 increases the quantity of the item added. If t1 does not correctly add the item, it
is not possible for t2 to increase its quality. Therefore, the execution of t2 may fail not as a result of a problem with the
program but because t2 is dependent upon t1. This dependency problem can be addressed by modifying database state to
satisfy the initial constraints. However, this approach has a number of limitations. The simplest are due to the resources
required for generating database states. The most important is due to the fact that whilst we can satisfy t2s requirements
from t1 we are unsure if t1 has an unforeseen impact on t2. For example, a test case may change part of the database state
that can adversely affect the behaviour of a subsequent test case. Therefore, it is obvious that certain behaviours require
the execution of individual tests in an ordered sequence. A test sequence s is a sequence of test cases 〈t1, . . . , tn〉. Each
test of the sequences is executed in the specified order. If a test case does not meet its output conditions (the test fails) the
user is notified of the failure. The database state is then modified to allow the sequence to proceed. However, the test result
of the sequence is flagged to tell the user that it did not execute correctly. This is done, instead of simply stopping the
sequence, as the tests still provide a certain confidence in the system. Our approach to test sequences allows an individual
test case to exist in a number of test sequences. It can also be observed that test sequences can be used for more than
testing complex functionality. It can potentially take a lot of effort to set up a database for a particular test case. If several
test cases require similar input databases, then it will be much more efficient to run them all against the same database.
For example, consider the situation where a database contains records for customers. In an example sequence, the first
test case would create a customer; the second would modify the customer; and the third would delete the customer. Each
test case represents important functionality of the system which are all related through the use of the same customer. It is
therefore more efficient to use a sequence to group related test cases.
3. Test adequacy criteria for database systems
A test suite is a collection of test cases (or test sequences in our approach) usually targeted towards the verification of the
entire system or a specific section of the system. The manner in which a test suite is generated varies between different
situations. The simplest is to randomly generate tests for the system. However, it is common to use more formalised
techniques based on some aspect of the system, including: the systems specification, observations of the system being
used, and the structure of the systems source code. To test every possible input is impractical for anything but the simplest
of programs. For database systems this becomes impossible. Thus, if we cannot completely test a system, what is
sufficient testing? Current work into software testing has proposed a number of test adequacy criteria that if satisfied will
sufficiently test the system according to some characteristic of the specification or implementation. In terms of database
testing only one set of criteria exist for determining the quality of a test suite [13]. This approach is based on determining
for each database operation an extensional representation of the portion of the database defined or referenced by the
application. This approach is limited for a number of reasons: (a) It generates an extensional representation in which
non–determinism countered by using a fixed initial state. However, this means that the test suite can only be considered
valid for this particular state. (b) It does not consider the effect transaction operators have on the definition–use pairs and
so will include unnecessary test cases for interactions that will not occur. (c) Nor does it consider multiple applications or
instances of the same application accessing a single database. The only other work on test adequacy for database systems
126
UKTest 2005
is by Suarez-Cabal and Tuya [2] in which they present a metric for testing coverage of an SQL SELECT query, and a
method for detecting when tuples needed to be added to the database to ensure better coverage, based on analysis of the
corresponding query tree1. Their work aims to determine adequacy of a single query (specifically the SELECT query)
and does not consider the behaviour of the system as a whole.
In this section we present a number of test adequacy criteria based on our intensional specification of database state
and intensional descriptions of the behaviour of database operations. The criteria presented, are focussed on the structural
and data-oriented elements of database systems. Briefly, structural elements include branches, loops and procedure calls.
Data-oriented elements are the points at which data is defined and used in the program. However, first we present a brief
discussion about the types of faults that can occur within a database system.
The fundamental issue in database testing is whether the application behaves as specified [3, 4]. From a simple per-
spective this can be seen as determining if the output from a database system matches its required output. Bearing in mind
that database state is persistent, it can be observed that the output database state is dependent not only on the input to the
program but also the initial (or input) database state. Therefore, a test case execution can be seen as moving the database
from one state to another. It is this transition that we aim to verify. The first type of fault is simply that the implemented
functionality does not match the specified functionality. Other faults include: attempting to access a database entity that
does not exist, operations attempting to violate the databases constraints (such as primary key or referential integrity), and
transactions being aborted or committed incorrectly. These types of faults manifest themselves either in the database state
or as a result of an interaction between program and database state.
3.1. Structural test adequacy criteria
Structural test criteria are a commonly used software test adequacy criteria [20]. This form of testing is based on a
structural model that represents the physical implementation of the software application. Kapfhammer and Soffa’s [13]
approach to test adequacy is based on an extended version of a control flow graph in which extra edges are included to
capture the dependencies that exist between database operations. However, this model only captures dependencies that
exist between database operations in a single procedure. Instead we base our model on a representation that completely
describes the structure of a database system and its composite components.
Definition 3. Each application P of the database system is modelled as an interprocedural graph consisting of:
• CP , the set of control flow graphs, where each ci ∈ CP corresponds to a procedure mi of program P .
• I is a graph where each edge represents a procedure call.
This is a general model that can be tailored towards particular implementations. For example, for a Java program I is
an interclass relation graph (IRG) representing the relationships between the classes (and their methods) in P [10]. The
IRG models the complexities associated with object–oriented languages, including variable and object type information;
internal and external methods; interprocedural interactions; inheritance, polymorphism and dynamic binding; and ex-
ception handling. In the model described in Definition 3 each statement of a program is captured by a particular node.
These nodes can be categorised and annotated with additional information. In particular we use the concept of a database
operation type:
Definition 4. A database operation δ is a node of a control flow graph that consists of some form of interaction with a
database D. Each δ is an notated with;
• δ.Σadd the subset of D that is updated by δ,
• δ.Σdel the subset of D that is deleted by δ, and
• δ.Σread the subset of D that is read by δ.
The sets δ.Σadd, δ.Σdel and δ.Σread allow us to reason over the interactions with the database and possible relations
between different database operations. The subsets of the database are represented intensionally as relational algebra
expressions.2
1A query tree is conceptually similar to an abstract syntax tree for programming languages.2For more information about how these sets are generated , reasoned over and can be used please see [18, 19].
127
UK Software Testing Research III
Our structural test adequacy criteria are based upon a static analysis of the model according to a specific criterion. These
structural criteria can be seen as analogous to traditional control flow–based criteria but with the additional characteristics
of database systems incorporated. Similar to control–flow based techniques we utilise the concept of a complete path π
that starts at the graph’s entry node and ends at an exit node [9]. Intuitively, it can be observed that the execution of a
particular test case will result in the execution of a particular complete path (relative to the input to the program and the
state of the database at that moment in time). Therefore we use the notation πt to refer to the complete path relative to
the execution of test case t.3 As our model is based upon control–flow graphs, existing criteria, such as statement, branch
and path coverage, are applicable to database systems. For brevity we refer the reader to [20] for a detailed description of
these criteria. The use of complete paths in test adequacy criteria is complicated by the fact that we describe initial states
using intensional constraints. Therefore, multiple executions of a test case may result in different paths due to changes in
the database state. A test suite is therefore only adequate in terms of the state in which it was executed. For example, a
test suite may only satisfy a subset of the requirements of a criterion when executed against a particular state, however, as
that state changes so may the degree of satisfaction. The test suite executed at a later date may result in a different degree
of coverage.
The database system model includes a special type of node for each database operation in an application. Criterion 1
simply assesses coverage of all database operations without discriminating as to their effect. Whilst a fault may not
be caused directly by a database operation, it is through these statements that database-related faults become detectable,
either by propagation to the database state through state change or by retrieval from the database state. Since each database
operation may potentially reveal a fault, ensuring coverage of all such statements by a test suite gives some guarantee that
a wide variety of database faults will be detected.
Criterion 1. A test suite ts satisfies the All Database Operations criterion if for each database operation δ, there exists
a t ∈ ts such that δ ∈ πt.
Criteria 2 and 3 assess coverage relative to two of the most common types of interactions with a database: retrieval and
update. They each potentially require fewer test cases than the 1 criterion, and so can be more efficient when dealing with
kinds of programs where there is some expectation as to the kind of fault that may appear. For example, when testing a set
of batch applications, we may prefer to concentrate our testing effort on updates to the database, ensuring that the result
states produced by the programs are correct. In such programs, database retrieval is used only to pull data into memory in
order to manipulate it before writing it back to the database. Any errors in this part of the code will likely result in incorrect
update operations as well. Alternatively, where a system consists of many report-style or query-browsing applications,
we may wish to focus attention on how data is retrieved from the database, and how it is later manipulated before being
presented to the user, rather than the minor house-keeping updates that occur during report generation.
Criterion 2. A test suite ts satisfies the All Read Operations criterion if for each database operation δ where δread 6= ∅,
there exists a t ∈ ts such that δ ∈ πt.
Criterion 3. A test suite ts satisfies the All Write Operations criterion if for each database operation δ where (δadd 6=∅) ∨ (δdel 6= ∅), there exists a t ∈ ts such that δ ∈ πt.
Although faults may be visible to a particular program in the middle of a transaction, the key point at which they
become fully visible externally is when the changes made by the program are made durable by execution of a transaction
commit. These are key points in the program, when a set of related changes is declared to be either consistent (i.e. legal)
and can therefore be made durable within the database state, or inconsistent and must therefore be undone. Therefore,
by ensuring that all such operations are executed at least once by a test suite, we have some guarantee that our test suite
is exercising a significant proportion of the kinds of database interactions that the application programs implement. This
leads us to propose the criterion 4 in which all commit and abort statements are required to be exercised by the test suite
for it to be considered adequate. In general, this criterion will allow smaller test suites to be used than criteria 2 and 3.
Criterion 4. A test suite ts satisfies the All Commits and Aborts criterion if for each database operation δ where
type(δ, commit) ∨ type(δ, abort), there ∃ t ∈ ts such that δ ∈ πt.
The above adequacy criteria are all subsumed by the standard all–statements criterion,4 and therefore share its inherent
weaknesses, in that test suites may satisfy them but may still leave a large part of the control flow graphs of a set of
3This is often referred to as the execution trace of t in the literature.4Sometimes referred to as all–nodes.
128
UKTest 2005
programs unexplored. In our context, this disadvantage is most serious in the case of the weakest criterion ( 4). The
motivation behind this criterion is that the test case should execute all transactions within the programs. However, the
structure of most database programs is much more complex than this criterion implies. There will in general be many
ways of reaching a specific commit or abort operation. Many of these will be slight variants on the same basic transaction
behaviour, but others may represent very different transactions that need to be tested.
In other words, rather than being satisfied with testing one path to each commit or abort, we would ideally prefer to test
all such paths. In order to define such an adequacy criterion, we first require a notion of a transaction path.
Definition 5. A transaction path in a program P is a subpath ni, . . . , nj in a complete path of P where:
• the node immediately preceding ni is either START or a commit or abort operation,
• nj is either a commit or an abort operation, and
• the subpath ni, . . . , nj−1 is commit- and abort-free.
Based on this, we can now define an adequacy criteria that requires all transaction paths to be exercised by a test suite:
Criterion 5. A test suite ts satisfies the All Transactions criterion iff every transaction path in P is covered by some test
t ∈ ts.
In practice, of course, such a criterion would need to be used in conjunction with some mechanism for ensuring that the
presence of cyclic paths does not lead to an infinite number of transaction paths to be tested. For example, Howden’s
boundary interior criterion could be applied [11]. If testing resources are limited (as is usually the case) we may also
want to avoid repeated testing of very similar transaction paths, and would instead want to concentrate testing effort on
covering as wide a variety of transaction paths as possible. This would require some heuristic to be combined with the
criterion, in order to determine which transaction paths are deemed sufficiently different from those already explored to
be worth attention.
In database systems, we have an additional source of structure that can form the basis for further test adequacy criteria.
This is the structure of the database itself. A common strategy when testing database applications, for example, is to
choose tests that cover all parts of the schema,5 and all forms of operation on each schema element. For example, if a
new database system is created that includes a Customer table, we would expect there to be code that controls the addition
of new customers to the database, that handles modifications to their details (such as name or address) and that carries
out deletion of data for customers that are no longer deemed to be active. We would also expect there to be at least one
program that reads data from the Customer table, since if it is not used then there is little point in maintaining the data.
A well designed database system will often be created with sub-routines or sub-programs that handle these basic
updates, and which ensure that the same business logic is applied, regardless of what higher level application program
they are called from. (Such systems are often constructed as three-tier systems, with an upper interface layer, a middle
business logic layer and a supporting database services layer.) In testing such systems, we may wish to ensure that each
form of operation on each part of the schema has been tested at least once, rather than testing many calls to the same
operations. This leads to a further structural adequacy criterion, which we define here in terms of the relational data
model (though the same principle could easily be applied to other models):
Criterion 6. A test suite ts for a database system with schema Σ satisfies the All Schema Elements criterion iff for every
relation r ∈ Σ, the following operations are covered by at least one test case in ts (though not necessarily the same test
case):
• an operation which retrieves data from r,
• an operation which inserts new tuples into r,
• an operation which deletes tuples from r, and
• an operation which modifies some existing tuples in r.
Further variants of this criteria can be considered, which operate a finer granularity of database structure. For example,
we might wish to ensure that our test case will include operations which read from each attribute in each table, as well as
5Or, since databases are often required to support a wide variety of applications, coverage may be limited to those parts of the schema actually accessed
by the programs to be tested, which can (in general) be determined statically.
129
UK Software Testing Research III
modifying the attribute. Another common area of focus for testing in database applications is on the relationships between
tables (modelled using foreign keys and additional integrity constraints in relational systems), since errors in modelling
the cardinality and optionality of relationships are common in database programming, and can have severe ramifications.
3.2. Define–use test adequacy criteria
Traditional programs are based around the definition and use of variables [15, 16]. A definition occurs when a variable is
on the left hand side of an assignment. A use may either occur: (a) on the right hand side of an assignment (a computation–
use) or (b) in the predicate of a conditional logic statement (a predicate–use). A definition–use pair occurs between a
statement that defines a variable and a subsequent (in the control flow graph) statement that uses that variable and there
is no intervening definition. A number of different criteria have been proposed based on the concept of definition–use
pairs [15, 16]. The simplest include the all-uses criterion (in which all uses must be covered) and the all-du-paths criterion
(in which all paths between definition–use pairs must be covered) [15, 16].
For database applications define–use pairs relate to the definition and use of parts of the database. Whilst, this is more
complicated than program statements, it has been successfully employed in testing [13, 18] and program slicing [19].
We will utilise the approach we proposed previously in the context of slicing [19] and regression testing [18] as it has a
finer level of granularity than the approach of Kapfhammer and Soffa [13] and also includes the effects of the transaction
operators: commit and abort. This is important as a definition–use pair cannot exist if the definition has been aborted
before it is used. The following is a description of a database definition–use pair:6
Definition 6. A database definition–use pair exists between operation δ1 and operation δ2 iff:
1. δ2.Σread ∩ (δ1.Σadd ∪ δ1.Σdel) 6= ∅, and
2. there is a rollback-free execution path p between δ1 and δ2 such that:
• δ1.Σadd \ p.Σdel 6= ∅ o r
• δ1.Σdel \ p.Σadd 6= ∅
For brevity, we use the notation δ1 99K δ2 to denote the fact that there exists a definition–use pair between δ1 and δ2. In
the above description it can be observed that definition–use pairs will only be encountered if certain paths of the program
are traversed (some paths will not be rollback–free). Therefore, in order to specify coverage we specify that Πδ199Kδ2is
the set of all paths in which the specific definition–use pair will occur. Using this description of a database definition–use
pair it is possible to propose a criterion based on their coverage:
Criterion 7. A test suite ts satisfies the All Database Application Define–Use Pairs criterion if for each δ1 99K δ2, there
exists a t ∈ ts such that πt ∈ Πδ199Kδ2.
In this criterion each instance of a define–use pair should be matched by a test case in which the use occurs after the
definition in the path of the test case. Intuitively, it can be observed that the above definition–use pairs only exist within
a single program. However, given the persistent nature of database state data is shared between different programs.
Therefore, a program may define a part of the database that may be subsequently used by another program. Given this
situation we are able to specify a definition–use criterion across the entire database system:
Criterion 8. A test suite ts satisfies the All Database System Define–Use Pairs criterion if for each δ1 99K δ2 where
δ1 ∈ P1, δ2 ∈ P2, there exists a test sequence s ∈ ts where the complete path of the test sequence πs contains a
rollback–free subpath p between δ1 and δ2 such that:
• δ1.Σadd \ p.Σdel 6= ∅ o r
• δ1.Σdel \ p.Σadd 6= ∅
This criterion aims to cover each instance of a define–use pair that may exist within an entire database system. For each
define–use pair a test sequence should exist in the test suite in which the definition occurs and then subsequently used. The
rollback-free subpath condition checks that between the definition occuring and it being used its effect on the database
state has not been reversed (either through an abort command or an intervening definition).
6Please refer to [19] for details describing how we reason over database queries and construct the definition–use pairs (which are described as database–
database dependencies).
130
UKTest 2005
Figure 1: Test adequacy criteria subsumption hierarchy
3.3. Subsumption hierarchy of test adequacy criteria
The test adequacy criteria proposed in this paper do not produce mutually exclusive test suites. We discuss the inclu-
siveness of two test adequacy criteria in terms of subsumption. A criterion C1 subsumes a criterion C2 if every test suite
that satisfies C1 also satisfies C2. Figure 1 shows the subsumption hierarchy for the criteria proposed in this paper. The
relationships presented in this hierarchy are justified as follows:
• All Database Operations subsumes All Read/Write/Commits&Aborts Operations and All Schema Elements
The simplest of subsumption relationships. Each of type read, write, and commits & aborts are a type of database
operation. Therefore, a test suite that covers all database operations will inherently cover each of the types.
• All Statements subsumes All Database Operations
A database operation is a particular type of statement that interacts with the database. Therefore, a test suite that
covers all statements inherently covers all database operations.
• All Transactions subsumes All Database Operations
A transaction is a collection of database operations. Every database operation must exist in a transaction.7 There-
fore, a test suite that covers all statements inherently covers all database operations.
• All Paths subsumes All Transactions
A transaction is captured by a transaction path. This in turn is a subpath of the systems source code. Therefore, to
cover all paths (which inherently covers all subpaths) will cover all transactions.
• All Database System Define–Use Pairs subsumes All Database Application Define–Use Pairs
Define–use pairs of the system as a whole will also include all define–use pairs located in an individual program.
Therefore, to cover all database system define–use pairs inherently covers all database application define–use pairs.
4. Conclusions and future work
In this paper, we have addressed a number of fundamental issues regarding database testing, particularly: (a) what a
database system is? (b) what a database test case is? and (c) what is adequate testing of a database system? In response to
this, we have presented the following contributions in this paper:
• Basic definitions for database testing:
7A number of database programming languages (and access layers) allow certain operations to be performed outside of transactions or operate in an
auto commit mode (in which each operation is automatically committed if it is successful). We therefore view each operation outside of a transaction as
existing within its own individual transaction.
131
UK Software Testing Research III
– A database system that is applicable to both the types of systems in use with businesses and the currently
existing techniques on database testing. Our definition is based on the concept that a database system may
consist of one or more applications that interact with one or more databases.
– A database test case. This definition uses intensional constraints to specify requirements required of the initial
and output database states. The use of intensional constraints (particularly for the input state) allows us to
utilise both real–world and artificial database states for testing.
– A database test sequence. This definition describes the execution of a sequence of database test cases aimed
at verifying some form of complex behaviour. Often the behaviour of a system cannot be verified by a single
test case. A test sequence also allows us to group test cases that can operate on the same initial state without
affecting each other.
• Test adequacy criteria for database systems that aim to determine what “adequate” testing is:
– Structural criteria focus upon the structural aspects of the database system. We focus on two forms of struc-
tural information: the application source code and the data model. The source code based criteria are based
on the possible patterns of database operations that may be executed. Particularly, we have presented criteria
based on covering different types of database operations, transactional statements and complete transactions.
The data model based criterion aims to cover the different entities of the data model. This translates to cover-
ing: tables, columns, foreign key relationships, etc. . . in the relational model.
– Define–use criteria focus upon the relationships between database operations in terms of what subset of the
database they define or use. Our approach differs from existing techniques as the reasoning over the effect
of a database operation on the database is based on intensional descriptions. This allows us to determine all
of the possible define–use pairs and not just those valid for the current database state. We specify define–use
criteria for a single application and the database system as a whole.
• Subsumption hierarchy describing the relationships between our test adequacy criteria and by placing them into
perspective with classical program state based criteria.
This work has presented a number of avenues for further work. The first avenue is an empirical comparison between the
proposed test adequacy criteria. Whilst our subsumption hierarchy provides a descriptive comparison of the differences
between the criteria it does not tell us anything about the costs associated with determining adequacy, the fault coverage
of different adequate test suites or the cost of executing each test suite. Furthermore, we plan to investigate possible opti-
misations to improve testing. This is particularly important for concurrent systems as the number of possible combination
is a combinatorially explosive problem.
References
[1] J. Banerjee, H.-T. Chou, J. F. Garza, W. Kim, D. Woelk, N. Ballou, and H.-J. Kim. Data model issues for object-oriented
applications. ACM Trans. Inf. Syst., 5(1):3–26, 1987.
[2] M. J. S. Cabal and J. Tuya. Using an SQL coverage measurement for testing database applications. In Proceedings of the 12th
ACM SIGSOFT International Symposium on Foundations of Software Engineering (SIGSOFT FSE), pages 253–262, October-
November 2004.
[3] D. Chays, S. Dan, P. G. Frankl, F. I. Vokolos, and E. J. Weber. A framework for testing database applications. In International
Symposium on Software Testing and Analysis (ISSTA), pages 147–157, August 2000.
[4] D. Chays, Y. Deng, P. G. Frankl, S. Dan, F. I. Vokolos, and E. J. Weyuker. An AGENDA for testing relational database applications.
Software Testing, Verification and Reliability, 14(1):17–44, 2004.
[5] E. F. Codd. A relational model of data for large shared data banks. Communications of the ACM (CACM), 13(6):377–387, 1970.
[6] T. Connolly and C. Begg. Database Systems. Addison-Wesley, 3 edition, 2002.
[7] Y. Deng and D. Chays. Testing Database Transactions with AGENDA. In Proceedings of the 27th International Conference on
Software Engineering (ICSE). IEEE Computer Society, May 2005.
132
UKTest 2005
[8] Y. Deng, P. G. Frankl, and Z. Chen. Testing database transaction concurrency. In International Conference on Automated Software
Engineering (ASE), pages 184–195. IEEE Computer Society, October 2003.
[9] P. G. Frankl and E. J. Weyuker. An applicable family of data flow testing criteria. IEEE Transactions on Software Engineering,
14(10):1483–1498, 1988.
[10] M. J. Harrold, J. A. Jones, T. Li, D. Liang, A. Orso, M. Pennings, S. Sinha, S. A. Spoon, and A. Gujarathi. Regression test
selection for java software. In Proceedings of the 16th Annual ACM SIGPLAN Conference on Object-Oriented Programming,
Systems, Languages, and Applications (OOPSLA), pages 312–326. ACM, October 2001.
[11] W. E. Howden. Methodology for the generation of program test data. IEEE Transactions on Computers, 24(5):554–560, 1975.
[12] G.-H. Hwang, S.-J. Chang, and H.-D. Chu. Technology for testing nondeterministic client/server database applications. IEEE
Transactions on Software Engineering, 30(1):59–77, 2004.
[13] G. M. Kapfhammer and M. L. Soffa. A family of test adequacy criteria for database-driven applications. In Proceedings of the
11th ACM SIGSOFT Symposium on Foundations of Software Engineering, pages 98–107. ACM, September 2003.
[14] C. Lecluse, P. Richard, and F. Velez. O2, an object-oriented data model. In H. Boral and P.-A. Larson, editors, Proceedings of the
1988 ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, June 1-3, 1988, pages 424–433. ACM
Press, 1988.
[15] S. Rapps and E. J. Weyuker. Data flow analysis techniques for test data selection. In Proceedings of the 6th International
Conference on Software Engineering (ICSE), pages 272–278, September 1982.
[16] S. Rapps and E. J. Weyuker. Selecting software test data using data flow information. IEEE Transactions on Software Engineering,
11(4):367–375, 1985.
[17] M. Stonebraker, P. Brown, and D. Moore. Object-Relational DBMSs. Morgan Kaufmann, 2 edition, 1998.
[18] D. Willmor and S. M. Embury. A safe regression test selection technique for database–driven applications. In To Appear in the
proceedings of the 21st International Conference on Software Maintenance (ICSM 2005). IEEE Computer Society, September
2005.
[19] D. Willmor, S. M. Embury, and J. Shao. Program slicing in the presence of a database state. In Proceedings of the 20th
International Conference on Software Maintenance (ICSM 2004), pages 448–452. IEEE Computer Society, September 2004.
[20] H. Zhu, P. A. V. Hall, and J. H. R. May. Software unit test coverage and adequacy. ACM Computing Surveys, 29(4):366427, 1997.
133
134
3. Search-Based Software
Testing
135
136
Automatic Software Test Data Generation For String Data Using
Heuristic Search with Domain Specific Search Operators
Mohammad Alshraideh, Leonardo Bottaci
Department of Computer Science,The University Of Hull,
HULL, HU6 7RX,UK.
M.Alshraideh@dcs.hull.ac.uk,L.Bottaci@hull.ac.uk
July 28, 2005
Abstract
This paper presents a novel approach for automaticsoftware test data generation where the test data isintended to cover program branches which dependon string predicates such as string equality, stringordering and regular expression matching. Geneticalgorithm search is used and initially four simplepredicate cost functions are proposed and their per-formance is investigated on a small number of sim-ple programs. One cost function consistently outper-forms the others. It becomes clear that the simplecost functions are failing to exploit all the availabledomain knowledge and that performance can be im-proved by introducing domain specific search oper-ators. Some operators are proposed and shown toproduce a significant improvement for a small num-ber of simple test programs.
Key words: Software test data generation, ge-netic Algorithms, string predicates.
1 Introduction
The goal of automatic test data generation is to gen-erate test data that can satisfy a given test crite-rion. For the purpose of empirical investigation, aspecific criterion, namely branch coverage, has beenadopted for this papers. A number of automatic soft-ware testing approaches have been developed to ob-tain control flow coverage [3]. These approaches
include random test generation, symbolic execution-based test generation [1], rule-based test genera-tion, constraint-based test generation and dynamictest generation [3], [10]. In the research reported inthis paper, a dynamic test data generation approachis used to automatically generate test data. In thisapproach, a heuristic search is guided by a measureof closeness to a test goal. A simple branch coverageproblem is illustrated in Figure 1. The problem isto find an input string s so that the required branchis executed. If s is such that the predicate fails, acost is associated with s. This cost is used to guidethe search. Given the use of a particular search tech-nique such as a genetic algorithm, a key problem ishow to compute a useful cost for this predicate fail-ure.Current work has been largely limited to programswhose predicates compare numbers [17], [11], [12],but not strings. This overly reduces software testingapproaches for applications in practice since stringpredicates are widely used in programming.The remainder of this paper is organized as follows.In Section 2, background and related work are pre-sented, Section 3 presents cost function for strings,Section 4 and Section 5 present genetic algorithmsand outline of string data generation using a GA, inSections 6 and 7 experimental investigation and ex-perimental results are presented, Section 8 presentsimproved search operation, further results are shownin Section 9 and Section 10 presents the conclusion
137
UK Software Testing Research III
var string: s;
...
if (s == astring) {
//execution required
}
...
Figure 1: Need to measure similarity of string s tostring value of astring in order to measure the costof failure to execute the required branch.
and future work.
2 Background and Related
Work
There are a number of different automatic test datageneration approaches such as random test data gen-eration [6], symbolic execution based test data gen-eration [7], [8] and dynamic test data generation.Random test data generation develops test data atrandom until a useful input is found [6]. In general,random test data generation performs poorly and isgenerally considered to be ineffective on realistic pro-grams [14]. Often the evaluation of guided searchmethods uses random testing as a benchmark [3].Symbolic execution has problems with dynamic fea-tures of programming languages, such as array sub-scripts and pointers.
Dynamic test data generation is a popular ap-proach for generating test data. Dynamic test datageneration relies on the execution of the programunder test in order to gain information that can beused to generate suitable test data. When searchingfor test data, candidate test inputs are executed toidentify how close the test input is in meeting thetest requirement [2]. With the aid of feedback, testinputs are gradually modified until one of them sat-isfies the requirement. For example, suppose that aprogram contains the condition statement
...
if (y == 30 ) {
//execution required
}
and the true branch of the predicate should be taken.Thus, we must find an input that can make the vari-able y equal to 30, when the condition statement isreached. For a given input, a simple way to deter-mine the value of variable y in the predicate is toexecute the program up to the condition statementand record the value of y. Given this information,the absolute difference between y and 30 is a possiblecost function to guide the search. This approach canalso be applied to all the arithmetic relational oper-ators: <, <=, =, ! = (non-equality), >, and >=.Techniques for handling compound predicates aredescribed in [10]. The cost functions may be usedto guide search techniques that include gradient de-scent, tabu search, genetic search, and simulated an-nealing [9].The majority of systems developed by applying thesetechniques have focussed on generating numeric testdata. In such systems, string data is considered interms of its underlying numeric character format andis the form used by cost functions and search op-erators. For example string equality may be inter-preted as equality of the underlying binary represen-tation [16]. A notable exception is the use of a stringspecific predicate in [15]. The claim in this paper isthat search performance can be improved by usingstring specific cost functions and search operators.
3 Cost Functions For Strings
A cost function is intended to measure the search ef-fort required to produce a solution from a candidatesolution. For string data, cost functions are requiredfor the string relational predicates of equality, order-ing and regular expression matching.
3.1 String Equals
The aim is to assign a measure to the extent towhich any two different strings have the same value.Initially four functions were considered, as follows:
138
UKTest 2005
1. Hamming Distance (HD): nk where n is thenumber of non-matching corresponding charac-ters and k is a positive constant.For example if:s1 = “COMPARISON”.s2 = “COMPARE”.and k=2; then HD(s1, s2)=8.
The above function takes no account of charac-ter ordering. For example in matching “AAA”with “BBB” and “ZZZ”, the first match ap-pears to be closer and yetHD(“AAA”,“BBB”) = HD(“AAA”,“ZZZ”) = 6
The following function attempts to overcomethis problem.
2. Character Distance (CD): which is the totalof the absolute ASCII code difference betweencorresponding characters in the two strings.Let string s1= c1
lc1
l−1... c1
3c1
2c1
1have ascii codes
a1
la1
l−1... a1
3a1
2a1
1then
CD(s1, s2) =∑
i
|a1
i− a2
i|.
Strings are left aligned and absent charactersare treated as nulls. For example if:s1 = “SET”s2 = “CASE”CD(s1, s2) = |ascii(‘S’)-ascii(‘C’)|+ |ascii(‘E’)-ascii(‘A’)|+ |ascii(‘T’)-ascii(‘S’)|+ |0 -ascii(‘E’)|CD = |83−67|+|69−65|+|84−83|+|0−69| =95
3. Character Value (CV): The string is consideredas a number with base equal to the cardinalityof the character set.
Let s = clcl−1... c3c2c1 be a string of length land
ξ(s) = a1 +m * a2 + m2 * a3 + ... + ml−1 * al
where m is max ascii char valueLet s1 and s2 be two strings then
CV (s1, s2) = |ξ(s1) − ξ(s2)|For example if:s1 = “ACDE”s2 = “ABC” ,ξ(s1) = 2560 * (69)+2561 * (68) + 2562 * (67)+ 2563 * (65)= 69 + 17408+ 4390912 + 1090519040= 1094927429ξ(s2) = 2560 * (67)+2561 * (66) + 2562 * (65)= 67 +16896 +4259840= 4276803, thenCV (s1, s2) =1090650626.In practice ξ(s) may be very large for longstrings, too large to be represented using lan-guage provided integer data types. In [15],where this function is presented, integer over-flow is avoided because the algorithm compar-ing strings compares and searches for one char-acter at a time using a single character distancefunction. This problem also occurs when thisfunction is used for string ordering.
4. Levenshtein Distance (LD): is a measure ofthe similarity between two strings, in term ofthe number of deletions, insertions, or substi-tutions required to transform one string intoanother. For example LD(“TEST”,“TOT””)=2, because 1 deletion and 1 substitution is re-quired.
All methods of the form String.EndsWith,String.StartsWith and Substring ultimately use thestring Equals cost formula and so the cost functionsgiven above can be used.
3.2 String Ordering
Here the problem is to return a numeric value thatmeasures the extent to which one string is less than,or precedes another, which is usually interpreted ac-cording to lexicographical ordering. The previouscharacter value function can be adapted as follows:Let s1 and s2 be two strings of length l with asciisequencesa1
l... a1
3a1
2a1
1and a2
l... a2
3a2
2a2
1
Then cost(s1 ≤ s2)= (a1
1- a2
1) +m *(a1
2- a2
2) +
139
UK Software Testing Research III
m2 *(a1
3- a2
3) + ... + ml−1 * (a1
l- a2
l)
where m is max ascii char value.As an example:
cost(“ACDE” ≤ “ABC”) becomescost(“ACDE” ≤ “ABC0”)=2560 * (69 -0)+2561 * (68 -67) + 2562 * (67 - 66)+ 2563 * (65 -65)=69 +256 *1 + 2562 *1+0 = 65861This cost function may produce very large values.Consider the problem of comparingcost(”AAAAAAAAAA”,”XXXXXXXXXX”) andcost(”AAAAAAAAAA”,”MXXXXXXXXX”)both these values are too large for 64 bit integers.It seems as if large integers values must be accom-modated if the above strings comparisons are to becorrectly costed. In the implementation used for thispaper, costs were represented by floating point val-ues up to a maximum of 1.7 × 10308 which allowsstrings up a length of 80 characters to be costed butwith some loss of precision.
3.3 Cost Of Matching Regular Ex-
pressions
In principle, a cost function for matching a givenstring to a regular expression can be derived fromthe cost function for string equality as follows. Aregular expression denotes a set of strings. If thegiven string is a member of this set then the cost ofthe match should be zero. If the given string is nota member of the set it seems plausible to define thematch cost as the lowest equality cost of the givenstring with any string in the regular expression set.The computational cost of such a cost function isof the order of the size of the regular expressionset, which is unacceptably high. Muzatko [13] givesan algorithm for constructing a FSM that acceptsstrings with up to a defined number of mismatches.The algorithm constructs a non-deterministic ma-chine containing l copies of the regular expressionmachine where l is the maximum number of mis-matches to be detected but the algorithm complex-ity is exponential.An acceptable cost function would have a computa-tional complexity of order equal to the size of thegiven string. The idea behind the proposed cost
function is to parse the given string using a finitestate machine, in much the same way as would bedone to check its membership of the regular set. In-stead of the finite state machine producing a simpleaccept or reject output, however, the machine com-putes a cost by counting “mismatched” state trans-lations.The example in Figure 2 shows how the cost ofmatch(“acb”,(ab)∗) may be computed. E is themachine that recognizes (ab)∗. E is the machinethat computes the cost of matching any string with(ab)∗. A distinction is made between not executinga transition because the current input character isnot present in any acceptable string and not execut-ing a transition because the current input characteris in the wrong position in the input string i.e themachine E has an acceptable transition but not atthe current state.k is the cost of a mismatched transition resultingfrom an input symbol not present in any regularstring and k/2 is the cost of a mismatched transi-tion that results from an input symbol present insome regular string.Where a transition has an output, k or k/2, thisvalue is added to the current cost. In E, match(“acb”, (ab)∗) begins with the character ′a′ and pro-duces a zero cost transition, the character ′c′ leadsto a k cost transition since ′c′ is not in the alphabetof the regular set, followed by a zero cost transitiondue to b to give a total cost of k. Match (“bca”,(ab)∗) produces a cost of k/2 + k to the end of thestring “bca” followed by the empty transition to thefinal state at a cost of k/2 to give a total cost of 2k.In general, E is defined as follows. Let S be a setof strings and E be a finite state machine definingsome regular subset of S.Let A be the alphabet for all strings in S.Let AE be the alphabet for all strings in E.Using E, define another machine with the samestates to accept any string in S, and in the processcompute a cost CE for that string. The machine isdefined as follows:For each state, add a transition from that state toitself to be traversed for any character in A\AE .Whenever this transition is traversed, k (say thenumber of states in E) should be added to the value
140
UKTest 2005
=
a
b
a
b
'E
E
b, k /2
A\{a,b}, k
a, k /2
A\{a,b}, k
=, k/2ε
Alphabet A = {a b, c}
Figure 2: The Finite state machine E computes thecost of matching a string with the set defined by E
computed by CE . For each state in E, for each charc in AE where no transition is defined at that state,for input c add a transition from the state to everyother state to which a transition in E is defined for c.These transitions cause k/2 to be added to the valuecomputed by CE . There is also a transition fromeach nonfinal state to the final state, it consumes noinput but k/2 is added to the cost.The number of states in E is equal to the numberin E but E has O(n2) additional transitions in theworst case. Although E is non-deterministic whichmeans that the number of state in the equivalent de-terministic machine is of O(2k) in the worst case,it has fewer states than the machine proposed byMuzatko which is O(2lk) in the worst case.
4 Genetic Algorithms
The above cost functions can be used to guide thesearch in a genetic algorithm. A genetic algorithm(GA) is a search algorithm based on principles fromnatural selection and genetic reproduction [5], [4].GAs have been applied successfully to a wide range ofapplications including optimization, scheduling, anddesign problems. Key features that distinguish GAsfrom other search methods include:
1. A population of individuals where each individ-ual represents a potential solution to the prob-lem to be solved.
2. A fitness function which evaluates the utility ofeach individual as a solution. In genetic algo-rithm search, as used in this paper, a cost func-tion is called a fitness function and is used torank the candidate solutions in the populationfor selection for crossover. Producing a new gen-eration that will (hopefully) be better.
3. A selection function which selects individuals forreproduction based on their fitness.
4. Genetic operators alter selected individuals tocreate new individuals for further testing. Theseoperators, e.g. crossover and mutation, attemptto explore the search space without completelylosing information (partial solutions) that havealready been found.
Figure 3 shows the basic steps of a GA. Firstthe population is initialized, either randomly orwith user-defined individuals. The GA then iteratesthrough an evaluate selector produce cycle until ei-ther a user defined stopping condition is satisfied orthe maximum number of allowed generations is ex-ceeded.
5 Outline of String Data Gen-
eration using a GA
5.1 String Representation
It is necessary to distinguish between the programinput test data type in a programming language(phenotype) which includes various types includingstrings and the encoded representation of an indi-vidual solution the chromosome or genotype (oftenalso called a string) that is used in the GA. Figure 4illustrates this concept - that the phenotype is therepresentation which is evaluated and that the geno-type is the representation which is manipulated bythe GA. For the purpose of this investigation, twogenotype forms were used. One form is identical to
141
UK Software Testing Research III
Initialize Population
(generate random sets ofinput String values
Evaluation execution of inputsof test program
Branch
Satisfied
selection ofcandidate inputs
use crossover and
mutationto produce new inputs
success
Y
N
Figure 3: General Flowchart of String Test Data gen-eration using a GA
char genotype
crossover
mutation
fitness
phenotype
Test program execution GA execution
binary genotype
Figure 4: genotype and phenotype
the phenotype, i.e the string data type, the otherform is a binary string formed by concatenating thebinary forms of each character of the string. Thegenetic operators are aware of the basic data type ofeach part of the genotype and so string and numericdata can be represented in a single genotype.
5.2 Crossover
Crossover is a genetic operator that combines(mates) two chromosomes (parents) to produce newchromosomes (offspring).
5.2.1 Character Crossover
Character crossover operates on the character se-quence genotype form. It selects a crossover point
within a chromosome then interchanges the two par-ent chromosomes segments at this point to producetwo new offspring.
Consider the following 2 parents which have beenselected for crossover. The | symbol indicates therandomly chosen crossover point which may extendbeyond the length of the shorter string.Parent1= “DATA | ”.Parent2= “GENER|ATION”. The offspring are:Offspring1= “DATA |ATION”.Offspring2= “GENER|”.Character crossover does not introduce any newcharacters, the following crossover does introducenew characters.
5.2.2 Binary String Crossover
Multipoint crossover is used. In this crossover eachcharacter is converted to binary then single pointcrossover is used between all two binary characterpairs. Following is a multipoint crossover example:p1 =”EBWU”.p2 =”RSCD”.The characters are changed to binary.parent1 = “100010|1 10000|10 101|0111 1010|101”.parent2 = “101001|0 10100|11 100|0011 1000|100”.offspring1 = “100010|0 10000|11 101|0011 1010|100”.offspring2 = “101001|1 10100|10 100|0111 1001|101”.Then after changing binary to character :offspring1 = “DCST”.offspring2 = “SRGM”.Note that there are two new characters G,M in theoffspring.
5.3 Mutation
Mutation is a genetic operator that alters one ormore gene values in a chromosome from its initialstate. A random character is selected for mutationand converted to binary of which one random bit ismodified.
6 Experimental Investigation
In order to investigate the effectiveness of the fourcost functions and two genotype representations,
142
UKTest 2005
a number of short programs were instrumented tocompute the cost of string predicates. These pro-grams are shown in Figures 8, 9, 10, 11 and 12.No examples of regular expression matching wereincluded as this cost function has yet to be imple-mented.
The aim of each experiment is to find input data toexecute all the branches in the programs. The initialinput sets were randomly generated across the rangeof ASCII code 0 to 255. A run with a population of60 was made with a maximum of 100000 generationsbefore stopping. The average result of 30 runs isshown in Table 1.
7 Experimental Results and
Discussion
The results in Table 1 show that the character dis-tance cost function is clearly better than the otherthree. For all programs, it is the best performingfunction. Comparing the two crossover operators,the number of offsprings are close to each other andthere is no clear winner. Note that although thecharacter crossover operator produces no new char-acters, the mutation operator does.
8 Improved Search Operations
In the previous part of the paper, the cost functionsand genetic search operators were selected to be in-dependent of any particular program under test. Inthe rest of the paper, domain and program specificsearch operations are considered in an attempt toimprove the search performance.In some programs, the target string for a search maybe available from the program text. For example, inthe simple program
function f(s:string) {
if (s.Equals("CHILD") {
...
}
}
The first branch of this program is true when s ="CHILD". This suggests a heuristic to guiding thesearch for values for the strings s, namely, set s toa string literal that appears in the program undertest.That this heuristic may fail is clear from theexample below
function f1(s:string) {
s=s+"D";
if (s.Equals("CHILD") {
...
}
}
Although neither of the two string literals "D" nor"CHILD" are input values that execute the targetbranch, they do provide reasonable starting pointsfor a guided search.
To generalize somewhat, consider the followingprogram fragment
function f(s:string) {
s=op(s);
if(s=="AC") {
//executed required
}
}
Where op is a string operation that returns a string,possibly different from its input. Employing theheuristic of using string literals from the program asstarting points for the search for the target branch,the search should begin with s = "AC". If op returnsa string equal to its input, then f("AC") will executethe target branch and the heuristic has produceda solution immediately. A more likely situation,however, is that op will return a string that is notequal to its input. Assume, for example, that op
reverses its input, in which case f("CA") executesthe target branch. To understand the implicationsof this, consider a search space that consists of9 strings only as shown in Figure 6. The edges
143
UK Software Testing Research III
of the bidirectional graph indicate possible stringtransformations by a search operator that may only“increment” or “decrement” a single character inthe string. The graph shows that a minimum of 4applications of the search operator are required tofind the solution when the search begins from "AC",as it would do when employing the heuristic of“seeding” the search using program literals. In thisexample, the heuristic provides the worst possibleseed for the search as the minimum distances fromother seeds are all shorter.The search space may be modified however byintroducing an additional search operator. Figure 7shows the space produced when an additional“reverse” search operator is introduced.The minimum number of applications of a searchoperator necessary to transform the seed string"AC" to the solution "CA" is now just one. Overall,the mean number of operations is reduced sincealthough, the addition of the reverse operatorincreases the number of edges they are all shortcutson paths from "AC" to "CA" and so the searchdistance is always reduced.The choice of the additional search operator in thisexample was clearly motivated by the knowledgethat op reverses its input in the program under test.op is, of course unknown. Even so, some informationabout the sort of operation that is performed by op
could reduce the search space. In general, additionalgenetic search operators (e.g mutation and crossoveroperators) can be constructor to perform typicalstring operations that can be found in programsthat operate on string data.
This has motivated the introduction of additionalgenetic operators. These operators require access tothe string literals of the program and so the randomstring generator was modified to bias generation to-wards these string literals. The test data generationtool used was modified to collect string literals fromthe test program.Initial population individuals were generated by se-lecting randomly from the domain in Figure 5. Thereis a bias towards selecting strings that appear as lit-erals in the program. This is achieved by setting a5% probability that a random string is selected from
Domain of the strings
"D" "CHILD"
Literals from program
All character sequences
Figure 5: The string domain (higher probability ofselecting a string from the literals set)
the subdomain of literals rather than the domain ofall char sequences.
The swap mutation operator exchanges two char-acters in the strings. The insertion mutation op-erator selects an insertion point, within the stringwith probability of 10% and an insertion point at oneor other end with a probability of 90%. A randomstring is inserted with a 90% bias to the programliteral string set. The bias to the ends of the stringare intended to reflect the common use of the stringconcatenation operator.The deletion mutation operator deletes from thegiven string a string selected randomly from the pro-gram literal string set, if such a string is present,otherwise it deletes a random character.
The introduction of new search operators has im-plications for the cost functions as illustrated bythe previous discussion. For example in matching“ABCD” with “ABXCD” and “ABXYZ”, the firstmatch appears to be closer, the Levenshtein distanceis only 1, and yetHD(“ABCD”,“ABXCD”) = 6HD(“ABCD” ,“ABXYZ”)= 6.The idea that “ABXCD” is a relatively close matchto “ABCD” is based on the possibility of transform-ing “ABXCD” to “ABCD” using a single characterdeletion. In contrast three characters deletion andtwo insertions are required to transform “ABXYZ”to “ABCD”.The following two functions were motivated by anattempt more accurately cost the extent to which astring requires transformation by the available searchoperations.
144
UKTest 2005
1. Member Hamming Distance (MHD): is simi-lar to Hamming Distance but is sensitive tocharacters that are present but in the incorrectposition; each non-matching and absent char-acter counts k, each non-matching but presentcharacter, counts k/2.For example ifs1 = “ABCD”s2 = “ABXCD”S3 = “ABXYZ”and k = 2MHD(s1, s2)=0 + 0 + 1 + 1 + 2 = 4MHD(s1, s3)=0 + 0 + 2 + 2 + 2 = 6
2. Member Character Distance (MCD): is similarto Character Distance; if a character c1 in s1is not matched by its corresponding characterc2 in s2 and moreover c1 is absent from s2then count m(max char value) + |ascii(‘c1’)-ascii(‘c2’)|.If a character c1 in s1 is not matched by itscorresponding character c2 in s2 but c1 ispresent in s2 then count |ascii(‘c1’)-ascii(‘c2’)|,following is an example to illustrate MCD :s1 = “ITALY”s2 = “ISLAND”Let k = 256MCD(s1, s2)=0 + (| ascii(’T’) -ascii(’S’)| +256)+ |ascii(’A’) -ascii(’L’)|+ | ascii(’L’)-ascii(’A’) |+ (| ascii(’Y’) -ascii(’N’)| + 256)+ (| ascii(’D’) -ascii(”) | + 256)MCD=0 + | 84 - 83 | +256 + | 65 -76 |+ | 76 -65 | +|89 -78 | + 256 + | 68 -32 | +256MCD = 257 + 11 + 11 + 267 + 292 = 838
Solution
Starting point of search
"AA" " BA " "CA"
"AB"
"AC"
"BB"
" BC "
"CB"
"CC"
Figure 6: The search space
Solution
Starting point of search
"AA" " BA " "CA"
"AB"
"AC"
"BB"
" BC "
"CB"
"CC"
Figure 7: The search space after addition of a newreverse search operator
function example1(s1 : String) {
if( s1.Equals("University")) {
\\executed required
}
}
Figure 8: Example test program
145
UK Software Testing Research III
function example2(s1 : String) {
if (s1 >= "ZZZZZX") {
\\executed required
}
}
Figure 9: Example test program
function example3(s1 : String, s2: String
s3 : String, s4: String) {
String s = s1 + s2;
s =s + s2.reverse() + s4;
if (s2 < s4 )
s.Insert(s.Length/2,s3);
else
s.Remove(1,2);
if (s.Equals("RANDOM")) {
\\executed required
}
}
Figure 10: Example test program
function example4(s1 : String) {
if ( s1.Length >20)
s1=s1.Remove(2,5);
else if( s1.Length >10 )
s1=s1.Insert(0,"MICRO");
else if(s1.Length >5 )
s1=s1.Replace("A","B");
else
s1=s1+"ROSOFTDEVLOPMENT";
if ( s1.Equals("MICROSOFTDEVLOPMENT") {
//executed required
}
}
Figure 11: Example test program
function example5(s1 : String) {
String[] sparts =split(s1,"/");
int k=sparts.Length-1;
if (sparts[k].Conatins(’.’)) {
String[] fileparts =split(sparts[k],".");
String suffix =fileparts[1];
if (suffix == "DOC") {
if (sparts[k - 1] =="WORD") {
print("Word folder contains document");
}
else
print("document not in word folder");
}
if (suffix == "PDF") {
if [sparts[k - 1] =="ACROBAT") {
print("Acrobat folder contains PDF");
}
else
print("document not in Pdf folder");
}
}
}
Figure 12: Example test program
146
UKTest 2005
9 Results after introducing
string specific search opera-
tors
Using these new cost functions and new genetic op-erators, test data was once again generated for thesample programs without the use of the biased testdata generator. Table 2 shows the results.It is clear that introduction of the genetic operatorsalone leads to slight overall improvement at best.The out-performance of the character distance basedfunction is retained but it is not quite as effective assimple character distance. Member Hamming dis-tance is a significant improvement over HammingDistance.When the biased string generator is introduced, theimprovement is significant. The results are shownin Table 3. The good performance for the programin Figure 11 depends heavily on the insertion anddeletion operators.
10 Conclusion
This paper presents a test data generation ap-proach for program branch coverage where branchpredicates include string predicates. Experimentshave been conducted on simple programs contain-ing string predicates. The preliminary experimentalresults show that the methodology is effective par-ticulary if string literals from the program under testcan be used.The further work is to implement and investigateregular expression matching and also to evaluate themethod on a larger class of programs.
References
[1] Beizer B., Software testing techniques, 2nd ed.,New York: van Nostrand Rheinhold, 1990.
[2] Korel B, Dynamic method for software test data
generation. software testing, Verification andReliability 2 (1990), no. 4, 203–213.
[3] Korel B., Assertion-oriented automated test
data generation, Proceedings of the 18th In-ternational Conferance on Software Engineering(1996), 71–80.
[4] Goldberg D. E., Genetic algorithms in search
optimization and machine learning, AddisonWesley, 1989.
[5] Holland J. H., Adaptation in natural and ar-
tificial systems, University of Michigan Press(1975).
[6] Duran J. and Ntafos S., An evaluation of ran-
dom testing, IEEE Transactions on SoftwareEngineering 10 (1984), no. 4, 438–443.
[7] King J., A new approach to program testing, InProceedings of the International Conference onReliable Software (1975), 228–233.
[8] , Symbolic execution and program test-
ing, Communications of the ACM 19 (1976),no. 7, 385–394.
[9] Tracey N. Clark J. and Mander K., Automated
flaw finding using simulated annealing, Inter-national Symposium on Software Testing andAnalysis 30 (1998), no. 1, 73–81.
[10] Bottaci L., Predicate expression cost functions
to guide evolutionary search for test data, InProceedings of the Genetic and EvolutionaryComputation Conference (2003), 2455–2464.
[11] Harman M. and etl, Improving evolutionary
testing by flag removal, In Proceedings of theGenetic and Evolutionary Computation Confer-ence (2002), 1359–1366.
147
UK Software Testing Research III
[12] McMinn P., Search-based software test data gen-
eration: A survey, Software Testing, Verifica-tion and Reliability 14 (2004), no. 2, 105–156.
[13] Muzatko P., Approximate regular expression
matching, 1996.
[14] Coward P.D., Symbolic execution and testing,Information and Software Technique 33 (1991),no. 1, 229–239.
[15] Zhao R., Character string predicate based auto-
matic software test data generation, Third Inter-national Conference On Quality Software, 2003,pp. 255–263.
[16] Jones B. Sthamer H. and Eyres E., Automatic
structural testing using genetic algorithms, Soft-ware Engineering Journal 11 (1996), 299–306.
[17] Baresel A. Wegener J. and Sthamer H., Evo-
lutionary test environment for automatic struc-
tural testing, Information and Software Technol-ogy 43 (2001), no. 14, 41–54.
148
UKTest 2005
Table 1: Number of Offspring to find a solution, averaged over 30 trials for each of the example programs.String generation is uniform random over all strings up to length 50.
Example Binarycrossover CharcrossoverProgram CD HD LD CV CD HD LD CV Best
Figure 8 634 1210 819 916 546 1786 1067 1002 CDFigure 9 1923 not used not used 1923 1742 not used not used 1742 CDFigure 10 8787 30369 11322 13202 9378 34968 16280 19652 CDFigure 11 836 10536 3162 5331 1028 11385 3359 7845 CDFigure12 24210 97215 71317 46214 22754 86120 64312 52125 CD
Average 7278 34833 21655 13517 7090 33565 21255 16473 CD
Table 2: Number of Offspring to find a solution, averaged over 30 trials for each of the example programs.String generation is uniform random over all strings up to length 50. Using Member Character and MemberHamming Distance
Example Binarycrossover CharcrossoverProgram MCD MHD MCD MHD BestFigure 8 598 1065 496 1413 MCDFigure 9 1700 not used 1617 not used MCDFigure 10 8493 26998 9286 25189 MCDFigure 11 724 8754 723 5316 MCDFigure12 28641 85411 27812 78645 MCD
Average 8031 30557 7987 27641
Table 3: Number of Offspring to find a solution when domain specific genetic operators are used, averagedover 30 trials. String generator is biased towards program string literals, both binary and character crossoverare used with equal probability.
Example Program CD MCD HD MHD LD CV
Figure 8 4 4 4 4 3 3Figure 9 4 4 not used not used not used 4Figure 10 5916 4561 7675 7022 6321 6451Figure 11 109 91 167 148 132 358Figure12 8392 7952 27230 24562 21361 14521
Average 2885 2522 8769 7934 6954 4267
149
UK Software Testing Research III
150
UKTest 2005
Use of branch cost functions to diversify the search for test data
Leonardo Bottaci�August 2, 2005
Abstract
Heuristic search techniques have been used with some success for the automatic generation of program test
data. A problem occurs, however, when the test goal requires the solution of a subgoal which is not included in
the cost or fitness function that guides the search. A particular example of this is the need to execute a particular
program path in order to execute a given branch even though the branch may be reached by a number of other
paths. A method is described by which the search for test data is directed at exploring diverse program paths
when data to satisfy the given test goal is difficult to find. When the search can no longer progress towards
satisfying the test goal, the test goal is augmented to search for input data that not only executes the test goal but
also executes specific branches that are consistent with the test goal but have either not been executed or executed
infrequently. These branches can be identified from the values of the branch predicate cost functions that are
necessarily computed to guide the search. The advantage of this method is that the search explores a greater
variety of execution paths through the program, increasing the likelihood of finding a solution. The method has
been implemented and tested with success on three difficult to test programs.
1 Introduction
A test adequacy criterion specifies the extent to which a given program should be tested. In the context of unit
testing, for example, common test adequacy criteria include statement coverage, branch coverage and multiple
condition coverage. In general, the more stringent the test adequacy criterion to which a program has been sub-
jected, the more confidence there is in the correctness of the program. Obviously, a strict adequacy criterion is to be
preferred, but in practice, test cases are usually constructed manually by a tester who may need to spend significant
time analysing the program under test. Consequently, there is much interest in the prospect of generating test data
automatically.
In the context of unit testing, an ideal automatic test data generation tool inputs the program under test and outputs
a set of test data for a given adequacy criterion. Because the general test data generation problem, i.e. the problem
of taking a given program and constructing an input that produces a given program behaviour, is well known to be
undecidable, (the halting problem is a special case of this problem) no ideal tool of this kind may be constructed.
Research effort has thus been directed towards heuristic approaches and a number of heuristic search methods
have been investigated (Jones, Sthamer, and Eyres 1996) (Korel 1990) (Korel 1992) (Ferguson and Korel 1996)
(Tracey, Clark, and Mander 1998) (Tracey, Clark, Mander, and McDermid 2000) (Wegener, Baresel, and Sthamer
2001) (Baresel, Sthamer, and Schmidt 2002) (Michael, McGraw, Schatz, and Walton 1997).
A key component of all heuristic search methods is an evaluation function (also known as an objective function or
cost function) that guides the search towards the solution. A cost function provides an evaluation of each point in
the search space in terms of how “close” it is to a solution. As a simple example of a cost function in the context
of software test data generation, consider the problem of searching for test data to execute the target branch in the
following program fragment where v is an integer variable.
...
if (v == 1) {
//TARGET BRANCH
Initially, test inputs are generated randomly. Any input that fails to execute the target branch, i.e. a test case that
produces a value in v not equal to 1 may be assigned a cost of abs(v � 1). This cost function is positive for all
non-solution points.�Department of Computer Science, University of Hull, Hull, HU6 7RX, l.bottaci@dcs.hull.ac.uk
151
UK Software Testing Research III
The role of the cost function within a heuristic search algorithm is to discriminate between the points in the search
space and thereby identify the best direction in which to explore. In the case of the example program, the search
algorithm would use the guidance of the cost function to generate program inputs with successively lower cost
until, hopefully, a solution is found. In practice, the search may be terminated after a budgeted period of time has
elapsed.
2 Test data generation problem
The progress of a test data search algorithm is arrested when it is unable to generate any test input that has a lower
cost than the current best input. In these conditions, the search is said to be trapped on a local optimum or plateau.
As a simple example, consider the problem of searching for test data to execute the target branch in the following
program fragment below.
v = 2;
if (x == 0) {
v = 0;
}
if (y == 0) {
v = v + 1;
}
if (z == 0) {
v = v - 3;
}
if (v == 1) {
//TARGET BRANCH
}
In general, in order for a particular program branch to be executed, the control dependency conditions for that
branch must be satisfied. Assuming no other branches are present, the control dependency condition for the target
branch is satisfaction of the predicate v == 1which leads to the cost function abs(v�1). In some cases, however,
the control dependency condition cannot be satisfied unless branches that do not appear in the condition are also
executed. In order to execute the target branch, in the above example, it is necessary that x be zero so that v may
be set to 0 and that y also be zero so that v may be set to 1 and that z be nonzero. The simple cost function,
abs(v � 1), is however, essentially insensitive to the values of x, y and z and can therefore not guide the search.
If the values of x, y and z at the conditionals are determined randomly then it is plausible that the probability of
satisfying x == 0 and y == 0 is low. In this situation, the execution of the target branch depends on what is
essentially a random search for an input that executes the x == 0 and y == 0 branches without executing the z
== 0 branch.
It should be appreciated that in its general form, the problem just illustrated is one that eventually is faced by all
attempts to generate test data for nontrivial programs via heuristic search. Different researchers have tackled this
problem in different ways (Ferguson and Korel 1996) (Baresel and Sthamer 2003) (Harman, Hu, Hierons, Baresel,
and Sthamer 2002). One approach, the chaining method (Ferguson and Korel 1996), is to further analyse the
program under test and thereby improve the cost function. In the case of the above example, data flow analysis
may be used to determine that the target predicate uses a value of v and that furthermore, there are definitions of
v in the first three conditionals that have not been executed and so their execution are plausible subgoals. A more
general approach, and the one presented in this paper, is to view the problem as one of inadequate search diversity.
Search methods that maintain, at each step in the search, not one, but many candidate solutions, are inherently better
suited to avoiding the local optimum or plateau trap. If these candidate solutions are widely spaced, sufficiently
diverse, there is a lower chance that they will all lead to the same local optimum. Although the importance of
diversity is recognised it is not easy to achieve since diversity is countered by the convergence process that is
essential for the search to find a solution.
There are two broad approaches to maintaining diversity; domain independent strategies and domain specific
strategies. The strategy adopted in the work reported here is domain specific and relies on the fact that there is
usually a more than one path by which an input may execute a program to satisfy the test goal. At conditional
statements, where program inputs may execute either branch without violating the control dependency condition
for the test goal, there is scope for introducing the execution of one particular branch as an additional subgoal to
152
UKTest 2005
the test goal. By identifying non-critical branches that have not been executed and introducing them as subgoals
for independent searches, the search is directed to inputs that execute the program differently.
To illustrate this approach, consider again the previous example program and assume that after a period of searching
for data to execute the target branch, the target branch has not been executed. At this stage the search has been
guided only by the cost abs(v � 1). The next step is to suspend the search and identify those branches that have
been reached but not executed and yet may be executed on a path that executes the target branch. The program
under test is instrumented to record the predicate costs at all branches and so this information is available.
In the example program, all the true branches will have been reached but it is unlikely that an input has been able
to satisfy any of them. The reached but unexecuted branches, that are absent from but consistent with the target
search goal, are all candidates for additional subgoals. In this example, the three branches prior to the target branch
are each combined with the target branch goal to produce three additional search goals. Three new searches are
now initiated, the original search remains suspended.
Since it is necessary to satisfy both of the first two conditionals of the program, it is still unlikely that any of these
three new searches will lead to the target. After a period of during which the searches make no progress, they will
be suspended and similarly examined to identify suitable additional subgoals. In this way a search is instigated for
an input to satisfy the first two conditionals in addition to the target. It is clear that once the goal of satisfying x
== 0 and y == 0 is included in a cost function, the search may progress satisfactorily.
In general, the aim is to direct the search into as yet unexplored areas of the input domain by changing the program
behaviour at specific branches. To explain the method in more detail, the following section defines the basic
concepts of program execution and is followed by a short introduction to genetic algorithm search. A detailed
description of the branch diversity method is then presented as an algorithm. Following that, some examples are
worked through in detail. The branch diversity search algorithm has been implemented and the results obtained for
the example program above and two other example programs are given.
3 Definitions
3.1 Program structure
A program control flow graph is a directed graph CFG = (N;E; s; e), where N is a set of nodes, E is a set of
edges, i.e. pairs from N and s is a unique start node and e a unique exit node. Nodes in N may correspond to the
statements or basic blocks of a program. An edge exists from one node to another (i.e. from a node to its control
flow successor) if and only if execution of one node may immediately be followed by execution of the other. Those
nodes in N that have more than a single control flow successor are called conditional nodes and correspond to
if-then statements, while statements and the like. Without loss of generality, it is assumed that a conditional node
has exactly two successors. The outgoing edges from a conditional node are called branches, and are labelled, one
with the symbol T, the other with the symbol F. Each conditional node is associated with a predicate expression.
Program execution flows along the branch labelled T when the predicate expression is true and along the branch
labelled F when the predicate expression is false.
A path in the CFG is a sequence of nodes < n0; ::; ni; ::; nk > such that a node may directly follow another in
the path only if it is a control flow successor of that node. An executable path is a path where n0 = s, nk = e.
An executable path is feasible if some program input may execute it. A branch is said to be executed if the first
and second nodes of the branch are adjacent in an execution path. A set of branches is executable iff there is an
execution path that executes them. A branch is said to be reached if the first node of the branch is executed, i.e. a
member of an execution path.
The program control dependency graph (Ferrante, Ottenstein, and Warren 1987; Harrold and Rothermel 1996) may
be derived from the program control flow graph. A node x dominates a node y (x 6= y) if x is a node through which
all executable paths to y must pass. A node y post-dominates a node x (x 6= y) if every executable path that passes
through x contains y. Let x and y be nodes in a control flow graph. y is control dependent on x if there is a path p
from x to y with all nodes z in p (excluding x and y) post-dominated by y but x is not post-dominated by y.
The control dependency relation may be used to construct the control dependency graph. In this graph, the node
s is control dependent on a root node entry, as is the node e, assuming the program terminates. All other nodes
(assumed to be reachable) are directly or indirectly control dependent on entry. A path in the control dependency
graph from entry to a given node x defines a sequence (possibly empty) of branches that, if executed, will lead
to the execution of x. The conjunction of predicate conditions along a control dependency path to x is known
153
UK Software Testing Research III
as a control dependency path condition. There may be more than one control dependency path to x and since a
path that reaches x must satisfy one of the control dependency path conditions for x, the disjunction of the control
dependency path conditions for x is the control dependency condition for x. The control dependency path (control
dependency condition) for a branch may be defined as the control dependency path (control dependency condition)
of the second node of the branch.
The method described in this paper requires searching for inputs that execute a set of branches as a means to
execute a target branch. These sets are produced by combining control dependency paths. Clearly, these sets of
branches should be executable. To appreciate the implications of this, consider the following example.
if (P) {
if (R) {
...;
}
if (Q) {
return;
}
}
if (T) {
//TARGET BRANCH
}
Assume that the test goal is to execute the branch T. The control dependency condition for T is (not P; T) or (P; not Q; T)To maintain diversity, each of the two disjuncts (control dependency path conditions) is used to guide a search in
a separate population, which is to say that one search is made to find an input to satisfy (not P; T) and another is
made to satisfy (P; not Q; T). Consider now the problem of deciding if R may be added as an additional subgoal to
either of these search goals. It is necessary to determine if R and the branches in the two control dependency path
are executable.
It is easy to establish if two branches are jointly executable by examination of the transitive closure of the control
flow graph. In this graph, each node has a reach set consisting of all the nodes reachable from that node. A branch
may be said to be reachable from a given branch if the reach set of the second node of the given branch includes
the first node of the other branch.
Given a set of branches, a brute-force algorithm that examines all sequences of branches is computationally ex-
pensive. Considering how this expense might be avoided, notice that in the example above, it might be assumed
that it is necessary to establish only that R and the target branch T is executable since a control dependency path is
always executable. In fact, although R and T are executable, and also the branches P; R; not Q; T, it does not follow
that all control dependency paths to T are executable with R, R; not P; T is not.
Multiple control dependency paths exist only in the presence of explicit transfer of control statements such as
return, break and continue statements1 and so the majority of control dependency conditions will consist of a
single path. Nonetheless, even in cases where a target branch is control dependent on multiple control dependency
paths it is possible to establish that the target branch together with a particular control dependency path and some
given branch are executable without examining all sequences of the branches involved.
This is justified by the following observation. Let T be a target branch and let p be one of the control dependency
paths to T and let R be a branch not in p. If the branches in p together with R are executable then these branches
may be placed in a sequence in which each successor branch is reachable from its predecessor and moreover, the
ordering of branches in p is preserved. This means that to establish the executability of p and R, it is necessary to
examine only sequences of these branches in which the ordering in p is preserved rather than examine all possible
sequences.
The justification for this observation is as follows. If p together with R is executable then there exists a sequence
of these branches, s, in which each successor branch is reachable from its predecessor. Let the set of branches p0be the subset of p that occurs in s prior to the first occurrence of R. Let b be the last branch in p0 as defined by
the ordering in p. It follows that the prefix of p that ends in b followed by R is executable. Note that although
the predecessor of R in s may precede b in p, there is a path through all the branches of p and therefore from this
predecessor to b.
1In the absence of explicit transfer of control statements, there are two control dependency paths to a loop header node but one of the paths
is produced by adding a back edge to the path to the loop body and as such it does not provide an alternative path to the loop header and thus is
not used as a search guide.
154
UKTest 2005
If b is not the last node of p then let c in p be the successor of R in s. From c it is possible to reach all the branches
in p less the branches in the prefix up to b, recall the existence of s, hence it is possible to reach the successor of b
in p. It follows that R together with p less the prefix that ends in b is executable.
Returning to the above example, adding R to the path (not P; T) requires examination of the sequences obtained
by merging control dependency paths (not P; T) and (P; R) which is the control dependency condition for R.
This produces the sequences (not P; P; R; T); (not P; P; T; R); (P; not P; R; T); (P; not P; T; R); (P; R; not P; T) and(not P; T; P; R) none of which are executable. Merging control dependency paths (P; not Q; T) and (P; R) produces
the sequence (P; P; R; not Q; T)) which is executable.
In the above example, the control dependency condition for R consists of a single path. In general, there will be
multiple control dependency paths to a branch such as R and subject to executability, a single path from the set of
paths to R is combined with a single path to the target branch to form the search goal of a single population. Again,
executability of two control dependency paths can be established by examination of branch sequences in which the
ordering of branches in each control dependency path is preserved. This is because the argument that applies to
any prefix of the control dependency path to the target branch also applies to any prefix of the control dependency
path to the branch R.
3.2 Branch cost functions
To use a control dependency path or any set of branches as a search goal, it is necessary to determine the cost
values for each branch predicate. To do this, each conditional node in the program is associated with a real-valued
predicate cost function that is evaluated whenever the conditional node is executed. This predicate cost function
returns a positive value whenever the predicate is false and a negative value if the predicate is true. The cost of
an evaluation of a logical negation of a predicate is the arithmetic negation of the cost of the evaluation of the
predicate.
Each reached branch maintains two cost values, both derived from the associated predicate cost function. One
cost value is the cost that all attempts to execute the branch are successful. This is called the cumulative and-cost.
The other cost value is the cost that any attempt is successful, called the cumulative or-cost. These costs can be
illustrated with an example showing three failed and two successful attempts to execute the predicate a � b for
various integer values of a and b. The predicate cost function is a� b when the predicate is false, and a � b� 1when the predicate is true. The cost values produced by relational predicates are normalised but the unnormalised
values are used in the table below in order to more clearly show the arithmetic.
Table 1: Cumulative or-cost and and-cost for the predicate a <= b for the values listed.
a b cost and-cost or-cost
4 1 3 3 3
3 1 2 5 6/5
2 1 1 6 6/11
1 1 -1 6 -1
2 1 -2 6 -3
The cost of a conjunction of two false costs is the sum of the costs of the conjuncts. The cost of a disjunction of
two false costs is pq
p+qwhere p and q are the disjunct costs. If only one cost is false, the conjunction cost is this cost,
the disjunction cost is the true cost. The motivation for these functions is given in (Bottaci 2003). The relevant
property of these cost functions for the work presented here (as can be seen in Table 1) is that the cumulative
and-cost increases with each failure to execute the predicate and the cumulative or-cost decreases.
Note that when both branches at a conditional node have been executed the and-cost is positive and the or-cost is
negative. Moreover, the magnitude of the and-cost is an indication of the number and magnitude of the failures
to satisfy the predicate. A high and-cost indicates that the predicate has hardly been satisfied. A low and-cost
indicates that the predicate has hardly not been satisfied.
The cost of a search goal is calculated, according to the formula for conjunction given earlier, as the conjunction of
the individual branch goal costs. Each individual branch cost is either a branch or-cost or a branch-and cost. With
this method there is the danger that a single large branch cost may dominate the overall cost value. Normalisation
or costs reduces this risk. An alternative method, not used in the work reported here, is to compute a cost consisting
of two components. One component, the most significant, counts the branch goals that that have yet to be to be
155
UK Software Testing Research III
satisfied. This cost component is the analogue of the approximation level used by Wegener et al. (Wegener, Baresel,
and Sthamer 2001) and Baresel et al. (Baresel, Sthamer, and Schmidt 2002). The second component is applicable
only if the first component is nonzero and is calculated as the disjunction of the unsatisfied branch goals.
For each branch, there are two associated branch search goals that may be specified to guide a search, namely
branch-or (on at least one occasion that the branch is reached, it is executed) and branch-and (on every occasion
that the branch is reached, it is executed). A branch goal is satisfied if the associated or-cost or and-cost is negative.
If execution of a branch is required to satisfy branch coverage or a control dependency condition then branch-or is
the relevant branch goal. When a search goal is augmented with additional branch goals in order to guide the search
to new execution paths, these branch goals are generated from branch execution data according to the following
rules:� If a branch (at a predicate that is not a member of the current search goal) has been reached but not executed
then branch-and is an additional branch goal.� If a branch (again, at a predicate that is not a member of the current search goal) has been executed but its
negation has not then the negation branch-and is an additional branch goal.� In cases where both a branch and its negation have been executed then two additional branch goals are
adopted, branch-and for the branch and branch-and for the branch negation.
Note that the goal of satisfying the branch or-cost is not adopted primarily to avoid creating an excessive number of
subgoals. Moreover, if is necessary to find an input that executes both branches at a predicate, it is hoped that such
an input will be found during the search for an input to satisfy the and-cost. Recall that the and-cost is adopted as
the branch goal after one branch at a predicate has already been executed.
The above rules apply to non-loop branches only. Loops are treated differently to if-statements because for pro-
grams that terminate, loop entries are eventually followed by a loop exit. For this reason, a subgoal that specifies
that a loop predicate is always true is not sensible and thus the two possible branch goals at a loop conditional are
loop entry and no loop entry.
4 Branch diversity search method
A genetic algorithm is an appropriate search technique given that the basis of the branch diversity search method
is to conduct a number of searches to explore different regions of the input domain. For the purpose of the work
presented here, a genetic algorithm can be described crudely in terms of three components, a set of candidate
solutions, called a population, a cost function (also known as a fitness function) and a set of search (genetic)
operators that can produce new candidate solutions by copying and modifying existing candidate solutions in the
population.
A basic genetic algorithm conducts a search by selecting candidate solutions from the population and using them
to produce new candidate solutions. The selection is random but biased towards the most promising candidates as
estimated by the cost function. The size of the population is usually fixed and so as new candidates are produced,
the least promising are discarded. This is survival of the fittest. Over many iterations, the population is said to
evolve towards a solution.
A multi-population genetic algorithm (Cantu-Paz 1998) extends the basic genetic algorithm by including a number
of populations. In the work reported here, each population is evolved with its own cost function. This is done
to direct the search in different populations to different regions of the input. The use of multiple populations is
a simple method of maintaining diversity in the set of inputs. If only a single population is used, survival of the
fittest can lead to the elimination of all individuals except those exploring the single lowest cost input region.
In general, multi-population genetic algorithms may allow individuals to “migrate” from one population to another.
An individual produced in one population is added to another population providing its evaluation according to the
cost function of the foreign population is sufficient to displace an existing individual. Migration is normally limited
in order to maintain the differences between populations.
In the work reported here, migration is unrestricted. There are two reasons for this, firstly, each population has its
own cost function which is the overriding determinant of which individuals remain in a population irrespective of
the number of migrants from other populations. For this reason, unrestricted migration does not lead to the loss of
diversity that it might in other multi-population genetic algorithms. Secondly, it is efficient to reuse executed tests
156
UKTest 2005
wherever possible since the time required to execute the program under test is usually the most important factor
that determines the speed with which test data is generated. Once an input has been executed and the cost function
data collected, an evaluation of the input against any specific cost function can be produced relatively quickly.
The following algorithm describes the main iterative procedure of the search method. For each control dependency
predicate path to the target branch, a population and associated cost function is created to search for the target.
The initial populations are constructed by randomly generating and executing a number of inputs. The following
algorithm then applies.
while (test program execution count < max test program execution count) {
do {
foreach (non-stagnant population) {
evolve population for a new input
if (target branch executed)
stop
if (population stagnant) {
if (can identify suitable branches as additional subgoals) {
foreach (additional subgoal control dependency path) {
create new search goal consisting of
current search goal and subgoal control dependency path
if (new search goal not a duplicate) {
create new population with new search goal
seed new population from existing and new tests
add new population to current populations
}
}
}
}
}
}
if (all populations stagnant) {
set all populations non-stagnant
}
}
Since it is not known which population will produce a solution, each population is evolved for only one input
in turn before moving on to the next population. A genetic algorithm of the so-called steady-state variety such
as Genitor (Whitley 1989) is a convenient may to do this. Reproduction takes place between two individuals
who produce one or two offspring (depending on the choice of reproduction operator). These offspring are then
evaluated and either inserted into the original population expelling the one or two least fit or discarded if the
offspring are the least fit. The population is kept sorted according to cost and the probability of selection for
reproduction is based on rank in this ordering.
An important consideration is determining when a population is no longer evolving towards a solution. In the work
reported here, search progress in a single population is considered to have stopped when a sequence of input cost
values of a given length k has been accumulated and in comparing each cost with the cost l, l � k=2, costs later,
the majority of comparisons do not show a cost decrease. Such a population is said to be stagnant.
Stagnant populations are not evolved but their search goals are extended and used to evolve other populations.
Whenever all populations are stagnant and the maximum execution count has not yet been reached then in order
to continue searching for inputs, the stagnant status of all populations is cleared. This is done by simply emptying
the sequence of accumulated cost values. This ensures that a formerly stagnant population is evolved for at least
k inputs before there is the possibility of once again becoming stagnant. Note that, in this scheme, since the most
effective populations will take longer to stagnate, they will be given more of the computation time.
A search goal for a population is a set (actually a conjunction) of branch search goals. New search goals are
generated as follows. The best input so far found is executed to identify the set of reached predicates that are
also absent from the current search goal. Branch goals are generated from each of these predicates as described
earlier. Branch goals that cannot be executed before the target branch are discarded. For each remaining branch
goal, and for each control dependency path to the branch goal, if the control dependency path and current search
goal are executable then the current search goal is extended by adding the branch goal and the goals of the control
157
UK Software Testing Research III
dependency path. The extended search goals that are not duplicates of existing search goals are used to evolve new
populations. Note that since a search goal may contain conditions in addition to execution of the target branch, the
target branch may be executed without satisfying a search goal.
All new populations, apart from the initial population, are seeded half from the inputs in existing populations and
the other half are generated randomly. Reusing the existing tests is usually efficient since once a test is found to
execute a given branch, it need not be “rediscovered” if it is required for a later branch.
5 Case study
To describe in practice how branch goals are selected and how search goals are generated to find input for particular
branch execution, two sample programs are studied in detail. Consider the following example test program that
checks whether there is an equal number of space and non-space characters in a string of at least eight characters.
1 if (s.Length > 7) {
mismatchcount = 0;
i = 0;
4 while (i < s.Length) {
5 if (s[i] == ’ ’) {
mismatchcount++;
}
else {
mismatchcount--;
}
i++;
}
12 if (mismatchcount == 0) {
print("parity"); //TARGET BRANCH
}
}
To execute the target branch, the two branches of the inner conditional must be executed the same number of
times. The control dependency condition for the target branch, however does not include these branches, it is
s:Length < 7 and mismatchcount = 0.
Given the 16-bit Unicode character set and uniform random character generation, the probability of generating a
random string that includes a space is low. As a result, the cost function produces low values only by directing the
search towards shorter strings since this is the only way to reduce the cost of abs(mismatchcount� 0). Eventually,
the population of candidate solutions may be expected to consist of strings all with a length of eight characters,
none of which are spaces. Note that in the unlikely event that any string does include a single space then the cost
function will rapidly ensure that all candidate solutions also include a single space since such strings have a cost
of only six. The crossover operator of a genetic algorithm is also likely to distribute those spaces and so the target
will eventually be satisfied. Until the first space is found, however, the search is directed by the cost function
abs(mismatchcount� 0) which in effect is a random search for a string that includes a space character.
The behaviour of the branch diversity algorithm for the example space-parity program is outlined below.
1. The initial search goal is the set of goal branches f1:To; 12:Tog. A goal branch is shown as an integer
(this is the predicate identifier and corresponds to the line number shown) and one of the four symbols
To; Ta; Fo; Fa which denote, respectively, ‘at least one execution of the predicate true branch’, ‘all execu-
tions of the predicate should be true’, ‘no execution of the predicate true branch’, ‘at least one execution of
the predicate false branch’.
2. A single population with the initial search goal is evolved until stagnant; at which point the reached pred-
icates are: f1; 4; 5; 12g. Since branches at predicates 1 and 12 are already present in the search goal, only
branches at predicates 4 and 5 are considered.
3. The predicate at 4 is a loop, which has been entered, and so a potential branch goal is 4:Fo, i.e. no loop entry.
If it is assumed that the predicate at 5 has not been satisfied then 5:Ta is a potential branch goal. Both 4:Fo
158
UKTest 2005
and 5:Ta are executable with f1:To; 12:Tog and so two new search goals are constructed, f1:To; 4:Fo; 12:Togand f1:To; 4:To; 5:Ta; 12:Tog. Note that 4:To is added because it is the control dependency condition for the
branch goal 5:Ta.
4. The population with the search goal f1:To; 4:Fo; 12:Tog is likely to stagnate. This prompts an attempt to
generate new search goals. Again, the reached predicates are: f1; 4; 5; 12g. Since branches at predicates 1,4 and 12 are already present in the search goal, only branches at predicate 5 are considered. If it is assumed
that the predicate at 5 has not been satisfied then 5:Ta is a potential branch goal. Note, however, that 5:Ta is
not executable with f1:To; 4:Fa; 12:Tog and so may not be added to the search goal. There are no other ways
in which this search goal may be extended.
5. The population with the search goal f1:To; 4:To; 5:Ta; 12:Tog is likely to find an input that executes the true
branch at 5. Once this occurs, a reduction in the cost of the 12:To branch follows.
Note that the presence of 5:Ta rather than 5:To guides the search towards inputs that have a relatively large number
of spaces and indeed, on its own it would attempt to maximise the number of spaces in the string. The presence
of 12:To counters this tendency. In this example either 5:To or 5:Ta is adequate to guide the search to a solution
but this is not always the case. There are cases where Ta rather than To is the necessary branch goal type. The
following example, illustrates this. There may also be occasions where 5:To is required rather than 5:Ta. Adopting
both 5:Ta and 5:To as branch goals significantly increases the number of populations with the danger that the
method becomes unworkable.
In order to more directly compare the method presented here with the chaining method (Ferguson and Korel 1996),
the program example presented in (Ferguson and Korel 1996) (line numbers as in original) is reworked to show
that the method presented here is also able to solve the test goal. The program accepts two integer arrays a and b
of length 10 and an integer target. The overall test goal is to execute the true branch of the if-statement at line
16. This branch is executed when at least one member of a and all the members of b are equal to target.
i = 0;
fa = false;
4 fb = false;
5 while (i < 10) {
6 if (a[i] == target) {
fa = true;
}
i = i + 1;
}
9 if (fa == true) {
i = 0;
11 fb = true;
12 while (i < 10) {
13 if (b[i] != target) {
14 fb = false;
}
i = i + 1;
}
}
16 if (fb == true) {
print("mess 1"); //TARGET BRANCH
}
1. The control dependency condition for the target branch is fb == true and is the search goal for the initial
population. This first population very quickly stagnates because the cost function for fb == true has but
two values corresponding to the two boolean values and is unable to provide a decreasing cost gradient.
2. The candidate subgoals are the reached branches not in the population goal, namely the while statement
at line 5, the if-statement at line 6 and the if-statement at line 9. This last conditional may be assumed to
be false because of the low probability of satisfying the if-statement at line 6, consequently the remaining
branches are not reached. From these predicates, the following branch goals are generated, 5:Fo (no loop
entry), 6:Ta and 9:Ta.
159
UK Software Testing Research III
3. The searches for 5:Fo and 9:Ta will stagnate but the search for 6:Ta will eventually prove successful. Once
the true branch of the if-statement at line 6 has been executed, the true branch at line 9 is executed, the
loop at line 12 is entered and it is very likely that the true branch of the following if-statement will also be
executed. Unfortunately, execution of this branch prevents satisfaction of the test goal. The populations will
all stagnate.
4. The reached non-executed branches not in the population goal are the false branch at line 9 and very likely
the false branch of the if-statement at line 13. Adopting this false branch as a subgoal i.e. 13:Fo (with the
and-cost as the cost function since only the true branch has been executed) will guide the search towards
arrays b in which each element is equal to target. This leads to the execution of the target branch.
Notice that since the predicate at 13 must be false on every visit, the 13:Fa branch goal would not have led to
success.
In practice, the generation of new search goals depends on when a population becomes stagnant and the branches
reached by the current best input. An example run for the above program is given in the next section.
Comparing the behaviour of the branch diversity method with that of the chaining method for this program, an
initial search is initiated for an input to execute the target branch. If no such input can be found data flow analysis
is used to identify the statements that are the last definitions of variables that are used in the failed branch predicate.
A definition of a variable is a last definition if there is a definition free path for that variable from the definition
successor in the control flow graph to the variable use. The chaining method creates search subgoals, represented
as event sequences. A local, non-population based search method is used to find input data that executes each last
definition along some definition free path. In the example program, the root of the tree is the event sequence that
specifies execution of the branch at line 16. From this point it identifies last definitions of fb at lines 4, 11 and
14. An attempt to execute the definition free path from the definition at line 4 has already failed and so this is not
reconsidered. The definition free paths from the definitions at lines 11 and 14 are therefore used to create the two
child event sequences of the root node. McMinn and Holcombe (McMinn and Holcombe 2004) have adapted the
chaining method to use genetic algorithm search to satisfy the subgoals identified but they retain event sequences.
The relative merits of using data-flow analysis over predicate cost values to select subgoals is an interesting ques-
tion. The methods are not incompatible with each other and so there may well be an argument for combining
them. In the first example program presented, each conditional statement body defined and used the significant
variable v. If, however, there were many more conditionals in the program, none of which defined v then the
advantage of data flow analysis becomes clear as it can avoid pursuing those branches that cannot affect the value
of v. The branch diversity algorithm will pursue all branches indiscriminantly of data flow. The tree of search
goals is developed breadth first according to the progress of each search.
6 Results
The branch diversity algorithm has been implemented in order to assess the efficiency of the method in practice.
One practical concern is the number of searches that may be instigated. In the worst case, there may be one
population for each subset of the set of non-target branches in a program. It is unclear how often such cases arise
in practice. This problem is mitigated somewhat, however, by the fact that the populations of ineffective searches
become stagnant and so claim only a small proportion of the computational time. It might be considered that once
a population is stagnant its evolution should be discontinued all together. The argument against this is that the
newly instigated searches are usually directed at only a proper subset of the region that may solve the test goal.
In addition, newly discovered inputs in a new population can be copied (migrated) and used to restart progress in
a stagnant population. Notice also that since a population may become stagnant more than once, it may produce
more than one set of extended search goals. Usually, these goals duplicate previously produced goals but this need
not be the case. New tests continue to migrate to stagnant populations. This means a stagnant population may be
revived to execute a different set of branches and thereby produce a different set of subgoals.
Another concern is the total size of the populations. Currently, the implementation maintains a maximum overall
population size for each set of populations that is created from extending a single population only and, the available
memory is equally divided between the newly created populations. There is no limit to the number of populations.
The first example program2 presented in this paper was submitted to an implementation of the branch diversity
search method in order to generate inputs for all branch coverage. The input domain for each integer variable,
2The program was coded as a JScript program, the source language acceptable to the implementation.
160
UKTest 2005
x, y and z was [�500000; 500000℄. The overall population size for each new formed set of populations was
120. The sequence of costs used to determine if a population is stagnant had a length of 100. No attempt was
made to tune any parameters of the genetic algorithm. In each of one of 30 trials, inputs were found to execute
all branches. Branch coverage required an average of 20,434 executions of the example program under test. Of
particular interest is the time taken to find an effective search goal. On average, a population with an effective
search goal, i.e. the satisfaction of the first and second predicates, was generated after 1842 executions of the
example program under test.
The second example program that checks the number of spaces in a string for parity was also submitted to an
implementation of the branch diversity search method in order to generate inputs for all branch coverage. The
input domain for each character variable of the input string was the 16 bit Unicode character set. The overall
population size and the sequence of costs used to determine if a population is stagnant was as before and again
no attempt was made to tune any parameters of the genetic algorithm. Over 30 trials, all branches were covered
in an average of 16,860 executions of the example program under test. On average, a population with an effective
search goal, i.e. the satisfaction of the inner if-statement 5:Ta, was generated after 1749 executions of the example
program under test.
The example program from (Ferguson and Korel 1996) was also submitted to an implementation of the branch
diversity search method, again in order to generate inputs for all branch coverage. The input domain for each
of the 21 integer values was [�99; 99℄. The overall population size and the sequence of costs used to determine
if a population is stagnant was as before and again no attempt was made to tune any parameters of the genetic
algorithm. Over 30 trials, all branches were covered in an average of 20661 executions of the example program
under test. On average, a population with an effective search goal, i.e. including branch goals 6:Ta; 13:Fo, was
generated after 1208 executions of the example program under test.
Given that stagnant populations give rise to new populations. Populations can be arranged as a tree. The root is
the initial population. The populations used for searches instigated because the search in this initial population
stagnated are children of the initial population. The tree of search goals that was developed during one trial run of
the implementation is shown below.16:To5:Fo 1009:Ta 29512:Fo 56912:To 13:Fo5:To 6:Ta12:Fo 9409:To 12:Fo9:To 12:To 13:Fo effective search goal9:Ta5:To 6:Ta 29712:To 13:Ta 79515:To 6:Fo12:Fo 62112:To 13:Fo12:Fo5:To 6:Ta 57212:To 13:Fo5:To 6:Ta 587 effective search goal
The initial population search goal is 16:To. After 100 (the length of the sequence of costs used to determine
stagnation) executions of the program under test, the initial population is stagnant and three new search goals are
formed, these appear at the first indentation level as 5:Fo, 5:To 6:Ta and 9:Ta, these being the additional branch
goals added to the initial population. Notice that it is at this point that the first of the crucial branch goals, 6:Ta, is
created. The second population to become stagnant is 16:To 5:Fo after a total of 295 executions. At this point, the
branch goal 9:Ta is added to 16:To 5:Fo to form a new search goal. Note also that 9:To has not yet been executed
even though the required branch goal 6:Ta has been used in the search from execution 100 onwards.
After 297 executions, 4 new populations are created by adding branch goals to the population previously created
with 9:Ta at 100 executions. At this point, notice that the second of the crucial branch goals, 13:Fo, is created to-
161
UK Software Testing Research III
gether with it control dependency goal branch 12:To. At this stage, although 6:To has been executed, unfortunately
no search goal combines both of the crucial branch goals. After 487 executions, however, the 5:To 6:Ta branch
goals are added to the search goal 16:To 9:Ta 12:To 13:Fo to provide an effective search goal.
The tree of populations continues to grow as shown to 7951 executions. Note that a second effective search goal is
created after 940 executions. In fact, the presence of 9:Ta rather than 9:To is the only difference between this search
goal and the effective search goal generated at 487 executions. After a total of 28636 executions, the required input
is found.
7 Conclusions and further work
A branch diversity search method has been presented for avoiding local optima and plateau during the search for
test data. It relies on identifying branches in the program under test where the flow of control may be modified
without invalidating the test goal. Additional searches are instigated when it is deemed that one or more of the
current searches are not making progress. The additional searches are designed to explore regions of the input that
have currently not been explored. This is achieved by identifying branches in the program where the execution
behaviour may be changed without violating any control flow constraints implied by the current search goal. The
predicate cost values that are calculated to guide the search for test data are used to select suitable branches. Three
short but difficult to test programs have been shown to be amenable to the method.
The immediate further work is to apply the method to a larger set of programs to gain experience with its per-
formance and uncover its problems and strengths. For example, for large programs, the best search strategy for
developing the search tree of populations, depth first, breadth first etc. is not clear. The computational demands
of the method are not clear. In looking for branches that are suitable as subgoals, branches that have not been
executed are preferred. Loops, as mentioned earlier are treated differently to if-statements, the available subgoals
are loop entry and no loop entry. Although it is not sensible to search for an input that enters the loop and never
exits, it may be useful to find inputs that maximise the number of loop iterations.
References
Baresel, A. and H. Sthamer (2003). Evolutionary testing of flag conditions. In Proceedings of GECCO 2003,
pp. 2442–2454. Springer Verlag.
Baresel, A., H. Sthamer, and M. Schmidt (2002). Fitness function design to improve evolutionary structural
testing. In Proceedings of Genetic and Evolutionary Computation Conference GECCO 2002, pp. 1329–
1336. Morgan Faufmann.
Bottaci, L. (2003). Predicate expression cost functions to guide evolutionary search for test data. In Proceedings
of Genetic and Evolutionary Computation Conference (GECCO 2003), pp. 2455–2464. Springer Verlag.
Cantu-Paz, E. (1998). A survey of parallel genetic algorithms. Calculateurs Paralleles, Reseaux et Systems
Reportis 10(2), 141–171.
Ferguson, R. and B. Korel (1996, Jan). The chaining approach for software test data generation. ACM Transac-
tions on Software Engineering and Methodology 5(1), 63–86.
Ferrante, J., K. J. Ottenstein, and J. D. Warren (1987, July). The program dependence graph and its use in
optimization. ACM Transactions on Programming Languages and Systems 9(3), 319–349.
Harman, M., L. Hu, R. Hierons, A. Baresel, and H. Sthamer (2002). Improving evolutionary testing by flag
removal. In Proceedings of Genetic and Evolutionary Computation Conference GECCO 2002, pp. 1359–
1366. Morgan Faufmann.
Harrold, M. J. and G. Rothermel (1996). Syntax-directed construction of program dependence graphs. Technical
Report OSU-CISRC-5/96-TR32, Department of Computer Information Science, The Ohio State University,
Columbus, OH.
Jones, B. F., H. Sthamer, and D. Eyres (1996). Automatic structural testing using genetic algorithms. Software
Engineering Journal 11(5), 299–306.
Korel, B. (1990, August). Automated software test data generation. IEEE Transactions on Software Engineer-
ing 16(8), 870–879.
Korel, B. (1992). Dynamic method for software test data generation. Software Testing, Verification and Relia-
bility 2(4), 203–213.
162
UKTest 2005
McMinn, P. and M. Holcombe (2004, June). Hybridizing evolutionary testing with the chaining approach. In
Proceedings of GECCO 2004, pp. 1363–1374. Springer Verlag.
Michael, C., G. McGraw, M. Schatz, and C. Walton (1997). Genetic algorithms for dynamic test data genera-
tion. Technical Report RSTR-003-97-11, RST Corporation, Suite 250, 21515 Ridgetop Circle, Sterling VA
20166.
Tracey, N., J. Clark, and K. Mander (1998, March). Automated program flaw finding using simulated annealing.
Software Engineering Notes 23(2), 73–81.
Tracey, N., J. Clark, K. Mander, and J. McDermid (2000). Automated test data generation for exception condi-
tions. Software – Practice and Experience 30, 61–79.
Wegener, J., A. Baresel, and H. Sthamer (2001). Evolutionary test environment for automatic structural testing.
Information and Software Technology 43, 841–854.
Whitley, D. (1989). The genitor algorithm and selective pressure: why rank based allocation of reproductive
trials is best. Proceedings of the Third International Conference GAs., 116–121.
163
UK Software Testing Research III
164
UKTest 2005
Testability Transformation for Efficient
Automated Test Data Search in the Presence of Nesting
Phil McMinn
University of Sheffield,
Regent Court,
211 Portobello Street,
Sheffield, S1 4DP, UK
p.mcminn@dcs.shef.ac.uk
David Binkley
Loyola College
4501 North Charles Street
Baltimore,
MD 21210-2699, USA
binkley@cs.loyola.edu
Mark Harman
King’s College
Strand, London
WC2R 2LS, UK
mark@dcs.kcl.ac.uk
Abstract
The application of metaheuristic search techniques to the automatic generation of software
test data has been shown to be an effective approach for a variety of testing criteria. However,
for structural testing, the dependence of a target structure on nested decision statements can
cause efficiency problems for the search, and failure in severe cases. This is because all
information useful for guiding the search - in the form of the values of variables at branching
predicates - is only gradually made available as each nested conditional is satisfied, one after
the other. The provision of guidance is further restricted by the fact that the path up to that
conditional must be maintained by obeying the constraints imposed by ‘earlier’ conditionals.
An empirical study presented in this paper shows the prevalence of types of if statement
pairs in real-world code, where the second if statement in the pair is nested within the
first. A testability transformation is proposed in order to circumvent the problem. The
transformation allows all branch predicate information to be evaluated at the same time,
regardless of whether ‘earlier’ predicates in the sequence of nested conditionals have been
satisfied or not. An experimental study is then presented, which shows the power of the
approach, comparing evolutionary search with transformed and untransformed versions of
two programs with nested target structures. In the first case, the evolutionary search finds
test data in half the time for the transformed program compared to the original version. In
the second case, the evolutionary search can only find test data with the transformed version
of the program.
1 Introduction
The application of metaheuristic search techniques to the automatic generation of software testdata has been shown to be an effective approach for functional [11, 21, 20], non-functional [26, 19,27], structural [12, 13, 4, 29, 10, 17, 25, 16, 15], and grey-box [14, 24] testing criteria. The searchspace is the input domain of the test object. An objective function provides feedback as to how‘close’ input data are to satisfying the test criteria. This information is used to provide guidanceto the search.
For structural testing, each individual program structure of the coverage criteria (for exampleeach individual program statement or branch) is taken as the individual search ‘target’. The effects
165
UK Software Testing Research III
Node
true
true
if a > b
if b > c
TARGET
TARGET MISSED (c - b) fed to objective function
TARGET MISSED (b – a) fed to objective function false
false
true if c > d false
TARGET MISSED (d - c) fed to objective function
void example(int a, int b, int c, int d)
{
(1) if (a > b)
{
(2) if (b > c)
{
(3) if (c > d)
{
(4) // target
...
Figure 1: Nested targets require the succession of branching statements to be evaluated by theobjective function one after the other
of input data are monitored through instrumentation of the branching conditions of the program.An objective function is computed, which decides how ‘close’ an input datum was to executingthe target, based on the values of variables appearing in the branching conditionals which lead toits execution. For example, if a branching statement ‘if (a == b)’ needs to be true for a targetstatement to be covered, the objective function feeds back a ‘branch distance’ value of abs(b − a)to the search. The objective values fed back are critical in directing the search to potential newtest data candidates which might execute the desired program structure.
However, the search can encounter problems when structural targets are nested within morethan one conditional statement. In this case, there are a succession of branching statements whichmust be evaluated with a specific outcome in order for the target to be reached. For example,in Figure 1, the target is nested within three conditional statements. Each individual conditionalmust be true in order for execution to proceed onto the next one. Therefore, for the purposes ofcomputing the objective function, it is not known that b > c must be true until a > b is true.Similarly, until b > c is satisfied, it is not known that c > d must also be satisfied. This gradualrelease of information causes efficiency problems for the search, which is forced to concentrate onsatisfying each predicate individually. For example, inputs where b is close to being greater thanc are of no consequence to the objective function until a > b.
Furthermore, the search is restricted when seeking inputs to satisfy ‘later’ conditionals, becausesatisfaction of the earlier conditionals must be maintained. If when searching for input values forb > c, the search chooses input values so that a is not greater than b, the path taken through theprogram never reaches the latter conditional, and thus the search never finds out if b > c or not.Instead it is held up again at the first conditional, which must be made true in order to reach thesecond conditional again. This inhibits the test data search, and the possible input values it canconsider in order to satisfy predicates appearing ‘later’ in the sequence of nested conditionals. Insevere cases the search may fail to find test data.
Ideally, all branch predicates need to be evaluated by the objective function at the sametime. This paper presents a testability transformation approach in order to achieve this. Atestability transformation [7] is a source-to-source program transformation that seeks to improvethe performance of a test data generation technique. The transformed program produced is merelya ‘means to an end’, rather than an ‘end’ in itself, and can be discarded once it has served itsintermediatory purpose as a vehicle for an improved test data search.
The ability to be able to evaluate all branch predicates at the same time results in a significantpositive impact in the level of guidance that can be provided to the search. This can be seen byexamining the objective function landscapes of the original and transformed versions of programs.Experiments carried out using evolutionary algorithms on two case studies confirm this. In thefirst study, test data was found in half the number of input data evaluations for the transformedversion. In the second study, the test data search was unsuccessful unless the transformed versionof the program was used.
166
UKTest 2005
An empirical study is presented which examines if statement pairs occurring in forty real-world programs. In this study, the latter if statement of the pair is nested in the first. Theresults further serve to show the benefit of the proposed transformation approach. In previouswork [3], a method is presented to simultaneously evaluate all nested branch conditions, but onlyif no further statements occur between each pair of if statements. The empirical study showsthat this only occurs for 18% of if pairs, whereas the transformation approach is also potentiallyapplicable to the additional 82% of cases.
2 Search-Based Structural Test Data Generation
Several search methods have been proposed for structural test data generation, including thealternating variable method [12, 13, 4], simulated annealing [23, 22] and evolutionary algorithms[29, 10, 17, 25, 16, 15]. This paper is interested in the application of the alternating variablemethod and evolutionary algorithms to structural test data generation.
2.1 The Alternating Variable Method
The alternating variable method [12], is employed in the goal-oriented [13] and chaining [4] testdata generation approaches, and is based on the idea of ‘local’ search. An arbitrary input vectoris chosen at random, and each individual input variable is probed by changing its value by a smallamount, and then monitoring the effects of this on the branch predicates of the program.
The first stage of manipulating an input variable is called the exploratory phase. This probesthe neighborhood of the variable by increasing and decreasing its original value. If either moveleads to an improved objective value, a pattern phase is entered. In the pattern phase, a larger moveis made in the direction of the improvement. A series of similar moves is made until a minimumfor the objective function is found for the variable. If the target structure is not executed, thenext input variable is selected for an exploratory phase.
In the example of Figure 1, the search target is the execution of node 4. Say the program isexecuted with the arbitrary input (a=10, b=20, c=30, d=10). Control flow diverges away fromthe target down the false branch from node 1. The search attempts to minimize the objectivevalue, which is formed from the true branch distance from node 1, i.e a − b. Exploratory movesare made around the value of a. A decreased value leads to a worse objective value. An increasedvalue leads to an improved smaller objective function value. Larger moves are made to increase a
until a is greater than b. Suppose the input to the program is now (a=21, b=20, c=30, d=10).Execution now proceeds down the true branch from node 1, but diverges away down the falsebranch at node 2. The search now attempts to minimize the objective function b - c in orderto execute node 2 as true. Exploratory moves around a have no effect on the objective function.Therefore exploratory moves are made around the values of b. A decreased value of b leads toa worse objective function value, whilst an increased value leads to execution taking the falsebranch at node 1 again. Therefore the search explores values around the current value of c.Increased values have a negative impact on the objective function, whilst decreased values lead toan improvement. Further moves are made to decrease the value of c until input is found whichexecutes node 2 as true. Suppose this is (a=21, b=20, c=19, d=10). Execution now proceedsdirectly through all branching statements to target node 4.
2.2 Evolutionary Testing
Evolutionary testing [29, 10, 17, 25, 16, 15] employs evolutionary algorithms for the test datasearch. Evolutionary algorithms [28] combine characteristics of genetic algorithms and evolutionstrategies, using simulated evolution as a search strategy, employing operations inspired by geneticsand natural selection.
Evolutionary algorithms maintain a population of candidate solutions rather than just onecurrent solution, as with local search methods. The members of the population are iteratively
167
UK Software Testing Research III
recombined and mutated to in order to evolve successive generations of potential solutions. Theaim is to generate ‘fitter’ candidate solutions within subsequent generations, which representbetter candidate solutions. Recombination forms offspring from the components of two parentsselected from the current population. The new offspring form part of the new generation ofcandidate solutions. Mutation performs low probability random changes to solutions, introducingnew genetic information into the search. At the end of each generation, each solution is evaluatedfor its fitness, using a ‘fitness’ function. The fitness function can be the direct output of an objectivefunction, or this value ranked or scaled in some way. Using fitness values, the evolutionary searchdecides whether individuals should survive into the next generation or be discarded.
In applying evolutionary algorithms to structural test data generation [29, 10, 17, 25, 16, 15],‘candidate solutions’ are possible test data inputs. The objective function evaluates each test datainput with regards to the current structural target in question. This is performed in a slightlydifferent way to the alternating variable method. The notion of branch distance is key, but asthe search does not work to iteratively improve one solution, the objective function incorporatesanother metric known as the approach level (also known as the approximation level) [25] to recordhow many nested conditionals are left unencountered by an input en route to the target.
Take the example of Figure 1 again. If some test data input reaches node 1 but divergesaway down the false branch, its objective value is formed from the true branch distance at node1, and an approach level value of ‘2’ to indicate there are still two further branching nodes tobe encountered (nodes 2 and 3). If the test data input evaluates node 1 in the desired way, itsobjective value is formed from the true branch distance at node 2, with the approach level valuenow being one. At node 3, the approach level is zero and the branch distance is derived from thetrue branch predicate.
Formally the objective function for a test data input is computed as follows:
obj val = approach level + normalize(branch dist) (1)
where the branch distance branch dist is normalized into the range 0-1 by the function normalize
using the following formula [1]:
normalize(branch dist) = 1 − 1.001−branch dist (2)
thus ensuring the value added to the approach level is close to 1 when the branch distance is verylarge, and zero when the branch distance is zero.
The approach level, therefore, adds a value for each branch distance which remains unevaluated.Since these values are not known, as the path of execution through the program has meant theyhave not been calculated, the maximum value is added, i.e. 1 (this ‘approximation’ to real branchdistances is why the approach level is sometimes referred to as the ‘approximation level’). Aswill be seen in the next section, the addition of this value rather than actual branch distance caninhibit search progress.
3 Nested Search Targets
The dependence of structural targets on one or more nested decision statements can cause problemsfor search-based generation methods, and even failure in severe cases.
The problem stems from the fact that information valuable for guiding the search is onlyrevealed gradually as each individual branching conditional is encountered. The search is forced toconcentrate on each branch predicate one at a time, one after the other. In doing this, the outcomeat previous branching conditionals must be maintained, in order to preserve the execution pathup to the current branching statement. If this is not done, the current branching statement willnever be reached. This restricts the search in its choice of possible inputs, narrowing the potentialsearch space.
In case study 1 (Figure 2a), where the target of the search is node 4, the fact that c needs tobe zero at node 3 is not known until a == b is true at node 1. However, in order to evaluate node
168
UKTest 2005
Node(s) void case_study_1_original{double a, double b)
{
(1) if (a == b)
{
(2) double c = b + 1;
(3) if (c == 0)
{
(4) // target
}
}
(e) }
(a) Original program
void case_study_1_transformed(double a, double b)
{
double _dist = 0;
_dist += branch_distance(a == b);
double c = b + 1;
_dist += branch_distance(c == 0);
if (_dist == 0.0)
// target
}
(b) Transformed version of program
Figure 2: Case study 1
3 in the desired way, the constraint a == b needs to be maintained. If the values of a and b arenot -1, the search has no chance of making node 3 true, unless it backtracks to reselect values ofa and b again. However, if it were to do this, the fact that c needs to be zero at node 3 will be‘forgotten’, as node 3 is no longer reached, and its true branch distance is not computed.
This phenomenon is captured in a plot of the objective function landscape (Figure 3a), whichuses the output of Equation 1. The shift from satisfying the initial true branch predicate of node1 to the secondary satisfaction of the true branch predicate of node 2 is characterized by a suddendrop in the landscape down to spikes of local minima. Any move to input values where a is notequal to b jerks the search up out of the minima and back to the area where node 1 is evaluated asfalse again. When stuck in the local minima, the alternating variable method can not alter bothinput variables at once. As the method will not accept an inferior solution which would place itback at node 1, it declares failure. The evolutionary algorithm, meanwhile, has to change bothvalues of a and b in order to traverse the local minima down to the global minimum of (a=-1,
b=-1).Case study 2 (Figure 4a) further demonstrates the problems of nested targets, this time with a
target within three levels of nesting. This can be seen in a plot of the objective function landscape,
169
UK Software Testing Research III
−100 −50 0 50 100−100
−50
0
50
100
0
0.2
0.4
0.6
0.8
1
1.2
1.4
b
a
Ob
jective
Va
lue
(a) Original program
−100
−50
0
50
100
−100
−50
0
50
100
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
b
a
Ob
jective
Va
lue
(b) Transformed version
Figure 3: Objective function landscape for case study 1
170
UKTest 2005
Node(s) void case_study2(double a, double b, double c)
{
(1) double d, e;
(2) if (a == 0)
{
(3) if (b > 1)
(4) d = b + b/2;
else
(5) d = b - b/2;
(6) if (d == 1)
{
(7) e = c + 2;
(8) if (e == 2)
{
(9) // target
}
}
}
(e) }
(a) Original program
void case_study2_transformed(double a, double b, double c)
{
double _dist = 0;
double d, e;
_dist += branch_distance(a == 0);
if (b > 1)
d = b + b/2;
else
d = b - b/2;
_dist += branch_distance(d == 1);
e = c + 2;
_dist += branch_distance(e == 2);
if (_dist == 0.0)
// target
}
(b) Transformed version of program
Figure 4: Case study 2
171
UK Software Testing Research III
−100
−50
0
50
100
−100
−50
0
50
100
1
1.5
2
ba
Ob
jective
Va
lue
(a) Original program
−100
−50
0
50
100
−100
−50
0
50
100
0
0.05
0.1
0.15
0.2
0.25
ba
Ob
jective
Va
lue
(b) Transformed version
Figure 5: Objective Function for case study 2, plotted where c = 0
172
UKTest 2005
seen in Figure 5a. The switch from minimizing the branch distance at node 2 to that of node 6 isagain characterized by a sudden drop. Any move from a value of a = 0 has a significant negativeimpact on the objective value, as the focus of the search is pushed back to satisfying this initialpredicate. In this area of the search space, the objective function has no regard for the values ofb, which is the only variable which can affect the outcome at node 6. To select inputs in orderto take the true branch from node 6, the search is constrained in the a = 0 plane of the searchspace.
3.1 Related Work
Baresel et al. [3] consider the nested search target problem where no further statements existbetween each subsequent if decision statement, as in the example of Figure 1. It is observed thatthe branch distances of each branching node can simply be measured at the ‘top level’, i.e. beforenode 1 is encountered, and simply added together for computing the objective function. However,if statements do exist between pairs of if statements, this solution is no longer plausible. In casestudy 1 (Figure 2), for example, the value of c at node 3 is fixed at node 2, which occurs afternode 1 is executed. In case study 2 (Figure 4), the value of d at node 6 could be fixed at nodes 4or 5, depending on the input value of b. Furthermore, the value of e is decided at node 7, whichis nested in the true branches of nodes 6 and 2. A burning question therefore, is how often suchintermediatory statements occur between if pairs in real-world code. Is an extended solution tothe nested target problem justified?
3.2 Nesting in real world programs - an empirical study
An empirical study investigated nested if statement pairs for forty real-world programs. A de-scription of each program, and its size, measured in lines of code by the tools wc and sloc can befound in Table 1.
The if statement pairs analyzed, for if statements P and Q - where Q is nested in P - followedthe system dependence graph [9] pattern of the following form:
1. Q is control dependent on P
2. P is not transitively control dependent on Q
Control dependency [5] is informally defined as “for a program node I with two exits (e.g. anif statement), program node J is control dependent on I if one exit from I always results in J
being executed, while the other exit may not result in J being executed”. Rules 1 and 2, therefore,ensure that Q is nested in P and that P is differentiated from Q.
As outlined in the previous section, the issue of a possible statement sequence A existingbetween P and Q is an important feature which distinguishes this work from the earlier work ofBaresel et al. [3]. Such occurrences were checked by the following rules:
3. A (if it exists) depends on some X which is control dependent on P (i.e. A depends onsomething nested in P )
4. A (if it exists) is not transitively control dependent on Q
A further fifth rule checked if A has a role in determining the outcome at Q, i.e. there is somevariable assigned to in A that is used in the predicate at Q:
5. Q is transitively data dependent on A, and this dependency is not loop-carried
The condition that the dependency is not loop-carried ensures that if Q is data dependent on A,A does indeed occur in between P and Q, and does not merely appear after both P and Q withinthe body of a loop.
173
UK Software Testing Research III
Table 1: Details of the real world programs
Program LOC Descriptionwc sloc
a2ps 63,600 40,222 Postscript formatteracct 10,182 6,764 Accounting packagebarcode 5,926 3,975 Barcode generatorbc 16,763 11,173 Calculatorbyacc 6,626 5,501 Berkeley YACCcadp 12,930 10,620 Protocol engineering tool-boxcompress 1,937 1,431 Data compression utilitycopia 1,170 1,112 ESA signal processing codecsurf-pkgs 66,109 38,507 Code surfer slicing toolctags 18,663 14,298 Produces tags for ex, more, and vidiffutils 19,811 12,705 File comparing routinesed 13,579 9,046 Unix editorempire 58,539 48,800 War gameEPWIC-1 9,597 5,719 Image compression toolespresso 22,050 21,780 Logic simplification for CAD (from SPECmark)findutils 18,558 11,843 File finding utilitiesflex2-4-7 15,813 10,654 BSD scanner (version 2.4.7)flex2-5-4 21,543 15,283 BSD scanner (version 2.5.7)ftpd 19,470 15,361 File Transfer Protocol daemongcc.cpp 6,399 5,731 Gnu C Preprocessorgnubg-0.0 10,316 6,988 Gnu Backgammongnuchess 17,775 14,584 Gnu chess game playergnugo 81,652 68,301 Gnu go game playergo 29,246 25,665 The game goijpeg 30,505 18,585 JPEG compressor (from SPECmark)indent 6,724 4,834 C formatterli 7,597 4,888 Xlisp interpreterntpd 47,936 30,773 Daemon for the network time protocoloracolo2 14,864 8,333 Array processorprepro 14,814 8,334 ESA array pre-processing codereplace 563 512 Regular expression string replacementspace 9,564 6,200 ESA ADL interpreterspice 179,623 136,182 Digital circuit simulatortermutils 7,006 4,908 Unix terminal emulation utilitiestile-forth-2.1 4,510 2,986 Forth Environmenttime-1.7 6,965 4,185 CPU resource measureuserv-0.95.0 8,009 6,132 Trust management servicewdiff.0.5 6,256 4,112 Diff front endwhich 5,407 3,618 Unix utilitywpst 20,499 13,438 CodeSurfer Pointer AnalysisSum 919,096 664,083Average 22,977 16,602
174
UKTest 2005
Table 2: Nesting in real-world programs
Program All Nothing Unrelated Relatedin between in between in between
a2ps 528 98 147 283acct 105 36 30 39barcode 116 8 40 68bc 114 28 29 57byacc 154 25 51 78cadp 136 65 26 45compress 22 6 4 12copia 1 1 0 0csurf-pkgs 757 97 267 393ctags 257 85 31 141diffutils 263 51 83 129ed 162 38 45 79empire 2,915 283 1,132 1,500EPWIC-1 160 35 38 87espresso 380 74 103 203findutils 187 38 42 107flex2-4-7 203 69 75 59flex2-5-4 261 80 110 71ftpd 900 174 203 523gcc.cpp 187 40 35 112gnubg-0.0 224 42 87 95gnuchess 498 134 159 205gnugo 1,578 384 531 663go 1,568 375 609 584ijpeg 277 104 64 109indent 224 56 46 122li 121 52 30 39ntpd 973 197 310 466oracolo2 282 22 65 195prepro 263 16 65 182replace 7 3 0 4space 283 22 65 196spice 3,010 428 717 1,865termutils 77 13 16 48tile-forth-2.1 44 20 6 18time-1.7 17 6 8 3userv-0.95.0 243 44 39 160wdiff.0.5 49 18 12 19which 33 4 12 17wpst 360 41 128 191average 448.5 82.8 136.5 229.2% 18.5% 30.4% 51.1%
175
UK Software Testing Research III
The results can be seen in Table 2. ‘All’ is a figure of all if statement pairs analyzed. ‘Nothingin between’ records all if P and Q pairs with no A. ‘Unrelated in between’ records all P and Q
pairs with an A, but A does not have an effect on Q. ‘Related in between’, on the other hand,counts all A’s that have an effect on the predicate at Q.
The results show that the ‘unrelated in between’ case, that is the form of if pairs that can behandled by the technique of Baresel et al. account for less than a fifth of all if pairs studied. Afurther 30% of the ‘unrelated in between’ could also be handled, since the extra statements do notaffect Q. Therefore, the branch distance calculation could still legitimately take place before P ,however data dependency analysis would be required to establish this situation. The remaining50% of the ‘related in between’ cases would not be plausibly handled by the approach of Bareselet al. This is overcome by the application of a testability transformation approach described inthe next section.
4 Applying a Testability Transformation
A testability transformation [7] is a source-to-source program transformation that seeks to improvethe performance of a test data generation technique. The transformed program produced is merelya ‘means to an end’, rather than an ‘end’ in itself, and can be discarded once it has served itspurpose as an intermediary for generating the required test data. The transformation processneed not preserve the traditional meaning of a program. For example, in order to cover a chosenbranch, it is only required that the transformation preserve the set of test-adequate inputs. Thatis, the transformed program must be guaranteed to execute the desired branch under the sameinitial conditions. Testability transformations have also been applied to the problem of flags forevolutionary test data generation [2, 6], and the transformation of unstructured programs forbranch coverage [8].
The philosophy behind the testability transformation proposed in this paper is to remove theconstraint that the branch distances of nested decision nodes must be minimized to zero one at atime, and one after the other. The transformation takes the original program and removes decisionstatements on which the target is control dependent. In this way, when the program is executed, itis free to proceed into the originally nested areas of the program, regardless of whether the originalbranching predicate would have allowed that to happen. In place of the decision is an assignmentto a variable dist, which computes the branch distance based on the original predicate. At theend of the program, the value of dist reflects the summation of each of the individual branchdistances. This value may then be used as the objective value for the test data input.
The original version of case study 1 (Figure 2a) can therefore be transformed into the programseen in Figure 2b. The benefit of the transformation can be immediately seen in a plot of the ob-jective function landscape (Figure 3b). The sharp drop into local minima of the original landscape(Figure 3a) is replaced with smooth planes sloping down to the global minimum.
Case study 2 (Figure 4) is of a slightly more complicated nature, with the target positionedwithin three levels of nesting. A further if-else decision exists at level one, before the secondconditional en route to the target. Within both branches of this decision, a value is assigned to thevariable d, on which the if statement at node 6 is dependent upon. The transformed version ofthe program can be seen in Figure 4b. Again, the benefits of the transformation can be instantlyseen in a plot of the objective landscape (Figure 5b). The sharp drop in the original landscape(Figure 5a) corresponding to branching node 1 being evaluated as true and branching node 2 beingencountered, is replaced by a smooth landscape sloping down from all areas of the search spacedown into the global minimum.
5 Experimental Study
The two case studies introduced were put to the test with an evolutionary approach.The Genetic and Evolutionary Algorithm Toolbox (GEATbx) [18] was used to perform the
176
UKTest 2005
Table 3: Test data evaluations for case study 1
Run Untransformed Version Transformed Version
1 35,130 11,9102 31,350 18,3903 17,580 13,2604 24,060 10,5605 27,300 14,0706 38,100 13,2607 39,180 9,7508 27,300 13,8009 30,540 12,720
10 32,700 16,500Average 30,324 13,422
Table 4: Test data evaluations for case study 2
Run Untransformed Version Transformed Version
1 54,030 19,4702 54,030 20,5503 54,030 16,7704 54,030 17,8505 54,030 18,3906 54,030 19,2007 54,030 19,7408 54,030 19,4709 54,030 15,150
10 54,030 16,500Average 54,030 18,309
evolutionary searches, which were conducted as follows. 300 individuals were used per generation,split into 6 subpopulations starting with 50 individuals each. Linear ranking is utilized, with aselection pressure of 1.7. The input vectors are operated on by the evolutionary algorithm ‘as is’,i.e. as a vector of double values. Individuals are recombined using discrete recombination, andmutated using real-valued mutation. Real-valued mutation is performed using “number creep” -the alteration of variable values through the addition of small amounts. Competition and migrationis employed across subpopulations. Each evolutionary search was terminated after 200 generationsif test data was not found. Each experiment with each program version was repeated ten times.
The domains of each double variable were -1000 to 1000 with a precision of 0.001, giving searchspace size of 1011 for case study 1 and 1017 for case study 2.
For case study 1, the evolutionary algorithm generally performed less than half the numberof objective function evaluations (i.e. test data evaluations) for the transformed version of theprogram, compared to the untransformed version (Table 3). The average best objective valueplot, in Figure 6, shows search progress for the untransformed version of case study, with suddenimprovements in objective as the search navigates from local minimum to local minimum. Searchprogress for the transformed version, as expected, is more consistent and gradual.
The evolutionary algorithm encountered severe difficulties with the untransformed program forcase study 2. Due to the existence of three levels of nesting, the search fails on each occasion.Exactly the same number of test data evaluations are performed on each of the ten repetitions ofthe experiment (Table 4), terminating in the 200th generation. The search has much more success
177
UK Software Testing Research III
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
Generation
Avera
ge B
est O
bje
ctive V
alu
eOriginal version
Transformed Version
14070
Figure 6: Average best objective value plot for case study 1
with the transformed version of the program, finding test data as early as the 55th generation inone of the ten repetitions.
6 Future Work
The transformation algorithm proposed in this paper does not accommodate for decision state-ments that are looping constructs, such as ‘while’ or ‘for’, or for if decision statements that arethemselves nested within an outer loop. This is because the branch distance value for a condi-tional could potentially be added more than once. An advanced version of the algorithm mightallow for loops by simply recording the minimum value of the branch distance encountered forthe conditional, and adding this to the end value of the dist variable. It makes no difference tothe transformation algorithm, of course, if intermediate blocks of statements occurring betweennested if pairs feature self-contained loops.
The transformation algorithm also has issues with certain type of predicates, which need tobe detected unless run-time errors are allowed to occur. One example of this is a predicate whichtests the possibility of a dynamic memory reference. The following example may lead to a programerror if it is transformed, due to the possibility of the array index of i being less than zero orgreater than the length of the array, and thus causing an array out of bounds error:
if (i >= 0 && i < length_of_a)
{
printf("%f\n", a[i]);
}
Another issue is the possibility of introducing division by zero errors, for example in thefollowing segment of code if the conditional were to be removed:
if (d != 0)
{
r = n / d;
}
178
UKTest 2005
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 100 200
Generation
Avera
ge B
est O
bje
ctive V
alu
eOriginal version
Transformed Version
Figure 7: Average best objective value plot for case study 2
Currently, the transformation algorithm works on a per-target basis - a separate transformationneeds to be performed for each search target. An advanced version of the algorithm could modifythe predicates of the program to contain function calls. The function call would record the branchdistance, and then decide on the basis of the nesting of the current target as to a boolean value toreturn, and ultimately, whether execution should be allowed to proceed down a specific branch.For example, in the following, tt nesting check records the branch distance of a == b at node Aand lets execution flow down through its true branch regardless of whether a actually does equalb or not. However, tt nesting check remains true to the original predicate b == c at node B,since the current target is not nested within it, but allows execution through the true branch atnode C regardless of whether c == d.
Node(A) if (tt_nesting_check(a == b))
{
(B) if (tt_nesting_check(b == c))
{
...
}
(C) if (tt_nesting_check(c == d))
{
// current target nested in here
}
}
7 Conclusions
This paper has described how targets nested within more than one conditional statement cancause problems for search-based approaches to structural test data generation. In the presenceof nesting, the search is forced to concentrate on satisfying one branch predicate at a time, oneafter the other. This slows search progress and restricts the potential search space available forthe satisfaction of branching predicates ‘later’ in the sequence of nested conditionals.
179
UK Software Testing Research III
A testability transformation approach was presented to the problem. A testability transfor-mation is a source-to-source program transformation that seeks to improve the performance ofa test data generation technique. The transformed program produced is merely a ‘means to anend’, rather than an ‘end’ in itself, and can be discarded once it has served its purpose as anintermediary for generating the required test data.
The main idea behind the testability transformation proposed in this paper is to remove theconstraint that the branch distances of nested decision nodes must be evaluated one after theother. The transformation takes the original program and removes decision statements on whichthe target is control dependent. In this way, when the program is executed, it is free to proceedinto the original nested areas of the program, calculating all branch distance values for the purposein order to compute objective values which are in full possession of the facts about the input data.
The approach was put to the test with two case studies. The case studies are small examples,and by no means represent a worse-case scenario, yet serve to demonstrate the power of theapproach. The transformed version of case study 1 allowed the evolutionary search to find testdata in half the number of test data evaluations over the original version of the program. Whilsttest data could not be found for the search target for the original version of case study 2, theevolutionary algorithm succeeded every time with the transformed version.
The transformation approach deals with assignments to variables in between nested condition-als which may affect the outcome at ‘later’ conditionals en route to the current structural target.The empirical study of if pairs in forty real-world programs, where one of the if statements ofthe pair is nested within the other, showed that this situation occurs just over 50% of the time.These cases can not be dealt with earlier work of Baresel et al. [3] which investigated the nestingproblem.
References
[1] A. Baresel. Automatisierung von strukturtests mit evolutionren algorithmen. Diploma Thesis,Humboldt University, Berlin, Germany, July 2000.
[2] A. Baresel, D. Binkley, M. Harman, and B. Korel. Evolutionary testing in the presence ofloop-assigned flags: A testability transformation approach. In Proceedings of the Interna-tional Symposium on Software Testing and Analysis (ISSTA 2004), pages 43–52, Boston,Massachusetts, USA, 2004. ACM.
[3] A. Baresel, H. Sthamer, and M. Schmidt. Fitness function design to improve evolutionarystructural testing. In Proceedings of the Genetic and Evolutionary Computation Conference(GECCO 2002), pages 1329–1336, New York, USA, 2002. Morgan Kaufmann.
[4] R. Ferguson and B. Korel. The chaining approach for software test data generation. ACMTransactions on Software Engineering and Methodology, 5(1):63–86, 1996.
[5] J. Ferrante, K. Ottenstein, and J. D. Warren. The program dependence graph and its usein optimization. ACM Transactions on Programming Languages and Systems, 9(3):319–349,1987.
[6] M. Harman, L. Hu, R. Hierons, A. Baresel, and H. Sthamer. Improving evolutionary testingby flag removal. In Proceedings of the Genetic and Evolutionary Computation Conference(GECCO 2002), pages 1359–1366, New York, USA, 2002. Morgan Kaufmann.
[7] M. Harman, L. Hu, R. Hierons, J. Wegener, H. Sthamer, A. Baresel, and M. Roper. Testabilitytransformation. IEEE Transactions on Software Engineering, 30(1):3–16, 2004.
[8] R. Hierons, M. Harman, and C. Fox. Branch-coverage testability transformation for unstruc-tured programs. The Computer Journal, To appear, 2005.
180
UKTest 2005
[9] T. Horwitz S., Reps and D. Binkley. Interprocedural slicing using dependence graphs. ACMTransactions on Programming Languages and Systems, 12:26–60, 1990.
[10] B. Jones, H. Sthamer, and D. Eyres. Automatic structural testing using genetic algorithms.Software Engineering Journal, 11(5):299–306, 1996.
[11] B. Jones, H. Sthamer, X. Yang, and D. Eyres. The automatic generation of software test datasets using adaptive search techniques. In Proceedings of the 3rd International Conference onSoftware Quality Management, pages 435–444, Seville, Spain, 1995.
[12] B. Korel. Automated software test data generation. IEEE Transactions on Software Engi-neering, 16(8):870–879, 1990.
[13] B. Korel. Dynamic method for software test data generation. Software Testing, Verificationand Reliability, 2(4):203–213, 1992.
[14] B. Korel and A. M. Al-Yami. Assertion-oriented automated test data generation. In Pro-ceedings of the 18th International Conference on Software Engineering (ICSE), pages 71–80,1996.
[15] P. McMinn. Search-based software test data generation: A survey. Software Testing, Verifi-cation and Reliability, 14(2):105–156, 2004.
[16] P. McMinn and M. Holcombe. Hybridizing evolutionary testing with the chaining approach.In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2004),Lecture Notes in Computer Science vol. 3103, pages 1363–1374, Seattle, USA, 2004. Springer-Verlag.
[17] R. Pargas, M. Harrold, and R. Peck. Test-data generation using genetic algorithms. SoftwareTesting, Verification and Reliability, 9(4):263–282, 1999.
[18] H. Pohlheim. GEATbx - Genetic and Evolutionary Algorithm Toolbox,http://www.geatbx.com.
[19] P. Puschner and R. Nossal. Testing the results of static worst-case execution-time analysis. InProceedings of the 19th IEEE Real-Time Systems Symposium, pages 134–143, Madrid, Spain,1998. IEEE Computer Society Press.
[20] N. Tracey. A Search-Based Automated Test-Data Generation Framework for Safety CriticalSoftware. PhD thesis, University of York, 2000.
[21] N. Tracey, J. Clark, and K. Mander. Automated program flaw finding using simulated an-nealing. In Software Engineering Notes, Issue 23, No. 2, Proceedings of the InternationalSymposium on Software Testing and Analysis (ISSTA 1998), pages 73–81, 1998.
[22] N. Tracey, J. Clark, and K. Mander. The way forward for unifying dynamic test-case genera-tion: The optimisation-based approach. In International Workshop on Dependable Computingand Its Applications, pages 169–180. Dept of Computer Science, University of Witwatersrand,Johannesburg, South Africa, 1998.
[23] N. Tracey, J. Clark, K. Mander, and J. McDermid. An automated framework for structuraltest-data generation. In Proceedings of the International Conference on Automated SoftwareEngineering, pages 285–288, Hawaii, USA, 1998. IEEE Computer Society Press.
[24] N. Tracey, J. Clark, K. Mander, and J. McDermid. Automated test data generation forexception conditions. Software - Practice and Experience, 30(1):61–79, 2000.
[25] J. Wegener, A. Baresel, and H. Sthamer. Evolutionary test environment for automatic struc-tural testing. Information and Software Technology, 43(14):841–854, 2001.
181
[26] J. Wegener, K. Grimm, M. Grochtmann, H. Sthamer, and B. Jones. Systematic testingof real-time systems. In Proceedings of the 4th European Conference on Software Testing,Analysis and Review (EuroSTAR 1996), Amsterdam, Netherlands, 1996.
[27] J. Wegener and M. Grochtmann. Verifying timing constraints of real-time systems by meansof evolutionary testing. Real-Time Systems, 15(3):275–298, 1998.
[28] D. Whitley. An overview of evolutionary algorithms: Practical issues and common pitfalls.Information and Software Technology, 43(14):817–831, 2001.
[29] S. Xanthakis, C. Ellis, C. Skourlas, A. Le Gall, S. Katsikas, and K. Karapoulios. Applicationof genetic algorithms to software testing (Application des algorithmes genetiques au test deslogiciels). In 5th International Conference on Software Engineering and its Applications,pages 625–636, Toulouse, France, 1992.
182
4. Tools and Experience
183
184
Model-Driven Engineering Testing Environments
Paul Baker, Paul Bristow, Clive Jervis, David King, Rob Thomson
{Paul.Baker, Paul.C.Bristow, Clive.Jervis, David.King, Rob.Thomson}
@motorola.com
Motorola Labs
Abstract
Model-driven engineering has been prevalent in the Telecommunications industry for anumber of years. Formal graphical notations such as MSC and SDL have been widely usedfor requirements and design respectively. These notations are ideal for expressing messagepassing behaviour of concurrent processes i.e. exactly the Telecoms problem domain.With the advent of UML 2.0, which incorporates close relatives of these notations, themodel-driven approach is starting to become more widely used both inside and outside ofTelecoms, for example, the automotive industry.
With any relatively new techniques the demand for good tools exceeds the supply.Our work on testing in a model-driven engineering environment began because the toolswere not available. In this paper we will cover the techniques, notations, and experienceswe have used in developing testing environments for MSC/SDL and UML 2.0. Tests inthese environments are derived from requirements models, and these tests are run againstthe design models, so this is a strong form of functional conformance testing.
1 Introduction
Motorola, like many other high-tech companies, has a keen interest in improving softwarequality, whilst reducing cost. Model-Driven Engineering (MDE) offers some hope of achievingthis aim. In this approach high-level graphical notations are used, for example, MSC [11] isused for requirements specification, and SDL [10] is used for design. More recently, UML2.0 [14], which incorporates close relatives of these notations, has started to be piloted, andis likely to supersede MSC/SDL in the long-run for MDE. Using higher-levels of abstractionallows for faster development, and less maintenance. Having requirements in a formal notation(such as MSC), is a huge improvement over natural language requirements, since MSC hasa clear and unambiguous semantics. Tool support also allows interpretation of the formalMSC models, whereas this is a non-trivial problem with requirements in a natural language.The graphical notations can also aid understanding by visualisation compared to a textualnotation. Testing is often quoted to take 40–60% of the software lifecycle [4], and testing in amodel-driven environment is no exception. However, there is more potential for automationin a model-driven environment compared with traditional approaches, since we have formalmodels for requirements. This enables us to generate conformance tests automatically. Thegenerated tests will check that the signal ordering and data of the application conforms exactlyto the requirements. If the requirements are complete, then we can potentially generate teststhat cover every scenario, we’re only limited by the sheer volume of tests, and their executiontime.
185
UK Software Testing Research III
2 MDE lifecycle
In Figure 1 we show the V-model development lifecycle used with the MDE approach. Thereis no fundamental difference with the classical V-model, the main difference is in the notationsand tools used.
2.1 Requirements
Starting at the top-left of Figure 1, requirements are created, mainly in the form of textualstatements. Requirements management at this stage is an important discipline, since defectsare often traced back to ambiguous, missing, or poor requirements. An often quoted figure isthat 50% of defects are traced back to errors in the requirements [13]. Tool support can beuseful here, for example, many groups in Motorola use Telelogic DOORS [16] for requirementsmanagement which allows for progress tracking, revision control, managing documentation,and providing linkage between requirements and relevant material.
Requirements
Code Design
Simulation or
Functional Testing
Component Testing
Integration Testing
Test Generation
Execution Environment
Application
Code
Code Generation
Requirements
Specification
Figure 1: V-model development lifecycle.
Since requirements at this stage are in a textual form problems are not easily detected,and effort is spent conducting formal reviews [6]. Problems include: missing or incomplete re-quirements, incorrect requirements, ambiguous requirements, requirement interaction. Thereis certainly scope to improve analysis at this level, for example, natural language interpreta-tion of requirements to pick up contradictions. Such analysis techniques potentially can havea large impact on software productivity and quality.
2.2 Requirements specification: MSC/UML 2.0 SDs
After the top-level requirements we have the formal functional requirements specification.With MDE for telecoms MSC [11] or UML 2.0 Sequence Diagrams [14] are ideal formalnotations for expressing message-passing behaviour. The message data and additional dataused here depends on the application domain, and data language tool support. The ASN.1language [9] is one of the most expressive data languages, but isn’t always used because oflack of tool support, or legacy reasons.
186
UKTest 2005
The MSC-2000 language [11], is standardised by the ITU-T and describes the message-passing behaviour between independent concurrent processes at a high-level. Figure 2 illus-trates a basic MSC that describes the interactions for a simple Bank statement request. Thishas three instances: User, ATM, and Bank. Time progresses downwards on each instanceline, but instances are asynchronous with each other, and should be thought of as threeseparate concurrent processes. Messages transmit data between instances, and consist of asend/receive event pair. Messages have arbitrary latency so, for example, although messageAcknowledgment is sent before RequestStatement, it could be received after RequestStatement.
User ATM Bank
RequestStatement(cardNum)
Acknowledgment(userMessage)
RequestStatement(cardNum)
Letter(statement)
msc Statement
Figure 2: Example MSC for ordering bank statements.
The MSC language contains many more constructs than we have shown here, for example,inline expressions allow fragments to be composed in parallel, or with loops, alternative, oroptional. There are also constructs for references, guards, general ordering, and more. Apartfrom the basic MSC language shown here, there is a High-level MSC language that allows thecomposition of basic MSCs. In addition there is an MSC document language which allows forthe declaration of instances, variables, and other data aspects. There is a standard textualformat for MSC, and a well-defined trace semantics. See the language standard for a completedescription of MSC [11].
UML 2.0 Sequence Diagrams [14] are based on MSCs, rather than on Sequence Diagramsfrom UML 1.x. The notation and semantics of UML 2.0 SDs is almost identical to MSC.The main differences are in the names of constructs, the support for classes, and that UML2.0 SDs has more inline operators. With the naming differences MSC instances are calledlifelines, and MSC inline expressions are called combined fragments. There are more inlineexpressions, such as a strict sequencing operator that orders events based on their visual order,i.e. the higher on the page an event, the earlier it will occur, irrespective of which instanceit’s on. UML 2.0 Interactions Overview Diagrams are similar to High-level MSCs (HMSCs),and UML 2.0 Classes and Collaboration Diagrams can be used for MSC documents. See [8]for a more detailed comparison of UML 2.0 Interactions with MSC. For more information onUML 2.0 SDs see the UML 2.0 standard [14].
187
UK Software Testing Research III
2.3 Code design: SDL/UML 2.0 Statecharts
SDL allows lower-level details of the processes to be defined compared with MSC. SDL is astate machine notation, and tools can be used for code generation or simulation, for example,Telelogic Tau [17]. There are constructs in SDL for sending, receiving, task boxes, states,and more. UML 2.0 Statecharts [14] are heavily based on SDL, and the same notations andsemantics are used.
Figure 3: UML 2.0 Statechart for Bank.
Figure 3 is a simple UML 2.0 Statechart for the Bank instance of the MSC in Figure2. In order we have a state Idle; a receive signal statement RequestStatement(cardNum);a decision box CheckCard(cardNum); and a send signal Letter(statement). Note thatthis implementation of the Bank instance doesn’t match the specification MSC, since theLetter signal is not always sent, whereas the specification says that it is always sent. This issomething that is detected by testing, and either the original specification should be changed,by adding an alt inline construct, or the SDL should be changed, by removing the decisionsymbol. For the purposes of code generation, or simulation additional information is requiredsuch as the data that is used for signals.
2.4 TTCN-3
TTCN-3 [5, 19] is a relatively new test language, although it has evolved from earlier incar-nations that started in the mid 1980’s. TTCN-3 is standardised by ETSI, and is starting tobecome the de facto language for tests. Other standards have started to express their testsuites in TTCN-3, for example, SIP, and IPv6 ETSI test specifications. TTCN-3 is ideal
188
UKTest 2005
for testing message-passing systems, such as Telecoms applications. More recently TTCN-3has been used in other domains such as automotive, railways, and web content testing. Seethe TTCN-3 2005 User Conference for presentations on these. There is strong support inTTCN-3 for types, signaling, verdicts, timers, test cases, test configurations, logging, and theusual programming language constructs. One of the main strengths of TTCN-3 is that it hasa runtime interface TRI (TTCN-3 runtime interface), and a control interface TCI (TTCN-3control interface). These interfaces allow the same test cases to be run for different platformsor with different encoding/decoding rules by changing the adapter code, rather than the testscripts themselves.
Figure 4 is a TTCN-3 testcase for the ATM in the MSC in Figure 2. Signals contain anumber of components, for example:
User.send(RequestStatement: cardNum);
will send the value cardNum with type RequestStatement over port User. These componentsneed to be explicitly defined in TTCN-3 for a complete test. Also note in our testcase wehave a choice operator alt, which is used here to check for our expected signal, or to checkfor spurious signals, or timeouts. Timers are used on receive signals, so that if our expectedsignal doesn’t arrive within a fix period of time then the test returns a fail verdict.
testcase Statement_test001() runs on MTCType
{
log("Test case: Statement_test001");
User.send(RequestStatement: cardNum);
toUser.receive(Acknowledgement: userMessage);
MaxTimer.start(5.0);
alt {
[] toBank.receive(RequestStatement: cardNum) {
MaxTimer.stop;
setverdict(pass)
}
[] any port.receive {
MaxTimer.stop;
setverdict(fail)
}
[] MaxTimer.timeout {
setverdict(fail)
}
}
}
Figure 4: A TTCN-3 testcase for the ATM as described by the MSC in Figure 2.
Another strength of TTCN-3 is its support for types, templates, and regular expressions.For example, CardNum can be defined using a template, which would have specific values forwhen signals are sent, i.e. just like a record, but the template used for when signals arereceived can contain regular expressions in some or all fields of the template. Below in anexample template, which has specific expected values for accountNum and pin, but a regular
189
UK Software Testing Research III
expression ? for the SecurityCode. The ? regular expression means that the field may containany value.
template CardNum {
accountNumber := e_accountNumber,
securityCode := ?,
pin := e_pin
}
3 Test generation
Over the last eight years or so, we have developed the test generator ptk [2] which is usedthroughout Motorola. Originally, ptk took MSCs and generated SDL test scripts. Today ptksupports MSC and UML 2.0 Sequence Diagrams, and can generate test scripts in a numberof test languages including TTCN-3. ptk takes three main inputs:
• An MSC or UML 2.0 Sequence Diagram – these are used as requirement specifications.ptk supports almost all the constructs in these languages, including HMSC, and Inter-action Overview Diagrams. The input representation supported is the standard textualformat for MSC (called MSC-PR). For UML 2.0 Sequence Diagrams we translate toMSC-PR first, which has almost a one-to-one mapping. These diagrams can be drawnwith a number of editors, for example, ptk supports Telelogic Tau [17], and ESG fordrawing MSCs, and Telelogic TauG2 [17] for UML 2.0 Sequence Diagrams.
• A directives file – this file has a similar purpose to an MSC document [11] in declaringdata aspects, and also contains information such as what instances represent the SystemUnder Test (SUT), and test configuration information, such as the communication portsused, and how they are connected to each other. The format for the directives file isnon-standard, since there isn’t a standard that contains all this information. The UML2.0 Testing Profile [15] may provide a standard for this information in the future.
• A data specification file – this includes the types, and data that will be used by thetestcases. The language used for the data file is the same as the language we aregenerating tests in, so for TTCN-3 test generation, the data file is in TTCN-3.
ptk generates conformance tests from a master MSC, where a subset of the instancesdescribe the behaviour of the SUT. ptk generates one or more test scripts from a singlemaster MSC, basic or high-level, which may reference other MSCs. Specifications of messageformats can also be processed along with the MSCs. Depending on which options are invoked,ptk will produce a combination of the following:
• Semantic Analysis – ptk can produce a representation of MSC event semantics in severalforms including traces, partial order graphs, or trace trees. The output can be either intextual or graphical format for viewing by graph visualisation tools such as daVinci, orDOT from AT&T Labs.
• Test Analysis – the output is similar in form to the semantic analysis, but restrictedjust to the events that will appear in the generated tests. The analysis represents thestage before the model is split into individual test scripts.
190
UKTest 2005
• Test Scripts – test scripts can also be produced in a variety of formats. Abstract treerepresentations can be produced in textual or graphical formats; there is a dedicatedSDL code generator and TTCN-3 code generator; and lastly a generator that producescode templates. The templates are processed using a C-preprocessor and a macro def-inition library to produce the final concrete test scripts. We have developed a libraryfor producing TTCN-2 scripts, but others could, in principle, be defined for other testlanguages.
3.1 Local versus remote test strategies
The default method of ptk is remote testing in which the test system is connected to theSUT using the actual channels used in the full system implementation. Local testing, thealternative, assumes that the test system is connected directly to the SUT so that it has tomodel any latency effects in the communication channels.
For remote testing, the test events occur only in the orders that they occur in the MSC,so that if a receive event x always occurs before a receive event y in the MSC, then the samewill be true in the generated tests, regardless of the possible order that these messages canbe sent by the SUT according to the MSC. This method corresponds to a test system thatsits at the ends of channels connected to the SUT that behave according to the MSC.
The other view is that the test system is connected directly to SUT where latency ina message’s transmission between the test system and SUT cannot occur. In this case themessage events in the tests are the inversion of the events as ordered by the SUT instances.That is, the order of events is taken from the SUT ends of the messages, but a send event isreplaced by a receive event and vice-versa.
For the MSC of Figure 2, where the ATM instance is the SUT, for remote testing we willinclude three signals in the test scripts: RequestStatement from User, Acknowledgement,and RequestStatement to Bank. The latter two message will occur in either both orders inthe test scripts. Note that the message Letter does not appear in the test scripts, since it isnot involved in interactions with the ATM. For local testing we will have the same signals inthe test scripts, but the Acknowledgement signal will occur the RequestStatement to Banksignal, since this is the order given on the SUT.
Figure 4 gives a simplified version of a local testcase for the MSC is Figure 2 where theATM instance is the SUT. It’s simplified in that the timeout checks for the first receivemessage have been removed, as well as the logging statements that are generated in practice.The additional data descriptions and port and configurations are also not included, but wouldusually accompany the testcases.
User.send(cardNum)
User.recv(UserMessage)
Bank.recv(cardNum)
PASS
Bank.recv(cardNum)
User.recv(UserMessage)
PASS
User.send(cardNum)
User.recv(UserMessage)
Bank.recv(cardNum)
PASS
Figure 5: A remote test (on the left), and a local test (on the right) for the ATM in Figure 2.
191
UK Software Testing Research III
3.2 Test generation strategies
The algorithms used for test generation are described in [2], so we will only summarise thembriefly here. Internally ptk uses a partial graph representation of the MSC, where events (e.g.send or receive) are graph vertices, and causal orders are graph edges. The test generationalgorithms process this graph representation. The simplest algorithm generates one traceof events per test script. Since many test languages support non-determinism of receiveevents, for example, TTCN-3 does this with the alt statement, traces can be combined intotrees, where there is a choice of receive events at each node. The default algorithm for ptkgenerates tests in this form, which typically results in fewer testcases than the one test pertrace algorithm.
4 MSC/SDL testing environment
The testing of SDL design models has typically been achieved using the following methods:
1. Using SDL simulation scripts written as MSC traces, or
2. Using co-simulation, where a simulation of the SDL model is tested using a test suitedeveloped using TTCN-2 [12]
The following sections described these in further detail.
4.1 SDL simulation scripts written using MSC
In this process a simulation of an SDL design model is automatically produced using thecommercial Telelogic Tau tool [17]. Trace MSCs, annotated with explicit data values, are thendrawn depicting valid requirement scenarios. These MSCs are then automatically convertedinto simulation scripts, which provide the stimulus to the simulated SDL model, as well aschecking observations received from the SDL model are correct.
Because trace MSCs typically represent a single possible path through an SDL model,many are needed to achieve a reasonable level of test coverage. Also, with such a largenumber of MSCs, maintenance problems are encountered with the tight coupling of datawith behaviour specification. For example, when an SDL signal is sent between instanceson a trace MSC, the value of the signal is explicitly defined for each instance of that signal.This means that when the signal type is modified, each instance definition must also bemodified, resulting in a very large maintenance burden for engineering teams. This alsomeans that value definitions cannot be reused between specification, design, and testing. Byproviding mechanisms for decoupling data from behavior specification, we have seen verypositive results. In this particular case we parameterised MSC with TTCN-3 data [5] toprovide both a decoupling of the value definition from signal instances drawn on an MSC, aswell as provide a more abstract notation for value definition.
In addition, we introduced the ptk test generation tool to aid test coverage and the tar-getting of tests for types of testing. In doing so, we found a number of advantages:
• Users could develop more abstract and intuitive HMSC/MSC models. In doing so, theycould introduce factorisation through the use of MSC parameterisation and static dataexpressions. Typically, one abstract MSC would be equivalent to developing 3-4 traceMSCs.
192
UKTest 2005
• ptk could be used to generate tests for both simulator testing and target testing. Forexample, if SDL simulator tests were generated by ptk the semantics of the SDL simu-lator (i.e. a subset of MSC semantics) would be taken into account. Whereas, if testswere generated for target testing a different semantic model could be applied (i.e. thefull MSC semantics), as well as the generation of adaptive test cases.
As a consequence of our experiences with this type of model testing we are now promotingthe use of instance specification as a key strategy for data reuse and reduced model mainte-nance when using UML 2.0. Where, an instance is a run-time entity with an identity that isdistinguishable from other run-time entities. Hence, instance modeling refers to the creationof “signal” instances as objects that can be referenced and defined in an independent manner.
4.2 SDL/TTCN-2 co-simulation
In this process a simulation of an SDL design model is again produced using the commercialTelelogic Tau tool [17]. However, this time tests are manually developed from the requirementsusing TTCN-2 [12]. In doing so, an executable test suite is generated by compiling the TTCN-2 tests with adapter code allowing the executable test suite to communicate with the simulatedSDL model.
This approach has some distinct advantages over the previous method:
• Tests can be written in an adaptive (non-deterministic) manner.
• TTCN-2 has a well-defined operational semantics, including concepts such as defaulthandling and concurrency that are not apparent when using SDL simulator scripts.
• The TTCN-2 tests can be reused to test a SDL simulation and the target withoutmodification, by adapting the way in which the executable tests communicate with theSystem Under Test.
To enhance this approach we parameterised MSC with the TTCN-2 data language, sup-ported by the ptk test generation tool.
5 UML 2.0 testing environment
In our UML 2.0 testing environment we have again used Telelogic’s tools, this time TauG2[17]. The TauG2 tools support UML 2.0 with editors, code generators, and simulators. TauG2is relatively new though, so is less advanced than Tau in some respects, such as the simulator,which is called the model verifier in TauG2.
For testing UML 2.0 SDs against UML 2.0 Statecharts in TauG2, we have had to providea number of additional tools. The whole testing environment is illustrated in Figure 6. Wehave provided tools for: converting SDs to MSC; generating TTCN-3 tests from SDs usingptk; converting UML 2.0 data to TTCN-3 data using the in-house tool UMB; encoders anddecoders for the signal data; and the interface code between TTCN-3 and the UML 2.0Statecharts. The interface code was written using TCP/IP sockets.
When executing a TTCN-3 test against the model verifier we are able to monitor signalson both sides with trace logs which are in the form of Sequence Diagrams. This can be usefulfor tracing defects.
193
UK Software Testing Research III
Figure 6: UML 2.0 testing environment.
After testing a simulation, the next step is to test the target code. The same TTCN-3tests can be used here, simply by changing the TTCN-3 TRI/TCI interfaces for the targetplatform.
6 Related work
Our focus has been with MSC/SDL and UML 2.0 models which are extensively used in theTelecoms domain, and now starting to be used in other domains. The reason we have providedtools such as ptk for test generation, and plug-ins for existing tools like TauG2 is that therehasn’t been off-the-shelf tools that we could use. AutoLink [7] is available in Tau [17], butdoesn’t meet our needs, since it supports only a limited subset of MSC, it’s semi-automatic,and generates one test per trace in TTCN-2 only. TTCN-3 is relatively new, and sometools are starting to appear which support its graphical format, such as Testing Technologiestoolsuite [18]. Recently, the OMG have been developing a UML Testing Profile standard [15]which will hopefully provide a modeling language for testing that tool vendors will support.
7 Conclusions
Model-driven engineering has proven to improve software quality, and at the same time reduceeffort by allowing for more automation [3]. Tool support is essential for MDE though, andto fully support testing in an MDE environment we have needed to provide additional toolsupport than is available externally such as test generators; interfaces between model code
194
UKTest 2005
and test code; data conversion tools; besides others.ptk developed in Motorola Labs for test generation has been used by several diverse groups
within Motorola over the last eight years. In general we have found there is a 33% reductionin effort compared to the manual approach. This takes into account issues of learning newtechnology, and the usual teething problems with migrating to a new environment. The otherbenefits are better coverage, better quality of tests, faster turnaround time (from changingthe specification to generating the tests), and easier maintenance. There are also side-benefitssuch as analysing the requirement specifications for defects. Indeed we have also developed atool that picks up defects on MSC requirements such as race conditions and non-local choice,which we’ve written about elsewhere [1].
We have briefly described two MDE testing environments in this paper, one with MSC/SDLand the other with UML 2.0. These environments allow for the early detection of errors inthe requirement specifications, which saves cost compared to errors that occur later in thelifecycle. In both environments it was important to use standard notations, with well-definedsemantics. The TTCN-3 language in particular has proven to be well-suited for describingdata and tests. The TRI/TCI interfaces to TTCN-3 allow for testing to be moved from asimulation environment to a real target platform by changing the adapters rather than thetests, this again saves effort.
With the advent of UML 2.0 the MDE approach is likely to become more widespreadwithin the telecoms industry, as well as other industries. There is currently still a need forrobust and complete tool support though. With better tool support, software quality is likelyto be improved and cost reduced, which is typically the bottom-line in competitive industries.
References
[1] P. Baker, P. Bristow, S. Burton, C. Jervis, D. King, B. Mitchell, and R. Thomson.Detecting and resolving semantic pathologies in UML sequence diagrams. In EuropeanSoftware Engineering Conference, Foundations of Software Engineering (ESEC-FSE’05),Portugal, Sept. 2005.
[2] P. Baker, P. Bristow, C. Jervis, D. King, and B. Mitchell. Automatic generation ofconformance tests from Message Sequence Charts. In Telecommunications and Beyond:The Broader Applicability of MSC and SDL, LNCS 2599, pages 170–198. Springer-Verlag,2003.
[3] P. Baker, S. Loh, and F. Weil. Model-driven engineering in a large industrial context—Motorola case study. In ACM/IEEE 8th International Conference on Model DrivenEngineering Languages and Systems (MoDELS/UML 2005), Jamaica, Oct. 2005.
[4] B. Beizer. Software Testing Techniques. Van Nostrand Reinhold, New York, 1990.
[5] European Telecommunications Standards Institute (ETSI). Methods for Testing andSpecification; The Testing and Control Notation version 3 (TTCN-3); Part 1: TTCN-3Core Language, Feb. 2003. Available from http://www.etsi.org.
[6] T. Gilb and D. Graham. Software Inspection. Addison-Wesley, London, 1993. ISBN0-201-63181-4.
195
UK Software Testing Research III
[7] J. Grabowski. Specification Based Testing of Real-Time Distributed Systems. PhD thesis,University of Lubeck, 2002.
[8] Ø. Haugen. Comparing UML 2.0 interactions with MSC-2000. In SAM 2004: SDL andMSC Fourth International Workshop, LNCS 3319, pages 69–84. Springer-Verlag, 2004.
[9] International Telecommunication Union (ITU-T). ITU-T Recommendation X.680: Ab-stract Syntax Notation One (ASN.1) Specification of Basic Notation, 2002. Availablefrom http://www.itu.int.
[10] International Telecommunication Union (ITU-T), Geneva. ITU-T Recommenda-tion Z.100: Specification and Description Language (SDL), 2002. Available fromhttp://www.itu.int.
[11] International Telecommunication Union (ITU-T), Geneva. ITU-T RecommendationZ.120: Message Sequence Chart (MSC), 2002. Available from http://www.itu.int.
[12] ITU-T – International Telecommunications Union. The Tree and Tabular CombinedNotation (TTCN-2); ITU Recommendation X.292, 1997.
[13] M. Nelson, J. Clark, and M. A. Spurlock. Curing the software requirements and costestimating blues: The fix is easier than you might think. Program Manager, XXVIII(6),1999.
[14] Object Management Group (OMG). UML 2.0 Superstructure Specification, Aug. 2003.Available from http://www.omg.com.
[15] Object Management Group (OMG). UML 2.0 Testing Profile Specification, 2003. Avail-able from http://www.omg.com.
[16] Telelogic AB, Sweden. DOORS documentation, 2005. Information available fromhttp://www.telelogic.com.
[17] Telelogic AB, Sweden. Tau and TauG2 documentation, 2005. Information available fromhttp://www.telelogic.com.
[18] Testing Technologies IST GmbH. TTWorkbench, 2005. Information available fromhttp://www.testing.tech.de.
[19] C. Willcock, T. Deiß, S. Tobies, S. Keil, F. Engler, and S. Schulz. An Introduction toTTCN-3. John Wiley and Sons Ltd, England, 2005. ISBN 0-470-01224-2.
MOTOROLA and the Stylized M Logo are registered in the US Patent & TrademarkOffice. All other product or service names are the property of their respective owners.
196
UKTest 2005
Towards the Holy Grail of Software Testing
Ian Gilchrist, IPL
In my presentation to UKTest 2003 (entitled ‘Limits to Testing; Limits to Test
Automation’) I attempted to define the Holy Grail of software testing as being the
ability to extract information from a software design repository and automatically
generate test cases from this to run against code produced (by an independent route)
from the same design specification. I gave at the time a number of reasons why this
was difficult to do, but hinted that stirrings of industrially useful activity were being
detected. The aim of this presentation is to update delegates on some progress that is
being made in this area.
One of the difficulties I mentioned earlier was that no commercially viable products
will be produced unless there is a recognised software design standard which has
enough critical mass to encourage the product vendors to create saleable products. It
is easy enough (with a lot of money) to create ‘boutique’ products which will work
for one specified product chain, but much harder to create something with mass
appeal. The moment seems reasonably ripe to be able to proclaim that UML V2.0 is
now showing signs of offering the necessary power/cost ratio to encourage the growth
of a cottage industry around it, which will include test generation. Indeed, UML 2.0
has already spawned a Testing Profile Specification which is worth a look.
There are currently a number of supplier vendors in the UML-design tool area. One of
these is the I-Logix company, with its Rhapsody product now at Version 6. Rhapsody
has for some while included the ability to generate high-level ‘test vectors’ based on
UML Sequence diagrams. These are usually used to help validate an emerging UML
model (i.e. confirm that the system design is going in the ‘right direction’). In the case
of Rhapsody this test vector execution facility is known as TestConductor. As is
reasonably well-known, a UML design in Rhapsody can also be used to generate
code, typically C++, which contains most of the C++ class templates need to turn a
validated design into working code.
However, I-Logix also took a view approximately 2-3 years ago that helping users
generate software test scripts based on details of the UML design would be a very
useful feature. In the aerospace industry in particular the production and successful
running of software unit tests is an important aspect of getting software certified as
‘fit to fly’. Crucially it is not sufficient simply to run the tests but they also have to
show evidence of enough test coverage. At the highest (safety-critical) level this
includes Modified Condition/Decision Coverage (MC/DC) as well as the more
obvious Statement and Decision forms of coverage. Achieving this high level of
coverage is a significant burden - at least, that’s the way it’s seen.
Accordingly in 2002 I-Logix contracted with Bremen-based OSC to produce a tool
(ATG, standing for Automatic Test Generator ) which would work from UML design
data within the Rhapsody tool database to generate test cases intended to ensure that
tests based on these would enable a ‘high’ level of coverage to be reached. In fact, test
goals can be set as options within ATG as ‘states’, ‘transitions’, ‘events’ and MC/DC.
Users also have the option of specifying whether tests should be at the unit,
integration, or high-levels of testing.
197
UK Software Testing Research III
ATG uses proprietary heuristics (based on standard UML data - class and object
diagrams, statecharts, activity diagrams, and implementation of operations in C++
notation) to generate a Test Case file in either ASCII or XML formats. This data can
then be exported to variety of test script formats, one of which is Cantata++. I-Logix
chose to make Cantata++ one of the standard formats offered because they saw it as
being one of the leading test tools of its type.
If accepted, my presentation will include a short (approx 15 minutes) demonstration
of Rhapsody, ATG and Cantata++ in action, working form a UML design, through
the auto-generation of test data, and a working Cantata++ test script. Since this is a
very fast-moving area of technological development it is not possible at this stage
(March 2005) to say exactly what will be demonstrated in September 2005.
To paraphrase Francis Fukiyama’s famous declaration on the fall of the Berlin Wall
that we were witnessing the ‘end of history’, is it now reasonable to say that we are
witnessing the ‘end of testing’? Not quite. Some issues remaining will include the
following:
• How ‘meaningful’ are the generated test scenarios? Do they lead to a
heightened sense of confidence that the code being tested is really suitable for
use?
• How easy is it to calculate expected values and insert Checks for these?
• Can simulation techniques (stubs and wrappers) be easily employed to enable
true unit testing to be carried out?
These and other questions will mean that software testing practitioners are not yet out
of a job!
198
UKTest 2005
UKTEST 2005 September 2005
Page 1 of 18 E. Perez-Miñana, JJ Gras
Improving fault prediction using Bayesian Networks for the development of embedded
software applications
Elena Pérez-Miñana
Motorola Labs / Basingstoke, UK
elena.perez-minana@motorola.com
Jean-Jacques Gras
Motorola Labs / Paris, France
jjgras@motorola.com
ABSTRACT
Predicting faults early on software projects has become possible thanks to the
utilization of innovative techniques in causal models that can combine observations of
key quality drivers with expert opinion and historical data. During a joint effort between Motorola Labs and a Motorola Toulouse software
engineering group, Bayesian Network (BNs) models were constructed for predicting the
faults inserted or found at each phase of a software development project. Defect
predictions were compared against product and process quality goals to help the
organization drive corrective actions during the project in a timelier manner. This
methodology contributed to a reduction of latent defects and associated Cost of Quality
(COQ) improvement. The scope and usage of these models was expanded through the specification of a
calibration method that led to the improvement of the BNs prediction models. In addition,
the sets of models were extended to cover new defects types and new areas of software
development such as system testing.
This paper describes the results that were generated from the validation and
refinement of the improved BN models outlining a set of clear guidelines for the
generation of a good predictor.
199
UK Software Testing Research III
UKTEST 2005 September 2005
Page 2 of 18 E. Perez-Miñana, JJ Gras
1. Introduction
Characterizing software production and quality is becoming more important over time
to Motorola with the ever growing amount of software embedded in products. Reliability
is the main product attribute traditionally used to assess product quality both for hardware
and software. Unfortunately, the assumptions held by existing software reliability
engineering methods are far from being true in a production context and their results,
based on failure count during testing, come too late to offer a chance to perform
significant corrective actions if the predicted product quality is not meeting expectations,
[4], [10]. In order to anticipate and satisfy the quality levels expected by customers while
controlling costs, high maturity organizations need to adjust their process continuously,
based on data and predictive methods.
A review of a set of existing Motorola Labs defect prediction models [6], representing
different software development activities, was conducted by the organizations involved.
In doing so, they were adapted in order to achieve fitness to the local context using
Motorola Labs BN construction tools. The Bayesian Test Assistant (BTA), one of the
components of this set of tools, enables end-users to interact with BN models, and get
predictions or run what-if scenarios at the component and system level once the models
have been built, and satisfactorily calibrated. The modelling technique used provides the
means to build causal models that can combine observations of key quality drivers with
expert opinion and historical data, offering a better opportunity to build a good predictor,
when compared against the potential associated to other statistical analysis techniques
[1].
Motorola Toulouse, in a process improvement effort, decided to deploy a new defect
prediction method in collaboration with Motorola Labs. Using BTA, an overall process
model was then composed from the individual activity models to match the project
structure and its process. During 2003, data was collected throughout the development
and predictions were generated periodically from early stages to improve defect
containment and product quality.
As a continuation of the initial project, the set of Bayesian Network (BN) models for
fault prediction were calibrated to the local context of this high-maturity software
development process. Research was then conducted to provide: (1) a method for the
improved calibration of prediction models built using BN, and (2) to extend the existing
set of models to cover new defects types and areas of software development, such as
system test.
This publication describes the results from the validation and refinement of the new
BN models that were constructed for the embedded software development process
deployed in Motorola Toulouse. The validation was conducted using data that was
collected from various participants of the software development and testing team. This
was done through a web-based questionnaire designed by looking at the input nodes of
200
UKTest 2005
UKTEST 2005 September 2005
Page 3 of 18 E. Perez-Miñana, JJ Gras
the network models that were being built. The number of questions, that were finally
included in the questionnaire, covered all the model input factors. Once the answers
provided by the development team were collected, they were fed to the network models
allowing us to generate predictions for each of the networks outputs. The set of output
values constitutes a distribution which we label as the “predicted distribution” (PreDst).
This value is compared against the “actual distribution” (ActDst) recorded by the
Motorola Toulouse team during the characterization of their software development
process (SDP).
2. Deployment approach
Figure 1 Motorola Toulouse BN structure
Figure 1, shows a high level view of the BN model structure that corresponds to a
component life cycle of the process used by the Motorola Toulouse organization, details
of the procedure followed to build these models are described in [1]. There are 7 sub-
networks in the whole model. Each one is comprised of the factors and factor’s inter-
dependences that are used in one phase of the life cycle. A summary of the approach
followed to deploy the BN defect prediction models is described in the steps listed below:
• Identification of the different software components developed during the project,
and used in the product, or final deliverable.
• Selection of the activities associated to the component development phases, these
are based on each component’s life cycle which was here similar across the
components. It is possible to start from an existing generic library of activity
models, thus reducing the effort required to build and validate completely new
models each time. Additionally, we can reuse the BN model library developed for
similar developments processes.
• Revision of the selected activity models. This step usually entails adapting them
to the local context of use (for instance updating the definition of a variable or
scaling it). This step was very important as the Motorola Labs model library was
still at a research stage.
Requirements BBN
Design BBN Coding BBN
Feature Integration
Inspection / Unit test
Syst Test /Field Test (product)
CORE BBN
Feature / Component 1
Size
201
UK Software Testing Research III
UKTEST 2005 September 2005
Page 4 of 18 E. Perez-Miñana, JJ Gras
• Composition, through our internal BN deployment tool, BTA, of the component
defect models. These are tailored to the project structure by connecting activity
models together in branches, each branch matching the sequence of development
activities of a component.
• Interconnection of all component branches into a product model, thus
representing the integrated component predictions at the product level.
In the case of the development of a big feature, an additional level of decomposition
into sub-components or sub-features may be necessary to give predictions with a more
appropriate granularity.
Thanks to the graphic representation of BN models, only limited modeling knowledge
was necessary to review, adapt, and compose the models. This expertise was provided
through Motorola Labs support during the initial transfer phase. As mentioned
previously, encouraging results had been obtained during this phase, highlighted by a
reduction of latent defects and the associated Cost of Quality improvement, prompting
the decision to continue with this practice and expand the scope of use of the models. To
this end, we then validated and improved the network models for the following phases:
Requirements, Design, and Coding.
The calibration enhancements (mean and variance) are achieved through the
validation/calibration of the nodes that comprise the existing models using historical data
that was collected in the 2003 projects and in new projects completed in 2004 as part of
the quality process improvement activities of the organization for which the network
models are intended. For this purpose a data collection tool, the “Web Questionnaire”
was deployed in Motorola Toulouse. At the same time, the first model set representing
component development is extended by including a “code inspection” model, and a
“system test effectiveness” model for system testing.
We wanted to ensure that the different software development groups working in
Motorola use the models to their best advantage. To this end, in addition to building BN
that represent the factors currently identified, we are keeping abreast with the findings of
research groups working in other organisations in the same area. This thread of study
provides us with information that allow us to build networks that include other factors,
for example new metrics, important in software improvement processes and currently
not recorded by our software groups., We think it is appropriate to start collecting data
with these new factors in order to build better predictors in the future. A detailed
description of the results obtained with this study is out of the scope of this paper.
Details of the new type of factors that are used to monitor the software requirements
phase are described in [12]. A brief description on how these factors can be integrated to
the information currently recorded by the software development groups we are
supporting can be found in section 3.2.
The following sections provide details of the network design process, the validation
method used, and the results obtained with the Motorola Toulouse case study.
202
UKTest 2005
UKTEST 2005 September 2005
Page 5 of 18 E. Perez-Miñana, JJ Gras
3. Validation Procedure
The method used for validating the BN is an iterative and incremental one due to the
nature of the problem. The phases of the software development process are conducted in
a certain order which means that the models themselves must be computed and fine-tuned
sequentially. As the models were developed for predicting fault densities incurred in one
phase of the process, given the values of the factors judged to influence defects insertion
or removal in that phase.
The Motorola Toulouse organization calculates and records, as part of its monitoring
activities, the fault density that developers have detected in the requirements documents.
It also calculates the fault density associated to the designs, and to the software that is
produced during the coding phase. It is very clear that the number of faults detected at
any stage of the process depend on the quality of the software artefacts that are produced
in the previous phases. This is reflected in the network models through links that are used
to transmit the computed values for a particular upstream network to those of the
downstream network. We use the Netica tool [10] to build the networks. The nodes that
will be used as “links” between networks are specified in the same way in these
networks. We subsequently use the BTA tool to create these links and to be able to
transmit computations between networks. The BTA output is used to convert the model
output into a predicted range.
We entered and used the cases that were recorded by the Motorola Toulouse team for
the validation and fine-tuning. Nowadays many methods can be applied to validate and
improve probabilistic models using data describing the situation that wants to be
modelled. For any of them, one of the most important components is the quality of this
data. The validation method chosen depends on the degree of confidence associated to the
data, because there are methods that need more reliable inputs. Last but not least, in order
to produce a good predictor, it must not be too dependant on the data that was used to
compute the estimator, i.e. the solution must be a good “generaliser”, and this can only be
achieved if the solution is not fitted too closely to the available data. Deciding on the
level of “closeness” of fitness is also difficult, and very dependent on the situation/
problem being solved.
Req
(OF/page) Des (OF/page)
Code Ph
(OF/KLOC)
ACT_FD_PR1 0.01 0.18 43.63
ACT_FD_PR2 0.07 0.38 7.98
ACT_FD_C2 0.08 0.25 9.69
MEAN_ALL 0.06 0.27 20.43
ACTUALS
Table 1 Fault Densities recorded by Motorola Toulouse
203
UK Software Testing Research III
UKTEST 2005 September 2005
Page 6 of 18 E. Perez-Miñana, JJ Gras
For the estimation of a set of inter-dependent probabilistic networks that are reliable
predictors of the fault density associated to a process described through a set of well-
defined factors, we have developed a procedure that makes use of two different
estimation methods, which complement one another reasonably well. A general outline
of the procedure runs as follows:
1. Arrange the network models according to the dependencies that exist between
them.
2. For each model mi :
a. Make sure that the nodes in the model cover all the factors that affect the
associated development activity. Also check the inter-dependencies
between these factors. See Section 3.2 for details.
b. Transform the range of the network output (mi) so that it matches the
range of the output factor against which it is going to be compared,
allowing for variance calibration from a full history of project data
provided by Motorola Toulouse. In this case we are predicting fault
densities; therefore the output for each network (FDmi) needs to be
mapped to the range of the fault density computed at the relevant phase.
c. The values used for the calibration of the mean FD for our models can be
found in Table 1. Each column corresponds to the FD recorded by the
development team during a particular phase of the process (Requirements,
Design, Coding). The original data was recorded during the development
of three projects. Each row in the table corresponds to the FDs that were
calculated in the different phases of the process executed for a particular
project.
d. Run the calibrated network model mi with the BTA tool using as inputs
each of the cases recorded on the Web-questionnaire. Record the FDmi
computed by the model and also the relative error with respect to the
expected fault density.
e. Depending on the variability of the data, it might not be possible to
compute a good predictor just by calibrating current models. In these cases
there are two possibilities:
i. Review the activities being modelled to determine whether there
are missing factors or whether some of the existing ones have not
been correctly specified. If this is the case, it may be necessary to
modify the structure of the existing network, nodes and/or links,
and initiate another iteration in the validation procedure.
ii. If we have enough data to produce a reasonable probabilistic
network, an attempt can be made to build a new model through
statistical analysis. We developed a new approach based on
principal component analysis (PCA) to create intermediate nodes,
and linear regression to determine the relation between these
intermediate nodes and the output node, details of the procedure
can be found in Section 3.1. The estimation features used are
available in statistical packages such as Minitab [8]. The resulting
predictor will compute very good values for the data used to build
204
UKTest 2005
UKTEST 2005 September 2005
Page 7 of 18 E. Perez-Miñana, JJ Gras
the regression model. In our case the set was small but the fit was
very good. Nevertheless, it is very likely that when using data that
was not part of the estimation procedure the predictions will be no-
where as good because predictors computed in this way usually
turn out to be poor “generalisers”.
3. Once a reasonable estimator mi has been computed, it can be linked to the
network mi+1 initiating a new iteration of the calibration and fine tuning
procedure.
3.1. Statistical Analysis
PCA is a data reduction technique used to identify a small set of variables that
account for a large proportion of the total variance in the original variables that describe a
population. Calculating the principal components (PC) of a population from a sample, it
is possible to determine which of the population’s descriptors (scores) are responsible for
its variance. These scores can be subsequently used to build the best estimator for
computing the predicted values which, in our case, correspond to the fault densities
recorded in an organisation. In our approach, we are using linear regression for this last
step.
Details of the steps carried out are the following:
1. Compute the PCs of the input data that has been collected with the web-
questionnaire defined for the first stages of development.
2. The PCA analysis technique also provides information on how dependent the
variance of the population is on each of the components that have been calculated.
3. Select the number of principal components PCs that are responsible for at least
90% of the population’s variance. It is important to try and select as small a
number of PCs as possible because this will result in a simpler network.
4. Build the linear model that will allow us to estimate the desired fault density using
the scores calculated with the PCA method. This step is realized using linear
regression.
5. Structure a BN which has the following nodes:
a. One input node for each of the factors that are relevant for the activity
phase of the process under consideration.
b. One intermediate node for each of the scores that were computed with the
PCA. It might be necessary to use more than one intermediate node to
reduce the number of inputs to the node as this increases the complexity of
the BN.
c. An additional layer of intermediate nodes has to be built to represent the
linear regression model. These nodes will be linked to the final output
node that is used to compute the fault density.
In those cases in which the PCA produces results showing little variance in the input
data, it might be necessary to go back and discuss the matter with the team that provided
the information through the web-questionnaire because it is probable that it might be
necessary to collect further information in order to be able to compute a good fault
density-estimator.
205
UK Software Testing Research III
UKTEST 2005 September 2005
Page 8 of 18 E. Perez-Miñana, JJ Gras
The approach described previously provides structure and clear guidelines to compute
the probabilistic networks that give support for monitoring and improving the overall
software development process of an organization. One thing that should be noted is that it
might not work in all cases. If the data cannot be expressed with a linear function, it is
likely that the computations would produce a very complex solution that would fit the
data well but not “explain” it better than the models produced by experts. It might also be
the case that the input data used for the estimation is not as reliable as it must be to ensure
the good results that can be achieved when applying the estimation procedure described.
This highlights the importance of implementing sound methods to collect the data
generated during any development process.
For those cases where it is possible to compute a reasonable estimator using linear
regression and PCA techniques, we still need to determine how good the solution is in
relation to predicting values when new data is fed to the model. This constitutes one of
the aims of the project for 2005, i.e. the specification of a procedure that will allow us to
determine the generalisation capability of an estimator that has been computed using our
method.
In the next section we include details of the application of the validation/estimation
procedure that was described previously. The case study is the fault density prediction
networks developed for Motorola Toulouse.
3.2. Determining the factors/nodes for a BN
At the time of building a BN from expert opinion for a particular group, we conduct a
number of interviews with the development team to discuss the type of information they
record to measure the quality of their processes and new factors they identify as quality
drivers but not currently recorded. Once this matter has been clarified, it is necessary to
group the factors that are inter-dependent to reflect these dependencies in the network.
For example, in the case of the BN model for the requirements phase an element that will
affect the quality of the output, i.e. the requirements specification, is their structural
complexity.
206
UKTest 2005
UKTEST 2005 September 2005
Page 9 of 18 E. Perez-Miñana, JJ Gras
Figure 2 Example Requirements BN model
The BN model shown in Figure 2 represents the factors and factors’ inter-
dependencies of the requirements process followed in Motorola Toulouse. The input and
output nodes of the network are described in Table 2. The intermediate nodes used in the
network integrate information that, in the experts’ experience, is brought together during
the decision process at any stage of the requirements phase. In some cases, the
intermediate nodes allow the network architects to explicitly express the dependency that
exists between some of the factors used to build the network.
Apart from deciding the inter-dependencies that exist between the input nodes, it was
also necessary to decide on the strength of the links associated to each of these inputs.
For example, the REQ_Accuracy node in the requirements BN has two inputs:
domain_knowledge, and time_pressure. Given that the distribution of the REQ_Accuracy
node is estimated using linear weighting, it is necessary to associate a weight value to
each of the inputs. The process of deciding on these weights involves a certain element of
trial and error; therefore care must be taken at the time of deciding on the best solution.
207
UK Software Testing Research III
UKTEST 2005 September 2005
Page 10 of 18 E. Perez-Miñana, JJ Gras
Node Name Definition
REQ_Domain_Knowledge REQ Domain Knowledge is a subjective assessment of the domain knowledge the project requirement team has.
GEN_REQ_F01_V08.dne Component's Generic Requirements Model label
REQ_Time_Pressure this node is a measure of the time pressure the project team is under during the requirement phase Five states are defined for this factor:
REQ_Accuracy Accuracy is a quality driven by the team's Domain Knowledge and the time pressure it is under.
REQ_Problem_Complexity REQ Problem Complexity is a subjective measure of the complexity of the application.
REQ_Uncomprehensiveness This node represents Comprehensiveness, Completeness and Consistency (plus Coverage in requirements documents.
REQ_Stability a subjective assessment of the evolution associated to the requirements at the end of the phase.
REQ_Process_knowledge a measure of how well the requirements process is followed Scale: five states, each of which defines a higher requirement process applicability.
REQ_Communication a measure of the level of team communication that exists
REQ_Challenge
this node represents the challenge the development team is confronted with as a result of things such as, missing requirements and requirements instability. Part of the holes in requirements could be filled by the team's design capability.
Table 2 Requirements network's input/output nodes' description
In our example three nodes, i.e. problem_complexity, time_pressure, and
domain_knowledge, are specified and associated to provide a measure of
“uncomprehensiveness” which allows us to represent in the BN the impact these elements
have on the overall complexity of the final requirements. Following the guidelines of the
IEEE standards on requirements specification [13] we have produced a BN which
provides a more fine-grained description of our understanding of the term “structural
complexity” by incorporating the following factors to the BN model: unambiguity,
modifiability, and verifiability. These factors provide a better description of the quality of
a requirement’s structure because they describe it in terms of their clarity, how easy it is
208
UKTest 2005
UKTEST 2005 September 2005
Page 11 of 18 E. Perez-Miñana, JJ Gras
to change them, and how easy it is to test them. We believe this network should be a
better predictor. I It includes more information of the requirements process used by the
development team, providing more means to produce a better predictor. The extent to
which this is the case will be part of future validation research.
In addition to the specification of the dependencies that exist between the various
factors, it is also necessary to specify an output node that will compute a value that can be
compared to the fault density values recorded by the development group. This is achieved
by mapping the requirements challenge value that the BN built estimates to a value that
falls into the range of the fault densities that have been recorded.
In certain cases, it might be necessary to define additional intermediate nodes. This
occurs when a large number of input factors are inter-dependent, as is the case in the
network models that were built using the statistical analysis approach. When it is
necessary to use all the inputs to calculate the principal component in the BN model, the
computational costs of calculating that node becomes prohibitive because of exponential
growth of the node conditional probability table. To reduce node complexity the inputs
are distributed amongst various intermediate ones, which are brought together again by
adding an additional layer of intermediate nodes to the network.
209
UK Software Testing Research III
UKTEST 2005 September 2005
Page 12 of 18 E. Perez-Miñana, JJ Gras
4. Motorola-Toulouse Case Study
The following section describes the results obtained by using Bayesian networks to
build fault density predictors for Motorola Toulouse. The models were built using an
iterative procedure therefore various network models are generated throughout. The
following discussion includes some of the solutions produced which are restricted to the
Requirements, Design, and Coding phases.
The procedure for validating/fine tuning the Motorola Toulouse model complies with
the one outlined in Section 3. Each of the models mentioned corresponds to a version of
the networks generated during an iteration of the validation/design procedure:
o INITIAL_MODELS: these are the initial networks that were built in consultation
with Motorola Toulouse organisation. Each model is extended to produce a fault
density output FD. In the initial stage of the network development, a consultation
was carried out to decide the factors that were relevant at each stage of the
process. Once this matter was agreed, it was necessary to decide how the various
networks were going to be linked together. This step is important because it is
through this structure that information is fed between networks. The networks
were run to have an initial insight to the type of predictions it was possible to
compute. The columns labelled INIT_MODEL in the tables of sections 4.2, 4.3,
and 4.4 list the PreDst computed by this particular set of networks. There is also
an adjacent table that lists the relative errors that each network model is
committing when attempting to compute the ActDst. The relative errors
associated to the INIT_MODEL are the highest of the entire table showing that it
is clearly necessary to calibrate all of the network models.
o CALIB_MODELS: there are two further sub-sets of models in this group. Each
one includes a version of the network models in which the nodes were calibrated
in order to achieve better results for the various sets of inputs provided by
Motorola Toulouse. We have included two versions that differ from each other in
the weights associated to the links connecting the input nodes with the
intermediate nodes.
o REGRESSION_MODELS: this set of models was computed using linear
regression and PCA. It produces the best set of results as can be seen from the
contents of the relative errors tables listed in sections 4.2, 4.3, and 4.4. At present
we are not able to determine the generalisation capabilities of these networks
therefore it is not possible to discuss this matter further at this stage.
The following subsections discuss the results of the network models that were built for
each phase of the Motorola Toulouse process using the validation procedure.
4.1. Data sets
The data-points that were fed to each of the networks correspond to averages over the
data collected for each project used as input in this design process.
210
UKTest 2005
UKTEST 2005 September 2005
Page 13 of 18 E. Perez-Miñana, JJ Gras
o PR1_AVR: is the average over the data collected from the individuals of the team
working on one of the projects they developed during 2003. For confidentiality
reasons the project has been labelled PR1. Each input that was fed to the networks
is the average over the values delivered by each individual member of the
development team.
o PR1_C2: this is a data set obtained from the team working on the feature labelled
C2 for the purposes of this report. This development was completed in 2004. The
feature C2 is part of the project PR1.
o PR2_C1_WAVG: the data points that were produced during the execution of
project PR2 could not be considered uniformly because they were obtained from
members of four different development teams, each one working on different
aspects of the features under development; therefore it was necessary to take into
account the differences between the contexts of work in each case. This was done
by associating a weight to each data point based on the size of the corresponding
component; a reasonable assumption is that the weight should be proportional to the
contribution of the component size to the overall feature. PR2_C1_WAVG is the
average over the weighted inputs.
o PR2_C1_AVR: corresponds to the average over all the data points collected from
members working on the PR2 project. The inputs were not weighted to compute
this average. We include it for comparison purposes with PR2_C1_WAVG, to
check the effect the weighting has, and determine whether the assumption of basing
it on code size is a correct one.
INIT_MODEL CALIBRATION 1 CALIBRATION 2 REGRESSION
PR1_AVR 0.13 0.05 0.05 0.01
PR1_C2 0.16 0.05 0.07 0.07
PR2_C1_WAVG 0.13 0.11 0.09 0.07
PR2_AVR 0.12 0.13 0.09 0.03
REQUIREMENTS FD
Table 3 Fault Density predicted by REQUIREMENTS model
INIT_MODEL CALIBRATION 1 CALIBRATION 2 REGRESSION
PR1_AVR 979.94 295.14 329.26 -12.93
PR1_C2 100.62 -33.59 -15.28 -8.21
PR2_C1_WAVG 69.05 49.86 15.21 -1.69
PR2_AVR 67.13 51.82 19.56 -57.87
RELATIVE ERROR
Table 4 Relative error (%) for REQUIREMENTS model
4.2. Requirements phase networks
Table 3, and Table 4 contain the results that were computed with the network models
generated for the requirements phase conducted at Motorola Toulouse. Table 3 lists the
FDs that were predicted with each of the network models built when it is fed the averages
of the different projects that were used in the validation. Table 4 includes the relative
211
UK Software Testing Research III
UKTEST 2005 September 2005
Page 14 of 18 E. Perez-Miñana, JJ Gras
errors associated to each of these FD predictions when compared against the ActFD
values recorded for each project.
Looking at the relative errors in Table 4, it is clear that the best results are computed
with the REGRESSION network. The CALIBRATION networks compute reasonable
results but only for one subset of the actual FDs. This indicates that it will be difficult to
capture in the network the differences that exist between the projects whose data was
used for the calibration procedure.
Comparing the results obtained by using the weighted average of the
PR2_C1_WAVG data set against the results obtained when feeding the simple
PR2_C1_AVR average to the requirements network, it is clear that it is necessary to take
into consideration the differences that exist between the perceptions of the development
groups that are working on different components of the same feature.
Comparing the networks of the validation procedure against the initial requirements
network that was built by Toulouse, there are structural differences in addition to the
range and functions that were used to compute the output node. It was necessary to add
another input node which we have labelled REUSE. Its value is a percentage indicating
how much of the requirements in the feature’s specification document are being reused
from a previous release. The reusability factor should always be taken into consideration
because of the impact it has over the amount of work that needs to be carried out when
developing a software feature.
INIT_MODEL CALIBRATION 1 CALIBRATION 2 REGRESSION
PR1_AVR 0.22 0.17 0.17 0.23
PR1_C2 0.25 0.17 0.18 0.21
PR2_C1_WAVG 0.23 0.22 0.21 0.35
PR2_AVR 0.24 0.25 0.24 0.37
DESIGN FD
Table 5 Fault Density predicted using the DESIGN models
INIT_MODEL CALIBRATION 1 CALIBRATION 2 REGRESSION
PR1_AVR 22.87 -6.03 -5.11 26.92
PR1_C2 2.68 -4.79 -27.64 -12.79
PR2_C1_WAVG -40.01 20.63 -45.97 -8.18
PR2_AVR -36.49 39.49 -36.92 -2.03
RELATIVE ERROR
Table 6 Relative Error (%) for the DESIGN models
4.3. Design phase networks
Once we were satisfied that we had on our hands a working estimator for the FD of
the requirements phase, we proceeded with an analysis of the data that was collected by
Toulouse to improve the network model that was initially put together to reflect the
factors and factors’ inter-dependencies that occur at the design phase.
212
UKTest 2005
UKTEST 2005 September 2005
Page 15 of 18 E. Perez-Miñana, JJ Gras
Structurally the only difference with the original network is that it was necessary to
add an output node to compute the output because the Design_Challenge that is
computed by the original network has nothing to do with the FD value that were recorded
by the Toulouse team as part of their quality processes. Table 5 includes the FDs
computed with the design models that were designed. The values in Table 6 correspond
to the relative error that exists between the predicted FD and the actual FD in each case.
For the Design phase the best models correspond to the CALIBRATION 1, and to the
REGRESSION model. Once again, it is possible to detect that with simple calibration the
most that can be achieved is to reach a solution that can only be close to a subset of the
actual FDs. The regression model at the design level is a better estimator. Once more it is
not possible to make any claims with respect to its generalisation potential.
4.4. Coding phase networks
The final stage of the validating/fine tuning phase involved the network that covered
the factors and factors’ inter-dependencies for the coding phase of the Motorola Toulouse
process. As in the previous cases, the validating operation was initiated once satisfactory
results had been achieved with the networks designed to cover the design phase.
INIT_MODEL CALIBRATION 1 CALIBRATION 2 REGRESSION
PR1_AVR 13.34 11.98 25.90 28.73
PR1_C2 12.77 11.20 23.52 26.65
PR2_C1_WAVG 13.42 12.75 25.90 31.29
PR2_AVR 12.10 13.29 20.42 28.77
CODING FD
Table 7 Fault Density estimated with the CODING models
INIT_MODEL CALIBRATION 1 CALIBRATION 2 REGRESSION
PR1_AVR -69.42 -72.54 -40.63 -34.15
PR1_C2 31.81 15.58 142.83 175.09
PR2_C1_WAVG 68.15 59.81 224.68 292.22
PR2_AVR 51.67 66.58 155.96 260.58
RELATIVE ERROR
Table 8 Relative error (%) of the CODING models
Of all the network models this was the most difficult to compute, partly due to the
huge differences that exist between the various Coding FDs that were recorded by the
Toulouse team, something that contrasts greatly with the little differences that exist
between the values of the input factors. The difference in the levels of diversity between
the inputs/outputs made it very difficult to produce good estimators either by calibrating
the existing nodes or using the linear regression/pca techniques. The little diversity that
was found in some of the input values made it extremely difficult to conduct a reasonable
PCA.
For this particular case, it is possible that a factor has been missed out in the network
model that would explain this difference for the PR2 project. It might also be the case
213
UK Software Testing Research III
UKTEST 2005 September 2005
Page 16 of 18 E. Perez-Miñana, JJ Gras
that some measurement is erroneous due for instance to some of the descriptions
associated to each of the input factors was not clear enough.
The recommendation for this coding model is to continue capturing data from other
projects and adjust the relations from the new data. Another possibility is to use a “model
bagging” approach consisting of combining multiple model predictions. We could take a
coding model from the generic library and combine its results with the Motorola
Toulouse-specific coding model.
5. Conclusion and Next Steps
The analysis of the data and models provided by Motorola Toulouse have given us
the opportunity of producing a reasonable set of models and also a new method to
conduct the validation and fine-tuning of BN developed to support the improvement of
the software development process for a particular organization.
The method shows two alternative ways to produce an improved network. The first
one is based mainly in modifying the ranges of the existing nodes and/or adding inter-
dependencies between them, and/or varying the weight values associated to each of the
nodes that are used as inputs to the intermediate nodes of a BN model. The second
possibility is to use linear regression, PCA to build the intermediate/output nodes of the
BN network.
The second way of computing the best network generally produces better results
when the outputs are compared against those of the data that was used for building the
network. It does not necessarily follow that this will be the case when the network is run
using data of future Motorola Toulouse projects because the network might not be a good
generaliser. Further studies are required to prove this point.
We believe that the new methods developed for calibration/fine tuning of the network
should be the best way forward as this allows us to better exploit the potentials of the BN
model. In order to fully benefit from this approach we need to extend our current dataset
with more input data and actual faults numbers, and to refine the data collection method
to increase the precision of measurements.
To continue fine-tuning the method described, we have being increasing the number
of opportunities of applying them to other groups within Motorola in addition to
Motorola Toulouse. This will definitely help us to improve the current models, in
addition to learning from new data, by reviewing new generic descriptions for the nodes
currently used and adding group specific indicator variables in each case to qualify the
factors using Bayesian inference. This will make it possible to transmit clearly an
unambiguous meaning of each of the factors that are part of the BN models and to refine
the measurement with the added indicators.
214
UKTest 2005
UKTEST 2005 September 2005
Page 17 of 18 E. Perez-Miñana, JJ Gras
As the process followed in an organization changes, and more effective ways to
capture the factors are devised, or the existing factors’ inter-dependencies change, it is
also necessary to adapt/change the BN models that are being used. Currently, we are
working on improving the models quality by conducting a review of the research area [1],
[5], [9], analysing the solutions reached by other research teams, and some of the new
modelling techniques that have been derived by them. Such an activity triggered the
changes to the Requirements network model discussed previously, which was extended
by integrating to the model the set of factors included in the ISO metrics for requirements
management [13]. The new model is included in Figure 3. The report mentioned in [12]
includes details on how the model was built. We intend to validate it in the project pilots
scheduled for 2005.
Figure 3 REQUIREMENTS generic network
We had a goal in 2004 to produce a clear procedure to validate/calibrate BN networks
to predict software reliability. To this end, we applied statistical techniques that provided
us with a measurement on the quality of the predictions made by our models, and
suggestions on how the models could be improved. The results achieved in Motorola
Toulouse is encouraging, nevertheless the number of unresolved questions indicate that
there is still a lot of research to be done and it is what we are currently working on.
215
UK Software Testing Research III
UKTEST 2005 September 2005
Page 18 of 18 E. Perez-Miñana, JJ Gras
6. REFERENCES1
[1] Boehm B., Basili V., "Software defect reduction, Top 10 list", IEEE Software
2001.
[2] Bourry F, Gras JJ, “Fault Prediction using Historical Data and Bayesian
Networks”, Proceedings of the Motorola S3 Symposium, 2004.
[3] Fenton N, Krause P, Neil M, “Software Measurement: Uncertainty and Causal
Modeling, IEEE Software, Jul/Aug 2002.
[4] Fenton N, Neil M, “A Critique of Software Defect Prediction Models”, IEEE
Transactions on Software Engineering, vol 25, No 5 Sept/Oct 1999.
[5] Fenton N, Pfleeger S, Software Metrics: A Rigorous & Practical Approach, ISBN
053495425-1, PWS, 1997.
[6] Gras JJ, “End-to-end defect modelling”, IEEE Software, Sept-Oct 2004.
[7] Gras JJ, McGaw D, “End-to-end defect prediction”, 15th International Symposium
on Software Reliability Engineering, 2004.
[8] Minitab, http://www.minitab.com.
[9] MODIST project, http://www.modist.org/
[10] Musa J, Iannino A, Okumoto K, “Software reliability engineering:
measurement, prediction, application”, ISBN 0-07-044093, McGraw-Hill, 1987.
[11] Netica, http://www.norsys.com/
[12] Pérez-Miñana E, “Extending the requirements factors of the fault prediction
BN models”, Motorola internal report, August 2004.
[13] IEEE Std 830-1998, “Recommended Practice for Software Requirements
Specifications”, 1998.
1 There are a number of additional publications that describe partial results achieved using the approach
described in the paper. They’re not listed in the references due to confidentiality issues.
216
UKTest 2005
217
top related