learning-based testing of automotive ecus1065359/fulltext01.pdf · learning-based testing of...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2016

Learning-based testing of automotive ECUs

SOPHIA BÄCKSTRÖM

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Learning-based testing of automotive ECUs

SOPHIA BÄCKSTRÖM

Master’s Thesis at CSCSupervisor: Karl MeinkeExaminer: Johan Håstad

Project provider: Scania ABSupervisor at Scania: Christopher Lidbäck

Stockholm, Sweden December 2016

AbstractLBTest is a learning based-testing tool for black box testing, devel-oped by the software reliability group at KTH. Learning based-testingcombines model checking with a learning algorithm that incrementallylearns a model of the system under test, which allows for a high degreeof automation.

This thesis examines the possibilities to use LBTest for testing ofelectronic control units (ECUs) at Scania. Through two case studies thepossibility to formalise ECU requirements and to model ECU applica-tions for LBTest are evaluated. The case studies are followed up withbenchmarking against test cases currently in use at Scania.

The results of the case studies show that most of the functionalrequirements can, after reformulation, be formalised for LBTest andthat LBTest can find previously undetected defects in ECU software.The benchmarking also shows a high error detection rate for LBTest.Finally, the thesis presents guidelines for requirement formulation andimprovements of LBTest are suggested.

ReferatInlärningsbaserad testning av ECU:er

LBTest är ett inlärningsbaserat verktyg för black box-testing som harutvecklats av programvarutillförlitghetsgruppen på KTH. Inlärnings-baserad testning kombinerar model checking med en inlärningsalgoritmsom stegvis bygger upp en lärd modell av systemet under test, vilketmöjliggör en hög grad av automatisering.

Denna uppsats undersöker möjligheten att använda LBTest för atttesta elektroniska kontrollenheter (ECU:er) på Scania. Genom två fall-studier utvärderas möjligheten att formalisera krav på ECU:er ochmodellera ECU-applikationer för LBTest. Fallstudierna följs upp meden benchmarking gentemot befintliga testfall på Scania.

Resultaten av fallstudierna visar att majoriteten av de funktion-ella kraven kan formaliseras för LBTest efter en omformulering och attLBTest kan hitta tidigare oupptäckta fel i mjukvaran. Benchmarkingenvisar en hög grad av feldetektion för LBTest. I uppsatsen föreslås ocksåriktlinjer för kravformulering och möjliga förbättringar av LBTest.

AcknowledgementsI would like to express my gratitude to my supervisors at CSC andand to the Scania employees who have helped me along the way.

Karl Meinke, Christopher Lidbäck, Andreas Rasmusson andHojat Khosrowjerdi – thank you for your patience and support.

Contents

1 Introduction 11.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 52.1 Software testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Back box and white box testing . . . . . . . . . . . . . . . . . 82.1.2 Mutation testing . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Model-based testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Model checking . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Linear temporal logic . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Learning-based testing . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1 LBTest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Previous case studies . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Testing of ECUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.1 System testing of ECUs at Scania . . . . . . . . . . . . . . . 152.4.2 Formalisation of ECU requirements . . . . . . . . . . . . . . . 16

3 The case studies 193.1 Case study 1: Low fuel level-warning . . . . . . . . . . . . . . . . . . 20

3.1.1 Requirement formalisation . . . . . . . . . . . . . . . . . . . . 203.1.2 Modelling and partitioning . . . . . . . . . . . . . . . . . . . 213.1.3 Warnings and detected errors . . . . . . . . . . . . . . . . . . 22

3.2 Case study 2: Dual-circuit steering . . . . . . . . . . . . . . . . . . . 223.2.1 Requirement formalisation . . . . . . . . . . . . . . . . . . . . 233.2.2 Modelling and partitioning . . . . . . . . . . . . . . . . . . . 253.2.3 Warnings and detected errors . . . . . . . . . . . . . . . . . . 26

3.3 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Results 29

4.1 Requirement formalisation . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Source of detected errors . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Discussion 315.1 The case studies in retrospect . . . . . . . . . . . . . . . . . . . . . . 315.2 The results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.1 The formalisation . . . . . . . . . . . . . . . . . . . . . . . . . 325.2.2 Detected errors . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2.3 The benchmarking . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 Issues and usability of LBTest . . . . . . . . . . . . . . . . . . . . . . 34

6 Conclusions and future work 376.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7 Bibliography 41

Chapter 1

Introduction

Software testing examines a program’s behaviour given specific settings and inputsto make a judgement about its quality or to detect defects (Jorgensen 2002). Theresults, the outputs of the program, are compared to the expected behaviour andevaluated. The evaluation of the result is known as the test oracle, and can beachieved by comparing the result with a statement in the test case or by using aseparate algorithm. These three steps - generating test data, injecting the inputto the program and the oracle step constitute the basis of software testing. Thesesteps can be executed manually, with some degree of automation or completelyautomated.

This thesis describes an evaluation of a fully automated testing tool, LBTest,for testing of electronic control units (ECUs) for heavy trucks. LBTest is developedby the software reliability group at KTH and implements learning-based testing.This testing strategy combines model checking with learning algorithms, whichincrementally learn a model of the system under test, using test cases as queries.New test cases are generated by checking the learned model against the system’sfunctional requirements, formulated in Linear Temporal Logic (LTL).

Scania is one of the world’s leading manufacturers of heavy trucks and buses.The company was founded in 1891 in Södertälje, where the head office is still located.Today Scania has over 44 000 employees and 3 500 of these work at Research &Development. The case study took place during the spring of 2016 at Research &Development, with the team responsible for system testing of chassis ECUs. Scaniais interested in how LBTest can be used to approximate models of their softwareand test requirements on the ECUs. To be able to test directly from requirementscould reduce the resources needed to manually write and maintain test code, aswell as allowing for a larger number of combinations of states and scenarios to beexecuted during testing.

Previous studies (Feng, Lundmark, Meinke, Niu, Sindhu & Wong 2013, Nycan-der 2015), have shown that LBTest can model a system under test, find new bugsand detect injected errors in the code. However, several questions remain regardingthe practical applicability of the tool, such as scalability, usability and efficiency in

1

CHAPTER 1. INTRODUCTION

regard to the resources needed.

1.1 ObjectiveThe goal of the study is to evaluate how suitable LBTest is for ECU-testing. Itshould also result in added knowledge of how real-life industrial problems can bemodelled for LBTest and to what degree the tool matches the demands of theindustry. The study will examine whether the conditions for using LBTest aremet given the current requirements documents, different approaches to requirementmodelling for LBTest and the effects of using the tool to test ECU software. Thiswill be achieved by analysing the following questions:

• To which degree are the ECU requirements formalisable and expressible inLTL?

• Can ECU applications be modelled for LBTest?

• Can LBTest find undiscovered bugs in the ECU software?

• How does LBTest compare to the existing testing framework in regard todetection of injected errors?

1.2 MethodologyThe research questions will be answered through comparative case studies based onECU requirements. Since the aim is to study LBTest in a specific real-life contextthe case study method is well suited. The case studies will include both qualitativeand quantitative elements, with the aim to both analyse potential obstacles for usingthe tool in this setting and evaluate it against the existing framework. Due to thelarge variation among the Scania requirements it is preferable to study more thanone ECU function. The requirements documents to be analysed will be selected bytesters at Scania. Data will be gathered by examining the number of formalisablerequirements, the verdicts from LBTest when testing the requirements and thenumber of injected errors detected by LBTest and the existing testing framework.

1.3 DelimitationsThe objective of the project is not to examine the full consequences of a completeshift to LBTest at Scania, but to evaluate the tool based on a subset of require-ments. The focus will be on the practical use of the tool, mainly on its industrialapplicability. The specific algorithms for learning and model checking will not bediscussed in detail.

This thesis does not discuss ethical or environmental aspects of software testing.However, sustainability issues and the effect improved software testing techniqueswould have on the environment could be relevant to investigate in future studies.

2

1.4. CONTRIBUTIONS

1.4 ContributionsTo this date no industrial benchmarking has been conducted with LBTest againstan already implemented testing framework, examining how the tool measures up toexisting test methods in the industry. This thesis will contribute by adding knowl-edge of how LBTest can be used in the automobile industry and how it comparesto one of the current testing strategies.

1.5 Thesis outlineChapter 2 will provide an introduction to software testing and the relevant theory forlearning-based testing and testing of ECUs. Chapter 3 describes the two case studiesof ECU requirements and the benchmarking against the current testing framework,piTest. Chapter 4 presents the results of the case studies and the benchmarking.Chapter 5 will provide a more detailed discussion of the results and the experience ofworking with LBTest. Finally, Chapter 6 summarises the conclusions of the project,provides recommendations for requirement formulation and suggests future work.

3

Chapter 2

Background

This chapter presents an introduction to software testing in general and learning-based testing and testing of ECUs in particular. Section 2.1 describes the role ofsoftware testing given different development models and basic testing strategies. Italso covers mutation testing, which is used to evaluate LBTest in the benchmarkingdescribed in Chapter 3. Section 2.2 provides the background necessary to understandthe concept of learning-based testing by an introduction to model-based testing, modelchecking and linear temporal logic. Learning-based testing and LBTest are coveredin Section 2.3 and Section 2.4 focuses on testing of ECUs and formalisation ofautomotive requirements.

2.1 Software testingSoftware testing alone cannot prove a program’s correctness, only display its defects.Or as phrased by Dijkstra (1970, p.7):

“Program testing can be used to show the presence of bugs,but never to show their absence”

Software testing tends to be a labour intensive and costly activity, requiring up to50% of software development costs (Ammann & Offutt 2008). One of the mostimportant aspects of keeping costs down is early detection of defects, as the cost ofcorrecting defects increases exponentially throughout the development cycle. Therelative cost can be 15 times higher if a defect is detected during testing comparedto detection during system design, and 100 times higher if it is not found beforesystem maintenance (Crispin & Gregory 2009). This has motivated early testingand testing throughout the development cycle.

Traditionally, testing activities have been categorized into stages where they arelinked to a specific phase in the development cycle and its level of abstraction ofthe system under test (SUT). The V-model (figure 2.1) describes the objectives oftesting and the source material used to derive test cases at each phase. The V-modelstems from sequential, top-down, development models such as the waterfall model,

5

CHAPTER 2. BACKGROUND

Figure 2.1. The V-model, based on (Jorgensen 2002) and (Mathur & Malik 2010)

in which each development phase relies on information produced by the previousphase. Each phase must be completely closed before the next can start, lockingthe specifications at the previous level. Unit testing evaluates each component’sfunctionality separately as a unit with respect to detailed design specifications andcode coverage. Integration testing uses architectural design to verify the integrationof the components of the system, whilst system testing examines the assembledsystem and its functionality based on software and design specifications. At thispoint detailed knowledge of the actual implementation should not be necessary.Acceptance testing finally validates the software with respect to user requirements,usually conducted with representatives of end users or persons with good domainknowledge (Mathur & Malik 2010). The purpose of the model is to make sure thattesting is conducted throughout the development cycle and not left until the endof the project. Basic implementation errors in the source code should be detectedduring unit testing, so that integration- and system testing can focus on broaderquestions regarding the design, the specification and communication between thedifferent parts of the SUT.

As a response to the inflexibility of sequential development models, agile de-velopment is based on short iterations with continuous improvements, emphasisingearly delivery, customer collaboration and cross-functional teams. An iteration canbe as short as a single week, so testing activities must start as early as possible, inparallel with development. A way to describe the testing activities given an agilestrategy are the testing quadrants (figure 2.2). The testing quadrants, as presentedby Crispin and Gregory (2009), describe how the different testing activities shouldbe conducted, rather than when. The model is based on four different aspects ofsoftware testing; business facing tests are contrasted with technology facing, andteam supporting tests with product evaluating. These categories are intended toserve as a basis for decisions on which tests should automated, when specific toolsshould be considered and when manual testing is beneficial.

6

2.1. SOFTWARE TESTING

Figure 2.2. The four testing quadrants, as described by Crispin & Gregory (2009)

The team supporting activities in Q1 and Q2 focuses on the product develop-ment. The first quadrant contains unit tests and component tests, which has thepurpose of verifying the code. These tests can be fully automated. The test activi-ties in Q2 views the product from a business perspective that is comprehensible toother stakeholders, such as product owners and business analysts. These activitiesshould include a high degree of automation as well, but on a different abstractionlevel. However, some of the activities in Q2 cannot be automated, for exampleprototyping and design validation.

The activities in Q3 and Q4 evaluate the product according to its functionaland non-functional requirements. Q3 contains tests that should not be automated,such as acceptance testing, exploratory testing and usability testing, validatingthat the product meets the functional requirements. These tests usually involvethe end-users of the product. The testing activities in Q4 evaluate the productfrom a technical perspective and explore non-functional aspects of the SUT, suchas security, performance and load testing. These tests can usually be aided withspecialised tools.

The main reason for the focus on test automation in agile development is theneed to present a product with basic functionality at each release, keeping theiteration time to a minimum. Test automation frees resources to focus on tests thatcannot be automated and the fast execution allows for a wider coverage. Buildingthe product step by step makes regression testing, verification of the SUT afterit has been updated, especially important. With automated regression tests the

7


functionality of the SUT can be verified each time changes are made to the code.

2.1.1 Back box and white box testing

Two main approaches to software testing are black box testing and white box,or structural, testing. These testing strategies concern the information used inwriting test cases and correlate to some degree to the current phase of testing.White box testing uses information from the actual implementation to create testcases and estimate coverage, which makes more sense with test activities that aimat verifying the source code. Black box testing focuses on the functionality ofthe SUT, ignoring the internal structure of the implementation. The program isviewed as an unknown function, transforming the program’s input data to outputaccording to requirements. Black box testing strategies tend to be based on therequirements themselves, for example testing from use cases, or through analysis ofthe input data, such as equivalent classes or boundary value analysis. Equivalenceclass testing identifies input data that are equivalent in regard to how they wouldaffect the SUT and groups them into disjoint partitions, covering the entire domain.Test cases are created by letting one sample represent each partition, to be able toreduce redundant test cases but still cover the entire input domain. Boundary valueanalysis identifies the boundaries of the input space to construct test cases, based onthe assumption that errors often occur near the extreme values of an input variable.(Ammann & Offutt 2008, Jorgensen 2002). Downsides to black box testing arethe risk of redundant test cases, due to overlapping functions, untested code anddifficulties defining the coverage of a test suite (Jorgensen 2002).

Coverage in white box testing is measured using the structure of the source codeof the SUT. The program is modelled as a flow graph with statements as nodes anddata flow as edges. A coverage criteria for a test suite can then be expressed usingelements of the graph, such as coverage of all nodes or all paths (Ammann & Offutt2008). Although white box testing easily describes coverage of the SUT, the mainchallenge for white box testing is scalability. Each conditional branch in a graphdoubles the number of possible paths, causing a test suite explosion as the programgrows. But regardless of testing strategy, complete testing is usually not a realisticoption. The possible combinations of input for most programs is effectively infinitely(Ammann & Offutt 2008).

2.1.2 Mutation testing

Mutation testing is a strategy to evaluate the quality of the test suite, by injectingerrors into the source code. A mutant is a small, syntactically valid, modificationof the program that in some way changes its behaviour. Examples of mutants areto replace || with && or x > 1 with x ≥ 1. A mutant that only changes thecode but not the behaviour of the program, a so called an equivalent mutant, cannever be detected by a failed test case. To avoid generating equivalent mutants ishowever impossible, since the problem of program equivalence is undecidable (Jia

8

2.2. MODEL-BASED TESTING

& Harman 2011). Each test suite is given a mutation score; the ratio betweendiscovered mutants and the number of non-equivalent mutants in the code. Thisscore can be seen as a type of coverage, examining the degree of errors caughtinstead of code or requirements.

Mutation testing is based on two assumptions:

- The competent programmer hypothesis. The hypothesis claims that it is notnecessary to consider all conceivable errors that could be made by a program-mer to evaluate the test suite. On the contrary, one can assume that theprogrammer will be of such competence that the SUT is very close to the bugfree program that is aimed for. Since the actual errors in the current code willdiffer very little from the correct version of the program they can be recreatedby a few syntactic changes (Jia & Harman 2011).

- The coupling effect. The coupling effect states that the simple syntactic errorsthat the mutators form are related to the complex errors that one would wish atest suite to detect. The assumed relationship is that complex errors are madeup by a number of syntactical changes, which can be simulated by mutants.Complex errors are seen as groups of simpler errors, not as a different categoryof errors. Therefore a mutant can be made to represent more complex errorsas well (Jia & Harman 2011).

Some of the advantages of mutation testing are that it may reveal issues withthe test suite that are not picked up in manual reviews and offers a consistentmeasure of the quality of a test suite (Baker & Habli 2013). But mutation testing isnot unproblematic. To consider all possible mutations while leaving out equivalentmutants can demand large computational resources. In addition, a reliable testoracle must be constructed. Empirical studies on the subject have given mixedresults. Some studies find that the majority of detected and corrected errors insoftware project do involve small code segments (Purushothaman & Perry 2005),other results (Gopinath, Jensen & Groce 2014) indicate that the errors found are toolarge for the competent programmer hypothesis to hold. Additionally, the couplingbetween the mutants and actual errors seems to differ between languages, which iswhy language specific mutators should be considered.

2.2 Model-based testingModel-based testing uses an abstract model of the SUT to generate relevant testcases. A strategy to automatically generate test cases in model-based testing is to leta model checker verify the model of the SUT given formally expressed requirements,and use any counterexamples given by the model checker as test cases.

A drawback of model-based testing is the required model of the SUT. Manualmodel construction is a complicated task, and given an agile development style themodel would need recurrent updates. Model-based testing using incomplete models

9


has been suggested by (Groce, Peled & Yannakakis 2002) and (Groce, Fern, Pinto,Bauer, Alipour, Erwig & Lopez 2012) to enable updates of a model according tochanges in the software. The basic idea, shared with the learning-based testingapproach, is to use the generated counterexamples both for testing the SUT andfor improving the model of the SUT. A counterexample from the model checkerthat does not cause a fail verdict when executed on the SUT shows a discrepancybetween the SUT and the model, and is used to improve the model of the SUT.

2.2.1 Model checking

A model checker takes a property of the SUT expressed in temporal logic and atransition system, such as a Kripke structure, as input and explores the entire statespace to determine if the model violates the given property (Fraser, Wotawa &Ammann 2007). A Kripke structure K is a tuple K = (S, S0, T, L) expressing thebehaviour of the program as a finite state machine. It contains a set of states S,an initial state S0, a total transition relation T , which connects every state to atleast one other, and a labeling function L that maps each state to a set of atomicpropositions (Fraser et al. 2007). This means that the system must be describableby a finite set of states and inputs, where the behaviour of the system only dependson the current state and the input. If a violation is found whilst checking the formalstatement against the Kripke structure a counterexample, where the negation of theproperty holds, will be given. If no such discrepancy is found, the property holdsin all possible states.

Model-based testing with model checkers enables both automatic generation ofrelevant test cases and an automated test oracle. The oracle can compare the SUToutput with the counterexample; a match in behaviour gives a fail verdict, otherwisethe test case passes. To achieve coverage through counterexamples so called trapproperties are used, negated properties for the items to be covered – such as nodes,edges or states. While trap properties for state coverage are just safety properties,other types of coverage measurements can demand more complex statements (Fraseret al. 2007).

2.2.2 Linear temporal logic

Linear temporal logic (LTL) expresses statements that can be true or false given aspecific point in time. It extends classical logic, such as propositional or predicatelogic, where statements are statically true or false. LTL allows for statements aboutpossible states in the future, that will be the case at some point (Fφ), in the nexttime step (Xφ), globally (Gφ) or until some other state (φ U ψ) (see table 2.1).The possibility to discriminate between different points in time makes LTL useful tomodel and express qualities of reactive systems, such as embedded software. Somespecifically interesting qualities that are expressible in LTL are safety properties andliveness properties. Liveness is an assurance that something good will eventuallyhappen in the future, for example - if the program is started it will eventually

10

2.3. LEARNING-BASED TESTING

terminate (φ → F (ψ)). Safety properties asserts that something bad never willhappen, that G(!φ) (Fisher 2011).

An extension of LTL includes past operators, expressing properties that, forexample, held at one (O), all (H) or the last (Y) of the previous states. The pastoperators strictly speaking do not add expressibility, all statements expressible withpast operators can be re-written using only future operators. But they do affect theusability of the language - from a user perspective, formalising requirements in LTL,the extension of past operators can offer statements that are easier to grasp andcloser to the initial formulation. This improvement in usability does not increasethe complexity of model checking (Pradella, San Pietro, Spoletini & Morzenti 2003).

Future operatorsX(φ) Next - φ holds in the next stateG(φ) Global - φ holds in all future statesF (φ) Finally - φ holds in a future state(φ U ψ) Until - φ holds until ψ holds(φ V ψ) Releases - ψ holds until φ holds, or ψ holds globally

Past operatorsY (φ) Previous - φ holds in the previous stateH(φ) Historically - φ holds in all past statesO(φ) Once - φ holds in at least one past state(φ S ψ) Since - φ holds in all states since ψ(φ T ψ) Triggered - ψ holds in all states since φ, or ψ holds historically

Table 2.1. The past and future operators in LTL, in NuSMV syntax (Cavada,Cimatti, Jochim, Keighren, Olivetti, Pistore, Roveri & Tchaltsev 2010)

2.3 Learning-based testingIn the paper Automated black-box testing of functional correctness using functionapproximation (Meinke 2004) black box testing is described as a constraint solvingproblem, a search for counterexamples to program correctness, which is solved bylearning the system under test. If S is a system where functional correctness can bemodelled by pre and post conditions in first order logic, then a successful black boxtest case that finds such a counterexample is an assignment of values to input vari-ables that satisfies pre and where S terminates with output variables that satisfies¬post. In this search for successful test cases an application of function approxi-mation, to represent an unknown underlying function with an approximation basedon the observed input and output, is suggested. The underlying function in thiscase is the system S, which is approximated by a model that maps the input andoutput space of S. For each unsuccessful test case the model of S is incrementallyapproximated and new, improved, test cases generated.

11


A generalisation of this strategy is learning-based testing (LBT). In LBT theapproximated model of the SUT is given by a machine learning algorithm, whichtogether with a model checker creates an iterative feedback loop, using the testcases as queries. At each iteration a test case is created either by the learningalgorithm, to generate a membership query, from a counterexample generated bythe model checker, checking the current model of the SUT against the requirements,or from a random test case generator (Sindhu 2013). LBT is a heuristic approachto find bugs in the state space of the SUT, possibly without having to explore thewhole state space through complete learning. Instead LBT makes a best guess forwhere a bug can be found through generalisation of the current information of theSUT. If the model checker finds a counterexample to the requirements in the currentmodel of the SUT, this will serve as the next test case. If the behaviour of the SUTmatches the counterexample a bug has been found. If not, the model of the SUT isimproved by adding the information from the input/output pair of the counterex-ample. This method allows for automation of both test case generation, executionand test oracle. The concept has been developed by the software reliability groupat KTH for both procedural and reactive programs and several learning algorithmshave been evaluated, such as Algebraic parameter estimation, IKL, an incrementallearning algorithm for Kripke structures, L* Mealy and minsplit. (Meinke, Niu &Sindhu 2012, Sindhu 2013).

2.3.1 LBTest

LBTest is a tool for functional black box testing that implements the learning-basedtesting paradigm. Two prerequisites for using of the tool are that it must be possibleto model the SUT as a finite state machine and that functional requirements for theSUT can be expressed in LTL. The additional resources needed to execute LBTestare a configuration file, a wrapper file and an executable file of the SUT (Meinke2015). The wrapper functions as a test harness and acts as the communicator be-tween LBTest and the SUT through the system standard input and output. LBTestdoes not have any direct contact with the SUT, so the accuracy of the results relieson the wrapper distributing correct information between the programs. The oraclestep is handled by comparing the output from the SUT to the counterexample givenby the model checker.

LBTest generates test cases from either the machine learning algorithm, themodel checker or a random input generator (figure 2.3). These values will be trans-lated by the wrapper to data that can be injected to the SUT. The wrapper thenreads the next state of the SUT, converts it into symbolic names that are compre-hensible to LBTest and sends the information on the output stream. LBTest needsall data to be partitioned into a finite set of equivalence classes, since the modelchecker cannot make direct use of integer values or other data types, such as graphsor trees. The wrapper must extract the necessary data for each defined type andtranslate it to a predefined value with a symbolic name for LBTest.

The configuration file defines the set up for the test session. It contains the

12

2.3. LEARNING-BASED TESTING

Figure 2.3. LTBest, as described by Meinke & Sindhu (2013)

different input- and output types of the SUT, the requirements to be tested, thelocation for external resources and stopping criteria. Examples of stopping criteriaare limitations on execution time, the number of hypotheses to be generated orequivalence checks to be made before termination. In LBTest convergence is con-sidered to be reached when no difference is found between the hypothesis and theSUT, given a specified number of samples of random queries. The configuration filealso provides the possibility to optimise testing by defining other keywords, such asthe learning algorithm and model checker to be used. The verdict given by LBTestis either pass, fail or warning; a detected counterexample that includes a loop. Aftera normal termination LBTest will produce a dot-file, containing the state machineof the last hypothesis.

LBTest claims to be well suited for agile development and continuous develop-ment since it supports a very high degree of test automation (Meinke 2015). Dueto the black box abstraction level the wrapper and configuration files do not haveto be altered between sprints and allows for alterations and re factoring of theimplementation.

2.3.2 Previous case studies

To this date two industrial case studies have been conducted by the software relia-bility group at KTH and an additional four by thesis workers. These studies havefocused on showing that LBTest can model the SUTs and to examine if LBTestfinds undetected defects in the SUT. The first two studies were conducted on aBrake-by-Wire system by Volvo and an access server by Fredhopper. More detailed

13


descriptions of these studies are found in (Feng, Lundmark, Meinke, Niu, Sindhu &Wong 2013).

Brake-by-Wire is a distributed system of five ECUs and a connecting networkbus, with one ECU connected to the brake- and gas pedals and the other four toone wheel each. The two pedals provided the input to the system and the outputwas measured by vehicle speed, rotational speeds of the wheels and torque values.Out of three requirements that were tested with LBTest two passed and one weregiven a fail verdict. The counterexample for the failed requirement turned out toshow an error in the SUT.

In the case study of the Fredhopper Access Server eleven informal requirementswere formalised and translated to LTL and nine of them passed. Two, expressingliveness properties, were given warnings since counterexamples in form of loops werefound. A loop where the desired state p is not reached breaks the property F (p)and therefore results in a warning from LBTest. It turned out that this behaviourwas due to errors in the requirements (a strong until U should have been regardedas a weak until W – the property holds either until a specified state becomes true,or it holds forever) as well as an error in the SUT.

Two of the thesis projects were conducted at TriOptima, one using a Django webapplication (Lundmark 2013) and the other on micro service architecture (Nycander2015). Both thesis workers expressed difficulties finding a suitable abstraction levelto model the SUT. Deciding whether a certain signal should be seen as an input,transforming the system, or output, an indicator of the state of the system, turnedout to be a non-trivial problem. Lundmark used the strategy of finding verbs thatdescribed actions that could be performed on the system as input and the result ofperforming these actions as output data types. Five requirements were translatedto LTL in his project and they were all given a pass verdict by LBTest. Lundmarkthen continued to experiment with injected errors in the code. These errors weredetected by LBTest.

In Nycander’s project different abstraction levels of the SUT were considered.First a black box wrapper was implemented, only utilising the interface from a userperspective. Limiting the model of the SUT to this interaction, leaving out theimplementation of the system, resulted in a model with only two states – calcu-lating and idle. Therefore, to achieve a more reactive system, a grey box wrappercommunicating directly with the internal messaging system was constructed as well.Seven requirements were tested with the grey box wrapper and all of them passed.In addition, a fault injected wrapper was implemented, injecting faults at runtimeby triggering a restart of the SUT. This was done to examine the system’s errorhandling and recovery. Twelve requirements were tested with this wrapper, withthe result that a bug in the SUT was discovered.

Both studies emphasise the difficulty in verifying wrapper functionality andshowed the challenge of finding the root cause for a warning concerning a require-ment. In both projects a separate log file for the wrappers was implemented to givea better understanding of the communication between LBTest, the SUT and thewrapper.

14

2.4. TESTING OF ECUS

2.4 Testing of ECUs

2.4.1 System testing of ECUs at Scania

An ECU is a real-time system that consists of both hardware and software and isspecialised to control or monitor parts of the vehicle’s functionality. This is doneby continuously reading inputs in form of digital and analogue in-signals, such asswitches and sensors. The output of the ECU is communicated over a ControllerArea Network (CAN) that links the ECUs together and transports diagnostic mes-sages and operational parameters. Each CAN-bus forms a sub-net, which is linkedto other sub-nets by a coordinator, an ECU that also distributes information aboutactions made by the driver.

Scania uses a version of the V-model for testing of ECUs. The team responsiblefor system testing of chassis ECUs is also involved in module integration test andtesting at part system-level. The test cases are designed to be executed on eithera Hardware-in-the-loop (HIL) rig or a software emulator. The two platforms en-able testing of specific functionality of the ECU by mocking the behaviour of itssurrounding systems. Testing using the rig and the emulator platform are comple-mentary to each other and the current testing framework is compatible with bothplatforms.

The emulator is developed by Scania and functions by stubbing the ECU ap-plication code wherever it polls the hardware for information. By doing this, theemulator can feed the ECU application with e.g. a voltage where the ECU wouldnormally read from an A/D-converter. The hardware is completely replaced by asoftware library, so the test cases can be executed on local computers. Also, softwaretesting can start before the ECU hardware is finalised. The emulator executes withdiscrete time steps, about 20 times faster than real-time, which provides both a fastexecution of slow events as well as the possibility to track fast events by pausing atspecific time steps. Internal variables and signals can be directly accessed throughthe ECUs memory area.

HIL-testing tests both the hardware and software of the ECU, while the be-haviour of most of the surrounding vehicle is simulated. The HIL-rig communicateswith the ECU through a hardware controller and provides input from I/O andCAN-traffic in real-time. Besides manipulation of input, the rig enables interruptsand hardware fault injections at run time, for example to evaluate error handlingin case of electronic failures. Internal signals of the ECU must be requested viaa communication protocol (Keyword Protocol 2000) and cannot be read directly,which limits the number of signals to be accessed at once. The main drawbacks ofHIL-testing are limited accessibility of the rigs and the real-time execution of testcases, compared to the fast execution of the emulator.

The current testing framework used for system testing of chassis ECUs at Scaniais piTest, an acronym for Python interface to emulated software test. The base forpiTest is the Python unit testing framework, the python version of JUnit. The basicconfiguration contains the name of the ECU to be tested, the platform type and

15


the directory for the communication signals between piTest and the platform. Fortesting in an emulated environment the emulator interface module iTest is used toread and write to the ECU software.

The main focus for the test cases written by the test team is specification-basedblack box testing, to verify the functional requirements. Requirement coverage isthe only coverage model currently in use, with at least one test case per requirement.Although some general guidelines exists, the testing strategies and degree of cover-age are also influenced by the individual testers, which is why a variation betweenthe test suites for different ECUs can be expected. Examples of testing strategies inuse are boundary value analysis, combinatory testing, experience-based testing andin vehicle testing. White box techniques are usually not considered since structuralcoverage is not among the system testers responsibilities.

2.4.2 Formalisation of ECU requirements

The ISO standard 26262 for functional safety of road vehicles (ISO 2011) has beena motivator for studies on formalisation of ECU requirements. The current versionof the standard targets the possible hazards caused by malfunctioning behaviourof electronic systems for passenger cars with weights under 3500 kg. However,an adaptation of the standard for heavy vehicles is expected as well, which has ledmanufacturers to investigate what impact a compliance to the standard would have.ISO 26262 proposes a top-down approach, where safety requirements are mappedto architectural elements and traced throughout the development life cycle. Thestandard states that safety requirements should be specified by a combination ofnatural language and formal, semi-formal or informal notations, depending on theirsafety integrity level.

A case study at Bosch (Post, Menzel & Podelski 2011) evaluated to which de-gree informal behavioural automotive requirements were formalisable by examiningtheir expressibility in a specification pattern system represented in restricted En-glish grammar, a formalised natural language automatically transformable to LTL,computation tree logic (CTL) or other logics. The reason for using this grammarwas to maintain readability by stakeholders but still allow for automatic consistencychecking on the requirements through formal analysis. A sample of 245 informalfunctional requirements from five projects in the automotive domain were randomlyselected for the study. Out of these requirements, 39 turned out to be not translat-able without loss of meaning. For 25 of the non-expressible requirements a branch-ing time concept were needed, due to their concern of possible instead of actualbehaviour of the SUT. Other reasons for untranslatability to the restricted Englishgrammar were statements about properties in several ECUs, not expressible atthe given abstraction level, and requirements not describing functional behaviours,concerning the appearance of the product. Another common reason for untrans-latability was vagueness in the requirements, to the degree that the authors werenot able to recover the properties that the requirements were intended to capture.

A similar case study was conducted at Scania (Filipovikj, Nyberg, & Rodriguez-

16

2.4. TESTING OF ECUS

Navas 2014) exploring the possibility to formalise their automotive requirementsusing specification patterns based on restricted English grammar. Out of 100 gath-ered requirements 30 % could not be expressed in restricted English grammar. Themost common obstacle was that the requirements did not concern system behaviour.After excluding the non-functional requirements, about 8 % of the remaining re-quirements were still not formalisable, mainly due to ambiguous expressions andomitted information. Among the formalisable requirements difficulties to grasp theintent and to determine the scope of requirements were also encountered, requiringassistance from Scania engineers for accurate formalisation.

17

Chapter 3

The case studies

The third chapter describes the two case studies that were conducted to evaluate thepossibility to formalise Scania’s automotive requirements in LTL, how ECU applica-tions can be modelled for LBTest and whether LBTest can find undiscovered defectsin the software. It also contains a description of the benchmarking conducted toevaluate LBTest against test cases currently in use at Scania.

To evaluate LBTest two system requirements documents were formalised, trans-lated to LTL and tested with LBTest. The first specified the low fuel level-warning(Scania 2015a) and the second dual-circuit steering (Scania 2015b). Both documentsincluded requirement specifications in natural language and semi-formal notation,often expressed as pseudo code. Several iterations of requirement translation andtesting were conducted to detect both incomplete and vague requirements as well aserrors in the implementation. Each translated requirement was first tested againstthe current implementation to make potential ambiguities in the requirements visi-ble and find possible deviations in the implementation. In the case where warningsor failures were given from LBTest due to incomplete or vague requirements a refor-mulation was considered to be able to move forward with the case study. Warningsdue to discrepancies between the requirements and the implementation were fol-lowed up by the test team.

A wrapper with basic functionality for communication with the emulator, usingthe emulator interface module iTest, and LBTest was already in place when theproject started. Case specific communication code was added and later reviewedby the test team to avoid warnings due to malfunctioning test code. As suggestedin previous case studies (Lundmark 2013, Nycander 2015) a wrapper log was im-plemented as well to keep track of the communication between the wrapper andLBTest.

The case studies were followed up with benchmarking against test cases currentlyin use, using a mutation testing strategy by injecting small errors into the sourcecode.

19

CHAPTER 3. THE CASE STUDIES

Figure 3.1. An illustration of the low fuel level-warning from a black box perspective

3.1 Case study 1: Low fuel level-warning

The low fuel level-warning provides the driver with an additional indication of whena refill is necessary, without having to monitor the fuel level estimation on theinstrument cluster. The information of the current fuel level is given by the internalsignal total fuel level, that is calculated by another function, the fuel level estimation(figure 3.1). The output of the system is the low fuel level-warning signal. Thebasic functionality of the low fuel level-warning is to trigger the warning once theestimated fuel has decreased below a threshold, and only turn it off if the estimatedfuel level has increased substantially, above a specified level.

3.1.1 Requirement formalisation

The requirements document for the low fuel level-warning included seven require-ments and specified one internal input signal, one output signal, and parametersettings for tank sizes and to enable the functionality. The initial requirement for-malisation only used the information given in each requirement and the generalinstructions for the requirements for the LTL-translation. One of the seven require-ments was a specification of the parameter setting for a subset of the other require-ments. Since this requirement by itself was not a functional mapping between inputand output it could not be tested separately, but the information regarding theparameter setting was added to the requirements concerned. The other six coveredthe functionality of the low fuel level-warning in three cases – specifications of gen-eral behaviour, behaviour at start up and behaviour given an error on the internalinput signal. One of these requirements specified a value that was not within thegiven range for given signal, which made the requirement untestable. The remain-ing five were translated to LTL, resulting in eight formalised requirements to coverthe basic parameter settings. The requirements specified different boundary valuesdepending of tank size and type, that were set by input parameters and sensor type.An additional parameter was used to switch the low fuel level-warning on or off.

The requirements could mainly be expressed in LTL as liveness statementsG(φ→ X(ψ)) – given input φ, ψ will hold in the next time step. For example:

20

3.1. CASE STUDY 1: LOW FUEL LEVEL-WARNING

If totalFuelLevel has status Error or NotAvailable output signallowFuelLevelWarning shall be set to NotAvailable.

Could be expressed as

G((totalFuelLevel = error | totalFuelLevel = NotAvailable)→ X(lowFuelLevelWarning = NotAvailable)

Halfway through the project new LTL operators were added to LBTest, allowingfor expressions about past events. In this case study these became useful to expressconstant qualities, such as parameter settings for the fuel level indicator and tanksizes, by stating H(φ) & G(φ) for each of these variables.

The first strategy for requirement formalisation was based on the informationgiven for each requirement, not adding any additional assumptions for when therequirements would or would not hold. This approach resulted in a number ofwarnings from LBTest, due to the structure of the original requirements. Theseparation between the main scenario and additional requirements for abnormalsituations included an implicit assumption that these abnormal situation would notoccur during the main scenario. But since this information was not explicitly statedit was not included in the LTL translation. A simplified example of this is one gen-eral requirement, expressed as G(φ → X(ψ)), and an additional requirement forerror handling, expressed as G(error → X(χ)). The intention of the first require-ment should be to express G(φ & !error → X(ψ)), with an exception in case of aninput error. Without this clarification LBTest produces a counter example to therequirement, stating that the requirement would not be valid in case of an inputerror. After discussing the requirements with the test team it became clear that therequirement should be expressed as G(φ & !error → X(ψ)). Another ambiguity,that did not have an obvious answer, was how to handle an overlap of the differentvariation scenarios specified. For example, the requirements specified one initialoutput value, during start up, and another output value in case of an error. But itwas not clear which of these values should apply in case of an error during start up.

3.1.2 Modelling and partitioningTo be able to use the same wrapper to evaluate different parameter settings thesewere set during the start up of each test case. Two of the three parameter settingswere merged into one input variable for LBTest, due to a very specific mapping ofthe two values determining the tank type and the sensor to be used for the testcases to be valid. The output for LBTest, indicating the state of the system, wasthe low fuel level-warning signal. The output signal was partitioned based on itsfour discrete values and the parameter settings for the tank were divided into threebasic cases – large tank, small tank or gas tank. The input signal, total fuel level ,was partitioned into four equivalence classes.

A difficulty in the modelling process was that the internal input, total fuellevel, was an estimation of the external fuel level, and could only be adjusted by

21


manipulating the external signal. The external fuel level was set by adjusting thevoltage of an analogue input pin, whilst the total fuel level was estimated by usinga low pass filter and a filter algorithm based on the value of the external fuel level.Due to the filtering process the difference between the external fuel level and theestimated fuel level could be substantial. Especially smaller changes in fuel level, lessthan would be expected during a refill, were difficult to detect. The requirementsdid cover cases of small changes in the estimated fuel level, which made the functiondifficult to test.

Testing the low fuel level-warning with this configuration resulted in a modelwith 26 states, which took 2 hours and 39 minutes to generate with an estimatedconvergence of 98%, based on 1000 random samples. However, the gaps betweenthe external and estimated fuel level caused unreliable verdicts. In an attempt towork around this issue more time was added to each test case, to make time for thefiltering process to adjust the estimated total fuel level to the value of the externalfuel level. Nonetheless, the discrepancy between the actual input and the estimatedvalue would occasionally be so large that the fail verdicts were given for adequatebehaviour of the SUT.

3.1.3 Warnings and detected errors

LBTest gave several warnings for the requirements given the current implementa-tion. Some were due to implicit assumptions and ambiguities in the requirements,others to difficulties to model the functionality of the SUT for LBTest withoutcausing accidental errors. The warnings due to requirement ambiguities were han-dled by adding the implicit assumptions that were lacking, to be able to continuewith further testing of the SUT. The warnings from LBTest caused by modellingissues proved harder to work around. The attempt to add more time to each testcase resulted in a substantially increased runtime, without being able to completelyavoid false negatives from LBTest. This made it difficult to find actual bugs orinjected errors in the SUT. The case study was therefore not followed up with abenchmarking.

3.2 Case study 2: Dual-circuit steering

The dual-circuit steering functionality is implemented to ensure adequate steeringability in presence of singular faults or when the engine is not running. The require-ments for the function describes when the second hydraulic system, powered by anelectric motor, should be activated. Other outputs affected are two CAN signalsthat communicate the status of the two hydraulic systems, one internal output sig-nal and eight trouble codes. The input to the function consists of four CAN signals,the ignition and two sensors. In addition, a parameter setting specifies whether adual-circuit steering system is connected (figure 3.2).

22

3.2. CASE STUDY 2: DUAL-CIRCUIT STEERING

Figure 3.2. An illustration of the dual-circuit steering function from a black boxperspective.

3.2.1 Requirement formalisation

The requirements document consisted of 32 requirements. The majority of these didnot form a mapping between the specified input and output variables. Instead, therequirements included specifications of so called model variables and their relationto input variables, output variables and each other. The model variables describedqualities of the current state of the function, such as secondary circuit handles steer-ing, primary circuit hydraulic malfunction or vehicle is moving. In total nine modelvariables were used in the document. Some of the variables matched internal signalsthat could be accessed by reading from memory. These variables could be seen asa form of output, but not on the current, black box, abstraction level. Anotherinterpretation was to view the model variables as internal variables, keeping trackof the last registered value for some of the inputs to the function. For example thevariable vehicle is moving was specified to use the last registered value of vehiclespeed. These variables turned out to be used in a similar way in the actual imple-mentation. To write test cases for internal variables in the implementation wouldbe a form of white box testing, evaluating the implemented code rather than thefunctionality.

To keep the testing at a black box abstraction level the requirements containingmodel variables were reformulated to only concern the relationship between thespecified input and output variables. This was achieved by tracking when the modelvariables were set and what effect they had on the output of the function. Someof these variables were set to true if and only if a specific diagnostic trouble codewas turned on, which made it possible to replace the variable itself with the troublecode. Others turned out to be dependent on several conditions on input, output andother model variables. An additional complication was the naming of the variables,which was not consistent throughout the requirements document.

An example of this process is the formalisation of Req 1 below. To fully under-stand it seven other requirements (Req 2 – Req 8 ) had to be taken into considerationand partially merged. The variable names in the example have been replaced withtoken names and irrelevant information cut out. The input variables are underlined,

23


output variables are bold and the model variables in italic.

Req 1While variable3 == true

If input4 == not setoutput2 = on

Req 2While input1 == off and input2 < 10

if input3 == off for more than 1 secondthen variable2 = true and troublecode2 = on

if input3 == onthen variable2 = false and troublecode2 = off

(. . . )

Req 3If variable1==true or troublecode3 = on, then variable3 = true

Req 4If variable2 ==true then variable3 = true

Req 5If input3==off and the vehicle is moving, then variable4 = true(. . . )

Req 6If variable4 == true then variable3 = true

Req 7While input1 == on and input2 > 400

if input3 = off for more than 1 secondthen variable1 = true and troublecode1= on

if input3 = onthen variable1 = false and troublecode1= off

(...)

Req 8If speed > moving limit

then vehicle is moving = trueIf speed < stationary limit

vehicle is moving = false

The resulting LTL requirement that captured the meaning of Req 1, after par-titioning the vehicle speed, became:

24

3.2. CASE STUDY 2: DUAL-CIRCUIT STEERING

G( ( (X(troublecode1=on | troublecode3=on | troublecode2 = on) |(input3=off & ((speed=medium | speed=high ) | (Y(speed=medium |speed=high) & speed=low)))) & input4=notset) → X(output2 = on))

Nine of the original requirements were excluded from the formalisation for dif-ferent reasons. Four requirements turned out to be non-functional at system testlevel, describing how data should be stored and where to access internal signals.Three requirements only described qualities of model variables, without affectingthe actual output of the function. In addition, two requirements concerned elec-tronic failures, such as the electric motor being short circuited to battery or overloaded, which had to be tested on the HIL-rig. They were therefore not concideredfor this case study. The remaining 23 requirements where reformulated, formalisedand translated to LTL.

In the re-formulation process several requirements were merged to map inputvariables directly to output variables. Other requirements, containing disjunctions,were separated into two or more LTL requirements. The 23 original requirementsthat could be formalised resulted in 30 LTL requirements. Only one requirement,which demanded the input parameter to be set to “off,” explicitly stated the valueof the parameter. For the remaining 29, where the value was assumed to be “on”,the setting was not mentioned which caused obvious counterexamples from LBTest.

A majority of the requirements followed the pattern of G(φ→ X(ψ)) as in theprevious case study. Some requirements specified a more complex statements bydescribing the relationship between past and future events, which made the addi-tion of the past operators to LBTest very helpful. For example, some requirementspecified events that should occur if a self test had been performed. Self test wasneither an input nor output variable from a black box perspective, but could beregarded as performed if the electric motor had been on while the second sensorhad a flow or no flow, since the last engine restart. This quality could be expressedby using past operators as:

(O(emotor = on & (sensor2 = flow | sensor2 = noflow) S(ignition = restart)) | (O(emotor = on & (sensor2 = flow |

sensor2 = noflow) & H(ignition = on))

3.2.2 Modelling and partitioningThe configuration of input and output types for LBTest mainly followed the specifiedinput and output variables as stated in the requirements document. One exceptionwas the internal output signal that did not have an effect on system test level andcould only be detected by reading from the emulator memory. After consulting withthe testers at Scania the requirements concerning this signal were excluded from thecase study.

Five of the input and output variables were discrete and could take two to sixvalues, which were specified for LBTest. The continuous variables, engine and vehi-cle speed, were partitioned based on the boundaries specified in the requirements,

25


for example when the vehicle should be regarded as moving or the speed for whenthe electric motor should be turned on.

This modelling strategy resulted in a model with over 60 states. The stoppingcriteria given to LBTest was 300 random checks, which means that no differencecould be found between the current hypothesis of the SUT and the actual behaviourof the SUT after executing 300 random input values. This level of convergence wasreached after 7 hours and 24 minutes. The final measure of convergence found 30differences out of 1000 random samples, an estimated convergence of 97%.

3.2.3 Warnings and detected errorsThe first warning given by LBTest was due to the missing information for the settingof the parameter value. Although only one requirement was tested with this settingit would have caused warnings for the majority of the other 28 requirements wherethe parameter setting was unspecified as well.

Five LTL-requirements received a fail verdict from LBTest due to discrepan-cies between the requirements the implementation under test. Two of the LTL-requirements stemmed from the same original requirement, which specified when thediagnostic trouble codes should be deactivated after a previous activation. LBTestfound counterexamples for two different trouble codes that were deactivated by anerror on an input signal, and activated again after the error was discontinued; abehaviour that the requirement stated should not to occur. Two warnings alsoconcerned requirements describing how errors on input signals should affect outputsignals and trouble codes. The final warning given by LBTest was due to a timelimit for when a trouble code should be activated, that was not followed. These fivefailed requirements were discussed with testing engineers at Scania and proved tobe real faults in the SUT, although they were not considered safety critical.

3.3 BenchmarkingThe benchmarking was conducted using a mutation testing strategy by injecting 10errors, one by one, to the source code of the dual-circuit steering function. Eachversion was then tested with both LBTest and test cases currently in use at thedepartment. The errors injected were changed boundary values, mixed up input/oroutput variables or altered Boolean values - small, syntactically valid, changes tothe source code. The faults were picked randomly without checking for equivalentmutants. The fault injected code was tested with both LBTest and piTest, usingthe emulator platform. To avoid getting alerts for ambiguities or defects that weredetected during the previous case study each affected LTL requirement was eitherupdated with exceptions for these instances or removed.

A known issue with the current version of LBTest that affected the bench mark-ing process was that the hypothesis of the SUT produced was deleted after eachtested requirement. For each requirement LBTest had to re-learn the SUT and builda new model, instead of using the old one. This made testing for a large number of

26

3.3. BENCHMARKING

requirements a tedious task. The initial plan for the benchmarking was thereforeto conjunct all 30 LTL-requirements into one to run against a fault injected ver-sion of the source code, fault by fault. Even though this had worked in previouscase studies of LBTest, each attempt to apply the strategy in this project resultedin a premature termination of LBTest. The root of this problem seemed to stemfrom the model checker that could need up to 20 minutes to verify the conjunctedrequirements, that likely caused a time out in the communication with LBTest.

The configurations for LBTest were based on recommendations from the softwarereliability group at KTH. However, during the case studies it became apparentthat the recommended, SAT-based, bounded model checker (BMC) had difficultiesto handle the size of the models that LBTest produced after about 50 iterations.To abort testing this early could lead to an unfair comparison between the testmethods. On the other hand BMC tended to detect defects faster than the modelchecker based on binary decision diagrams (BDD), so a complete switch of modelchecker would result in a substantially longer runtime, which could potentially delaythe project. Therefore a compromise was used. For each injected error, the sourcecode was tested with piTest and LBTest, using the BMC model checker. If theerror was detected using the BMC model checker no further testing was conductedfor the error. In case that LBTest was not able to find the error after 50 iterationsusing the BMC model checker, additional testing was conducted using the BDDmodel checker to make sure that the verdict was not caused by the limitations ofthe model checker.

27

Chapter 4

Results

This chapter presents the results of the two case studies and the benchmarkingdescribed in Chapter 3. The results of the formalisation process, the warnings givenby LBTest and the benchmarking between LBTest and piTest are displayed.

4.1 Requirement formalisation

The results of analysing the requirements from the two case studies one by one leadto an exclusion of 11 requirements from a total of 39 original requirements (table4.1). These were not tested during the case studies.

The non-functional requirements described qualities of the SUT, such as datastorage and accessibility of signals, instead of expected behaviour given input andoutput values. The other non-formalisable requirements described functionality onwhite box level, that did not have an effect on the functions output variables, orspecified values that did not match the possible values of the specific variable. Inaddition two requirements were excluded during the second case study since theycould only be tested on a HIL-rig. Out of the eleven requirements that were notformalised and tested with LBTest five were neither tested in the current testingframework for emulator tests.

Original Non-functional Other non- Other notrequirements formalisable testable

Case study 1 7 1 1 0Case study 2 32 4 3 2Total 39 5 4 2

Table 4.1. The unformalisable requirements from the two case studies.

29

CHAPTER 4. RESULTS

4.2 Source of detected errors

Ambiguous Wrapper code Deviations fromrequirements or modelling the requirements

Case study 1 2 1 0Case study 2 1 0 5Total 3 1 5

Table 4.2. The root cause for the warnings and fail verdicts given by LBTest

Five of the fail verdicts and warnings given by LBTest were due to actual discrep-ancies between the requirements and the implementation, that had not previouslybeen detected (table 4.2). These were found during the second case study.

4.3 Benchmarking

Fault piTest LBTest – BMC LBTest - BDD1 Not terminated Detected -2 Undetected Detected -3 Detected Undetected Undetected4 Not terminated Detected -5 Undetected Detected -6 Not terminated Undetected Undetected7 Detected Detected -8 Undetected Detected -9 Not terminated Detected -10 Not terminated Detected -

Table 4.3. The detection of injected faults by piTest and LBTest

LBTest gave a pass verdict for two instances of error injected code and piTestfor three. LBTest gave a fail verdict for eight instances and piTest for two. Theremaining five errors caused severe problems for piTest, since the execution of thetest cases relied on certain initial values being reached during the set up phase.piTest could not terminate properly when testing the altered code and no finalverdict was given.

The BDD-model checker was only used when the BMC-model checker could notdetect the injected error. No new detections were made by using BDD.

30

Chapter 5

Discussion

In the fifth chapter different aspects of the project are evaluated and analysed.Section 5.1 and 5.2 discuss the outcomes and limitations of the project and theresults are compared to previous studies. In Section 5.3 the experience of workingwith LBTest is evaluated, describing issues regarding the tool and its usability.

5.1 The case studies in retrospect

The case studies indicated a large variance among ECU functions and their spec-ification in regard to speed, reactivity and preciseness of the requirements. Todetermine whether a function would be suitable for testing using LBTest turnedout to be more complex than expected.

The first case study, the low fuel level-warning, at first appeared to be well suitedfor testing with LBTest. The input and output variables were clearly specified andmost of the requirements were functional mappings between these variables. As itturned out, the requirements were not as clear as first assumed, which made somefunctionality difficult to assess. However, these issues could be expected to holdregardless of test method. The main obstacle for testing the function was the delayon the input signal caused by the filtering of input data, which made the results oftesting the function unreliable. It was also difficult to utilize the strengths of LBTestwith a function that has such a low level of reactivity, with only one input signaland static parameter settings. In general, the tool assumes all combinations of inputvalues to be sound at any given state of the SUT. But the parameter settings in thelow fuel level-warning were so specific that only a few combinations were valid foreach option, a very small subset of the total combinatorial state space.

The second case study, on the other hand, was based on a set of requirementsthat were not well formulated for black box testing. However, the structure ofthe function still made it possible to perform fruitful testing with LBTest. Allcombinations of input could be expected to be valid at any given time, which enableda high degree of relevant test cases.

31

CHAPTER 5. DISCUSSION

5.2 The results

5.2.1 The formalisation

The degree of requirements that could not be used for testing turned out to bemuch higher than in the previously mentioned case study at Bosch (Post, Menzel& Podelski 2011). They found about 16% of the requirements to be unformalis-able, compared to 23% in this project and 30% in the previous case study at Scania(Filipovikj, Nyberg & Rodriguez-Navas 2014). One important difference is that theBosch study was conducted using requirements classified as functional, althoughsome of these turned out to be non-functional in the end. No general policy forseparating functional and non-functional requirements was given for the system re-quirements used for the two case studies in this project, and therefore no a prioriseparation could be made. A more relevant comparison would therefore be to ex-clude the group of non-functional requirements, which would result in about 12% ofthe functional requirements not being formalisable. This degree of formalisabilityamong functional requirements is within the range of the two previous studies, 16%(Post, Menzel & Podelski 2011) and 8% (Filipovikj, Nyberg, & Rodriguez-Navas2014).

As could be expected, several of the issues described by Filipovikj, Nyberg, &Rodriguez-Navas were encountered during this project as well, such as ambiguousstatements and explicit assumptions among the requirements. Two of the majorreasons for unformalisability in the Bosch study were requirements reasoning aboutpossible behaviours, that would need a formal language with branching time con-cept, and a very high abstraction level with requirements concerning several ECUs.Neither of these occurred during this project. Instead a more challenging task wasto handle the low abstraction level, close to the implementation.

Being able to express the requirements one by one in a formal language was onlyone step in constructing a formal specification of the systems behaviour. As shownin the case study of the low fuel level-warning, requirements expressing functionalmappings between input and output variables can still give a vague or contradictorydescription of the SUT, making some aspects of the functionality difficult to assess.

The past operators that were added halfway through the project allowed for asimplified formalisation process. Although these operators in theory do not expandthe expressiveness of LTL, some of the requirements would have been very difficultto formulate with only future operators, especially statements that described eventsthat depended on earlier actions and in turn triggered future events.

5.2.2 Detected errors

The two major sources for fail verdicts by LBTest during the two case studies wereincomplete or ambiguous requirements and discrepancies between the requirementsand the implementation (table 4.2). A common issue was that some input variablesand parameter settings were not explicitly stated in the main scenarios, although

32

5.2. THE RESULTS

the lack of these values or settings were assumed in the requirements. Vagueness inthe requirements could also make it difficult to assess whether a fail verdict fromLBTest was caused by an actual implementation error, since the requirement itselfwas difficult to interpret. Testers and developers at the department were consultedin these situations but the opinions occasionaliy differed, also among the Scaniaemployees.

The deviations from the requirements that were detected by LBTest mainly con-cerned activation and deactivation of output signals given certain error states andthe relationship between the output variables. The original requirements that theLTL requirements were based on described a desired behaviour for several variablesin multiple situations, which were difficult to cover using the current testing frame-work.The strategy used by LBTest, to inject a high number of test cases to modelthe actual behaviour of the SUT, showed to be more appropriate in these cases.

5.2.3 The benchmarking

The initial idea was to inject a substantially larger amount of errors into the sourcecode that handled the dual-circuit steering function, but due to the long executiontime of LBTest the study had to be severely limited. However, after consideringthe results of the two case studies it seems that a comparison preferably should beperformed using different ECUs, given the large variance. Since the benchmarkingwas conducted using ten injected errors covering only one function it is far fromstatistically reliable and can only serve as a base for a discussion. Apart from thelimited size of the study, using mutation testing as an evaluation strategy can beproblematic. Mutation testing provides an indication of the type of errors that canbe detected, but the coupling between the mutators and real bugs is not obvious.

LBTest detected 8 out of 10 injected errors and gave a pass verdict for two ofthe injected errors (table 4.3), while piTest gave a pass verdict for three instancesof injected errors and a fail verdict for two. The remaining five injected errors couldnot be handled properly by the test cases in piTest. The test cases relied on acorrect response from the affected variables during the setup of the test session anda final verdict could therefore not be given. These instances were not categorisedas detections. However, one could argue that it should be clear in such a scenariothat either the implementation or the test suite is defected, and that the lack of averdict would merit a further investigation to pinpoint the defect.

It is difficult to draw conclusions from such a small case study, but the typesof errors that were detected or missed does give an indication of the strengths andweaknesses of the test methods. The two errors that were not detected by LBTestwere changes of boundary values. These were clearly covered in the requirementsand should be relatively easy to detect. No deeper investigation into why they re-mained undetected was conducted. One explanation could be that LBTest relies onthe model checker to not only provide correct input data for a potential counterex-ample, but also to predict the exact output of the SUT. If the actual output turnsout to be slightly different than predicted LBTest will give a pass verdict, although

33

CHAPTER 5. DISCUSSION

the behaviour of the SUT might still violate the requirement.The three injected errors that piTest gave a pass verdict were small changes

that affected the output signals in different ways, for example by manipulating theconditions for when an output signal should return an error or reading informationfrom the wrong sensor. After examining the test cases that piTest executed itwas discovered that these output signals actually were examined, but no assertionswere made to make sure that they gave an adequate response. The status of theoutput signal was instead displayed through a printed statement, indicating anabnormal value. However, the output from piTest extended 1400 lines, makingsuch a statement nearly invisible. Since the final verdict stated that no errors werefound this was not considered a detection.

The choice to keep the test cases in piTest on the same abstraction level as therequirements, performing a form of grey box testing, could arguably be the reasonfor why three of the errors went undetected. Trying to cover all variables, regardlessof if they are relevant for the current test level or not, could make it more difficultto provide a sufficient coverage for the system test level. A tool such as LBTestfocuses on the relationship between the specified input and output variables, whichkeeps the testing at an adequate abstraction level.

5.3 Issues and usability of LBTestAlthough LBTest gave a high error detecting ratio in the benchmarking, it is a toolthat demands much of the user and it contains several defects that need to be workedaround. The following is a summary of different issues encountered during the casestudies, that should affect the possibility for LBTest to match the expectations ofthe industry.

• A known defect in LBTest is that the last hypothesis of the SUT is deleted aftereach requirement, which requires the user to either merge all the requirementsinto one or handle a severely increased runtime. A merge of the requirementsis only useful in the situation where one wants to verify that no bug is foundin the SUT. If the intent is to find as many bugs as possible the requirementsstill have to be executed one by one, since LBTest only can be expected tofind one defect per requirement. In addition, a merge of the requirements isnot always an option, as in the case of the benchmarking during this project.

• As described in previous studies of LBTest (Lundmark 2013, Nycander 2015),one difficulty in using the tool is to establish the source of a bug in the testenvironment. All test automation requires code reviews, and the risk of bugs inthe communication with the SUT is unavoidable. But in LBTest the wrappercode must give correct information to two systems, both the SUT and LBTest.In addition, if a problem is determined to stem from LBTest, it could be causedby either the model checker, the learning algorithm or lie in the communicationbetween them and LBTest. Although LBTest has become more user friendly

34

5.3. ISSUES AND USABILITY OF LBTEST

by, for example, providing proper error messages for missing arguments inthe configuration file, other issues, such as syntax errors, can still be hard todetect.

• The runtime for LBTest can be very long, depending on the execution timefor each test case. Improved learning algorithms and model checkers couldreduce the runtime.

• LBTest does not terminate the testing session after finding a counterexamplethat results in a fail or warning. To save time during the benchmarking, eachsession was aborted after the first hypothesis that did not yield a pass. Thisissue did therefore not affect the project, but would probably be problematic ina real life setting. LBTest keeps running until one of the stopping criteria aremet, to achieve the specified coverage. The final verdict must then be assessedby reviewing the verdict for all hypothesis. One improvement would be tostop testing once a hypothesis is given a fail verdict, without considering theconvergence of the model. The counterexample is valid even though the modeldiffers in several aspects from the SUT, since it simply predicts a behaviourthat is not allowed given the current requirements. If the SUT behaviourmatches the predicted behaviour a bug has been found, regardless of the stateof the model that generated the counterexample. Another solution is for thetool to summarise the results at the end of the testing session and deliver afinal verdict.

• It is not possible to flag or in any other way to mark a defect as alreadyknown. To not receive the same error again the requirement that specifiesthe behaviour in question must be reformulated or removed. Simple improve-ments of the user interface, such as remembering the mapping of the lastconfiguration file between sessions, would also simplify the use of the tool.

35

Chapter 6

Conclusions and future work

This final chapter summarises the project and the conclusions that can be drawnfrom it. Guidelines for requirement formulation, based on the results of the project,are given and suggestions are made for future work to evaluate and improve LBTest.

6.1 Conclusions

The aim of this project was to study if LBTest could be used for testing of automo-tive ECUs by examining to which degree the ECU requirements could be formalised,if the ECU application’s behaviour could be modelled by LBTest and to which de-gree LBTest could find both existing and injected errors in the software.

Although well formulated requirements that provide an unambiguous specifica-tion of the SUT on an adequate abstraction level were of high importance, a largepart of the inadequate requirements could be tested with LBTest after a thoroughreformulation. About 23% of the requirements examined could not be successfullyformalised for black box testing, and a majority of the remaining requirements hadto be reformulated. The main reason for reformulating the requirements was thatthey did not form a mapping between the specified input and output variables andcould therefore not be used for black box testing. Other reasons for reformulationwere missing assumptions about input values, parameter settings for the main sce-narios and contradictory outcomes of overlapping scenarios. A more structured wayof formulating requirements is strongly recommended.

The result from the case studies suggests a variance in regard to possible mod-elling strategies for LBTest among ECU specifications. Based on these two casestudies alone it is difficult to assess the ratio of ECU applications that would bebeneficial to test with a tool such as LBTest. For ECUs that share the qualitiesencountered during the second case study, fast embedded systems with a high levelof reactivity, LBTest should be a useful tool for verification.

The strengths of LBTest, shown during the benchmarking and the case studies,were a clear focus on the variables to be measured and a broad coverage of statesthat could affect the output. Five instances of discrepancies between requirements

37

CHAPTER 6. CONCLUSIONS AND FUTURE WORK

and implementation were revealed by LBTest during the second case study. Thesmall benchmarking that was conducted resulted in a high degree of error detectionfor LBTest. Eight out of ten injected errors were given a fail verdict and two a passverdict. This can be compared with the result of the current testing frameworkpiTest, that gave a pass verdict for three errors and a fail verdict for two. Five ofthe injected errors could not be handled by piTest, since the test cases relied onbeing able to read certain input data during the test set up.

In summary LBTest is, at least to some degree, suitable for testing of ECUsoftware. The suitability depends on the structure, reactivity and speed of thefunction, but also on the adequacy of the requirements. However, the tool hasseveral issues that should be managed for it to be effectively used in the industry.A more formal strategy for handling the requirements would be beneficial regardlessof the choice of testing framework, given the ambiguities that were brought to lightduring the formalisation process.

6.2 RecommendationsBased on the experiences from the project, the following guidelines are recommendedto avoid unnecessary ambiguities and allow for testing based on formal specificationson system test level.

• The requirements document should clearly specify the input and output vari-ables concerned on an adequate abstraction level. Verify that variable namesare kept consistent for all requirements concerned and that the values theyare assigned are within the domain specified for the particular variable.

• Functional and non-functional requirements should be described in separatesections of the document.

• The functional requirements should form a mapping between the specifiedinput and output variables, i.e. without involving any additional variables.

• The preconditions for each requirement must be clearly stated, including therequirements specifying the main scenarios. For example parameter settingsand assumptions of the lack of error states should be explicitly expressed.

• Check that the ranking of the requirements is clear, so that states with con-tradictory outcomes will not apply at the same time.

6.3 Future work• Given the diversity indicated in this project, a broader investigation of the

requirements of different ECU functions could shed further light on the typesof systems that would benefit from testing using a tool such as LBTest.

38

6.3. FUTURE WORK

• A larger benchmarking, based on requirements of several ECU functions,should give a more precise verdict of how LBTest measures up against thecurrent testing framework.

• Improvements in LBTest in regard to how the SUT is learnt and how test casesare evaluated could make the testing more efficient and reduce the runtime.Two ideas that have been discussed during the project are to use existing testcases as a base for a more relevant model for LBTest and to add a verificationof SUT behaviour given the requirements in the oracle step. In the currentversion of LBTest the output of the SUT is only compared to the predictedoutput of the counterexample given by the model checker. An additional checkif the SUT output violates the requirement could allow for an earlier detectionof bugs.

39

Chapter 7

Bibliography

Ammann, P. & Offutt, J. (2008). Introduction to Software Testing, Cambridge:Cambridge University Press

Baker, R. & Habli, I. (2013). An Empirical Evaluation of Mutation Testing forImproving the Test Quality of Safety-Critical Software. IEEE Transactions on Soft-ware Engineering (39(6)), pp. 787-805

Cavada, R., Cimatti, A., Jochim, C. A., Keighren, G., Olivetti, E., Pistore, M.,Roveri, M. & Tchaltsev, A. (2010) NuSMV 2.6 User Manual.Available at: http://nusmv.fbk.eu/NuSMV/userman/v26/nusmv.pdf [Accessed 2016-05-29]

Crispin, L. & Gregory, J. (2009). Agile Testing: A Practical Guide for Testers andAgile Teams. Upper Saddle River, NJ: Addison-Wesley.

Dijkstra, E.W. (1970). Notes On Structured Programming. T.H.-Report 70-WSK-03, Technological University Eindhoven (2nd ed.)

Feng, L., Lundmark, S., Meinke, K., Niu, F., Sindhu, M. & Wong , P. (2013). CaseStudies in Learning-Based Testing. Testing Software and Systems, Lecture Notesin Computer Science vol. 8254 pp. 164-179

Filipovikj, P., Nyberg, M. & Rodriguez-Navas, G. (2014). Reassessing the pattern-based approach for formalizing requirements in the automotive domain. In 2014IEEE 22nd International Requirements Engineering Conference, IEEE, pp. 444-450

Fisher, M. (2011). An introduction to practical formal methods using temporallogic. Chichester: Wiley

Fraser, G., Wotawa, F. & Ammann, P. (2007). Testing with model checkers: a

41

CHAPTER 7. BIBLIOGRAPHY

survey. (SNA Technical Report SNA-TR-2007-P2-04) Graz: Competence NetworkSoftnet Austria

Gopinath, R., Jensen, C. & Groce, A. (2014). Mutations: How Close are they toReal Faults? In ISSRE ’14 Proceedings of the 2014 IEEE 25th International Sym-posium on Software Reliability Engineering. Naples 3-6 Nov. 2014, pp.189-200

Groce, A., Fern, A., Pinto, J., Bauer, T., Alipour, M., Erwig, M. & Lopez, C. (2012).Learning-Based Test Programming for Programmers. In International SymposiumOn Leveraging Applications of Formal Methods, Verification and Validation, Hera-clion, Crete, October 2012.

Groce, A., Peled, M. & Yannakakis, M. (2002). Adaptive model checking. Toolsand Algorithms for the Construction and Analysis of Systems, pp. 357-370

ISO (2011). ISO 26262 - Road vehicles- Functional safety. Geneva: ISO.

Jia, Y. & Harman, M. (2011). An Analysis and Survey of the Development ofMutation Testing. IEEE Transactions on Software Engineering archive (Volume 37Issue 5), pp. 649-678

Jorgensen, P. (2002). Software Testing: A Craftsman’s Approach. Paul C. Taylor& Francis, Jun 26, 2002

Lundmark, S. (2013). Learning-based Testing of a Large Scale Django Web Appli-cation An Exploratory Case Study Using LBTest. Master thesis, KTH

Mathur, S. & Malik, S. (2010). Advancements in the V-Model. International Jour-nal of Computer Application IJCA Journal

Meinke, K. (2004). Automated black-box testing of functional correctness usingfunction approximation. In ISSTA ’04: Proceedings of the 2004 ACM SIGSOFTinternational symposium on Software testing and analysis 2004, pp 143–153, NewYork, NY, USA

Meinke, K. (2015). Learning-based testing with LBTest. Stockholm: KTH

Meinke, K., Niu, F. & Sindhu, M. (2012). Learning-based software testing : A tu-torial, in Leveraging Applications of Formal Methods, Verification, and Validation: International Workshops, Vienna, Austria, October 17-18, 2011. Revised SelectedPapers, pp. 200-219.

Meinke, K. & Sindhu, M. (2013). LBTest : A Learning-based Testing Tool forReactive Systems in Proceedings - IEEE 6th International Conference on Software

42

Testing, Verification and Validation, ICST 2013, pp. 447-454.

Nycander, P. (2015). Learning-Based Testing of Microservices : An ExploratoryCase Study Using LBTest. Master thesis, KTH

Post. A., Menzel. I & Podelski. A. (2011). Applying Restricted English Grammaron Automotive Requirements—Does it Work? A Case Study. In 17th InternationalWorking Conference, REFSQ, Essen, Germany, March 28-30, 2011.

Purushothaman, R. & Perry, D. (2005). Toward understanding the rhetoric of smallsource code changes. IEEE Transactions on Software Engineering (Vol:31 Issue: 6), pp. 511-526

Pradella, M., San Pietro, P., Spoletini & P., Morzenti, A. (2003). Practical modelchecking of LTL with past. In ATVA03: 1st Workshop on Automated Technologyfor Verification and Analysis pp. 135–146

Scania (2015a) AE202 Low Fuel Level Warning

Scania (2015b) AE417 Dual-Circuit Steering

Sindhu, M. (2013) Algorithms and Tools for Learning-based Testing of ReactiveSystems Doctoral Thesis, KTH

43

www.kth.se

learning-based testing of automotive ecus1065359/fulltext01.pdf · learning-based testing of...

Documents