lazy systematic unit testing for java anthony j h simons christopher d thomson

Lazy Systematic Unit Testing for Java

Anthony J H Simons Christopher D Thomson

Overview

Lazy Systematic Unit Testing testing concepts and methodology

The JWalk Tester tool flagship of the JWalk 1.0 toolset

Dynamic analysis and pruning smart interactive generation and evaluation

Oracle building and test prediction building a test oracle with minimal user interaction

Head-to-head evaluation a testing contest: JWalk versus JUnit

http://www.dcs.shef.ac.uk/~ajhs/jwalk/

Motivation

State of the art in agile testing Test-driven development is good, but… …no specification to inform the selection of tests …manual test-sets are fallible (missing, redundant cases) Can we do better in test-case selection?

Regression testing : a touchstone? No specifications in XP, so use saved tests instead, which

become guarantors of correct behaviour Article of faith – passing saved tests guarantees no faults

introduced in the modified unit Actually no, state partitions cause geometric decrease in

effective state coverage (Simons, 2005)

Regression Testing Model

Base object proven correct by basic test set

Derived object refines Base object in some way

Basic test set used to test regression in Derived object

Passing regression tests proves that Derived conforms to Base

But this is an unreliable assumption!

Base

Derived

Btestproves

provesrefines

Test assumption: retesting “proves” compatible behaviour

conforms

Coverage of Base

Discharged

¬ isOnLoan()issue(a)

Issued

isOnLoan()

discharge()

borrower() / error borrower() / OK

isOnLoan()

Normal

new()

all pairs

validatedstateT2 = C (L0 L1 L2) P

Reach every state and validateevery transition pair

Coverage of Derived

Discharged

¬ isOnLoan()new()

issue(a)

Issued

isOnLoan()

discharge()

borrower() / error borrower() / OK

isOnLoan()

Normal

OnShelf

¬ reserved()

OnLoan

¬ reserved()

PutAside

reserved()

Recalled

reserved()discharge()

issue(a)

reserved()

reserve(b)

reserve(b)cancel()

cancel()

some pairs

reachedstateReusing the same T2 test-set

Test Regeneration Model

Base

Derived

Btest

Dtest

transitivelyconforms

proves

proves

refines

Only base object proven correct by basic test set

Derived object requires all-new tests, regenerated from derived specification

Derived object conforms to derived spec. by testing

Derived spec. conforms to base spec. by verification

Derived object conforms transitively to base spec.

New idea: conformity proven by both verification and testing

The Conundrum Regression testing is too weak

saved tests don’t exercise the refined model manual extra tests don’t cover all path combinations regression guarantee is progressively weakened

Test regeneration is more reliable all-new tests generated from a refined specification automatically generated tests cover all path combinations there is a guarantee of repeatable test quality (for Tk)

How to replicate for agile methods? No up-front specification from which to generate tests The only artifact is the evolving code, which changes Can we make any use of this?

Lazy Systematic Unit Testing

Lazy Specification late inference of a specification from evolving code semi-automatic, by static and dynamic analysis of code

with limited user interaction specification evolves in step with modified code

Systematic Testing bounded exhaustive testing, up to the specification emphasis on completeness, conformance, correctness

properties after testing, repeatable test quality

http://en.wikipedia.org/wiki/Lazy_systematic_unit_testing

JWalk Tester

Lazy systematic unit testing for Java static analysis - extracts the public API

of a compiled Java class protocol walk (all paths) – explores, validates all interleaved

methods to a given path depth algebra walk (memory states) – explores, validates all

observations on all mutator-method sequences state walk (high-level states) – explores, validates n-switch

transition cover for all high-level states


Try me

Example: Stack

Analysis of the API (protocol, algebra)

Test reports for each test cycle

Test statistics and summary report

Load the Test Class

Choose a location the working directory the root of a package is

its parent directory

Choose a test class browse for the test class

within a directory browse for a package-

qualified class within a package

Shortcut type the (qualified) test

class name directly

Pick Settings and Go

Strategy protocol: all methods algebra: all constructions states: all states and

transitions Modality

inspect: the interface explore: exercise paths validate: against oracle

Test depth maximum path length

Start testing click on the JWalker to

run a test series

Protocol Inspection

Protocol analysis static analysis of the

public API of test class includes all inherited

public methods may/not include standard

Object methods specify this through the

custom settings

Algebraic Inspection

Algebraic analysis dynamic analysis of

algebraic categories primitive, transformer and

observer operations Technique

compares concrete object states

identifies unchanged, or re-entrant states

controlled by probe-depth and state-depth (custom)

State Inspection

State Analysis dynamic analysis of

high-level states automatically names

discovered states computes state cover

Technique based on public state

predicate methods seeks the boolean state

product (fails gracefully) controlled by probe-

depth

Baseline Approaches

Breadth-first generation all constructors and all interleaved methods (eg JCrasher,

DSD-Crasher, Jov) generate-and-filter (eg Rostra, Java Pathfinder) by state

equivalence class

Computational cost exponential growth, memory issues, wasteful over-

generation, even if filtering is later applied

#paths = Σc.mk, for k = 0..n

Key: c = #constructors, m = #methods, k = depth

Dynamic Pruning

Interleaved analysis generate-and-evaluate, pruning active paths on the fly (eg

JWalk, Randoop) remove redundant prefix paths after each test cycle, don’t

bother to expand in next cycle Increasing sophistication

prune prefix paths ending in exceptions (fail again) JWalk, Randoop (2007)

and prefixes ending in algebraic observers (unchanged) JWalk 0.8 (2007)

and prefixes ending in algebraic transformers (reentrant) JWalk 1.0 (2009)

Protocol Exploration

Protocol strategy explores all interleaved

methods by brute force explores all paths up to

length n (test depth) repeats invocations of

the same method Pruning

paths raising exceptions in test cycle i

are not extended in test cycle i+1

Baseline

newpush

top

pop

pushtop

poppush

top

pop

pushtop

pop

Key: novel state

exception

top

poptop

pop

top

pop

top

poptop

pop

push push

push

push

Brute-force, breadth-first exploration

pushtop

pushtop

pushtop

pop

pop

pop

top

pop

Prune Exceptions…

newpush

top

pop

pushtop

poppush

top

pop

pushtop

pop

Key: novel state

exception

top

poptop

pop

top

pop

top

pop

push push

push

Prune error-prefixes (JWalk0.8, Randoop)

top

pop

Algebraic Exploration

Algebraic strategy explores all algebraic

constructions grows paths using only

primitive operations observes paths ending in

any kind of operation Pruning

prunes paths ending in exceptions (next cycle)

also with re-entrant or unchanged states

Prune Observers

newpush

top

pop

pushtop

poppush

top

pop

pushtop

pop

Key: novel state

exception

unchanged state

pushtop

pop

Prune error- and observer-prefixes (JWalk0.8)

pop

top

…Transformers

newpush

top

pop

pushtop

pop

pushtop

pop

top

pop

Key: novel state

exception

unchanged state

reentrant state

Prune error-, observer- and transformer-prefixes (JWalk1.0)

State Exploration

State strategy reaches every high-level

state explores all transition

paths up to length n, from each state

has n-switch coverage Pruning

grows only primitive paths to reach all states

prunes paths ending in exceptions (next cycle)

Exploration Summary

Test settings test class, strategy,

modality, depth Exploration summary

# executed in total # discarded (pruned) # exercised (normal) # terminated (exception)

Technique calculates discarded

from theoretical max paths

The Same State?

Some earlier approaches distinguish observers, mutators by signature (Rostra) intrusive state equality predicate methods (ASTOOT) external (partial) state equality predicates (Rostra) subsumption of execution traces in JVM (Pathfinder)

Some algebraic approaches shallow, deep equality under all observers (TACCLE)

but assumes observations are also comparable very costly to compute from first principles

serialise object states and hash (Henkel & Diwan) but not all objects are serialisable no control over depth of comparison

State Comparison

Reflection-and-hash extract state vector from objects compute hash code for each field order-sensitive combination hash code

Proper depth control shallow or deep equality settings, to chosen depth hash on pointer, or recursively invoke algorithm

Fast state comparison each test evaluation stores posterior state code fast comparison with preceding, or all prior states possible to detect unchanged, or reentrant states

Pruning: Stack

Stack baseline except. observ. transf.

0 1 1 1 1

1 7 7 7 7

2 43 31 13 13

3 259 139 25 19

4 1555 667 43 25

5 9331 3391 79 31

Pruned: 9,300 redundant pathsRetained: 31 significant paths (best 0.33%)

Table 1: Cumulative paths explored after each test cycle

Pruning: Reservable Book

ResBook baseline except. observ. transf.

0 1 1 1 1

1 9 9 9 9

2 73 73 25 25

3 585 561 49 33

4 4681 4185 97 41

5 37449 memex 169 41

Pruned: 37,408 redundant pathsRetained: 41 significant paths (best 0.12%)

Table 2: Cumulative paths explored after each test cycle

Validation Modality

Lazy specification interacts with tester to

confirm key results uses predictive rules to

infer further results stores key results in

reusable test oracle Technique

key results found at the leaves of the algebra tree

apply predictions to other test strategies

Tester accepts or rejects outcome

Test Result Prediction

Semi-automatic validation the user confirms or rejects key results these constitute a test oracle, used in prediction eventually > 90% test outcomes predicted

JWalk test result prediction rules eg: predict repeat failure

new().pop().push(e) == new().pop()

eg: predict same state target.size().push(e) == target.push(e)

eg: predict same result target.push(e).pop().size() == target.size()

Try me

Kinds of Prediction

Strong prediction From known results, guarantee further

outcomes in the same equivalence class eg: observer prefixes empirically checked before making any

inference, unchanged state is guaranteed target.push(e).size().top() == target.push(e).top()

Weak prediction From known facts, guess further outcomes; an incorrect

guess will be revealed in the next cycle eg: methods with void type usually return no result, but may

raise an exception target.pop() predicted to have no result target.pop().size() == -1 reveals an error

Algebraic Validation

Algebraic testing grows all primitive paths

ending in all operations solicits results for leaves

of the algebra tree best mode in which to

create an oracle

Prediction predicts void-results predicts results saved in

previous test cycles

Oracle predicts a correct outcome

Tester confirms an outcome

Protocol Validation

Protocol Testing create oracle first using

the algebra-strategy then apply same oracle

in the protocol-strategy most results predicted!

Prediction (chains of) observers

don’t affect states re-entrant methods

return to earlier states

Oracle predicts many outcomes

State Validation

State testing extends oracle created

for the algebra-strategy can validate 1000’s of

transition paths for a mere few 10’s of

user confirmations Prediction

all results for “nearby” states predicted

needs confirmations for more “remote” states

Oracle predicts many outcomes

Validation Summary

Test summary other statistics as before

Validation summary # passed (in total) # failed (in total) # confirmed (by user) # rejected (by user) # correct (by oracle) # incorrect (by oracle)

10x automated vs manual checks

Amortized Interaction Costs

number of new confirmations, amortized over 6 test cycles con = manual confirmations, > 25 test cases/minute pre = JWalk’s predictions, eventually > 90% of test cases

Test class a1 a2 a3 s1 s2 s3

LibBk con 3 5 7 0 0 5

LibBk pre 2 8 18 18 38 133

ResBk con 3 14 56 0 11 83

ResBk pre 6 27 89 36 241 1649

eg: algebra-test to depth 2, 14 new confirmations

eg: state-test to depth 2, 241 predicted results

Feedback-based Methodology Coding

The programmer prototypes a Java class in an editor Exploration

JWalk systematically explores method paths, providing useful instant feedback to the programmer

Specification JWalk infers a specification, building a test oracle based on

key test results confirmed by the programmer Validation

JWalk tests the class to bounded exhaustive depths, based on confirmed and predicted test outcomes

JWalk uses state-based test generation algorithms

Example – Library Book

Exploration surprise: target.issue(“a”).issue(“b”).getBorrower() == “b” violates business rules: fix code to raise an exception

Validation all observations on chains of issue(), discharge() n-switch cover on states {Default, OnLoan}

public class LibraryBook { private String borrower; public LibraryBook(); public void issue(String); public void discharge(); public String getBorrower(); public Boolean isOnLoan();}

Extension – Reservable Book

Exploration only revisits novel interleaved permutations of methods surprise: target.reserve(“a”).issue(“b”).getBorrower() == “b”

Validation all obs. on chains of issue(), discharge(), reserve(), cancel() n-switch cover on states {Default, OnLoan, Reserved,

Reserved&OnLoan}

public class ReservableBook extends LibraryBook { private String requester; public ReservableBook(); public void reserve(String); public void cancel(); public String getRequester(); public Boolean isReserved();}

Evaluation

User Acceptance programmers find JWalk habitable they can concentrate on creative aspects (coding) while

JWalk handles systematic aspects (validation, testing) Main Cost is Confirmations

not so burdensome, since amortized over many test cycles metric: measure amortized confirmations per test cycle

Comparison with JUnit common testing objective for manual and lazy systematic

testing; evaluate coverage and testing effort Eclipse+JUnit vs. JWalkEditor: given the task of testing the

“transition cover + all equivalence partitions of inputs”

Comparison with

JUnit manual testing method Manual test creation takes skill, time and effort (eg: ~20 min

to develop manual cases for ReservableBook) The programmer missed certain corner-cases eg: target.discharge().discharge() - a nullop? The programmer redundantly tested some properties eg: assertTrue(target != null) - multiple times The state coverage for LibraryBook was incomplete, due to

the programmer missing hard-to-see cases The saved tests were not reusable for ReservableBook, for

which all-new tests were written to test new interleavings

Advantages of JWalk

JWalk lazy systematic testing JWalk automates test case selection -

relieves the programmer of the burdenof thinking up the right test cases!

Each test case is guaranteed to test a unique property Interactive test result confirmation is very fast (eg: ~80 sec in

total for 36 unique test cases in ReservableBook) All states and transitions covered, including nullops, to the

chosen depth The test oracle created for LibraryBook formed the basis for

the new oracle for ReservableBook, but… JWalk presented only those sequences involving new

methods, and all interleavings with inherited methods

Measuring the Testing?

Suppose an ideal test set BR : behavioural response (set) T : tests to be evaluated (bag – duplicates?) TE = BR T : effective tests (set)

TR = T – TE : redundant tests (bag)

Define test metrics Ef(T) = (|TE | – |TR |) / |BR| : effectiveness

Ad(T) = |TE | / |BR| : adequacy

Speed and Adequacy of Testing

Test goal: transition cover + equiv. partitions of inputs manual testing expensive, redundant and incomplete JWalk testing very efficient, close to complete

eg: wrote 104 tests, 21 were effective and 83 not!

eg: JWalk achieved 100% test coverage

Test class T TE TR Adeq time min.sec

LibBk manual 31 9 22 90% 11.00

ResBk manual 104 21 83 53% 20.00

LibBk jwalk 10 10 0 100% 0.30

ResBk jwalk 36 36 0 90% 0.46

Some Conclusions

JUnit: expert manual testing massive over-generation of tests (w.r.t. goal) sometimes adequate, but not effective stronger (t2, t3); duplicated; and missed tests hopelessly inefficient – also debugging test suites!

JWalk: lazy systematic testing near-ideal coverage, adequate and effective a few input partitions missed (simple generation strategy) very efficient use of the tester’s time – sec. not min. or: two orders (x 1000) more tests, for same effort

More Conclusions

Feedback-based development unexpected gain: automatic validation of prototype code c.f. Alloy’s model checking from a partial specification

Moral for testing automatically executing saved tests is not so great need systematic test generation tools to get coverage automate the parts that humans get wrong! let humans focus on right/wrong responses.

JWalk 1.0 Toolset

JWalk Tester JWalk Utility JWalk Editor

JWalk Marker JWalk Grapher JWalk SOAR

Any Questions?


Put me to the test!

© Anthony Simons, 2009, with help from Chris Thomson, Neil Griffiths, Mihai Gabriel Glont, Arne-Michael Toersel

Custom Configuration

Oracle directory default is the test class

directory; pick a new location

Convention standard: exclude all

of Object’s methods custom: include some complete: include all

Probe depth max path length for

dynamic analysis

State depth tree depth for object state

comparison shallow state (inc. array

values) by default

Generators

The heart of JWalk synthesise test input values on demand try to assure even spread of inputs for a given type by default, supply monotonic sequences of values

MasterGenerator built-in ObjectGenerator is fairly comprehensive synthesises basic values, arrays, standard objects, etc.

CustomGenerator take control of how particular types are synthesised provide custom generators; add to a master as delegates eg: StringGenerator, EnumGenerator, InterfaceGenerator

Custom Generators

Choose a location default is the test class

directory

Choose a generator enter generator directly browse within package

Click add/remove add a custom generator

to the list remove a generator from

the list

CustomGenerator Interface

Provide a generator class with:public boolean canCreate(Class<?> type);

public Object nextValue(Class<?> type);

public void setOwner(MasterGenerator master);

Key points: advertises which types it can synthesise generates a sequence of objects on demand may keep a handle to its owning master eg: InterfaceGenerator maps interface types onto concrete

classes; invokes nextValue recursively (on its master).

Example: IndexGenerator

public class IndexGenerator implements CustomGenerator {

private int seed = 1;private boolean flag = false;public boolean canCreate(Class<?> type) {

return type == int.class;}public Object nextValue(Class<?> type) {

if (flag) { flag = false; return seed++; }

else { flag = true; return seed; }}public void setOwner(MasterGenerator master) {}

}

Creates repeating pairs of indices

Specific for the int index type

Nullop: ignores the master generator

When are they Useful?

IndexGenerator generates repeating pairs of indices exercises put/get pairs in vector, array types

StdIOGenerator redirect System.in, System.out to conventional files test programs with IO using prepared data in files

FileGenerator take control of filenames and streams (security) test programs using prepared data in files

Arbitrary test set-up take control of how the environment is established

lazy systematic unit testing for java anthony j h simons christopher d thomson

Documents

test predictionbuilding

test oracle

basic test setderived

testcase selection

waybasic test set

conundrumregression

verificationderived

repeatable test quality