software robustness testingusers.ece.cmu.edu/~koopman/ballista/02_06_ballista...general testing...

Software Robustness TestingA Ballista Retrospective

Phil [email protected]

http://ballista.org

With contributions from:Dan Siewiorek, Kobey DeValeJohn DeVale, Kim Fernsler, Dave Guttendorf,

Nathan Kropp, Jiantao Pan, Charles Shelton, Ying Shi

Institutefor ComplexEngineeredSystems

2

Overview

Introduction

• APIs aren’t robust (and people act as if they don’t want them to be robust!)

Top 4 Reasons people give for ignoring robustness improvement

• “My API is already robust, especially for easy problems” (it’s probably not)

• “Robustness is impractical” (it is practical)

• “Robust code will be too slow” (it need not be)

• “We already know how to do it, thank you very much” (perhaps they don’t)

Conclusions

• The big future problem for “near-stationary” robustness isn’t technology --

it is awareness & training

3

Ballista Software Testing Overview

Abstracts testing to the API/Data type level• Most test cases are exceptional• Test cases based on best-practice SW testing methodology

INPUTSPACE

RESPONSESPACE

VALIDINPUTS

INVALIDINPUTS

SPECIFIEDBEHAVIOR

SHOULDWORK

UNDEFINED

SHOULDRETURNERROR

MODULEUNDER

TEST

ROBUSTOPERATION

REPRODUCIBLEFAILURE

UNREPRODUCIBLEFAILURE

4

Ballista: Test Generation (fine grain testing)Tests developed per data type/subtype; scalable via composition

5

Initial Results: Most APIs Weren’t RobustUnix & Windows systems had poor robustness scores:• 24% to 48% of intentionally exceptional Unix tests yielded non-robust results• Found simple “system killer” programs in Unix, Win 95/98/ME, and WinCE

Even critical systems were far from perfectly robust• Safety critical operating systems• DoD HLA (where their stated goal was 0% robustness failures!)

Developer reactions varied, but were often extreme• Organizations emphasizing field reliability often wanted 100% robustness• Organizations emphasizing development often said

“core dumps are the Right Thing”• Some people didn’t care• Some people sent hate mail

6

Comparing Fifteen POSIX Operating Systems

Normalized Failure Rate

Ballista Robustness Tests for 233 Posix Function Calls

0% 5% 10% 15% 20% 25%

AIX 4.1

QNX 4.22QNX 4.24

SunOS 4.1.3

SunOS 5.5

OSF 1 3.2OSF 1 4.0

1 Catastrophic

2 Catastrophics

Free BSD 2.2.5

Irix 5.3Irix 6.2

Linux 2.0.18

LynxOS 2.4.0NetBSD 1.3

HP-UX 9.05

1 Catastrophic

1 Catastrophic

HP-UX 10.20

Abort Failures Restart Failure

1 Catastrophic

7

Robustness: C Library vs. System Calls

Normalized Failure Rate

Portions of Failure Rates Due To System/C-Library

0% 5% 10% 15% 20% 25%

AIX 4.1

QNX 4.22QNX 4.24

SunOS 4.1.3SunOS 5.5

OSF 1 3.2OSF 1 4.0

Free BSD 2.2.5

Irix 5.3Irix 6.2

Linux 2.0.18LynxOS 2.4.0

NetBSD 1.3

HP-UX 9.05HP-UX 10.20

1 Catastrophic

2 Catastrophics

1 Catastrophic

1 Catastrophic

1 Catastrophic

C LibrarySystem Calls

8

Estimated N-Version Comparison Results

*

*

*

**

*

Normalized Failure Rate by Operating System

Normalized Failure Rate (after analysis)0% 10% 20% 30% 40% 50%

Ope

ratin

gSy

stem

Tes

ted

SunOS 5.5SunOS 4.13

QNX 4.24QNX 4.22OSF-1 4.0OSF-1 3.2

NetBSD Lynx

LinuxIrix 6.2Irix 5.3

HPUX 10.20

HPUX 9.05FreeBSD

AIX

Abort % Silent % Restart %

* Catastrophic

9

Failure Rates by Function Group

10

Even Those Who Cared Didn’t Get It RightOS Vendors didn’t accomplish their stated objectives (e.g.,):• IBM/AIX wanted few Aborts, but had 21% Aborts on POSIX tests• FreeBSD said they would always Abort on exception (that’s the Right Thing)

but had more Silent (unreported) exceptions than AIX!• Vendors who said their results would improve dramatically on the next

release were usually wrong

Safe Fast I/O (SFIO) library• Ballista found that it wasn’t as safe as the authors thought

– Missed: valid file checks; modes vs. permissions; buffer size/accessibility

Do people understand what is going on?• We found four widely held misconceptions that prevented improvement in

code robustness

11

#1: “Ballista will never find anything (important)”1. “Robustness doesn’t matter”

• HP-UX gained a system-killer in the upgrade from Version 9 to 10– In newly re-written memory

management functions…… which had a 100% failure rate under Ballista testing

• So, robustness seems to matter!

2. “The problems you’re looking for are too trivial -- we don’t make those kinds of mistakes”• HLA had a handful of functions that

were very non-robust• SFIO even missed some “easy”

checks• See Unix data to the right…

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

OSF1 3.2

SunOS 5.5SunOS 4.1.3QNX 4.24QNX 4.22OSF1 4.0BNetBSD 1.3LynxOS 2.4.0Linux 2.0.18IRIX 6.2IRIX 5.3HP-UX A.09.05

AIX 4.1

HP-UX B.10.20FreeBSD 2.2.5

Dev/C

lass-Specific

Synchronization

Clocks &

Timers

C Library

Process Prims.

System

Databs

Process Env.

Files & Dirs

I/O P

rimitives

Messaging

Mem

ory Mng.

Scheduling

Rob

ustn

ess

Failu

re R

ate

12

#2: “100% robustness is impractical”The use of a metric – in our case Ballista – allowed us to remove all detectable robustness failures from SFIO and other API subsets• (Our initial SFIO results weren’t entirely zero; but now they are)

Abort Failure Rate for Select Functions

14

57

52

79

36

8

58

50

6.29

5.99 9.

2

2 1

5.01

2 4.5

0 0 0 0 0 0 0 0

0

10

20

30

40

50

60

70

80

90

open write read close fileno seek sfputc sfgetc

Function

% A

bort

Fai

lure

s

STDIOOriginal SFIORobust SFIO

13

Can Even Be Done With “Ordinary” APIMemory & semaphore robustness improved for Linux• Robustness hardening yielded 0% failure rate on standard POSIX calls below

Failure rates for memory/process original Linux calls(All failure rates are 0% after hardening)

16.4

31.3

75.2 75.2

57.8

7.3

35.3

59.564.7

41.2

64.7

0.010.020.030.040.050.060.070.080.0

memch

rmem

cmp

memcp

ymem

move

memse

tse

m_init

sem_d

estro

yse

m_getv

alue

sem_p

ost

sem_tr

ywait

sem_w

ait

Function name

Failu

re R

ate

(%)

Mem. Manipulation Module Process Synch. Module

14

#3: “It will be too slow”Solved via caching validity checks • Completely software-implemented cache for checking validity

• Check validity once, remember result– Invalidate validity check when necessary

VerificationModule

Reference

Lookup

Store

ClearModule

Invalidate

ResultCache Structure in Memory

15

Caching Speeds Up Validity TestsWorst-case of tight loops doing nothing but “mem” calls is still fast• L2 Cache misses would dilute effects of checking overhead further

Slowdown of robust memory functions with tagged malloc

0.95

1

1.05

1.1

1.15

1.2

1.25

16 32 64 128 256 512 1024 2048 4096

Buffer size (Bytes)

Slow

dow

n

memchrmemcpymemcmpmemsetmemmove

16

Future MicroArchitectures Will HelpException & validity check branches are highly predictable• Compiler can structure code to assume validity/no exceptions• Compiler can give hints to branch predictor• Branch predictor will quickly figure out the “valid” path even with no hints• Predicated execution can predicate on “unexceptional” case

Exception checks can execute in parallel with critical path• Superscalar units seem able to execute checks & functions concurrently• Out of order execution lets checks wait for idle cycles

The future brings more speculation; more concurrency• Exception checking is an easy target for these techniques• Robustness is cheap and getting cheaper (if done with a view to architecture)

17

#4: “We Did That On Purpose”Variant: “Nobody could reasonably do better”• Despite the experiences with POSIX, HLA & SFIO, this one persisted• So, we tried an experiment in self-evaluating robustness

Three experienced commercial development teams• Components written in Java• Each team self-rated the robustness of their component per

Maxion’s “CHILDREN” mnemonic-based technique• We then Ballista tested their (pre-report) components for robustness

• Metric: did the teams accurately predict where their robustness vulnerabilities would be?

– They didn’t have to be perfectly robust– They all felt they would understand the robustness tradeoffs they’d made

18

Self Report Results: Teams 1 and 2They were close in their prediction• Didn’t account for some language safety features (divide by zero)• Forgot about, or assumed language would protect them against NULL in A4

Component A and B Robustness

0

5

10

15

20

25

A1 A2 A3 A4 B1 B2 B3 B4

Function

Abor

t Fai

lure

Rat

e (%

)

MeasuredExpected

19

Self Report Data: Team 3Did not predict several failure modes• Probably could benefit from additional training/tools

Component C Total Failure Rate

0

10

20

30

40

50

60

C1 C3 C5 C7 C9

C11C13C15C17C19FC21FC23FC25FC27F

Method Designation

Tota

l Fai

lure

Rat

e (%

)

MeasuredExpected

20

Conclusions: Ballista Project In PerspectiveGeneral testing & wrapping approach for Ballista• Simple tests are effective(!)

– Scalable for both testing and hardening • Robustness tests & wrappers can be abstracted to the data type level

– Single validation fragment per type – i.e. checkSem(), checkFP()…

Wrappers are fast (under 5% penalty) and usually 100% effective• Successful check results can be cached to exploit locality

– Typical case is an index lookup, test and jump for checking cache hit– Typical case can execute nearly “for free” in modern hardware

• After this point, it is time to worry about resource leaks, device drivers, etc.

But, technical solution alone is not sufficient• Case study of self-report data

– Some developers unable to predict code response to exceptions• Training/tools needed to bridge gap

– Even seasoned developers need a QA tool to keep them honest– Stand-alone Ballista tests for Unix under GPL; Windows XP soon

21

Future Research Challenges In The LargeQuantifying “software aging” effects• Simple, methodical tests for resource leaks

– Single-threaded, multi-threaded, distributed all have different issues– One problem is multi-thread contention for non-reentrant resources

» e.g., exception handling data structures without semaphore protection

• Measurement & warning systems for need for SW rejuvenation– Much previous work in predictive models– Can we create an on-line monitor to advise it is time to reboot?

Understanding robustness tradeoffs from developer point of view• Tools to provide predictable tradeoff of effort vs. robustness

– QA techniques to ensure that desired goal is reached– Ability to specify robustness level clearly, even if “perfection” is not desired

• Continued research in enabling ordinary developers to write robust code• Need to address different needs for development vs. deployment

– Developers want heavy-weight notification of unexpected exceptions– In the field, may want a more benign reaction to exceptions

software robustness testingusers.ece.cmu.edu/~koopman/ballista/02_06_ballista...general testing...

Documents