software robustness testingusers.ece.cmu.edu/~koopman/ballista/02_06_ballista...general testing...
TRANSCRIPT
Software Robustness TestingA Ballista Retrospective
Phil [email protected]
http://ballista.org
With contributions from:Dan Siewiorek, Kobey DeValeJohn DeVale, Kim Fernsler, Dave Guttendorf,
Nathan Kropp, Jiantao Pan, Charles Shelton, Ying Shi
Institutefor ComplexEngineeredSystems
2
Overview
Introduction
• APIs aren’t robust (and people act as if they don’t want them to be robust!)
Top 4 Reasons people give for ignoring robustness improvement
• “My API is already robust, especially for easy problems” (it’s probably not)
• “Robustness is impractical” (it is practical)
• “Robust code will be too slow” (it need not be)
• “We already know how to do it, thank you very much” (perhaps they don’t)
Conclusions
• The big future problem for “near-stationary” robustness isn’t technology --
it is awareness & training
3
Ballista Software Testing Overview
Abstracts testing to the API/Data type level• Most test cases are exceptional• Test cases based on best-practice SW testing methodology
INPUTSPACE
RESPONSESPACE
VALIDINPUTS
INVALIDINPUTS
SPECIFIEDBEHAVIOR
SHOULDWORK
UNDEFINED
SHOULDRETURNERROR
MODULEUNDER
TEST
ROBUSTOPERATION
REPRODUCIBLEFAILURE
UNREPRODUCIBLEFAILURE
4
Ballista: Test Generation (fine grain testing)Tests developed per data type/subtype; scalable via composition
5
Initial Results: Most APIs Weren’t RobustUnix & Windows systems had poor robustness scores:• 24% to 48% of intentionally exceptional Unix tests yielded non-robust results• Found simple “system killer” programs in Unix, Win 95/98/ME, and WinCE
Even critical systems were far from perfectly robust• Safety critical operating systems• DoD HLA (where their stated goal was 0% robustness failures!)
Developer reactions varied, but were often extreme• Organizations emphasizing field reliability often wanted 100% robustness• Organizations emphasizing development often said
“core dumps are the Right Thing”• Some people didn’t care• Some people sent hate mail
6
Comparing Fifteen POSIX Operating Systems
Normalized Failure Rate
Ballista Robustness Tests for 233 Posix Function Calls
0% 5% 10% 15% 20% 25%
AIX 4.1
QNX 4.22QNX 4.24
SunOS 4.1.3
SunOS 5.5
OSF 1 3.2OSF 1 4.0
1 Catastrophic
2 Catastrophics
Free BSD 2.2.5
Irix 5.3Irix 6.2
Linux 2.0.18
LynxOS 2.4.0NetBSD 1.3
HP-UX 9.05
1 Catastrophic
1 Catastrophic
HP-UX 10.20
Abort Failures Restart Failure
1 Catastrophic
7
Robustness: C Library vs. System Calls
Normalized Failure Rate
Portions of Failure Rates Due To System/C-Library
0% 5% 10% 15% 20% 25%
AIX 4.1
QNX 4.22QNX 4.24
SunOS 4.1.3SunOS 5.5
OSF 1 3.2OSF 1 4.0
Free BSD 2.2.5
Irix 5.3Irix 6.2
Linux 2.0.18LynxOS 2.4.0
NetBSD 1.3
HP-UX 9.05HP-UX 10.20
1 Catastrophic
2 Catastrophics
1 Catastrophic
1 Catastrophic
1 Catastrophic
C LibrarySystem Calls
8
Estimated N-Version Comparison Results
*
*
*
**
*
Normalized Failure Rate by Operating System
Normalized Failure Rate (after analysis)0% 10% 20% 30% 40% 50%
Ope
ratin
gSy
stem
Tes
ted
SunOS 5.5SunOS 4.13
QNX 4.24QNX 4.22OSF-1 4.0OSF-1 3.2
NetBSD Lynx
LinuxIrix 6.2Irix 5.3
HPUX 10.20
HPUX 9.05FreeBSD
AIX
Abort % Silent % Restart %
* Catastrophic
9
Failure Rates by Function Group
10
Even Those Who Cared Didn’t Get It RightOS Vendors didn’t accomplish their stated objectives (e.g.,):• IBM/AIX wanted few Aborts, but had 21% Aborts on POSIX tests• FreeBSD said they would always Abort on exception (that’s the Right Thing)
but had more Silent (unreported) exceptions than AIX!• Vendors who said their results would improve dramatically on the next
release were usually wrong
Safe Fast I/O (SFIO) library• Ballista found that it wasn’t as safe as the authors thought
– Missed: valid file checks; modes vs. permissions; buffer size/accessibility
Do people understand what is going on?• We found four widely held misconceptions that prevented improvement in
code robustness
11
#1: “Ballista will never find anything (important)”1. “Robustness doesn’t matter”
• HP-UX gained a system-killer in the upgrade from Version 9 to 10– In newly re-written memory
management functions…… which had a 100% failure rate under Ballista testing
• So, robustness seems to matter!
2. “The problems you’re looking for are too trivial -- we don’t make those kinds of mistakes”• HLA had a handful of functions that
were very non-robust• SFIO even missed some “easy”
checks• See Unix data to the right…
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
OSF1 3.2
SunOS 5.5SunOS 4.1.3QNX 4.24QNX 4.22OSF1 4.0BNetBSD 1.3LynxOS 2.4.0Linux 2.0.18IRIX 6.2IRIX 5.3HP-UX A.09.05
AIX 4.1
HP-UX B.10.20FreeBSD 2.2.5
Dev/C
lass-Specific
Synchronization
Clocks &
Timers
C Library
Process Prims.
System
Databs
Process Env.
Files & Dirs
I/O P
rimitives
Messaging
Mem
ory Mng.
Scheduling
Rob
ustn
ess
Failu
re R
ate
12
#2: “100% robustness is impractical”The use of a metric – in our case Ballista – allowed us to remove all detectable robustness failures from SFIO and other API subsets• (Our initial SFIO results weren’t entirely zero; but now they are)
Abort Failure Rate for Select Functions
14
57
52
79
36
8
58
50
6.29
5.99 9.
2
2 1
5.01
2 4.5
0 0 0 0 0 0 0 0
0
10
20
30
40
50
60
70
80
90
open write read close fileno seek sfputc sfgetc
Function
% A
bort
Fai
lure
s
STDIOOriginal SFIORobust SFIO
13
Can Even Be Done With “Ordinary” APIMemory & semaphore robustness improved for Linux• Robustness hardening yielded 0% failure rate on standard POSIX calls below
Failure rates for memory/process original Linux calls(All failure rates are 0% after hardening)
16.4
31.3
75.2 75.2
57.8
7.3
35.3
59.564.7
41.2
64.7
0.010.020.030.040.050.060.070.080.0
memch
rmem
cmp
memcp
ymem
move
memse
tse
m_init
sem_d
estro
yse
m_getv
alue
sem_p
ost
sem_tr
ywait
sem_w
ait
Function name
Failu
re R
ate
(%)
Mem. Manipulation Module Process Synch. Module
14
#3: “It will be too slow”Solved via caching validity checks • Completely software-implemented cache for checking validity
• Check validity once, remember result– Invalidate validity check when necessary
VerificationModule
Reference
Lookup
Store
ClearModule
Invalidate
ResultCache Structure in Memory
15
Caching Speeds Up Validity TestsWorst-case of tight loops doing nothing but “mem” calls is still fast• L2 Cache misses would dilute effects of checking overhead further
Slowdown of robust memory functions with tagged malloc
0.95
1
1.05
1.1
1.15
1.2
1.25
16 32 64 128 256 512 1024 2048 4096
Buffer size (Bytes)
Slow
dow
n
memchrmemcpymemcmpmemsetmemmove
16
Future MicroArchitectures Will HelpException & validity check branches are highly predictable• Compiler can structure code to assume validity/no exceptions• Compiler can give hints to branch predictor• Branch predictor will quickly figure out the “valid” path even with no hints• Predicated execution can predicate on “unexceptional” case
Exception checks can execute in parallel with critical path• Superscalar units seem able to execute checks & functions concurrently• Out of order execution lets checks wait for idle cycles
The future brings more speculation; more concurrency• Exception checking is an easy target for these techniques• Robustness is cheap and getting cheaper (if done with a view to architecture)
17
#4: “We Did That On Purpose”Variant: “Nobody could reasonably do better”• Despite the experiences with POSIX, HLA & SFIO, this one persisted• So, we tried an experiment in self-evaluating robustness
Three experienced commercial development teams• Components written in Java• Each team self-rated the robustness of their component per
Maxion’s “CHILDREN” mnemonic-based technique• We then Ballista tested their (pre-report) components for robustness
• Metric: did the teams accurately predict where their robustness vulnerabilities would be?
– They didn’t have to be perfectly robust– They all felt they would understand the robustness tradeoffs they’d made
18
Self Report Results: Teams 1 and 2They were close in their prediction• Didn’t account for some language safety features (divide by zero)• Forgot about, or assumed language would protect them against NULL in A4
Component A and B Robustness
0
5
10
15
20
25
A1 A2 A3 A4 B1 B2 B3 B4
Function
Abor
t Fai
lure
Rat
e (%
)
MeasuredExpected
19
Self Report Data: Team 3Did not predict several failure modes• Probably could benefit from additional training/tools
Component C Total Failure Rate
0
10
20
30
40
50
60
C1 C3 C5 C7 C9
C11C13C15C17C19FC21FC23FC25FC27F
Method Designation
Tota
l Fai
lure
Rat
e (%
)
MeasuredExpected
20
Conclusions: Ballista Project In PerspectiveGeneral testing & wrapping approach for Ballista• Simple tests are effective(!)
– Scalable for both testing and hardening • Robustness tests & wrappers can be abstracted to the data type level
– Single validation fragment per type – i.e. checkSem(), checkFP()…
Wrappers are fast (under 5% penalty) and usually 100% effective• Successful check results can be cached to exploit locality
– Typical case is an index lookup, test and jump for checking cache hit– Typical case can execute nearly “for free” in modern hardware
• After this point, it is time to worry about resource leaks, device drivers, etc.
But, technical solution alone is not sufficient• Case study of self-report data
– Some developers unable to predict code response to exceptions• Training/tools needed to bridge gap
– Even seasoned developers need a QA tool to keep them honest– Stand-alone Ballista tests for Unix under GPL; Windows XP soon
21
Future Research Challenges In The LargeQuantifying “software aging” effects• Simple, methodical tests for resource leaks
– Single-threaded, multi-threaded, distributed all have different issues– One problem is multi-thread contention for non-reentrant resources
» e.g., exception handling data structures without semaphore protection
• Measurement & warning systems for need for SW rejuvenation– Much previous work in predictive models– Can we create an on-line monitor to advise it is time to reboot?
Understanding robustness tradeoffs from developer point of view• Tools to provide predictable tradeoff of effort vs. robustness
– QA techniques to ensure that desired goal is reached– Ability to specify robustness level clearly, even if “perfection” is not desired
• Continued research in enabling ordinary developers to write robust code• Need to address different needs for development vs. deployment
– Developers want heavy-weight notification of unexpected exceptions– In the field, may want a more benign reaction to exceptions
22