acceptability-oriented computing martin rinard laboratory for computer science massachusetts...
TRANSCRIPT
Acceptability-Oriented Computing
Martin Rinard Laboratory for Computer Science
Massachusetts Institute of Technology
Acceptability View
Correct Execution
AcceptabilityEnvelope
Execution Space
Acceptable Executions
Unacceptable Execution
Correct Execution
Execution Space
AcceptabilityEnvelope
Repaired Execution
Resilient Computing Execution
Questions
• How to identify acceptability envelope?• Set of acceptability properties• Basic properties that any execution
must satisfy to be acceptable• How to ensure program stays within
envelope?• Acceptability monitoring• Acceptability enforcement
Correct Execution
Acceptability Enforcement
Execution Space
AcceptabilityEnvelope
AcceptabilityMonitoring
Repaired Execution
Resilient Computing Execution
Proposed Structure
Core System
Data StructureRepair
ProbeRepair
Outputs
OutputFilter
Inputs
InputFilter
Proposed Structure
Core System Outputs
Data StructureRepair
ProbeRepair Output
Filter
Control TransferOutput
Rectification
Inputs
InputFilter
Proposed Structure
Core System Outputs
Data StructureRepair
ProbeRepair Output
Filter
Control Transfer
ExceptionRecovery
Inputs
InputFilter
OutputRectification
Proposed Structure
Core System Outputs
Data StructureRepair
ProbeRepair Output
Filter
ResponseEnforcement Control Transfer
ExceptionRecovery
Inputs
InputFilter
OutputRectification
Monitoring and Enforcement Mechanisms
• Black Box• Do not affect core• Input/output filters and correlators
• White Box – New code and data into core• Gray Box
• No change to core program• Can change data structures and control flow• Mechanisms
•Procedure call and system call interception•Ptrace interface, mmap to access address
space
Reason for Acceptability-Oriented Computing:
Difficulty of Delivering Perfect Software• Difficulty in all areas of development effort
• Understanding domain, obtaining requirements
• Producing specification, developing software
• Change Aspiration of Development Process• Accept inevitability of imperfection• Goal is to deliver acceptable program
• Augment Development Activities • Identify crucial acceptability properties• Ensure that program does not violate them
Aspiring to Perfection Recognized as Harmful
Defocuses development effort• All parts seen as equally important• No formal way to direct development
effort to most important parts of code• Produces brittle structure
• Each piece of functionality implemented•Once (no redundancy)•Completely (hard and easy parts
together)• No recovery or protection mechanisms• Program completely vulnerable to any
error
Advantages of Acceptability-Oriented Computing
• Focused, prioritized development effort • Appropriately direct engineering activities • Ensure satisfaction of acceptability
properties• Resilient software structure
• Redundant acceptability property enforcement
• Mechanisms enforce partial properties• Simpler (easier to obtain acceptability) than
complete modules in core software• Resulting software structure tolerates errors
Ideal Result
• Can build systems with less development effort• Can reduce testing effort for core• Can leave (infrequent) errors in system
• Can build systems with more functionality• Can invest saved development effort on
increasing functionality of system• Can make larger system stable• Can use more aggressive, riskier
algorithms
Map Example
put x 10put y 12
get yrem z
put z 11
1012
1210
11
OutputsInputs
Acceptability PropertyOutput must be within min and max inputs
Map Core
Map Example
put x 10put y 12
get yrem z
put z 11
1012
1210
11
OutputsInputs
Acceptability PropertyOutput must be within min and max inputs
Map Core
Unacceptable Output
put x 10put y 11
put x 12rem x
rem y
OutputsInputs
get x
1011
1212
11
2
UnacceptableOutput
Map Core
Input/Output Correlation
put x 10put y 11
put x 12rem x
rem y
OutputsInputs
get x
1011
1212
11
Input/Output CorrelatorMin: Max:
2
InputMonitor
OutputFilter
Map Core
Input/Output Correlation
put x 10put y 11
put x 12rem x
rem y
OutputsInputs
get x
1011
1212
11
Input/Output CorrelatorMin: 10 Max: 12
2
put x 10put y 11
put x 12rem x
rem y
get x
InputMonitor
OutputFilter
Map Core
Input/Output Correlation
put x 10put y 11
put x 12rem x
rem y
OutputsInputs
get x
1011
1212
11
Input/Output CorrelatorMin: 10 Max: 12
2
1011
1212
11
put x 10put y 11
put x 12rem x
rem y
get x
InputMonitor
OutputFilter
Map Core
First Option: Shut Down System
put x 10put y 11
put x 12rem x
rem y
OutputsInputs
get x
1011
1212
11
Input/Output CorrelatorMin: 10 Max: 12
2
1011
1212
11
put x 10put y 11
put x 12rem x
rem y
get x
InputMonitor
OutputFilter
Map Core
Second Option: Return Error Code
put x 10put y 11
put x 12rem x
rem y
OutputsInputs
get x
1011
1212
11
Input/Output CorrelatorMin: 10 Max: 12
2
1011
1212
11
0
put x 10put y 11
put x 12rem x
rem y
get x
InputMonitor
OutputFilter
Map Core
ErrorCode
Third Option: Return Min or Max Value
put x 10put y 11
put x 12rem x
rem y
OutputsInputs
get x
1011
1212
11
Input/Output CorrelatorMin: 10 Max: 12
2
1011
1212
11
10
put x 10put y 11
put x 12rem x
rem y
get x
InputMonitor
OutputFilter
Map Core
MinValue
When to Use Each Option
• Shut down system when• It is safe and acceptable• External intervention is available
• Return error code when• Client is able to deal with error code
• Return min or max when• Not safe to shut down system• No external intervention available• Client not prepared to deal with error
code
Safe Exit
Delegation
ResilientComputing
All options use block box mechanism
Implementation Approach
a1
e7
b3
d4
h10
i11
HashTable
FreeList
AcceptabilityProperty
Each entry has exactly one incoming reference
• From table, table entry, or free list
• Implies no cycles in table or free list
• Implies disjointness of table and free list
Checking for Acceptability Violations
• Auxiliary reference count for each entry• Traverse data structures to compute
counts• Check that no count greater than one• Complications
• Invalid pointers (addressing violations)
• Out of bounds array indices (more addressing violations)
• Cycles (infinite traversal loops)
Mechanisms for Accessing Data Structures
• White Box• Link monitor and checking code into core• Possibility of core corrupting checker
(and vice-versa!)• Gray Box
• Checker uses ptrace interface (or mmap)• More cumbersome to access data
structures• But checker isolated from core
Inconsistency Responses
• Fail stop – halt program, await intervention• Feasible when halting acceptable• And intervention practical• May actually decrease reliability
• Delegation – return error code to client• Feasible when client can deal with error
• Resilient computing – fix inconsistency, continue• Enables continued (acceptable) execution• Hides effect of inconsistency from clients
Code for Put Procedure in Map Example
int table[M];int freelist;put(n, v) e = alloc(); value(e) = v; strcpy(name(e), n); p = find(n); if (p != NOENTRY) free(p); b = bin(n); next(e) = table[b]; table[b] = e; return(v);free(e) value(e) = freelist; freelist = e;
Hash table and free list
Insert entry into free list
Allocate and initialize new hash table entry
Insert new entry into hash table
Free old entry with same name
Code for Put Procedure in Map Example
int table[M];int freelist;put(n, v) e = alloc(); value(e) = v; strcpy(name(e), n); p = find(n); if (p != NOENTRY) free(p); b = bin(n); next(e) = table[b]; table[b] = e; return(v);free(e) value(e) = freelist; freelist = e;
Hash table and free list
Insert entry into free list
Allocate and initialize new hash table entry
Insert new entry into hash table
Free old entry with same name
Does not check for empty free list
Leaves entryin table
Creates cycle if entry already in table
Problem
Program crashes if free list empty when call put
New Acceptability Property
Free list is not empty
Acceptability Enforcement
Repair algorithm ensures free list not empty
Data Structure Repair Goal
Map Core Map Core
Invalid References
Cycle
Empty Free List
All References Valid
No Cycles
Entries in Free List
Enforcing Consistency• Hand-coded consistency algorithm• Coding is difficult because must assume data
structures can be arbitrarily corrupted• Invalid references, out of bounds indices• Cycles (can cause infinite loops in repair code)
• Two data structure traversals• First eliminates invalid references and indices• Second removes all but first reference to each
entry (requires auxiliary marking data structure)• Reconstruct free list
• Any unreferenced entry put into list
• If free list still empty, steal entry from table
Issues
• Replace failure with potentially suboptimal (but still acceptable) execution
• Checking overhead• Depends on properties and application• Subject to optimization
• Obscured errors• Record violations and updates in logs• Use logs to reconstruct actions
• Potential errors in checking and repair code• Acceptability enforcement code deals with
simpler properties than core• Should be simpler and easier to get correct
Generalizations
• Process structure consistency• System structured as collection of
processes• Monitor and regenerate processes to
preserve consistency properties• System configuration consistency
• Difficult to get configuration settings correct
• Monitor and update to satisfy properties• Properties may depend on running
applications, attached devices, etc.• Both involve structural properties
Next Problemint table[M];int freelist;put(n, v) e = alloc(); value(e) = v; strcpy(name(e), n); p = find(n); if (p != NOENTRY) free(p); b = bin(n); next(e) = table[b]; table[b] = e; return(v);free(e) value(e) = freelist; freelist = e;
Buffer Overrun
Long Inputs Crash Core
Map Core
put x 10put y 11
put xxxxxxxxxxx 12
rem x
rem y
OutputsInputs
get xxxxxxxxxxx
101111
Long Inputs Crash Core
Map Core
put x 10put y 11
put xxxxxxxxxxx 12
rem x
rem y
OutputsInputs
get xxxxxxxxxxx
101111
Long Inputs Crash Core
Map Core
put x 10put y 11
put xxxxxxxxxxx 12
rem x
rem y
OutputsInputs
get xxxxxxxxxxx
101111
put x 10
put y 11
put xxx 12
rem x
rem y
get xxx
121012
TruncatingInput Filter
Classification of Techniques
• Acceptability properties can involve • Inputs, outputs, state, behavior, timing• In any combination
• Examples• Use data structures to filter outputs• Use inputs to repair data structures• Process structure and configuration
consistency • Timing constraints
• Input arrivals and triggered program actions
• Frequency of output events
Maintenance Commands fsck(1M)
NAME fsck - check and repair file systems
SYNOPSIS fsck [ -F FSType ] [ -m ] [ -V ] [ special ... ]
fsck [ -F FSType ] [ -n | N | y | Y ] [ -V ] [ -o FSType-specific-options ] [ special ... ]
DESCRIPTION fsck audits and interactively repairs inconsistent file system conditions. If the file system is inconsistent the default action for each correction is to wait for the user to respond yes or no. If the user does not have write permission fsck defaults to a no action. Some corrective actions will result in loss of data. The amount and sever- ity of data loss may be determined from the diagnostic out- put.
Ray Tracing Graphics Computations
NormalVectors
Shoot Rays Into SceneCompute How They
Interact with Triangles
Ray Tracing Graphics Computations
NormalVectors
Shoot Rays Into Scene
Degenerate Triangle
(colinear vertices)Normal vector
computation fails
Acceptability-Oriented Approach
• Do not code up all degenerate cases• Failed computation generates a signal
• Catch signal• Generate some likely value• Continue with that value
• Result• Several pixels are incorrect• But picture as a whole looks fine• Program simpler and works faster
Hardware
Interlocks
Interlocks PreventUnsafe entry into enclosure while the bank is energized or not
grounded Unsafe operation of air-disconnect while vacuum switches are
closed Unsafe operation of ground switch(s) while air-disconnect is
closed
Common Theme
Presence of acceptability-oriented features reduces need for perfection
• Safety-critical systems• Persistent data• Can have more ambitious core
• More functionality• More aggressive, riskier algorithms• Can tolerate algorithms with known errors
• Less development effort• Less testing and certification• Can leave infrequent errors in system
Two Kinds of Acceptability-Oriented Computing
• Opportunistic acceptability-oriented computing• Observe acceptability problem• Develop acceptability enforcement
mechanism specifically for that problem• Systematic acceptability-oriented computing
• Identify acceptability properties during requirements analysis and design
• Integrate acceptability features into design• Implement acceptability enforcement
mechanisms as normal development activity
Changes to Development Activities
• Requirements
• Specification
• Design
• Implementation
• Testing
• Deployment
• Maintenance
Changes to Development Activities
• Requirements
• Specification
• Design
• Implementation
• Testing
• Deployment
• Maintenance
Problem With Standard Approaches:
Aspiration of Perfection•Flat set of requirements•Specification expected to
perfectly capture requirements•Implementation goal: produce
flawless implementation•Testing goal: eliminate all
implementation errors•No attempt to
•Focus on important properties
•Build resilient system
With Acceptability-Oriented Computing
• Requirements
• Specification
• Design
• Implementation
• Testing
• Deployment
• Maintenance
Prioritize RequirementsSeparate what really matters for system
From what would be nice to have
Foundation of Acceptability-Oriented Computing
Provides basis for acceptability properties
• Requirements
• Specification
• Design
• Implementation
• Testing
• Deployment
• Maintenance
Translate Prioritized Requirements into Acceptability Properties
External PropertiesInputs and Outputs
With Acceptability-Oriented Computing
With Acceptability-Oriented Computing
• Requirements
• Specification
• Design
• Implementation
• Testing
• Deployment
• Maintenance
Identify Internal Acceptability Properties
Data StructuresImplementation
How to Integrate Acceptability Property Enforcement
How to monitor executionHow to interveneBlack/gray/white box
With Acceptability-Oriented Computing
• Requirements
• Specification
• Design
• Implementation
• Testing
• Deployment
• Maintenance
Implement and integrate acceptability enforcement mechanisms
With Acceptability-Oriented Computing
• Requirements
• Specification
• Design
• Implementation
• Testing
• Deployment
• Maintenance
Acceptability enforcement code helps discover and localize errors
Develop, deploy new acceptability properties as necessary
With Acceptability-Oriented Computing
• Requirements
• Specification
• Design
• Implementation
• Testing
• Deployment
• Maintenance
Turn on appropriate resilient computing mechanisms
Helps system execute acceptably with minimal external intervention
With Acceptability-Oriented Computing
• Requirements
• Specification
• Design
• Implementation
• Testing
• Deployment
• MaintenanceDevelop, deploy new acceptability properties as necessary
With Acceptability-Oriented Computing
• Requirements
• Specification
• Design
• Implementation
• Testing
• Deployment
• Maintenance
Potential Adoption Paths•Adopt incrementally
• Start with specific activity• Or selected part of system• Can stop short of complete
adoption if it makes sense•Adopt in parallel
• Small acceptability team• Most developers oblivious
•Can orient entire development process around acceptability
Consequences of Systematic Acceptability-Oriented Computing
• Better (more acceptable) software• Improved understanding of requirements• Inevitable errors placed to minimize harm• Resilient systems that recover from errors
• Better documentation • Acceptability properties document what is
important about the system• Acceptability enforcement mechanisms
ensure that they accurately reflect implementation
More Consequences
• Reduced development and maintenance costs• Prioritized engineering effort• More aggressive software reuse• Reduced testing costs
•Acceptability properties help tester•Simpler testing for acceptability
enforcers• Can leave infrequent errors in system
Why Don’t PCs Have Memory With Parity?
With ParityMemory error
occursPC flags it and stopsConsumer blames
manufacturer
Manufacturer’s Perspective
No ParityMemory error occursPC oblivious, keeps goingIf system crashes,
consumer blames Microsoft
No Incentive to Increase Parts Cost to Get “Benefit” of Parity
Why Don’t PCs Have Memory With Parity?
With ParityMemory error
occursPC flags it and
stopsHave to reboot
Consumer’s Perspective
No ParityMemory error occursPC oblivious, keeps goingSystem may never crash
(at least, not because of parity error)
If it does crash, no big surprise
Lack of Parity Increases Reliability!Because It Makes PC Oblivious to
Failure
Why Will Java Program Fail?
a[i] = x; x = a[i];
o.f = x; x = o.f;
Out of boundsarray access
Null pointerdereference
Standard ResponseThrow an exception and terminate the program
Why Will Java Program Fail?
a[i] = x; x = a[i];
o.f = x; x = o.f;
Out of boundsarray access
Null pointerdereference
Resilient ComputingResponse
Ignore error andkeep executing
DiscardValue
Use ManufacturedValue
Can Extend the Approach to C
• When program attempts to access illegal address• Discard value (writes)• Use manufactured value (reads)• Program keeps executing
• Improved version uses a Safe C compiler• Catches pointer and array bounds errors• Replace exception handler to
•Discard value (read)•Use manufactured (write)•Program keeps executing
• Improvement reduces data structure damage
Why Continued Execution is so Valuable
• Systems often consist of• Multiple components• Each provides important
functionality• Artificial coupling between
components• Components need flow of control to
deliver its functionality• Any error in any component can
deny flow of control to all others• Continued execution enables control
to continue to flow to each component
Why Continued Execution is so Valuable
Furthermore• Even within a component, error
may not cause unacceptable execution
• Or cause of error may eventually be flushed
• Moral of the story• 90% of life is just showing up• Keep program showing up
More Ways to Ensure Continued Execution
• Eliminate special-case code• Poorly tested, likely to contain errors• Not as important as common-case
code• Locate code that causes errors and
remove it (garbage collection instance of this idea)
Complication: Infinite Loops
• Failure-oblivious techniques can make a computation immortal
• Need a way to identify, then kill useless or misguided computations• Bound loop iterations• Randomize branch and jump targets• Speculatively parallelize computation
• Lack of good mortality units• Can attempt to leverage existing
structure: threads, transactions, components, …
• New construct to express mortality units
Conservative Aggressive
Data Structure Repair
Data Structure Consistency Checks
Failure-ObliviousComputing
Application-SpecificError Recovery
Input and OutputFiltering
HardwareInterlocks
Acceptability-Oriented Computing is a Perspective
CodeExcision
Limp HomeModes
RedundantComputing
DevelopmentProcess Changes
Key Ideas
• Reject aspiration of perfection• Focus on acceptability
• Acceptability properties identify acceptability envelope
• Acceptability enforcement mechanisms keep system within acceptability envelope
• Opportunistic vs. systematic approaches• Ideal result
• More resilient systems• Less development and testing effort
Example Techniques
• Filter out unacceptable inputs• Truncate strings to eliminate buffer overruns• Clamp numeric values within range
• Use data structures to filter inputs and outputs• Use inputs to repair data structures• Process structure and configuration consistency• Continued execution as acceptability property
• Failure-oblivious computing• Code and input variation
Aesthetics
• AOC about how to get along in world without perfection
• One thing to accept perfection as unattainable• Another thing to view aspiration for perfection
as counterproductive• Examples from art that informed thinking• Bach (little fugue in G minor) vs. Mahler 2
• Scale important differentiator• Picasso (19 year old perfect picture, cubism)• Michelangelo (david, unfinished)