response to undesired events in software systems kimberly hanks and phil varner a presentation...
TRANSCRIPT
Response to Undesired Events in Software Systems
Kimberly Hanks and Phil Varner
A Presentation brought to
you by David Parnas
Undesired != Unexpected
• Undesired Events are:– Deviation from the ideal operation of the
system– Not always correctable like errors– A fact of life we have to deal with
• “Correct programs”, in the sense that would make UE handling unnecessary, do not exist
UE Handling Overview
• Problem: Even with "correct" programs, UEs at runtime will occur
• To deal with them , we need to:– Know what to look for– Successfully diagnose– Recover if possible
• This paper proposes how
Problems with Perfect World
• Everyone makes errors
• Machines fail, causing programs/data to fail
• Programs change, new errors pop up
• Incorrect or inconsistent data may be supplied
Where Do We Start?
• Parnas assumes systems that are structured according to good information hiding and the “uses” hierarchy
• This shapes all of his proposals
• In particular, UE detection and handling is predicated on a system of levels
Levels of Detection and Handling
• The first clue that something is wrong can appear at a level other than where an UE originates
• Example:– Initiate a read on a storage resource, e.g., a tape
block, which turns out to be bad– Detection occurs at the HW level when the read
can’t be completed, even though initiated from some high-level application
Levels…(2)
• Should the HW be responsible for a recovery attempt?
• Parnas says no—but where should it be handled?– At the originating level
• Why?
Levels…(3)
• The originating level is “where the knowledge is”– The failed read happens at the HW level, but
the HW doesn’t know any useful implications– The level where the read was initiated sits on a
VM that provides certain abstractions to the user
– The UE is only meaningful in the context of these abstractions
Levels…(4)
• What would meaningful handling look like?– A diagnostic stated in the abstractions of the
level– A provision of an alternative, in the context of
those abstractions
UE Handlers and Info Hiding
• We want to handle the UE at the level at which it is meaningful, but…
• This doesn’t necessarily mean the information necessary to handle it is available (it may have been abstracted away to effect good information hiding)
• How should we manage this tradeoff?
UE Handlers and Info Hiding (2)
• “Everything should be made as simple as possible, but not simpler.” –Einstein
• Hide all and only information which is not likely to be useful in diagnosing and recovering from UEs
• Prediction is the key
Meta-structure
• The general policy proposed constitutes a meta-structure of UE handling
• It has several advantages– Handlers don’t violate info-hiding– “Uses” hierarchy is intact– Allows refinement without major revision– Aids debugging
Separation of Case and EH/R
• Separation of Normal Case and Error Handling/Recovery
• Java: try{} catch{}
Suggestion 1
• Assign responsibility for the detection of attempts to violate its specification to the "abstract machine”– trap metaphor - hide detection mechanism,
expose interface for handling UEs– should be able to handle errors in context of
VM abstraction
Degrees of Recovery
• Hardware - handle or crash
• Instead, design for multi-level recovery
• Policy determined by cost and aim
No Recovery
local
attempts: INTEGER
do
if attempts < Max_attempts then
last_character := low_level_read_function (f)
else
failed := True
end
end
Simple Recovery
local
attempts: INTEGER
do
if attempts < Max_attempts then
last_character := low_level_read_function (f)
else
failed := True
end
rescue
attempts := attempts + 1
retry
end
Degrees of Recovery
local
attempts: INTEGER
do
if attempts < Max_attempts then
last_character := low_level_read_function(f)
else
failed := True
end
rescue
attempts := attempts + 1
if attempts == 1
retry
elseif attempts == 2
sleep(2)
retry
elseif attempts == 3
flush_buffers
sleep(4)
retry
else
end
end
Suggestion 2
• Do not specify a module to have properties which UEs will frequently violate
• Examples:– don’t use limited cap data structures when # of
objects is unknown, etc.– don’t allow possibility of, e.g. calling pop() on
empty stack
Error Indication
• Strongly "typed" errors - Java
• Limitations on values of parameters (Eiffel)
• Capacity limitations
• Requests for undefined information
• Restrictions on the order of operations (encapsulation?)
• Detection of actions which are likely to be unintentional (defined how?)
Error Indication II
• Sufficiency - ensure your module will work correctly or complain
• Priority of Traps - multiple error handling?• Size of Trap Vector - how many commands
in one trap try{}catch{}• State after Trap - Atomicity• Errors of Mechanism - tradeoff between
simplicity and detail
Redundancy and Efficiency
• Redundant error checks slow the system
• Can often be removed in later versions– Retain upper level, remove lower– Retain lower, remove upper
• Which is best, and why?