response to undesired events in software systems kimberly hanks and phil varner a presentation...

Response to Undesired Events in Software Systems

Kimberly Hanks and Phil Varner

A Presentation brought to

you by David Parnas

Undesired != Unexpected

• Undesired Events are:– Deviation from the ideal operation of the

system– Not always correctable like errors– A fact of life we have to deal with

• “Correct programs”, in the sense that would make UE handling unnecessary, do not exist

UE Handling Overview

• Problem: Even with "correct" programs, UEs at runtime will occur

• To deal with them , we need to:– Know what to look for– Successfully diagnose– Recover if possible

• This paper proposes how

Problems with Perfect World

• Everyone makes errors

• Machines fail, causing programs/data to fail

• Programs change, new errors pop up

• Incorrect or inconsistent data may be supplied

Where Do We Start?

• Parnas assumes systems that are structured according to good information hiding and the “uses” hierarchy

• This shapes all of his proposals

• In particular, UE detection and handling is predicated on a system of levels

Levels of Detection and Handling

• The first clue that something is wrong can appear at a level other than where an UE originates

• Example:– Initiate a read on a storage resource, e.g., a tape

block, which turns out to be bad– Detection occurs at the HW level when the read

can’t be completed, even though initiated from some high-level application

Levels…(2)

• Should the HW be responsible for a recovery attempt?

• Parnas says no—but where should it be handled?– At the originating level

• Why?

Levels…(3)

• The originating level is “where the knowledge is”– The failed read happens at the HW level, but

the HW doesn’t know any useful implications– The level where the read was initiated sits on a

VM that provides certain abstractions to the user

– The UE is only meaningful in the context of these abstractions

Levels…(4)

• What would meaningful handling look like?– A diagnostic stated in the abstractions of the

level– A provision of an alternative, in the context of

those abstractions

UE Handlers and Info Hiding

• We want to handle the UE at the level at which it is meaningful, but…

• This doesn’t necessarily mean the information necessary to handle it is available (it may have been abstracted away to effect good information hiding)

• How should we manage this tradeoff?

UE Handlers and Info Hiding (2)

• “Everything should be made as simple as possible, but not simpler.” –Einstein

• Hide all and only information which is not likely to be useful in diagnosing and recovering from UEs

• Prediction is the key

Meta-structure

• The general policy proposed constitutes a meta-structure of UE handling

• It has several advantages– Handlers don’t violate info-hiding– “Uses” hierarchy is intact– Allows refinement without major revision– Aids debugging

Separation of Case and EH/R

• Separation of Normal Case and Error Handling/Recovery

• Java: try{} catch{}

Suggestion 1

• Assign responsibility for the detection of attempts to violate its specification to the "abstract machine”– trap metaphor - hide detection mechanism,

expose interface for handling UEs– should be able to handle errors in context of

VM abstraction

Degrees of Recovery

• Hardware - handle or crash

• Instead, design for multi-level recovery

• Policy determined by cost and aim

No Recovery

local

attempts: INTEGER

do

if attempts < Max_attempts then

last_character := low_level_read_function (f)

else

failed := True

end

end

Simple Recovery

local

attempts: INTEGER

do


last_character := low_level_read_function (f)

else

failed := True

end

rescue

attempts := attempts + 1

retry

end

Degrees of Recovery

local

attempts: INTEGER

do


last_character := low_level_read_function(f)

else

failed := True

end

rescue

attempts := attempts + 1

if attempts == 1

retry

elseif attempts == 2

sleep(2)

retry

elseif attempts == 3

flush_buffers

sleep(4)

retry

else

end

end

Suggestion 2

• Do not specify a module to have properties which UEs will frequently violate

• Examples:– don’t use limited cap data structures when # of

objects is unknown, etc.– don’t allow possibility of, e.g. calling pop() on

empty stack

Error Indication

• Strongly "typed" errors - Java

• Limitations on values of parameters (Eiffel)

• Capacity limitations

• Requests for undefined information

• Restrictions on the order of operations (encapsulation?)

• Detection of actions which are likely to be unintentional (defined how?)

Error Indication II

• Sufficiency - ensure your module will work correctly or complain

• Priority of Traps - multiple error handling?• Size of Trap Vector - how many commands

in one trap try{}catch{}• State after Trap - Atomicity• Errors of Mechanism - tradeoff between

simplicity and detail

Redundancy and Efficiency

• Redundant error checks slow the system

• Can often be removed in later versions– Retain upper level, remove lower– Retain lower, remove upper

• Which is best, and why?

Reliability/Dependability

• How does UE handling relate to reliability and dependability?

Conclusion

• Things go wrong

• UE detection and handling is good

• Must be correctly implemented to be useful

response to undesired events in software systems kimberly hanks and phil varner a presentation...

Documents