kletz introduction and chapter 1,2,3 phil varner and dave larochelle
TRANSCRIPT
Failure of the Day
● Voter News Service - updated computer system after 2000 election, now it doesn't work!
● “It is brand-new and has never been test-driven.”– implies testing is after completion
● “This is the equivalent of a NASA space shot without doing any test runs.” – hmm, maybe not the best analogy
● “We are going to learn a few things tommorrow night in real time.”
Introduction
● “Find a little, learn a lot” – Law and Order Model
● “Find a lot, learn nothing” – Software Engineering Model
Cause
● Dictionary– “The producer of an effect result or consequence”– “The one, such as a person, event, or condition, that is
responsible for such an action”
● Factor that contributed to events?
Cause means Nothing
● tempted to find things we can do nothing about● often identify symptoms instead of problem● air of finality● implies singularity● usually doesn't help us prevent further failures● “Fate is just a lazy person's excuse for doing
nothing”
Accident Models
● Spend time fitting data to model● “distracts from free-range thinking required to
uncover the less obvious ways of preventing an accident” - huh?
● ha, ha, your Crouching Lemur Formal Methods cannot defeat my Hidden Marmot Ad-Hoc Technique
Introduction (cont.)
● “chain of events”– tells WHAT but not WHY
● “can easily miss subtle and complex couplings ad interactions among failure events and omit entirely accidents involving no component failure”
● misses “structural deficiencies in the organization, the safety culture in the industry, and management deficiencies”
Leveson
● Leveson - control systems theoretical model● “Mishap occurs when external disturbances are
not adequately controlled”– external meaning “the thing being controlled”
● “dysfunctional interaction” among components– normal accidents
● Missing, inadequate, or unenforced constraints
Introduction
● Record all the facts– what facts are important?– shouldn't analysis drive investigation?– “reports should not be too verbose or busy people will
not read them”– Oh, I'm sorry the nuclear plant meltdown, but I didn't
have time to read your report
Accidents
● Technical oversight – “Human error”– End “causes” - symptoms– buffer overflow, exception mishandled
● Hazards that were not seen before– meaning “we didn't recognize it”– Specification, testing, review– Switching software off when not necessary (Ariane 5)– Using non-stack exploitable languages
Accidents
● Managerial failings– improving management systems– generally has to do with supervision, oversight– training operators (programmers?)– Forcing adequate specification, implementation,
testing, verification, validation
● Are these catagories adequate/rich enough?● Do they constrain thinking?● Leveson has similar categories, but more detailed
Defen(c)e in Depth
● Usually too much dependence on last line of defence– Why accidents are usually attributed to human error
● “It was the pilot's fault he turned off the engines”
– Most effective actions are at start of chain● “Don't put the power switch next to the fuel control switch”
– However, normal accidents mean we may not even know that a chain exists, much less where the start is
● Pick one
Interesting Characteristics
● Accident 1– efficiency practice (polyethelene sheet)
● Korean Air?
– sensor “failure” (gas detector)● TMI
● Accident 2– unauthorized modification (lid modified)
● (Flixborough)
– poor quality modification (lid repairs)● (Flix)
Problems (from Johnson)
● Supporting a systematic approach– no approach is not an approach
● Framing any analysis of software failure– How far back do we chain the events?– Why bolts? Why a pump? Why do we, as a society,
require large petrochemical resources to maintain our generally unhappy lifestyles?
● Making adequate recommendations– Arianne, London examples
Events? Causes?
● Self-Destruct● Laucher Disintegrates● High angle of attack● Nozzles hard over● Diag. data sent to MC● Backup SRI dead● Primary SRI dead
● Backup SRI dies● Exception thrown● 64 bit fp to 16 bit int
conversion● Stupid French person?
Ariane 5 Pre-events
● Operand Error review – 7 vars, 4 protected
● External reviews held– what did they do, what did they find?
● Simulations and test done using A4 test data● System test excluded SRI and used simulated
output● SRI computer never tested using actual expected
flight measurements
Causes?
● Official “Cause of the Failure”– “ [the failure] was caused by the complete loss of
guidance and attitude information... due to specification and design errors [in the SRI]”
● Technical Cause– “An Operand Error when converting the horizontal bias
(BH), and the lack of protection of this conversion which caused the SRI computer to stop.”
● Real Cause– Stupid French people (just as silly as the other two)
Recommendations
● 4 technical, 7 hazard, 3 management(!)● One page long (of course, there is another one)● Do not allow any sensor, such as the SRI, to stop
sending best effort data● Organize a specific sw qualification review for
each piece of equipment w/ sw
Recommendations
● Review all flight software (including embedded software), and in particular: – identify all implicit assumptions...– verify the range of values taken by any internal or
communications variables in the software
● Include trajectory data in specification and test requirements
● A more transparent organization of the cooperation among the partners in the A5 program must be considered
Chapter 2
● Drying unit control panel located in a Zone 2 area● Electrical equipment was not flameproof or
nonsparking ● Mounted in a metal cabinet● Cabinet purged with nitrogen● A pressure switch isolated the electrical supply if
the pressure fell too low
What went wrong?
● Fuel– Solvent leaked into the cabinet with the nitrogen
● Air– Leaked into the cabinet– Leaked solvent may have weakened joints
● Ignition source– Electrical
● Pressure switch basically disarmed
Recommendations
● First layer– Prevent back flow, etc.– Alterations of trip set points should only be made
with written authorisation– New equipment should be scheduled for testing
immediately
Second Layer
● Difficult to maintain pressure in a metal cabinet● Trip kept operating
– Reduced trip to ¼ inch– Reduced trip to 0– Operators just knew it worked
Third Layer
● Safety e quipment for a rare hazard produced a greater hazard
● Did the equipment need to be in a zone 2 area?– 'He did not ask if the control panel could be moved.
That was not his job. His was to supply equipment suitable for the agreed classification It was no one's job to ask if it would be possible to change the classifaction ...'
● It is bad to assume a trip will always work and rely on it.
Other observations
● Latent failures● Applicability to Software
– Numerous examples of efforts to gain additional safety/security having the opposite effect.
● PHP– Paranoid sites increase logging– The increased logging exposes them to format string attacks
● Bind– Digital signatures added for increased security– New code has a buffer overflow,.
– Discussion other examples, etc.
Chapter 3: Things that almost go Boom
● Tank water and some oil● Slip-plate has not been removed● Foreman proposed to break the join, remove the
slip plate and remake joint before the water drained out
● Manager agreed to the proposal● Foreman unsucessful, oil starts to leak● Attempt to shut down the burners fails● Burner were manually isolated.
5 Reports
● Slip-plate is the problem● Failure to shut down the furnace● Don't rush● Management problems● Business climate in the facility
Don't Rush
● Very few problems are too urgent to discuss for 15 minutes
● Usually better solutions if you stop to think
Management problems
● Manager inexperienced● Foreman was respected expert
''The manager could not be blamed. Nevertheless sooner or later every manager has to learn to stand up to his foremen, not disregarding their advice, but weighting it in the balance. He should be very reluctant to overrule them if they are advocating caution, more willing to do so if, as in this case, they advocate taking a chance''
Business Climate in the Facility
● Manager was influenced by others attitude to safety
● Was the manager given contradictory instructions– Get the plant on-line quickly– Follow safety procedures
● 'no-win' situation– After accidents blamed for not following safety
procedures. – If the required output/efficiency is not achieved also
blamed”