kletz introduction and chapter 1,2,3 phil varner and dave larochelle

34
Kletz Introduction and Chapter 1,2,3 Phil Varner and Dave Larochelle

Upload: natalie-houston

Post on 02-Jan-2016

225 views

Category:

Documents


1 download

TRANSCRIPT

KletzIntroduction and Chapter 1,2,3

Phil Varner and Dave Larochelle

Failure of the Day

● Voter News Service - updated computer system after 2000 election, now it doesn't work!

● “It is brand-new and has never been test-driven.”– implies testing is after completion

● “This is the equivalent of a NASA space shot without doing any test runs.” – hmm, maybe not the best analogy

● “We are going to learn a few things tommorrow night in real time.”

Introduction

● “Find a little, learn a lot” – Law and Order Model

● “Find a lot, learn nothing” – Software Engineering Model

Cause

● Dictionary– “The producer of an effect result or consequence”– “The one, such as a person, event, or condition, that is

responsible for such an action”

● Factor that contributed to events?

Cause means Nothing

● tempted to find things we can do nothing about● often identify symptoms instead of problem● air of finality● implies singularity● usually doesn't help us prevent further failures● “Fate is just a lazy person's excuse for doing

nothing”

Accident Models

● Spend time fitting data to model● “distracts from free-range thinking required to

uncover the less obvious ways of preventing an accident” - huh?

● ha, ha, your Crouching Lemur Formal Methods cannot defeat my Hidden Marmot Ad-Hoc Technique

Introduction (cont.)

● “chain of events”– tells WHAT but not WHY

● “can easily miss subtle and complex couplings ad interactions among failure events and omit entirely accidents involving no component failure”

● misses “structural deficiencies in the organization, the safety culture in the industry, and management deficiencies”

Leveson

● Leveson - control systems theoretical model● “Mishap occurs when external disturbances are

not adequately controlled”– external meaning “the thing being controlled”

● “dysfunctional interaction” among components– normal accidents

● Missing, inadequate, or unenforced constraints

Introduction

● Record all the facts– what facts are important?– shouldn't analysis drive investigation?– “reports should not be too verbose or busy people will

not read them”– Oh, I'm sorry the nuclear plant meltdown, but I didn't

have time to read your report

Accidents

● Technical oversight – “Human error”– End “causes” - symptoms– buffer overflow, exception mishandled

● Hazards that were not seen before– meaning “we didn't recognize it”– Specification, testing, review– Switching software off when not necessary (Ariane 5)– Using non-stack exploitable languages

Accidents

● Managerial failings– improving management systems– generally has to do with supervision, oversight– training operators (programmers?)– Forcing adequate specification, implementation,

testing, verification, validation

● Are these catagories adequate/rich enough?● Do they constrain thinking?● Leveson has similar categories, but more detailed

Defen(c)e in Depth

● Usually too much dependence on last line of defence– Why accidents are usually attributed to human error

● “It was the pilot's fault he turned off the engines”

– Most effective actions are at start of chain● “Don't put the power switch next to the fuel control switch”

– However, normal accidents mean we may not even know that a chain exists, much less where the start is

● Pick one

Interesting Characteristics

● Accident 1– efficiency practice (polyethelene sheet)

● Korean Air?

– sensor “failure” (gas detector)● TMI

● Accident 2– unauthorized modification (lid modified)

● (Flixborough)

– poor quality modification (lid repairs)● (Flix)

Flixborough

● Poorly designed and implemented modification

Problems (from Johnson)

● Supporting a systematic approach– no approach is not an approach

● Framing any analysis of software failure– How far back do we chain the events?– Why bolts? Why a pump? Why do we, as a society,

require large petrochemical resources to maintain our generally unhappy lifestyles?

● Making adequate recommendations– Arianne, London examples

Ariane 5

Events? Causes?

● Self-Destruct● Laucher Disintegrates● High angle of attack● Nozzles hard over● Diag. data sent to MC● Backup SRI dead● Primary SRI dead

● Backup SRI dies● Exception thrown● 64 bit fp to 16 bit int

conversion● Stupid French person?

Ariane 5 Pre-events

● Operand Error review – 7 vars, 4 protected

● External reviews held– what did they do, what did they find?

● Simulations and test done using A4 test data● System test excluded SRI and used simulated

output● SRI computer never tested using actual expected

flight measurements

Causes?

● Official “Cause of the Failure”– “ [the failure] was caused by the complete loss of

guidance and attitude information... due to specification and design errors [in the SRI]”

● Technical Cause– “An Operand Error when converting the horizontal bias

(BH), and the lack of protection of this conversion which caused the SRI computer to stop.”

● Real Cause– Stupid French people (just as silly as the other two)

Recommendations

● 4 technical, 7 hazard, 3 management(!)● One page long (of course, there is another one)● Do not allow any sensor, such as the SRI, to stop

sending best effort data● Organize a specific sw qualification review for

each piece of equipment w/ sw

Recommendations

● Review all flight software (including embedded software), and in particular: – identify all implicit assumptions...– verify the range of values taken by any internal or

communications variables in the software

● Include trajectory data in specification and test requirements

● A more transparent organization of the cooperation among the partners in the A5 program must be considered

Chapter 2

● Drying unit control panel located in a Zone 2 area● Electrical equipment was not flameproof or

nonsparking ● Mounted in a metal cabinet● Cabinet purged with nitrogen● A pressure switch isolated the electrical supply if

the pressure fell too low

System operation

● Young inexperienced graduate operates the system

● Switches on the system

What went wrong?

● Fuel– Solvent leaked into the cabinet with the nitrogen

● Air– Leaked into the cabinet– Leaked solvent may have weakened joints

● Ignition source– Electrical

● Pressure switch basically disarmed

Recommendations

● First layer– Prevent back flow, etc.– Alterations of trip set points should only be made

with written authorisation– New equipment should be scheduled for testing

immediately

Second Layer

● Difficult to maintain pressure in a metal cabinet● Trip kept operating

– Reduced trip to ¼ inch– Reduced trip to 0– Operators just knew it worked

Third Layer

● Safety e quipment for a rare hazard produced a greater hazard

● Did the equipment need to be in a zone 2 area?– 'He did not ask if the control panel could be moved.

That was not his job. His was to supply equipment suitable for the agreed classification It was no one's job to ask if it would be possible to change the classifaction ...'

● It is bad to assume a trip will always work and rely on it.

Other observations

● Latent failures● Applicability to Software

– Numerous examples of efforts to gain additional safety/security having the opposite effect.

● PHP– Paranoid sites increase logging– The increased logging exposes them to format string attacks

● Bind– Digital signatures added for increased security– New code has a buffer overflow,.

– Discussion other examples, etc.

Chapter 3: Things that almost go Boom

● Tank water and some oil● Slip-plate has not been removed● Foreman proposed to break the join, remove the

slip plate and remake joint before the water drained out

● Manager agreed to the proposal● Foreman unsucessful, oil starts to leak● Attempt to shut down the burners fails● Burner were manually isolated.

5 Reports

● Slip-plate is the problem● Failure to shut down the furnace● Don't rush● Management problems● Business climate in the facility

Don't Rush

● Very few problems are too urgent to discuss for 15 minutes

● Usually better solutions if you stop to think

Management problems

● Manager inexperienced● Foreman was respected expert

''The manager could not be blamed. Nevertheless sooner or later every manager has to learn to stand up to his foremen, not disregarding their advice, but weighting it in the balance. He should be very reluctant to overrule them if they are advocating caution, more willing to do so if, as in this case, they advocate taking a chance''

Business Climate in the Facility

● Manager was influenced by others attitude to safety

● Was the manager given contradictory instructions– Get the plant on-line quickly– Follow safety procedures

● 'no-win' situation– After accidents blamed for not following safety

procedures. – If the required output/efficiency is not achieved also

blamed”

● 'What we don't say is as important as what we do say.'

● Kletz criticizes those who want to hold individual managers and directors responsible But is there an other way to change the business climate?

● Note potential relevance to software.