a system’s approach to existence of goal conflicts and production pressures. sidney dekker,...

Post on 29-Mar-2018

222 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A System’s Approach to Safety

Prof. Nancy LevesonAeronautics and Astronautics

MIT

The Problem

Why do we need a new approach?

• New causes of accidents in complex, software-intensive systems– Software does not “fail,” it usually issues unsafe

commands

– Role of humans in systems is changing

• Traditional safety engineering approaches were developed for relatively simple electro-mechanical systems

• We need more effective techniques for these new systems and new causes

Accident with No Component Failures

Types of Accidents• Component Failure Accidents

– Single or multiple component failures– Usually assume random failure

• Component Interaction Accidents– Arise in interactions among components– Related to interactive complexity, coupling and use of computers– Level of interactions has reached point where can no longer be

thoroughly• Planned• Understood• Anticipated• Guarded against

So What Do We Need to Do?“Engineering a Safer World”

• Expand our accident causation models

• Create new hazard analysis techniques

• Use new system design techniques– Safety-driven design– Improved system engineering

• Improve accident analysis and learning from events

• Improve control of safety during operations

• Improve management decision-making and safety culture

An Expanded View of Accident Causes

Accident Causality Models

• Underlie all our efforts to engineer for safety• Explain why accidents occur• Determine the way we prevent and investigate accidents

“All models are wrong, some models are useful”George Box

Chain-of-Events (Domino) Causation Models

Assumption: Accidents are caused by chains of component failures

• Simple, direct relationship between events in chain– Ignores non-linear relationships, feedback, etc.

• Events almost always involve component failure, human error, or energy-related event

• Forms the basis for most safety-engineering and reliability engineering analysis:

e,g, FTA, PRA, FMECA, Event Trees, etc.

and design:e.g., redundancy, over-design, safety margins, ….

Chain-of-events example

Limitations of Chain-of-Events Causation Models

• Oversimplifies causality

• Excludes or does not handle– Component interaction accidents (vs. component failure

accidents)

– Indirect or non-linear interactions among events

– Systemic factors in accidents

– Human “errors”

– System design errors (including software errors)

– Migration toward states of increasing risk

The Computer Revolution

• Software is simply the design of a machine abstracted from its physical realization

• Machines that were physically impossible or impractical to build become feasible

• Design can be changed without retooling or manufacturing• Can concentrate on steps to be achieved without worrying

about how steps will be realized physically

+ =General PurposeMachine

SoftwareSpecialPurposeMachine

Advantages = Disadvantages

• Computer so powerful and useful because has eliminated many of physical constraints of previous technology

• Both its blessing and its curse– No longer have to worry about physical realization of our designs– But no longer have physical laws that limit the complexity of our

designs.

• What does “failure” of a design (pure abstraction) mean?

Software-Related Accidents

• Are usually caused by flawed requirements– Incomplete or wrong assumptions about operation of controlled

system or required operation of computer

– Unhandled controlled-system states and environmental conditions

• Merely trying to get the software “correct” or to make it reliable will not make it safer under these conditions.

Software-Related Accidents (2)

• Software may be highly reliable and “correct” and still be unsafe:– Correctly implements requirements but specified behavior

unsafe from a system perspective.

– Requirements do not specify some particular behavior required for system safety (incomplete)

– Software has unintended (and unsafe) behavior beyond what is specified in requirements.

Safety = Reliability

• Safety and reliability are NOT the same– Sometimes increasing one can even decrease the other.– Making all the components highly reliable will have no impact on

system accidents.

• For relatively simple, electro-mechanical systems with primarily component failure accidents, reliability engineering can increase safety.

• But this is untrue for complex, software-intensive socio-technical systems.

It’s only a random failure, sir! It will never happen again.

Operator Error: Old View(Sidney Dekker, Jens Rasmussen)

• Operator error is cause of incidents and accidents

• So do something about operator involved (suspend, retrain, admonish)

• Or do something about operators in general– Marginalize them by putting in more automation– Rigidify their work by creating more rules and procedures

Operator Error: New View

• Operator error is a symptom, not a cause

• All behavior affected by context (system) in which occurs

• To do something about error, must look at system in which people work or operate machines:– Design of equipment– Usefulness of procedures– Existence of goal conflicts and production pressures

Sidney Dekker, 2009

Hindsight Bias

Overcoming Hindsight Bias

• Assume nobody comes to work to do a bad job.

• Investigation reports should explain– Why it made sense for people to do what they did

– What changes will reduce likelihood of happening again

Adaptation

• Systems are continually changing– Planned changes– Unplanned changes

• Rasmussen: Systems and organizations migrate toward accidents (states of high risk) under cost, productivity, and profit pressures in an aggressive, competitive environment

• During operations need to:– Control planned changes– Control and/or detect unplanned changes

Simplified System Dynamics Model of Columbia Accident

STAMP

A new accident causation model using Systems Theory

(vs. Reliability Theory)

Applying Systems Thinking to Safety

• Losses are the result of complex processes, not simply chains of failure events

• Accidents can occur due to unsafe interactions among components– Component Failure Accidents

– Component Interaction Accidents

• Most major accidents arise from a slow migration of the entire system toward a state of high-risk– Need to control and detect this migration

STAMP (System-Theoretic Accident Model and Processes)

• Treat safety as a dynamic control problem rather than a component failure problem

– O-ring did not control propellant gas release by sealing gap in field joint of Challenger Space Shuttle

– Software did not adequately control descent speed of Mars Polar Lander

– Public health system did not adequately control contamination of the milk supply with melamine

– Financial system did not adequately control the use of financial instruments

– Deepwater Horizon design and operations did not adequately control the release of hydrocarbons from the well.

– The Washington Metropolitan Area Transit Authority railway design and operations did not adequately control separation between trains

Safety is a Control Problem (2)

• Events are the result of the inadequate control– Result from lack of enforcement of safety constraints in system

design and operations

• A change in emphasis:

“prevent failures”↓

“enforce safety constraints on system behavior”

STAMP (2)

• Systems can be viewed as hierarchical control structures

– Systems are treated as interrelated components kept in a state of dynamic equilibrium by feedback loops of information and control

– Controllers imposes constraints upon the activity at a lower level of the hierarchy: safety constraints

ExampleSafetyControlStructure

Safety Constraints

• Each component in the control structure has – Assigned responsibilities, authority, accountability– Controls that can be used to enforce safety constraints

• Each component’s behavior is influenced by– Context (environment) in which operating – Knowledge about current state of process

Accidents occur when model of process is inconsistent with real state of process and controller provides inadequate control actions

Controlled Process

Model ofProcess

ControlActions

Feedback

Controller

Control processes operate between levels of control

Feedback channels are critical-- Design-- Operation

Relationship Between Safety and Process Models

• Accidents occur when models do not match process and– Required control commands are not given– Incorrect (unsafe) ones are given– Correct commands given at wrong time (too early, too late)– Control stops too soon

Explains software errors, human errors, component interaction accidents …

A Broad View of Controls

Component failures and unsafe interactions may be “controlled” through design

(e.g., redundancy, interlocks, fail-safe design, other design techniques)

or through process• Manufacturing processes and procedures• Maintenance processes• Operations

or through social controls (cultural, policy, regulation, individual self interest)

Summary: Accident Causality

• Accidents occur when– Control structure or control actions do not enforce safety

constraints

• Unhandled environmental disturbances or conditions

• Unhandled or uncontrolled component failures• Dysfunctional (unsafe) interactions among

components

– Control structure degrades over time (e.g., asynchronous evolution)

– Control actions inadequately coordinated among multiple controllers

A Third Source of Risk

• Control actions inadequately coordinated among multiple controllers

© Copyright Nancy Leveson, Aug. 2006

Controller 1

Controller 2

Process 1

Process 2

Controller 1

Controller 2Process

Boundary areas

Overlap areas (side effects of decisions and control actions)

Uncoordinated “Control Agents”

Control Agent(ATC)

InstructionsInstructions

“SAFE STATE”ATC provides coordinated instructions to both planes

“SAFE STATE”TCAS provides coordinated instructions to both planes

Control Agent(TCAS)

InstructionsInstructions

“UNSAFE STATE”BOTH TCAS and ATC provide uncoordinated & independent instructions

Control Agent(ATC)

InstructionsInstructions

No Coordination

(From Rasmussen)

Uses for STAMP• More comprehensive accident/incident investigation and root

cause analysis

• Basis for new, more powerful hazard analysis techniques (STPA)

• Safety-driven design (physical, operational, organizational)– Can integrate safety into the system engineering process– Assists in design of human-system interaction and interfaces

• Organizational and cultural risk analysis– Identifying physical and project risks– Defining safety metrics and performance audits– Designing and evaluating potential policy and structural improvements– Identifying leading indicators of increasing risk (“canary in the coal mine”)

• Improve operations and management control of safety

What is Safety Culture?

Shein: The Three Levels of Organizational Culture

Safety culture is set by the leaders who establishthe values under which decisions will be made.

Safety Culture• Safety culture is a subset of culture that reflects

general attitude and approaches to safety and risk management

• Trying to change culture without changing environment in which it is embedded is doomed to failure

• Simply changing organizational structures may lower risk over short term, but superficial fixes that do not address the set of shared values and social norms are likely to be undone over time.

• “Culture of denial”

© Copyright Nancy Leveson, Aug. 2006

Examples of Positive Cultural Values and Assumptions

• Incidents and accidents are valued as an important window into systems that are not functioning as they should – triggering causal analysis and improvement actions.– Safety information is surfaced without fear – Safety analysis is conducted without blame

• Safety commitment is valued

Example Cultural Values and Assumptions (2)

• There is a feeling of openness and honesty, where everyone’s voice is valued. Employees feel managers are listening.– Trust among all parties (hard to establish, easy to break).– Employees feel psychologically safe about reporting concerns – Employees believe that managers can be trusted to hear their

concerns and will take appropriate action – Managers believe employees are worth listening to and are

worthy of respect.

Types of Flawed Safety Cultures

• Culture of Denial– Risk assessment unrealistic – Credible risks and warnings are dismissed without appropriate

investigation (only want to hear good news)– Believe accidents are inevitable, the price of productivity

• Compliance Culture– Focus on complying with government regulations– Produce extensive “safety case” arguments

• Paperwork Culture– Produce lots of paper analyses with little impact on design and

operations

Safety Policy

• Reflects how the company or group values safety

• Should be easy to understand, easily operationalized

• States the way the company views safety: guiding principles

Example Operational Safety Philosophy (1)(Colonial Pipeline)

• All injuries and accidents are preventable.

• We will not compromise safety to achieve any business objective.

• Leaders are accountable for the safety of all employees, contractors, and the public.

• Each employee has primary responsibility for his/her safety and the safety of others.

• Effective communication and the sharing of information is essential to achieving an accident-free workplace.

• Employees and contractor personnel will be properly trained to perform their work safely.

Example Operational Safety Philosophy (2)(Colonial Pipeline)

• Exposure to workplace hazards shall be minimized and/or safeguarded.

• We will empower and encourage all employees and contractors to stop, correct and report any unsafe condition.

• Each employee will be evaluated on his/her performance and contribution to our safety efforts.

• We will design, construct, operate and maintain facilities and pipelines with safety in mind.

• We believe preventing accidents is good business.

Safety in Operations

Continuous Improvement and Learning

• Learning from events – Accident/incident analysis– Generating Recommendations

• Continuous Improvement– Assigning responsibility– Follow-up to ensure implemented– Feedback channels to determine whether changes effective

Impediments to Learning

• Filtering and subjectivity in accident reports

• “Root cause” seduction– Idea of a singular cause is satisfying to our desire for certainty and

control

– Leads to fixing symptoms (sophisticated game of “whack a mole”)

• “Blame is the enemy of safety”

• Oversimplification– Focus on hardware component failure and operator error

– Tend to look for linear cause-effect relationships and proximal events (rather than systemic factors)

Blame is the Enemy of Safety

• “My UK safety customers are incredibly spooked by [the Nimrod accident report] because of the way it singled out individuals in the safety assessment chain for criticism. It has made a very difficult process of assessing safety risk even more difficult.”

• People stop reporting errors and problems– Just Culture movement

Using STAMP in Accident Analysis• Identify system hazard violated and the system safety

design constraints• Construct the safety control structure as it was designed

to work– Component responsibilities (requirements)– Control actions and feedback loops

• For each component, determine if it fulfilled its responsibilities or provided inadequate control.– If inadequate control, why? (including changes over time)

• Determine the changes that could eliminate the inadequate control (lack of enforcement of system safety constraints) in the future.

© Copyright Nancy Leveson, Aug. 2006

© Copyright Nancy Leveson, Aug. 2006

New Hazard Analysis Technique

• Starts from hazards• Identifies safety constraints (system and component

safety requirements)• Identifies scenarios leading to violation of safety

constraints– Includes scenarios (cut sets) found by Fault Tree Analysis– Finds additional scenarios not found by FTA and other failure-

oriented analyses

• Can be used on technical design and organizational design

5 Missing or wrong communication with another controller

Evaluation (1) • Performed a non-advocate risk assessment for inadvertent launch

on new BMDS• Deployment and testing of BMDS held up for 6 months because so

many scenarios identified for inadvertent launch. In many of these scenarios: – All components were operating exactly as intended

• E.g., missing cases in software, obscure timing interactions• Could not be found by fault trees or other standard

techniques

– Complexity of component interactions led to unanticipated system behavior

– STPA also identified component failures that could cause inadvertent launch (most analysis techniques consider only these failure events)

• Now being used proactively as changes made to system

Evaluation (2)• Joint research project between MIT and JAXA to

determine feasibility and usefulness of STPA for JAXA projects

• Comparison between STPA and FTA for HTV• Problems identified?• Resources required?

• ISS component failures• Crew mistakes in operation• Crew process model inconsistent

• Activation missing/inappropriate• Activation delayed

• HTV component failures• HTV state changes over time• Out‐of‐range radio disturbance• Physical disturbance

• t, x feedback missing/inadequate• t, x feedback delayed• t, x feedback incorrect• Flight Mode feedback missing/inadequate• Flight Mode feedback incorrect• Visual Monitoringmissing/inadequate

• Wrong information/directive from JAXA/NASA GS

Identified by both (STPA and FTA)Identified by STPA only

Comparison between STPA and FTA

Technical• Safety analysis of new missile defense system (MDA)

• Safety-driven design of new JPL outer planets explorer

• Safety analysis of the JAXA HTV (unmanned cargo spacecraft to ISS)

• Incorporating risk into early trade studies (NASA Constellation)

• Orion (Space Shuttle replacement)

• Safety of maglev trains (Japan Central Railway)

• NextGen (for NASA, just starting)

• Accident/incident analysis (aircraft, petrochemical plants, air traffic control, railway accidents, …)

Does it work? Is it practical?

• Analysis of the management structure of the space shuttle program (post-Columbia)

• Risk management in the development of NASA’s new manned space program (Constellation)

• NASA Mission control ─ re-planning and changing mission control procedures safely

• Food safety

• Safety in pharmaceutical drug development

• Risk analysis of outpatient GI surgery at Beth Israel Deaconess Hospital

• Analysis and prevention of corporate fraud

Social and Managerial

Does it work? Is it practical?

Conclusions• A new, more sophisticated causality model is needed to

handle the new causes of accidents and the complexity in our modern systems

• Safety is a control problem, not just a failure problem

• Safety engineering and risk management needs to consider operations and changes over time and not just the original engineering design

• Using STAMP, we can create much more powerful and effective safety engineering tools and techniques and operate safer systems

Nancy Leveson, Engineering a Safer World, MIT Press, 2011http://sunnyday.mit.edu/safer-world

top related