responding to outages maturely

Post on 01-Sep-2014

4.019 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

These

TRANSCRIPT

Responding to Outages Maturely

John AllspawSVP, Tech Ops

Code As Craft, Berlin

Tuesday, April 24, 12

OPERABILITY

Tuesday, April 24, 12

PRODUCTION

Tuesday, April 24, 12

http://WhoOwnsMyAvailability.com

Tuesday, April 24, 12

Tuesday, April 24, 12

How important is this?

Tuesday, April 24, 12

Tuesday, April 24, 12

Tuesday, April 24, 12

Tuesday, April 24, 12

Tuesday, April 24, 12

Tuesday, April 24, 12

Tuesday, April 24, 12

Tuesday, April 24, 12

Tuesday, April 24, 12

Tuesday, April 24, 12

Tuesday, April 24, 12

Tuesday, April 24, 12

Tuesday, April 24, 12

How important is this?

Tuesday, April 24, 12

How Can This Happen?

Tuesday, April 24, 12

Complicated? Complex?

Tuesday, April 24, 12

Complex Systems

• Cascading Failures

• Difficult to determine boundaries

• Complex systems may be open

• Complex systems may have a memory

• Complex systems may be nested

• Dynamic network of multiplicity

• May produce emergent phenomena

• Relationships are non-linear

• Relationships contain feedback loopsTuesday, April 24, 12

How Can This Happen?It does happen.And it will again.

And again.Tuesday, April 24, 12

Tuesday, April 24, 12

Optimization

MTBF

MTTRTuesday, April 24, 12

http://www.flickr.com/photos/sparktography/75499095/Tuesday, April 24, 12

How does team troubleshooting

happen?Tuesday, April 24, 12

Time

Problem Starts

DetectionEvaluation

ResponseStable

ConfirmationAll Clear Po

stMort

em

Tuesday, April 24, 12

Time

Problem Starts

DetectionEvaluation

ResponseStable

ConfirmationAll Clear

Stress

PostM

ortem

Tuesday, April 24, 12

Forced beyond learned roles

Actions whose consequences are both important and difficult to see

Cognitively and perceptively noisy

Coordinative load increases exponentiallyTuesday, April 24, 12

Tuesday, April 24, 12

So What Can We Do?

Tuesday, April 24, 12

We Learn From Others

Tuesday, April 24, 12

Characteristics of response to escalating scenarios

Tuesday, April 24, 12

...tend to neglect how processes develop within time (awareness of rates) versus assessing how things are in the moment

Characteristics of response to escalating scenarios

“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980

Tuesday, April 24, 12

...have difficulty in dealing with exponential developments (hard to imagine how fast something can change, or accelerate)

Characteristics of response to escalating scenarios

“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980

Tuesday, April 24, 12

...inclined to think in causal series, instead of causal nets.

A therefore B,

instead of

A, therefore B and C (therefore D and E), etc.

Characteristics of response to escalating scenarios

“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980

Tuesday, April 24, 12

Thematic Vagabonding

Pitfalls

Tuesday, April 24, 12

Pitfalls

Goal Fixation(encystment)

Tuesday, April 24, 12

Pitfalls

Refusal to make decisions

Tuesday, April 24, 12

Non-communicating lone wolf-isms

Heroism

Tuesday, April 24, 12

Irrelevant noise in comm channels

Distraction

Tuesday, April 24, 12

Jens Rasmussen, 1983Senior Member, IEEE

“Skills, Rules, and Knowledge; Signals, Signs, and Symbols, and Other Distinctions in Human Performance Models”IEEE Transactions On Systems, Man, and Cybernetics, May 1983

Tuesday, April 24, 12

SKILL - BASED

Simple, routineRULE - BASED

Knowable, but unfamiliarKNOWLEDGE - BASED

WTF IS GOING ON?(Reason, 1990)

Tuesday, April 24, 12

• Which causes did you consider first?

• Which ones did you not consider at all?

• How much of what you considered comes from recent history?

• How much comes from observations from other team members?

Team Troubleshooting

Tuesday, April 24, 12

• How effective is the response team in communicating to other groups? Users?

• How long does it take to exhaust obvious cause(s)?

Team Troubleshooting

Tuesday, April 24, 12

Team Dynamics

Tuesday, April 24, 12

• Air Traffic Control

• Naval Air Operations At Sea

• Electrical Power Systems

• Etc.

High Reliability Organizations

• Complex Socio-Technical systems

• Efficiency <-> Thoroughness

• Time/Resource Constrained

• Engineering-driven

Tuesday, April 24, 12

Tuesday, April 24, 12

“The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea”Rochlin, La Porte, and Roberts. Naval War College Review 1987

http://govleaders.org/reliability.htm

Tuesday, April 24, 12

Tuesday, April 24, 12

Close interdependence between groups

Tuesday, April 24, 12

Close reciprocal coordination and information sharing, resulting in overlapping knowledge

Tuesday, April 24, 12

High redundancy: multiple people observing the same event and sharing information

Tuesday, April 24, 12

Broad definition of who belongs to the team.

Tuesday, April 24, 12

Teammates are included in the communication loops rather than excluded.

Tuesday, April 24, 12

Lots of error correction.

Tuesday, April 24, 12

High levels of situation comprehension: maintain constant awareness of the possibility of accidents.

Tuesday, April 24, 12

High levels of interpersonal skills

Tuesday, April 24, 12

Maintenance of detailed records of past incidents that are closely examined with a view to learning from them.

Tuesday, April 24, 12

Patterns of authority are changed to meet the demands of the events: organizational flexibility.

Tuesday, April 24, 12

The reporting of errors and faults is rewarded, not punished.

Tuesday, April 24, 12

So What ElseCan We Do?

Tuesday, April 24, 12

We Drill

Tuesday, April 24, 12

We GameDay

Tuesday, April 24, 12

Tuesday, April 24, 12

We Learn To Improvise

Tuesday, April 24, 12

IMPROVISATION

Tuesday, April 24, 12

IMPROVISATION

Tuesday, April 24, 12

We Learn From Our Mistakes

Tuesday, April 24, 12

Postmortems

• Full timelines: What happened, when, who involved

• Review in public, everyone invited

• Search for “second stories” instead of “human error”

• Cultivating a blameless environment

• Giving requisite authority to individuals to improve things

Tuesday, April 24, 12

High signal:noise in comm channels?

Troubleshooting fatigue?

Troubleshooting handoff?

All tools on-hand and working?

Improvised tooling or solutions?

Metrics visibility?

Collaborative and skillful communication?

Qualifying Response

Tuesday, April 24, 12

Remediation

Tuesday, April 24, 12

We Share Near-MissEvents

Tuesday, April 24, 12

Near MissesHey everybody -

Don’t be like me. I tried to X, but that wasn’t a good idea.

It almost exploded everyone.

So, don’t do: (details about X)

Love, Joe

Tuesday, April 24, 12

• Can act like “vaccines” - help system safety without actually hurting anything

• Happen more often, so provide more data on latent failures

• Powerful reminder of hazards, and slows down the process of forgetting to be afraid

Near Misses

Tuesday, April 24, 12

Practice!

• How we troubleshoot in the moment, as a distributed team

• How we handle time pressure

• How we Observe/Orient/Decide/Act

• How we communicate during emergencies

• How we trust (or not) each other during emergencies

• How we relate to emergencies when things are normal

• How we could detect how we are protected during normal times (i.e., why aren’t we going down RIGHT NOW?)

Tuesday, April 24, 12

Resilient Response

• Can learn from other fields

• Can train for outages

• Can learn from mistakes

• Can learn from successes as well as failures

Tuesday, April 24, 12

http://www.flickr.com/photos/sparktography/75499095/Tuesday, April 24, 12

THE END

Tuesday, April 24, 12

A parting wordA parting challenge

Tuesday, April 24, 12

Two Propositions

Tuesday, April 24, 12

100 changes

6 change-related issuesTuesday, April 24, 12

100 > 6

Tuesday, April 24, 12

Proposition #1

“Ways in which things go right are special cases of the ways in which things go wrong.”

Tuesday, April 24, 12

Proposition #1

Successes = failures gone wrong

Study the failures, generalize from that.

Potential data sources: 6 out of 100

Tuesday, April 24, 12

Proposition #2

“Ways in which things go wrong are special cases of the ways in which things go right.”

Tuesday, April 24, 12

Proposition #2

Failures = successes gone wrongStudy the successes, generalize from that

Potential data sources: 94 out of 100Tuesday, April 24, 12

94/100 ?

6/100 ?

OR

Tuesday, April 24, 12

What and WHY Do Things Go RIGHT?

Tuesday, April 24, 12

Not just: why did we fail?

But also: why did we succeed?

Tuesday, April 24, 12

Mature Role of Automation

http://www.bainbrdg.demon.co.uk/Papers/Ironies.html

“Ironies of Automation” - Lisanne Bainbridge

Tuesday, April 24, 12

Mature Role of Automation

• Moves humans from manual operator to supervisor

• Extends and augments human abilities, doesn’t replace it

• Doesn’t remove “human error”

• Are brittle

• Recognize that there is always discretionary space for humans

• Recognizes the Law of Stretched Systems

Tuesday, April 24, 12

Law of Stretched Systems

“Every system is stretched to operate at its capacity; as soon as there is some improvement, for example, in the form of new technology, it will be exploited to achieve a new intensity and tempo of activity”

D. Woods, E. Hollnagel, “Joint Cognitive Systems: Patterns” 2006

Tuesday, April 24, 12

top related