incidents & accidents · @mattstratton disclaimer, part the second: this is a topic with a...

54
Matty Stratton, DevOps Evangelist, PagerDuty @mattstratton Incidents & Accidents

Upload: others

Post on 13-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

Matty Stratton, DevOps Evangelist, PagerDuty

@mattstratton

Incidents & Accidents

Page 2: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Gratuitous slide of kids to get you

on my side

Page 3: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details
Page 4: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Disclaimer, part the first: Learn from other industries,

do not take on their stresses.

Page 5: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Disclaimer, part the second: This is a topic with a surprisingly

large number of details.

Page 6: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

“Peacetime”

PE

AC

ETI

ME W

AR

TIME

Page 7: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

“Peacetime”NO

RM

AL

EM

ER

GE

NC

Y

Page 8: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

“Peacetime”

OK

NO

T OK

Page 9: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Before, during, after

Page 10: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Before

Page 11: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Have criteria defined for when to have and not have a call.

Page 12: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Any unplanned disruption or degradation of service that is actively affecting customers’

ability to use the product.

Page 13: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Post incident criteria widely.Don’t litigate during a call.

Page 14: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Monitor the business criteria,and act accordingly.

Page 15: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

People are expensive.

Page 16: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Practice still makes perfect.

Page 17: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

“Know your role”

Page 18: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Have a clear understanding of who is supposed to be

involved in each role.

Page 19: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

https://www.fema.gov/national-incident-management-system

• “National Incident Management System” (NIMS)

• Incident Command System (ICS).

• Standardized system for emergency response.

• Hierarchical role structure.

• Provides a common management framework.

• Originally developed for CA wildfire response.

Page 20: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

OPERATIONS

LIAISONS

COMMAND

Deputy Scribe

Customer Liaison

Subject Matter Expert (SME)

Subject Matter Expert (SME)

Subject Matter Expert (SME)

Subject Matter Expert (SME)

Subject Matter Expert (SME)

Incident Commander (IC)

Internal Liaison

Page 21: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

During

Page 22: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

I’m Matty.

I’m the Incident Commander.

Page 23: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Single source of reference.

Page 24: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Becomes the highest authority.

(Yes, even higher than the CEO)

Page 25: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Not a resolver. Coordinates and delegates.

KEY TAKEAWAY

Page 26: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

DON’T DO THIS

Let’s get the IC on the RC, then get a BLT for all the SME’s.

Page 27: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Clear is better than concise.

KEY TAKEAWAY

Page 28: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

The IC manages theflow of conversation.

Page 29: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

What’s wrong?

Page 30: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

What actions can we take?

Page 31: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

What are the risks involved?

Page 32: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

“Can someone…”

Page 33: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Rich, I’d like you to investigate the increased latency, try to find the cause. I’ll come back to you in 5 minutes. Understood?

Understood.

Page 34: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Humor is best in context.

Page 35: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

DT5: Roger that

GND: Delta Tug 5, you can go right on bravo

DT5: Right on bravo, taxi.

(…): Testing, testing. 1-2-3-4.

GND: Well, you can count to 4. It’s a step in the right direction. Find another frequency to test on now.

(…): Sorry

Page 36: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Have a clear rosterof who’s been engaged.

Page 37: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Rally fast, disband faster.

Page 38: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Have a way to contribute information to the call.

Page 39: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Have a clear mechanism for making decisions.

Page 40: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

“IC, I think we should do X” “The proposed action is X,

is there any strong objection?”

Page 41: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Capture everything, and call out what’s important now vs. later.

Page 42: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

“One last thing…” (Assign an owner at the

end of an incident)

Page 43: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

After

Page 44: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

“After action reports”, “Postmortems”,

“Learning Reviews”

Page 45: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

The impact to people is a part of your incident review as well.

Page 46: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Record incident calls,review them afterwards.

Page 47: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Regularly review the incident process itself.

Page 48: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Have structure in place beforehand

Practice, practice, practice

Have clearly delineated roles

Manage the conversation flow

Make clear decisions

Rally fast, disband faster

Review regularly

Page 49: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Page 50: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Don’t panic. Stay calm.

Calm people stay alive.

Page 51: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

https://response.pagerduty.com

Page 52: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Page 53: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Resources:

• Angry Air Traffic Controllers and Pilots - https://youtu.be/Zb5e4SzAkkI • Blameless Post-Mortems (Etsy Code as Craft) - https://codeascraft.com/

2012/05/22/blameless-postmortems/ • Incidents And Accidents: Examining Failure Without Blame (Arrested DevOps)

- https://www.arresteddevops.com/blameless/ • PagerDuty Incident Response Process - https://response.pagerduty.com/

Page 54: Incidents & Accidents · @mattstratton Disclaimer, part the second: This is a topic with a surprisingly large number of details

@mattstratton

Thank you!

Questions?