data-driven postmortems 16x9-rev4-dodams-30m...data-driven postmortems jason yee, datadog @gitbisect...

47
DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT

Upload: others

Post on 03-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

DATA-DRIVEN POSTMORTEMSJASON YEE, DATADOG @GITBISECT

Page 2: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

@gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey Hunter

about me:

@Datadoghq SaaS based monitoring platform Trillions of data points per day We’re hiring! bit.ly/datadog-jobs

about Datadog:

Page 3: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

“The problems we work on at Datadog are hard and often don't have obvious, clean-cut solutions, so it's useful to cultivate your troubleshooting skills, no matter what role you work in.”

Internal Datadog Developer Guide

TW: @gitbisect @datadoghq

Page 4: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

“THE ONLY REAL MISTAKE IS THE ONE FROM WHICH WE LEARN NOTHING.”- Henry Ford

TW: @gitbisect @datadoghq

Page 5: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

COLLECTING DATA IS CHEAP; NOT HAVING IT WHEN YOU NEED IT CAN BE EXPENSIVE

SO INSTRUMENT ALL THE THINGS!

TW: @gitbisect @datadoghq

Page 6: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey
Page 7: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

4 QUALITIES OF GOOD METRICSNOT ALL METRICS ARE EQUAL

TW: @gitbisect @datadoghq

Page 8: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

1. MUST BE WELL UNDERSTOOD

TW: @gitbisect @datadoghq

Page 9: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

TW: @gitbisect @datadoghq

2. SUFFICIENT GRANULARITY

Page 10: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

1 second Peak 46%

1 minute Peak 36%

5 minutes Peak 12%

Page 11: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

3. TAGGED & FILTERABLE

TW: @gitbisect @datadoghq

Page 12: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey
Page 13: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey
Page 14: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

4. LONG-LIVED

TW: @gitbisect @datadoghq

Page 15: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

TW: @gitbisect @datadoghq

Page 16: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

TW: @gitbisect @datadoghq

Page 17: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

TW: @gitbisect @datadoghq

Page 18: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

TW: @gitbisect @datadoghq

Page 19: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

RECURSE UNTIL YOU FIND THE TECHNICAL CAUSES

TW: @gitbisect @datadoghq

Page 20: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

HUMAN ELEMENTTECHNICAL ISSUES HAVE NON-TECHNICAL CAUSES

TW: @gitbisect @datadoghq

Page 21: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

IF YOU’RE STILL RESPONDING TO THE INCIDENT, IT’S NOT TIME FOR A POSTMORTEM

TW: @gitbisect @datadoghq

Page 22: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

HUMAN DATA

DATA COLLECTION: WHO?▸ Everyone!

▸ Responders

▸ Identifiers

▸ Affected Users

TW: @gitbisect @datadoghq

Page 23: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

HUMAN DATA

DATA COLLECTION: WHAT?

▸Their perspective

▸What they did

▸What they thought

▸Why they thought/did it

TW: @gitbisect @datadoghq

Page 24: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

“WRITING IS NATURE’S WAY OF LETTING YOU KNOW HOW SLOPPY YOUR THINKING IS.”

RICHARD GUINDON

TW: @gitbisect @datadoghq

Page 25: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

TELLING STORIES

“A PICTURE IS WORTH A THOUSAND WORDS” - ANCIENT PROVERB

TW: @gitbisect @datadoghq

Page 26: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

HUMAN DATA

DATA COLLECTION: WHEN?▸ As soon as possible.

▸Memory drops sharply within 20 minutes

▸ Susceptibility to “false memory” increases

▸Get your project managers involved!

TW: @gitbisect @datadoghq

Page 27: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

HUMAN DATA

DATA SKEW/CORRUPTION▸ Stress

▸ Sleep deprivation

▸ Burnout

TW: @gitbisect @datadoghq

Page 28: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

HUMAN DATA

DATA SKEW/CORRUPTION▸ Blame

▸ Fear of punitive action

TW: @gitbisect @datadoghq

Page 29: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

HUMAN DATA

DATA SKEW/CORRUPTION▸ Bias

▸ Anchoring

▸ Hindsight

▸Outcome

▸ Availability (Recency)

▸ Bandwagon Effect

TW: @gitbisect @datadoghq

Page 30: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

HOW WE DO POSTMORTEMS AT DATADOG

TW: @gitbisect @datadoghq

Page 31: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

DATADOG POSTMORTEMS

A FEW NOTES▸ Postmortems emailed to company wide

▸ Scheduled recurring postmortem meetings

TW: @gitbisect @datadoghq

Page 32: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

DATADOG’S POSTMORTEM TEMPLATE (1/5)

SUMMARY: WHAT HAPPENED?▸Describe what happened here at a high-level --

think of it as an abstract in a scientific paper.

▸What was the impact on customers?

▸What was the severity of the outage?

▸What components were affected?

▸What ultimately resolved the outage?

TW: @gitbisect @datadoghq

Page 33: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

TW: @gitbisect @datadoghq

Page 34: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

TW: @gitbisect @datadoghq

Page 35: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

DATADOG’S POSTMORTEM TEMPLATE (2/5)

HOW WAS THE OUTAGE DETECTED?▸We want to make sure we detected the issue

early and would catch the same issue if it were to repeat.

▸Did we have a metric that showed the outage?

▸Was there a monitor on that metric?

▸ How long did it take for us to declare an outage?

TW: @gitbisect @datadoghq

Page 36: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

TW: @gitbisect @datadoghq

Page 37: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

TW: @gitbisect @datadoghq

Page 38: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

DATADOG’S POSTMORTEM TEMPLATE (3/5)

HOW DID WE RESPOND?▸Who was the incident owner & who else was

involved?

▸ Slack archive links and timeline of events!

▸What went well?

▸What didn’t go so well?

TW: @gitbisect @datadoghq

Page 39: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

*Names changed

TW: @gitbisect @datadoghq

Page 40: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

CHATOPS ARCHIVES FTW!

*Names changed

TW: @gitbisect @datadoghq

Page 41: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

*Names changed

TRACK LEARNINGS AS YOU GO

TW: @gitbisect @datadoghq

Page 42: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

DATADOG’S POSTMORTEM TEMPLATE (4/5)

WHY DID IT HAPPEN?▸Deep dive into the cause

▸ Examples from this incident:

▸ http://bit.ly/dd-statuspage

▸ http://bit.ly/alq-postmortem

TW: @gitbisect @datadoghq

Page 43: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

DATADOG’S POSTMORTEM TEMPLATE (5/5)

HOW DO WE PREVENT IT IN THE FUTURE?▸ Link to Github issues and Trello cards

▸Now?

▸Next?

▸ Later?

▸ Follow up notes

TW: @gitbisect @datadoghq

Page 44: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

*Names changed

TW: @gitbisect @datadoghq

Page 45: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

DATADOG’S POSTMORTEM TEMPLATE

RECAP:▸What happened (summary)?

▸ How did we detect it?

▸ How did we respond?

▸Why did it happen (deep dive)?

▸ Actionable next steps!

TW: @gitbisect @datadoghq

Page 46: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

KEEP LEARNING

MORE RESOURCES▸ Postmortem Template

http://bit.ly/postmortem-template

▸ The Infinite Hows - John Allspaw http://bit.ly/infinite-hows

TW: @gitbisect @datadoghq

Page 47: data-driven postmortems 16x9-rev4-dodams-30m...DATA-DRIVEN POSTMORTEMS JASON YEE, DATADOG @GITBISECT @gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey

SLIDES: bit.ly/dod-ams-postmortemsQUESTIONS: @gitbisect | [email protected]