handling incidents
TRANSCRIPT
How to handle incidents, downtime & outages
Devopsdays, Amsterdam 2015 David Mytton, Founder, Server Density
Cost of uptime?
Cost of uptime?
Cost of uptime?
$2.9bnQ1: 2015
Cost of uptime?
Cost of uptime?
$2.9bnQ1: 2015
$870mQ1: 2015
Cost of uptime?
Cost of uptime?
$2.9bnQ1: 2015
$870mQ1: 2015
$4.1bnQ1: 2015
Cost of uptime?
How much are you spending?
Expect downtime
• Prepare
• Respond
• Postmortem
Prepare
• On call
• Primary/secondary
Prepare
• On call
• Primary/secondary
• Reachability
Prepare
• On call
• Off call
Prepare
• On call
• Off call
• Docs
Prepare
• On call
• Off call
• Docs
• Searchable
Prepare
• On call
• Off call
• Docs
• Searchable
• Independent
Prepare
• Key info
• Team contacts
Prepare
• Key info
• Team contacts
• Vendor contacts
Prepare
• Key info
• Team contacts
• Vendor contacts
• Key credentials
Prepare
• Key info
• Unexpected situations
Prepare
• Communication
• Key info
• Unexpected situations
Prepare
• Communication
• Internet access
• Key info
• Unexpected situations
• Communication
• Internet access
• Support access
Prepare
Respond
• First responder
1. Load incident response checklist
Respond
• First responder
1. Load incident response checklist
2. Log into Ops War Room
Respond
• First responder
1. Load incident response checklist
2. Log into Ops War Room
3. Log incident in JIRA
Respond
• First responder
1. Load incident response checklist
2. Log into Ops War Room
3. Log incident in JIRA
4. Begin investigation
• Key response principles
• Log everything
Respond
Respond
• Key response principles
• Log everything
• Frequent public updates
Respond
• Key response principles
• Log everything
• Frequent public updates
• Gather the team
Respond
• Key response principles
• Log everything
• Frequent public updates
• Gather the team
• Escalate!
• Within a few days
Postmortem
• Within a few days
• Tell the story
Postmortem
• Within a few days
• Tell the story
• Appropriate technical detail
Postmortem
• Within a few days
• Tell the story
• Appropriate technical detail
• What failed, why?
Postmortem
Postmortem
• How it’s going to be fixed
Postmortem