Re-thinking Incident Response Automation
Kiran Gollu, Co-founder/CEO
Neptune.io © 2015
Brief Intro: Myself & Neptune
Neptune.io © 2015
• Architected an incident response automation platform for AWS
• Founding team at Amazon S3, DynamoDB for 5 years
Strong engineering-heavy team
Agenda
Neptune.io © 2015
• State of incident response automation today
• Our learning's from building such a platform for AWS/Neptune
• Best practices • Intro to Neptune
• Examples : Incident response workflows
• Q/A
What is Incident Response?
Neptune.io © 2015
How to handle incidents/outages?
Many more..
Alerts
Incident response automation is broken today!
Neptune.io © 2015
Neptune.io © 2015
Source : DevOps survey; Victor Ops incident response
95% of Time To Recovery(TTR) is still manual today
Alert Troubleshooting Triage | Investigate | Identify
Resolution Documentation
73% 10% 5% 12%
Snapshots • Graphs & metrics • Logs • Webpages Service health checks • Internal • External Host/App diagnostics • “Top”, “df –H” etc. • Heap dumps/Stack traces
Runbooks • On single/cluster
of hosts • Any script, any
language Cloud API/CLI actions • Start/Stop/
Reboot • Scale up/down
Root-cause analysis & Audit • Heap dumps • Logs • Graphs Post-mortem • History • Diagnostics
What has changed?
Neptune.io © 2015
• Automation, uptime, and agility : #1 priority for businesses • e.g. People can’t imagine Gmail going down
• #Servers, #VMs, #Containers, #Apps launched exploding!
• Maintenance has become huge burden • 13 different tools for managing app
• Difficult to track down root cause what’s going on where • Cloud, dynamic environments => knowledge sharing is a problem
Typical incident takes 1-2 hours to diagnose & fix
Big companies built custom automation tools internally
FBAR : Facebook Auto Remediation Platform “…Its doing the work of approximately 200 sys admins…”
“We built one for Amazon Web Services!”
Neptune.io © 2015
The rest diagnosed in minutes instead of hours
40-60% of alerts get fixed automatically without human intervention
Key takeaways
Neptune.io © 2015
• Uptime and Automation agility are critical drivers for your businesses
• Incident response automation gives you: • More uptime, better customer experience • Reduction in MTTR • Happier engineers
Maturity level of Incident Response Teams
Neptune.io © 2015 @jpaulreed @kfinnbraun DevOps Enterprise Summit
3 core pieces of incident response platform
Neptune.io © 2015
1. Analytics
Neptune.io © 2015
• Helps identify those top-20% alerts causing 80% of pain • Sorted by frequency and MTTR
• Capture: • MTTA (mean time to acknowledgement) • MTTR (mean time to resolution) • Frequency of occurrence (#times a particular alert has occurred)
• Reporting + Auditing • Audit all activity (both manual + automated) • Leads to data-driven post mortems
2. Context
Neptune.io © 2015
• When an alert occurs: • Gather context automatically from 13 different tools
• Monitoring tools, logging tools, health checks, dependent services
Use cases: • High memory à capture top-10 memory hogs, memory usage graphs • High app error rate à capture error rate, latency trends, app logs with
5xx errors
3. Remediation
Neptune.io © 2015
• When an alert occurs:
• If it’s a known alert => Run a remediation runbook
• Use cases: • Process crashed à restart process • Host is unpingable à restart 3 times and escalate if still fails
Our learnings
Neptune.io © 2015
• Automate simple things first
• Have checks in place to avoid cascading failures • Don’t automatically fix when you don’t know root cause
• We started with more focus on remediation, but customers really wanted automated context gathering
• Customers were not of maturity level that we expected, though they’d like to be
• Security is of paramount importance • Customer prefer vetted runbooks compared to running arbitrary scripts
• Use github or chef/puppet recipes for runbooks (code reviewed/vetted)
Neptune.io © 2015
Neptune: Incident Response Automation-as-a-Service
IRA as a Service Monitoring as a Service Alerting as a Service
Existing tools just alert somebody, without any context or diagnostics
We provide diagnostics for unknown issues, and for known issues, we fix
them automatically
Neptune.io © 2015
Deployment Models
• SaaS Model - available today! • Github/vetted runbooks
• On-premise AWS VPC deployment model – available today! • Enterprise customers
• On-premise deployment model (roadmap)
Deep Dive: Architecture
Neptune.io © 2015
Event Queue
Policy-based Rule Engine
Action Queue
Neptune Web Service
Dedicated Queue Per customer
Publish action results
Neptune Agent
REST API-based Runbook repo
Custom Tool
Read-only
Quick Demo
Neptune.io © 2015
UseCase1: Auto-Remediation UseCase2: Auto-Diagnosis
Host-level Alert – high memory • Collect top-10 memory hogs
• restart the process
App-level Alert – high error rate • Collect graph snapshots, logs
• Run script on cluster of machines
Neptune.io © 2015
Sample error rate incident today (before Neptune)
Neptune.io © 2015
Sample error rate incident (after Neptune)
You can get started in 10 min
1. Configure monitoring tool to send alerts to Neptune 2. Install a light-weight agent on a few servers
Neptune.io © 2015
SaaS Model: Why are we secure?
Neptune.io © 2015
• Go-based Agent: • No dependencies (agent code is open source) • Outbound access only • No need to open any inbound firewall ports
• Agent is light weight, dumb, and not chatty • Sits idle unless there is something to do • Consumes < 0.01% CPU, 20MB Memory
• Authentication: • Leverage AWS STS token Auth: use temp credentials, rotate every 4 hours • Neptune API_KEY
Refer to On-prem AWS VPC deployment model if SaaS doesn’t work for you
SaaS Model: You Control Runbooks
Neptune.io © 2015
• All Runbooks stay within your firewall
• Runbooks are version controlled (e.g. Github)
• No one can edit your runbooks • Even Agent has read-only access to runbook repository
Refer to On-prem AWS VPC deployment model if SaaS doesn’t work for you