five causes of alert fatigue -- and how to prevent them
DESCRIPTION
“Alert Spam” is a major recurring pain brought up by Ops teams: the constant flood of noisy alerts from your monitoring stack. This presentation discusses five types of spammy alerts that we hear about most often (and how we’d like to see them resolved). Most of them will sound familiar to you.TRANSCRIPT
Alert Fatigue -and what to do about it
Elik Eizenberg, VP R&D
http://www.bigpanda.io
2
alert fatiguenoun
A constant flood of noisy, non-actionable alerts, generated by your monitoring stack.
Synonyms: alert overload, alert spam
3
Poor Signal-to-Noise Ratio
Delayed Response
Wrong Prioritization
Constant Context Switching
4
Common Pitfalls
5
What you see: 20 critical Nagios / Zabbix alerts, all at once
What happened: - Unexpected traffic to your app- You get an alert from practically every host in the cluster
In an ideal world: - 1 alert, indicating 80% of the cluster has problems - Don’t wake me up unless at least some % of the cluster is
down
Alert Per Host
6
What you see: Low disk space alert on a MongoDB host
What happened: - DB disk is slowly filling up as expected- Will become urgent in a few weeks
In an ideal world: - No need for an alert at all!- Automatically issue a Jira ticket and assign it to me
Important != Urgent
7
What you see: The same high-load alerts, every Monday after lunch
What happened: - Monday is busy by definition- You can’t use the same thresholds every day
In an ideal world: - Dynamically update your thresholds- Or focus only on anomalies (e.g. etsy/skyline)
Non-Adaptive Thresholds
8
What you see: Incoming alerts from Nagios, Pingdom, NewRelic, Keynote & Splunk…
What happened: - Data corruption in a couple of Mongo nodes- Resulting in heavy disk IO and some transaction errors- This kind of error manifests itself in server, application & user
level
In an ideal world: - Auto correlate highly-related alerts from different systems- Show me one high-level incident, instead of low-level alerts
Same Issue, Different System
9
What you see: Issue pops us for a couple of minutes, then disappears.
What happened: - Maybe a cronjob over utilizes the netwrok- Or a random race-condition in the app- Or a rarely-used product feature that causes the backend to
crash
In an ideal world: - No need for an alert every time it happens- Give me a monthly report of common shot-lived alerts
Transient Alerts
10
Give us a try - http://www.bigpanda.iohttp://twitter.com/bigpanda
Thanks for listening!