Alert Fatigue -and what to do about it
Elik Eizenberg, VP R&D
http://www.bigpanda.io
2
alert fatiguenoun
A constant flood of noisy, non-actionable alerts, generated by your monitoring stack.
Synonyms: alert overload, alert spam
3
Poor Signal-to-Noise Ratio
Delayed Response
Wrong Prioritization
Constant Context Switching
4
Common Pitfalls
5
What you see: 20 critical Nagios / Zabbix alerts, all at once
What happened: - Unexpected traffic to your app- You get an alert from practically every host in the cluster
In an ideal world: - 1 alert, indicating 80% of the cluster has problems - Don’t wake me up unless at least some % of the cluster is
down
Alert Per Host
6
What you see: Low disk space alert on a MongoDB host
What happened: - DB disk is slowly filling up as expected- Will become urgent in a few weeks
In an ideal world: - No need for an alert at all!- Automatically issue a Jira ticket and assign it to me
Important != Urgent
7
What you see: The same high-load alerts, every Monday after lunch
What happened: - Monday is busy by definition- You can’t use the same thresholds every day
In an ideal world: - Dynamically update your thresholds- Or focus only on anomalies (e.g. etsy/skyline)
Non-Adaptive Thresholds
8
What you see: Incoming alerts from Nagios, Pingdom, NewRelic, Keynote & Splunk…
What happened: - Data corruption in a couple of Mongo nodes- Resulting in heavy disk IO and some transaction errors- This kind of error manifests itself in server, application & user
level
In an ideal world: - Auto correlate highly-related alerts from different systems- Show me one high-level incident, instead of low-level alerts
Same Issue, Different System
9
What you see: Issue pops us for a couple of minutes, then disappears.
What happened: - Maybe a cronjob over utilizes the netwrok- Or a random race-condition in the app- Or a rarely-used product feature that causes the backend to
crash
In an ideal world: - No need for an alert every time it happens- Give me a monthly report of common shot-lived alerts
Transient Alerts
10
Give us a try - http://www.bigpanda.iohttp://twitter.com/bigpanda
Thanks for listening!