Transcript
Page 1: Five Causes of Alert Fatigue -- and how to prevent them

Alert Fatigue -and what to do about it

Elik Eizenberg, VP R&D

http://www.bigpanda.io

Page 2: Five Causes of Alert Fatigue -- and how to prevent them

2

alert fatiguenoun

A constant flood of noisy, non-actionable alerts, generated by your monitoring stack.

Synonyms: alert overload, alert spam

Page 3: Five Causes of Alert Fatigue -- and how to prevent them

3

Poor Signal-to-Noise Ratio

Delayed Response

Wrong Prioritization

Constant Context Switching

Page 4: Five Causes of Alert Fatigue -- and how to prevent them

4

Common Pitfalls

Page 5: Five Causes of Alert Fatigue -- and how to prevent them

5

What you see: 20 critical Nagios / Zabbix alerts, all at once

What happened: - Unexpected traffic to your app- You get an alert from practically every host in the cluster

In an ideal world: - 1 alert, indicating 80% of the cluster has problems - Don’t wake me up unless at least some % of the cluster is

down

Alert Per Host

Page 6: Five Causes of Alert Fatigue -- and how to prevent them

6

What you see: Low disk space alert on a MongoDB host

What happened: - DB disk is slowly filling up as expected- Will become urgent in a few weeks

In an ideal world: - No need for an alert at all!- Automatically issue a Jira ticket and assign it to me

Important != Urgent

Page 7: Five Causes of Alert Fatigue -- and how to prevent them

7

What you see: The same high-load alerts, every Monday after lunch

What happened: - Monday is busy by definition- You can’t use the same thresholds every day

In an ideal world: - Dynamically update your thresholds- Or focus only on anomalies (e.g. etsy/skyline)

Non-Adaptive Thresholds

Page 8: Five Causes of Alert Fatigue -- and how to prevent them

8

What you see: Incoming alerts from Nagios, Pingdom, NewRelic, Keynote & Splunk…

What happened: - Data corruption in a couple of Mongo nodes- Resulting in heavy disk IO and some transaction errors- This kind of error manifests itself in server, application & user

level

In an ideal world: - Auto correlate highly-related alerts from different systems- Show me one high-level incident, instead of low-level alerts

Same Issue, Different System

Page 9: Five Causes of Alert Fatigue -- and how to prevent them

9

What you see: Issue pops us for a couple of minutes, then disappears.

What happened: - Maybe a cronjob over utilizes the netwrok- Or a random race-condition in the app- Or a rarely-used product feature that causes the backend to

crash

In an ideal world: - No need for an alert every time it happens- Give me a monthly report of common shot-lived alerts

Transient Alerts

Page 10: Five Causes of Alert Fatigue -- and how to prevent them

10

Give us a try - http://www.bigpanda.iohttp://twitter.com/bigpanda

Thanks for listening!


Top Related