five causes of alert fatigue -- and how to prevent them

10
Alert Fatigue - and what to do about it Elik Eizenberg, VP R&D http://www.bigpanda.io

Upload: bigpanda

Post on 29-Jun-2015

885 views

Category:

Technology


4 download

DESCRIPTION

“Alert Spam” is a major recurring pain brought up by Ops teams: the constant flood of noisy alerts from your monitoring stack. This presentation discusses five types of spammy alerts that we hear about most often (and how we’d like to see them resolved). Most of them will sound familiar to you.

TRANSCRIPT

Page 1: Five Causes of Alert Fatigue -- and how to prevent them

Alert Fatigue -and what to do about it

Elik Eizenberg, VP R&D

http://www.bigpanda.io

Page 2: Five Causes of Alert Fatigue -- and how to prevent them

2

alert fatiguenoun

A constant flood of noisy, non-actionable alerts, generated by your monitoring stack.

Synonyms: alert overload, alert spam

Page 3: Five Causes of Alert Fatigue -- and how to prevent them

3

Poor Signal-to-Noise Ratio

Delayed Response

Wrong Prioritization

Constant Context Switching

Page 4: Five Causes of Alert Fatigue -- and how to prevent them

4

Common Pitfalls

Page 5: Five Causes of Alert Fatigue -- and how to prevent them

5

What you see: 20 critical Nagios / Zabbix alerts, all at once

What happened: - Unexpected traffic to your app- You get an alert from practically every host in the cluster

In an ideal world: - 1 alert, indicating 80% of the cluster has problems - Don’t wake me up unless at least some % of the cluster is

down

Alert Per Host

Page 6: Five Causes of Alert Fatigue -- and how to prevent them

6

What you see: Low disk space alert on a MongoDB host

What happened: - DB disk is slowly filling up as expected- Will become urgent in a few weeks

In an ideal world: - No need for an alert at all!- Automatically issue a Jira ticket and assign it to me

Important != Urgent

Page 7: Five Causes of Alert Fatigue -- and how to prevent them

7

What you see: The same high-load alerts, every Monday after lunch

What happened: - Monday is busy by definition- You can’t use the same thresholds every day

In an ideal world: - Dynamically update your thresholds- Or focus only on anomalies (e.g. etsy/skyline)

Non-Adaptive Thresholds

Page 8: Five Causes of Alert Fatigue -- and how to prevent them

8

What you see: Incoming alerts from Nagios, Pingdom, NewRelic, Keynote & Splunk…

What happened: - Data corruption in a couple of Mongo nodes- Resulting in heavy disk IO and some transaction errors- This kind of error manifests itself in server, application & user

level

In an ideal world: - Auto correlate highly-related alerts from different systems- Show me one high-level incident, instead of low-level alerts

Same Issue, Different System

Page 9: Five Causes of Alert Fatigue -- and how to prevent them

9

What you see: Issue pops us for a couple of minutes, then disappears.

What happened: - Maybe a cronjob over utilizes the netwrok- Or a random race-condition in the app- Or a rarely-used product feature that causes the backend to

crash

In an ideal world: - No need for an alert every time it happens- Give me a monthly report of common shot-lived alerts

Transient Alerts

Page 10: Five Causes of Alert Fatigue -- and how to prevent them

10

Give us a try - http://www.bigpanda.iohttp://twitter.com/bigpanda

Thanks for listening!