nagios conference 2012 - alex solomon - managing your heros
DESCRIPTION
Alex Solomon's presentation on the people's aspect of monitoring. The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcnaTRANSCRIPT
![Page 1: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/1.jpg)
MANAGING YOUR HEROESThe People Aspect of Monitoring
Alex [email protected]
(a.k.a. Dealing with Outages and Failures)
![Page 2: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/2.jpg)
2
WHO AM I?
Alex Solomon
• Founder / CEO of PagerDuty
• Intersect Inc.
• Amazon.com
![Page 3: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/3.jpg)
DEFINITIONS
3
![Page 4: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/4.jpg)
4
Service Level Agreement (SLA)
Mean Time To Response
Mean Time To Resolution (MTTR)
Mean Time Between Failures (MTBF)
![Page 5: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/5.jpg)
OUTAGES
5
![Page 6: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/6.jpg)
6
Can we prevent them?
![Page 7: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/7.jpg)
PREVENTING OUTAGES
7
Single Points of Failure (SPOFs)
Complex, monolithic systems
Redundant systems
Service-oriented architecture
![Page 8: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/8.jpg)
8
Netflix distributed SOA system
![Page 9: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/9.jpg)
9
Change
(not much you can do about this one)
PREVENTING OUTAGES
![Page 10: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/10.jpg)
10
OUTAGES
![Page 11: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/11.jpg)
FAILURE LIFECYCLE
11
![Page 12: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/12.jpg)
12
Investigate
Root-cause Analysis
Fix
Alert
detect failure
Monitoring
![Page 13: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/13.jpg)
13
Critical Incident Timeline
{
Issue isdetected
Engineer startsworking on issue
Issue isfixed
RESPONSE TIME
Alert Investigate Fix
RESOLUTION TIME
![Page 14: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/14.jpg)
MONITOR
14
![Page 15: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/15.jpg)
MONITOR EVERYTHING!
• Data center
• Network
• Servers
• Database
• Application
• Website
• Business Metrics
15
All levels of the stack
![Page 16: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/16.jpg)
WHY MONITOR EVERYTHING?
16
Metrics!
Metrics!
Metrics!
![Page 17: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/17.jpg)
TOOLS
17
• Internal monitoring (behind the firewall):
•
•
• External monitoring (SaaS-based):
•
•
• Metrics:
• Graphite or
![Page 18: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/18.jpg)
ALERT
18
![Page 19: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/19.jpg)
19
Best Practice: Categorize alerts by severity.
![Page 20: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/20.jpg)
SEVERITIES
• sev1 - large scale business loss
• sev2 - small to medium business loss
• sev3 - no immediate business loss, customers may be impacted
• sev4 - no business loss, no customers impacted
20
Define severities based on business impact:{2 criticalseverities{
2 non-critical severities
![Page 21: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/21.jpg)
• Who
• How
• Response time
21
Each severity level should have its own standard operating procedure (SOP):
![Page 22: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/22.jpg)
• Sev1: Major outage, all hands on deck
• Notify the entire team via phone and SMS
• Response time: 5 min
• Sev2: Critical issue
• Notify the on-call person via phone and SMS
• Response time: 15 min
• Sev3: Non-critical issue
• Notify the on-call person via email
• Response time: next day during business hours
22
![Page 23: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/23.jpg)
• Sev1 incidents
• Rare
• Rarely auto-generated
• Frequently start as sev2 which are upgraded to sev1
23
![Page 24: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/24.jpg)
• Sev2 incidents
• More common
• Mostly auto-generated
24
![Page 25: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/25.jpg)
• Sev3 incidents
• Non-critical incidents
• Can be auto-generated
• Can also be manually generated
25
![Page 26: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/26.jpg)
• Severities can be downgraded or upgraded
• ex. sev2 ➞ sev1 (problem got worse)
• ex. sev1 ➞ sev2 (problem was partially fixed)
• ex. sev2 ➞ sev3 (critical problem was fixed but we still need to investigate root cause)
26
![Page 27: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/27.jpg)
27
One more best-practice:
Alert before your systems fail completely
![Page 28: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/28.jpg)
28
Main benefit of severities
Only page on critical issues (sev1 or 2)
![Page 29: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/29.jpg)
29
Preserve sanity
![Page 30: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/30.jpg)
30
Avoid “Peter and the wolf ” scenarios
![Page 31: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/31.jpg)
ON-CALL BEST PRACTICES
31
PersonLevel
TeamLevel
![Page 32: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/32.jpg)
ON-CALL AT THE PERSON LEVEL
32
Cellphone
![Page 33: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/33.jpg)
33
CellphoneSmart phone
OR AND
![Page 34: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/34.jpg)
34
4G / 3G internet
(don’t forget your laptop)
4G hotspot 4G USB modem 3G/4G tethering
![Page 35: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/35.jpg)
•Time zero: email and SMS
• 1 min later: phone-call on cell
• 5 min later: phone-call on cell
• 5 min later: phone-call on landline
• 5 min later: phone-call to girlfriend
35
Page multiple times until you respond
![Page 36: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/36.jpg)
36
Bonus: vibrating bluetooth bracelet
![Page 37: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/37.jpg)
ON-CALL AT THE TEAM LEVEL
37
Do not send alerts to the entire teamRarely
sev1 OKsev2 NO
![Page 38: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/38.jpg)
38
On-call schedules:
• Simple rotation-based schedule
• ex. weekly - everyone is on-call for a week at a time
• Set up a follow-the-sun schedule
• people in multiple timezones
• no night-shifts simple rotation
![Page 39: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/39.jpg)
39
What happens if the on-call person doesn’t respond at all?
![Page 40: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/40.jpg)
40
If you care about uptime, you need redundancy in your on-call.
![Page 41: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/41.jpg)
41
Set up multiple on-call levels with automatic escalation between them:
Level 1: Primary on-call
Level 2: Secondary on-call
Escalate after 15 min
Level 3: Team on-call (alert entire team)
Escalate after 20 min
![Page 42: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/42.jpg)
42
Best Practice: Put management in the on-call chain
Level 1: Primary on-call
Level 2: Secondary on-call
Escalate after 15 min
Level 3: Team on-call (alert entire team)
Escalate after 20 min
Level 4: Manager / Director
Escalate after 20 min
![Page 43: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/43.jpg)
43
Best Practice: put software engineers in the on-call chain
• Devops model
• Devs need to own the systems they write
• Getting paged provides a strong incentive to engineer better systems
![Page 44: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/44.jpg)
44
Best Practice: measure on-call performance
• Measure: mean-time-to-response
• Measure: % of issues that were escalated
• Set up policies to encourage good performance
• Put managers in on-call chain
• Pay people extra to do on-call
“You can’t improve what you don’t measure.”
![Page 45: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/45.jpg)
45
Network Operations Center
![Page 46: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/46.jpg)
46
NOC with lots of Nagios goodness
![Page 47: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/47.jpg)
47
NOCs:
• Reduce the mean-time-to-response drastically
• Expensive (staffed 24x7 with multiple people)
• Train NOC staff to fix a good %age of issues
• As you scale your org, you may want a hybrid on-call approach (where NOC handles some issues, teams handle other issues directly)
![Page 48: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/48.jpg)
48
Critical Incident Timeline
Issue isdetected
Engineer startsworking on issue
Issue isfixed
RESPONSE TIME
Alert Investigate Fix
RESOLUTION TIME
Alert
Issue isdetected
Alerting system gets ahold of
somebody
Engineer gets to a computer, connects
to internet
Engineer isaware of issue
Engineer startsworking on issue
![Page 49: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/49.jpg)
49
Alert
Issue isdetected
Alerting system gets ahold of
somebody
Engineer gets to a computer, connects
to internet
Engineer isaware of issue
Engineer startsworking on issue{• Carry 4G internet device +
laptop at all times
• Set loud ringtone at night
How to minimize:{How to minimize:
• Alert via phone & SMS
• Alert multiple times via multiple channels
• Failing that, escalate!
• Failing that, escalate to manager!
![Page 50: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/50.jpg)
RESEARCH & FIX
50
![Page 51: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/51.jpg)
51
Investigate Fix
How do we reduce the amount of time needed to investigate and fix?
![Page 52: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/52.jpg)
• When you encounter a new failure, document it in the Guide
• Document symptoms, research steps, fixes
• Use a wiki
52
Set up an Emergency Ops Guide:
![Page 53: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/53.jpg)
53
![Page 54: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/54.jpg)
54
![Page 55: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/55.jpg)
55
Automate fixesor
Add more fault tolerance
![Page 56: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/56.jpg)
• Tools to help you diagnose problems faster
• Comprehensive monitoring, metrics and dashboards
• Tools that help search for problems in log files quickly (ie. Splunk)
• Tools to help your team communicate efficiently
• Voice: Conference bridge, Skype, Google Hangout
• Chat: Hipchat, Campfire
56
You need the right tools:
![Page 57: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/57.jpg)
57
Best Practice: Incident Commander
![Page 58: Nagios Conference 2012 - Alex Solomon - Managing Your Heros](https://reader034.vdocument.in/reader034/viewer/2022051515/557cf577d8b42a98158b4853/html5/thumbnails/58.jpg)
58
• Essential for dealing with sev1 issues
• In charge of the situation
• Providers leadership, prevents analysis paralysis
• He/she directs people to do things
• Helps save time making decisions
Incident Commander: