mean time to sleep: quantifying the on-call experience
TRANSCRIPT
@lozzd • @ryan_frantz
Mean Time to SleepQuantifying the on-call experience
Laurie Denness@lozzd
Ryan Frantz@ryan_frantz
@lozzd • @ryan_frantz
Who is in an on-call rotation?
@lozzd • @ryan_frantz
Who is on call right now?
@lozzd • @ryan_frantz
Who feels like on-call sucks?
Welcome. How is on call?
@lozzd • @ryan_frantz
Let’s help our people sleep
@lozzd • @ryan_frantz
Make on-call more bearable
@lozzd • @ryan_frantz
Incremental Changes
@lozzd • @ryan_frantz
Email toAcknowledge
@lozzd • @ryan_frantz
Email to Acknowledge• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email to Acknowledge• Replying “ack” with some context makes it appear in
IRC too
@lozzd • @ryan_frantz
Email Only Alerts• Do you care if RAID becomes degraded in the middle of
the night?
@lozzd • @ryan_frantz
Email Only Alerts• Do you care if RAID becomes degraded in the middle of
the night?
• Do you care if one of your web/hadoop/X boxes dies in the middle of the night?
@lozzd • @ryan_frantz
Email Only Alerts• Do you care if RAID becomes degraded in the middle of
the night?
• Do you care if one of your web/hadoop/X boxes dies in the middle of the night?
• Can it wait until the morning?
@lozzd • @ryan_frantz
Added Context• Previous service state
• Duration in that state
@lozzd • @ryan_frantz
• Previous service state
• Duration in that state
Added Context• Previous service state
@lozzd • @ryan_frantz
• Previous service state
• Duration in that state
Added Context• Previous service state
@lozzd • @ryan_frantz
Added Context• Previous service state
• Duration in that state
@lozzd • @ryan_frantz
Added Context• Previous service state
• Duration in that state
• Alert recipients
@lozzd • @ryan_frantz
Added Context• Previous service state
• Duration in that state
• Alert recipients
@lozzd • @ryan_frantz
Added Context• Previous service state
• Duration in that state
• Alert recipients
@lozzd • @ryan_frantz
Added Context• Previous service state
• Duration in that state
• Alert recipients
• Notes
@lozzd • @ryan_frantz
Added Context• Previous service state
• Duration in that state
• Alert recipients
• Notes
• Link to Runbook
@lozzd • @ryan_frantz
Added Context• Previous service state
• Duration in that state
• Alert recipients
• Notes
• Link to Runbook
@lozzd • @ryan_frantz
Added Context• Previous service state
• Duration in that state
• Alert recipients
• Notes
• Link to runbook
@lozzd • @ryan_frantz
Alert Storms• Reduce noise when 200 things go wrong by aggregating
@lozzd • @ryan_frantz
Alert Storms• Reduce noise when 200 things go wrong by aggregating
• Trigger alert percentage of pool over threshold
@lozzd • @ryan_frantz
Low friction downtime• IRC commands to downtime hosts/sets of hosts
@lozzd • @ryan_frantz
Low friction downtime• IRC commands to downtime hosts/sets of hosts
@lozzd • @ryan_frantz
Downtime Reminders• Help prevent false notifications
@lozzd • @ryan_frantz
Downtime Reminders• Help prevent false notifications
@lozzd • @ryan_frantz
Event Handlers• Teach Nagios to augment the team
@lozzd • @ryan_frantz
Event Handlers• Teach Nagios to augment the team
• Restarting services (nscd)
@lozzd • @ryan_frantz
Event Handlers• Teach Nagios to augment the team
• Restarting services (nscd)
• Re-running jobs (transient errors)
@lozzd • @ryan_frantz
Event Handlers• Teach Nagios to augment the team
• Restarting services (nscd)
• Re-running jobs (transient errors)
• Duplicate crons (Chef)
@lozzd • @ryan_frantz
Incremental Improvements?• Maybe
@lozzd • @ryan_frantz
Incremental Improvements?• Maybe
• More ideas; hoped they’d stick
@lozzd • @ryan_frantz
Incremental Improvements?• Maybe
• More ideas; hoped they’d stick
• We didn’t know because we didn’t measure
@lozzd • @ryan_frantz
Measure Everything• “You can’t manage what you can’t measure.”
- Deming (not really)
@lozzd • @ryan_frantz
Measure Everything• “You can’t manage what you can’t measure.”
- Deming (not really)
• But, we weren’t measuring anything
@lozzd • @ryan_frantz
What should we measure?
@lozzd • @ryan_frantz
What should we measure?• Volume of alerts (total, by severity)
@lozzd • @ryan_frantz
What should we measure?• Volume of alerts (total, by severity)
• Alert categorization (actionable vs not)
@lozzd • @ryan_frantz
What should we measure?• Volume of alerts (total, by severity)
• Alert categorization (actionable vs not)
• Alert times: Off-hours?
@lozzd • @ryan_frantz
What should we measure?• Volume of alerts (total, by severity)
• Alert categorization (actionable vs not)
• Alert times: Off-hours?
• Noisy hosts/services
@lozzd • @ryan_frantz
Opsweekly
@lozzd • @ryan_frantz We have data.
@lozzd • @ryan_frantz
Aggregate alerts1. Look at reports
@lozzd • @ryan_frantz
Aggregate alerts1. Look at reports
2. Wow, look at all those alerts for the same thing
@lozzd • @ryan_frantz
Aggregate alerts1. Look at reports
2. Wow, look at all those alerts for the same thing
3. Aggregate alerts
@lozzd • @ryan_frantz
Aggregate alerts1. Look at reports
2. Wow, look at all those alerts for the same thing
3. Aggregate alerts
4.Profit
@lozzd • @ryan_frantz
Parent relationships• Prevent alerts due to upstream issues (downed switch)
@lozzd • @ryan_frantz
Parent relationships• Prevent alerts due to upstream issues (downed switch)
• Standard Nagios feature
@lozzd • @ryan_frantz
Parent relationships• Prevent alerts due to upstream issues (downed switch)
• Standard Nagios feature
• Computers can do this for us!
@lozzd • @ryan_frantz
Parent relationships• signalvnoise.com
@lozzd • @ryan_frantz
Parent relationships• signalvnoise.com
• LLDP on host shows switch info
@lozzd • @ryan_frantz
Parent relationships• signalvnoise.com
• LLDP on host shows switch info
• Put switch info into Chef using ohai
@lozzd • @ryan_frantz
Parent relationships• signalvnoise.com
• LLDP on host shows switch info
• Put switch info into Chef using ohai
• Create Nagios host configs based on data
@lozzd • @ryan_frantz
Service Dependencies• Hundreds of Graphite-sourced checks
@lozzd • @ryan_frantz
Service Dependencies• Hundreds of Graphite-sourced checks
• Created new template that sets a servicegroup that depends on the Graphite service.
@lozzd • @ryan_frantz
Keep on analyzing• It’s okay to just identify and delete alerts that don’t
mean anything!
@lozzd • @ryan_frantz
Keep on analyzing• It’s okay to just identify and delete alerts that don’t
mean anything!
• Or move them to email only
@lozzd • @ryan_frantz
More Quantification!
@lozzd • @ryan_frantz
Reviewing the Year• Use reports
@lozzd • @ryan_frantz
Reviewing the Year• Use reports
• Use search
@lozzd • @ryan_frantz
Reviewing the Year• Use reports
• Use search
• Identify noisiest alerts
@lozzd • @ryan_frantz
Reviewing the YearYEARLY REPORT SCREENSHOTS
@lozzd • @ryan_frantz
• Great time to look at this data and make improvements
Nagios Hack Day/Week
@lozzd • @ryan_frantz
• Great time to look at this data and make improvements
• If Disk Space is the worst. Can we rethink that?
Nagios Hack Day/Week
@lozzd • @ryan_frantz
Outsource Your Alerts• Etsy’s Search Team has on-call rotation
@lozzd • @ryan_frantz
Outsource Your Alerts• Etsy’s Search Team has on-call rotation
• A whole subset of alerts that don’t go to Ops
@lozzd • @ryan_frantz
Outsource Your Alerts• Etsy’s Search Team has on-call rotation
• A whole subset of alerts that don’t go to Ops
• More teams starting this but Search Team is at 100%
@lozzd • @ryan_frantz
Sleep Tracking
@lozzd • @ryan_frantz
“Track your life!” - @ph
@lozzd • @ryan_frantz
@lozzd • @ryan_frantz
@lozzd • @ryan_frantz
@lozzd • @ryan_frantz
@lozzd • @ryan_frantz
Did it work?
@lozzd • @ryan_frantz
Did it work?
@lozzd • @ryan_frantz
Did it work?• Yes.
@lozzd • @ryan_frantz
Did it work?• Yes.
@lozzd • @ryan_frantz
Did it work?• Yes.
• Signal to noise ratio is much better
@lozzd • @ryan_frantz
Did it work?• Yes.
@lozzd • @ryan_frantz
Did it work?• Yes.
• Okay, so it’s a little more complicated than that
@lozzd • @ryan_frantz
Did it work?• Yes.
• Okay, so it’s a little more complicated than that
• Adding alerts all the time means new “annoying” things
@lozzd • @ryan_frantz
Did it work?• Yes.
• Okay, so it’s a little more complicated than that
• Adding alerts all the time means new “annoying” things
• Keep monitoring
@lozzd • @ryan_frantz
What’s next?
@lozzd • @ryan_frantz
• We focus on people’s sleep
The Effect of Sleep
@lozzd • @ryan_frantz
• We focus on people’s sleep
• But not the effect on the person when they come to work the next day
The Effect of Sleep
@lozzd • @ryan_frantz
• We focus on people’s sleep
• But not the effect on the person when they come to work the next day
• How do we measure the impact of sleep loss/deprivation?
The Effect of Sleep
@lozzd • @ryan_frantz
• We focus on people’s sleep
• But not the effect on the person when they come to work the next day
• How do we measure the impact of sleep loss/deprivation?
The Effect of Sleep
• Subjective: Pittsburgh Sleepiness Scale
• Objective: Psychomotor vigilance task (PVT) to measure alertness
@lozzd • @ryan_frantz
Beyond Opsweekly• Employee wellness program
@lozzd • @ryan_frantz
Beyond Opsweekly• Employee wellness program
• Security have started using past sleep data to check for weird logins to systems
@lozzd • @ryan_frantz
More context: nagios-herald
@lozzd • @ryan_frantz
More reports• We have a bunch of data, we can build better reports,
drill down to analyze alerting trends
@lozzd • @ryan_frantz
More reports• We have a bunch of data, we can build better reports,
drill down to analyze alerting trends
• Can we attribute particular actions to reduced noise volume?
• Aggregate alerts
• Non-downtimed alerts
@lozzd • @ryan_frantz
Thanks
@lozzd • @ryan_frantz
Etsy Ops Team
@lozzd • @ryan_frantz
SewMona
@lozzd • @ryan_frantz
Open Source/Links• http://ryanfrantz.com/mtts
• https://github.com/etsy/opsweekly
• https://github.com/etsy/nagios-herald
• https://github.com/jonlives/jawboneup_to_graphite
• http://codeascraft.com
@lozzd • @ryan_frantz
Questions?
@lozzd • @ryan_frantz
Mean Time to SleepQuantifying the on-call experience
@lozzd • @ryan_frantz
Mean Time to SleepQuantifying the on-call experience