mean time to sleep: quantifying the on-call experience

Post on 21-Apr-2017

20.426 Views

Category:

Engineering

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

@lozzd • @ryan_frantz

Mean Time to SleepQuantifying the on-call experience

Laurie Denness@lozzd

Ryan Frantz@ryan_frantz

@lozzd • @ryan_frantz

Who is in an on-call rotation?

@lozzd • @ryan_frantz

Who is on call right now?

@lozzd • @ryan_frantz

Who feels like on-call sucks?

Welcome. How is on call?

@lozzd • @ryan_frantz

Let’s help our people sleep

@lozzd • @ryan_frantz

Make on-call more bearable

@lozzd • @ryan_frantz

Incremental Changes

@lozzd • @ryan_frantz

Email toAcknowledge

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

@lozzd • @ryan_frantz

Email Only Alerts• Do you care if RAID becomes degraded in the middle of

the night?

@lozzd • @ryan_frantz

Email Only Alerts• Do you care if RAID becomes degraded in the middle of

the night?

• Do you care if one of your web/hadoop/X boxes dies in the middle of the night?

@lozzd • @ryan_frantz

Email Only Alerts• Do you care if RAID becomes degraded in the middle of

the night?

• Do you care if one of your web/hadoop/X boxes dies in the middle of the night?

• Can it wait until the morning?

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

@lozzd • @ryan_frantz

• Previous service state

• Duration in that state

Added Context• Previous service state

@lozzd • @ryan_frantz

• Previous service state

• Duration in that state

Added Context• Previous service state

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

• Link to Runbook

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

• Link to Runbook

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

• Link to runbook

@lozzd • @ryan_frantz

Alert Storms• Reduce noise when 200 things go wrong by aggregating

@lozzd • @ryan_frantz

Alert Storms• Reduce noise when 200 things go wrong by aggregating

• Trigger alert percentage of pool over threshold

@lozzd • @ryan_frantz

Low friction downtime• IRC commands to downtime hosts/sets of hosts

@lozzd • @ryan_frantz

Low friction downtime• IRC commands to downtime hosts/sets of hosts

@lozzd • @ryan_frantz

Downtime Reminders• Help prevent false notifications

@lozzd • @ryan_frantz

Downtime Reminders• Help prevent false notifications

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

• Restarting services (nscd)

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

• Restarting services (nscd)

• Re-running jobs (transient errors)

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

• Restarting services (nscd)

• Re-running jobs (transient errors)

• Duplicate crons (Chef)

@lozzd • @ryan_frantz

Incremental Improvements?• Maybe

@lozzd • @ryan_frantz

Incremental Improvements?• Maybe

• More ideas; hoped they’d stick

@lozzd • @ryan_frantz

Incremental Improvements?• Maybe

• More ideas; hoped they’d stick

• We didn’t know because we didn’t measure

@lozzd • @ryan_frantz

Measure Everything• “You can’t manage what you can’t measure.”

- Deming (not really)

@lozzd • @ryan_frantz

Measure Everything• “You can’t manage what you can’t measure.”

- Deming (not really)

• But, we weren’t measuring anything

@lozzd • @ryan_frantz

What should we measure?

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

• Alert categorization (actionable vs not)

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

• Alert categorization (actionable vs not)

• Alert times: Off-hours?

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

• Alert categorization (actionable vs not)

• Alert times: Off-hours?

• Noisy hosts/services

@lozzd • @ryan_frantz

Opsweekly

@lozzd • @ryan_frantz We have data.

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

2. Wow, look at all those alerts for the same thing

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

2. Wow, look at all those alerts for the same thing

3. Aggregate alerts

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

2. Wow, look at all those alerts for the same thing

3. Aggregate alerts

4.Profit

@lozzd • @ryan_frantz

Parent relationships• Prevent alerts due to upstream issues (downed switch)

@lozzd • @ryan_frantz

Parent relationships• Prevent alerts due to upstream issues (downed switch)

• Standard Nagios feature

@lozzd • @ryan_frantz

Parent relationships• Prevent alerts due to upstream issues (downed switch)

• Standard Nagios feature

• Computers can do this for us!

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

• LLDP on host shows switch info

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

• LLDP on host shows switch info

• Put switch info into Chef using ohai

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

• LLDP on host shows switch info

• Put switch info into Chef using ohai

• Create Nagios host configs based on data

@lozzd • @ryan_frantz

Service Dependencies• Hundreds of Graphite-sourced checks

@lozzd • @ryan_frantz

Service Dependencies• Hundreds of Graphite-sourced checks

• Created new template that sets a servicegroup that depends on the Graphite service.

@lozzd • @ryan_frantz

Keep on analyzing• It’s okay to just identify and delete alerts that don’t

mean anything!

@lozzd • @ryan_frantz

Keep on analyzing• It’s okay to just identify and delete alerts that don’t

mean anything!

• Or move them to email only

@lozzd • @ryan_frantz

More Quantification!

@lozzd • @ryan_frantz

Reviewing the Year• Use reports

@lozzd • @ryan_frantz

Reviewing the Year• Use reports

• Use search

@lozzd • @ryan_frantz

Reviewing the Year• Use reports

• Use search

• Identify noisiest alerts

@lozzd • @ryan_frantz

Reviewing the YearYEARLY REPORT SCREENSHOTS

@lozzd • @ryan_frantz

• Great time to look at this data and make improvements

Nagios Hack Day/Week

@lozzd • @ryan_frantz

• Great time to look at this data and make improvements

• If Disk Space is the worst. Can we rethink that?

Nagios Hack Day/Week

@lozzd • @ryan_frantz

Outsource Your Alerts• Etsy’s Search Team has on-call rotation

@lozzd • @ryan_frantz

Outsource Your Alerts• Etsy’s Search Team has on-call rotation

• A whole subset of alerts that don’t go to Ops

@lozzd • @ryan_frantz

Outsource Your Alerts• Etsy’s Search Team has on-call rotation

• A whole subset of alerts that don’t go to Ops

• More teams starting this but Search Team is at 100%

@lozzd • @ryan_frantz

Sleep Tracking

@lozzd • @ryan_frantz

“Track your life!” - @ph

@lozzd • @ryan_frantz

@lozzd • @ryan_frantz

@lozzd • @ryan_frantz

@lozzd • @ryan_frantz

@lozzd • @ryan_frantz

Did it work?

@lozzd • @ryan_frantz

Did it work?

@lozzd • @ryan_frantz

Did it work?• Yes.

@lozzd • @ryan_frantz

Did it work?• Yes.

@lozzd • @ryan_frantz

Did it work?• Yes.

• Signal to noise ratio is much better

@lozzd • @ryan_frantz

Did it work?• Yes.

@lozzd • @ryan_frantz

Did it work?• Yes.

• Okay, so it’s a little more complicated than that

@lozzd • @ryan_frantz

Did it work?• Yes.

• Okay, so it’s a little more complicated than that

• Adding alerts all the time means new “annoying” things

@lozzd • @ryan_frantz

Did it work?• Yes.

• Okay, so it’s a little more complicated than that

• Adding alerts all the time means new “annoying” things

• Keep monitoring

@lozzd • @ryan_frantz

What’s next?

@lozzd • @ryan_frantz

• We focus on people’s sleep

The Effect of Sleep

@lozzd • @ryan_frantz

• We focus on people’s sleep

• But not the effect on the person when they come to work the next day

The Effect of Sleep

@lozzd • @ryan_frantz

• We focus on people’s sleep

• But not the effect on the person when they come to work the next day

• How do we measure the impact of sleep loss/deprivation?

The Effect of Sleep

@lozzd • @ryan_frantz

• We focus on people’s sleep

• But not the effect on the person when they come to work the next day

• How do we measure the impact of sleep loss/deprivation?

The Effect of Sleep

• Subjective: Pittsburgh Sleepiness Scale

• Objective: Psychomotor vigilance task (PVT) to measure alertness

@lozzd • @ryan_frantz

Beyond Opsweekly• Employee wellness program

@lozzd • @ryan_frantz

Beyond Opsweekly• Employee wellness program

• Security have started using past sleep data to check for weird logins to systems

@lozzd • @ryan_frantz

More context: nagios-herald

@lozzd • @ryan_frantz

More reports• We have a bunch of data, we can build better reports,

drill down to analyze alerting trends

@lozzd • @ryan_frantz

More reports• We have a bunch of data, we can build better reports,

drill down to analyze alerting trends

• Can we attribute particular actions to reduced noise volume?

• Aggregate alerts

• Non-downtimed alerts

@lozzd • @ryan_frantz

Thanks

@lozzd • @ryan_frantz

Etsy Ops Team

@lozzd • @ryan_frantz

SewMona

@lozzd • @ryan_frantz

Open Source/Links• http://ryanfrantz.com/mtts

• https://github.com/etsy/opsweekly

• https://github.com/etsy/nagios-herald

• https://github.com/jonlives/jawboneup_to_graphite

• http://codeascraft.com

@lozzd • @ryan_frantz

Questions?

@lozzd • @ryan_frantz

Mean Time to SleepQuantifying the on-call experience

@lozzd • @ryan_frantz

Mean Time to SleepQuantifying the on-call experience

top related