Three years of breaking thingsto make them betterDonny [email protected]
ConclusionsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
1. Failure Friday is awesome, you should do it
ConclusionsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
1. Failure Friday is awesome, you should do it2. Don’t automate it… yet
ConclusionsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
1. Failure Friday is awesome, you should do it2. Don’t automate it… yet3. When it gets boring, switch it up
What is Failure Friday?THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
Failure Friday is a fault injection test against our production environment.
Our Basic AttacksTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Stop/start a process• Suspend/resume a process• Reboot a machine• Network isolation• Add latency & packet loss
Larger-scale eventsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Ramp up traffic to the disaster recovery site
Larger-scale eventsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Ramp up traffic to the disaster recovery site• Fail over database master
Larger-scale eventsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Ramp up traffic to the disaster recovery site• Fail over database master• Take down one data centre (one region, one AZ)
Benefits of Failure FridayTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Are you sure your process comes up after a reboot?• If one machine is slow, does it act as a tarpit and slow
down others?• Does your DR work?• Get people comfortable touching production• Make sure your monitoring and alerting works
How to get startedTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
1. Don’t automate2. Pick reasonable problems, test in staging first3. Track results in your task tracker (JIRA, etc)
Our Basic AttacksTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
2013• Stop/start a process• Suspend/resume a process• Reboot a machine• Network isolation• Add latency & packet loss
Our Basic AttacksTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
2013• Stop/start a process• Suspend/resume a process• Reboot a machine• Network isolation• Add latency & packet loss
2016• Stop/start a process• Suspend/resume a process• Reboot a machine• Network isolation• Add latency & packet loss
Game Day Hour - Number 1THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Test major incident response• Cause fake incident, fix it, retro on our response
focusing on communication
Game Day Hour - Number 1THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Test major incident response• Cause fake incident, fix it, retro on our response
focusing on communication• Don’t make it a semi-surprise
Game Day Hour - Number 2THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Database issue, but… most of the ops team is at Devops Days Sydney
Game Day Hour - Number 2THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Database issue, but… most of the ops team is at Devops Days Sydney
• Many variations on this:• team X is at an offsite• office Y is hit by a natural disaster
Chaos CatTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Reboot a random machine• Add latency/packet loss to a machine for 7 minutes• Ultimately: run a full plan• Remember: it was 3 yearsbefore we started automating
Team-specific FFsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• A single FF is a bottleneck• Individual teams can run their own:
• On-call training - cause pages on a test account• “What’s your status?” - look at dashboards for
previous time periods: some healthy, some unhealthy
Team-specific FFsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• A single FF is a bottleneck• Individual teams can run their own:
• On-call training - cause pages on a test account• “What’s your status?” - look at dashboards for
previous time periods: some healthy, some unhealthy
• Reviewing dashboards is a gold mine!• For all exercises, have a pair act with the rest of
the team observing
More Game DaysTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Team / office is unavailable• Github is down• Slack / hangouts is down• CI server is down
Capacity planningTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• If your traffic spiked by 20%, do you have enough capacity?
• Take down servers and find out!
Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
“100% is the wrong reliability target for basically everything"
https://landing.google.com/sre/interview/ben-treynor.html
Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
Nines Monthly Unavailability
1: 90% 3 days2: 99% 7.2 hours3: 99.9% 43.8 minutes4: 99.99% 4.38 minutes5: 99.999% 25.9 seconds6: 99.9999% 2.6 seconds7: 99.99999% 263 milliseconds8: 99.999999% 26.3 milliseconds9: 99.9999999% 2.63 milliseconds
Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Crazy idea: if you’re under budget for 3 months in a row, take down your service to use up the budget
Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Crazy idea: if you’re under budget for 3 months in a row, take down your service to use up the budget
• Find hidden dependencies (priority inversion)
Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
• Crazy idea: if you’re under budget for 3 months in a row, take down your service to use up the budget
• Find hidden dependencies (priority inversion)• Gut-check your target availability
ConclusionsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER
1. Failure Friday is awesome, you should do it2. Don’t automate it… yet3. When it gets boring, switch it up