three years of breaking things to make them better - devops days sydney 2016

41
Three years of breaking things to make them better Donny Nadolny [email protected]

Upload: donny-nadolny

Post on 19-Jan-2017

174 views

Category:

Technology


3 download

TRANSCRIPT

Three years of breaking thingsto make them betterDonny [email protected]

Conclusions

ConclusionsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

1. Failure Friday is awesome, you should do it

ConclusionsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

1. Failure Friday is awesome, you should do it2. Don’t automate it… yet

ConclusionsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

1. Failure Friday is awesome, you should do it2. Don’t automate it… yet3. When it gets boring, switch it up

THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

What is Failure Friday?

What is Failure Friday?THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

Failure Friday is a fault injection test against our production environment.

THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

Our Basic AttacksTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Stop/start a process• Suspend/resume a process• Reboot a machine• Network isolation• Add latency & packet loss

Larger-scale eventsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Ramp up traffic to the disaster recovery site

Larger-scale eventsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Ramp up traffic to the disaster recovery site• Fail over database master

Larger-scale eventsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Ramp up traffic to the disaster recovery site• Fail over database master• Take down one data centre (one region, one AZ)

Benefits of Failure FridayTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Are you sure your process comes up after a reboot?• If one machine is slow, does it act as a tarpit and slow

down others?• Does your DR work?• Get people comfortable touching production• Make sure your monitoring and alerting works

How to get startedTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

1. Don’t automate2. Pick reasonable problems, test in staging first3. Track results in your task tracker (JIRA, etc)

THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

What’s new?

Our Basic AttacksTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

2013• Stop/start a process• Suspend/resume a process• Reboot a machine• Network isolation• Add latency & packet loss

Our Basic AttacksTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

2013• Stop/start a process• Suspend/resume a process• Reboot a machine• Network isolation• Add latency & packet loss

2016• Stop/start a process• Suspend/resume a process• Reboot a machine• Network isolation• Add latency & packet loss

Game Day Hour - Number 1THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Test major incident response• Cause fake incident, fix it, retro on our response

focusing on communication

Game Day Hour - Number 1THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Test major incident response• Cause fake incident, fix it, retro on our response

focusing on communication• Don’t make it a semi-surprise

Game Day Hour - Number 2THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Database issue, but…

Game Day Hour - Number 2THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Database issue, but… most of the ops team is at Devops Days Sydney

Game Day Hour - Number 2THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Database issue, but… most of the ops team is at Devops Days Sydney

• Many variations on this:• team X is at an offsite• office Y is hit by a natural disaster

Chaos CatTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

Chaos CatTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Reboot a random machine• Add latency/packet loss to a machine for 7 minutes• Ultimately: run a full plan• Remember: it was 3 yearsbefore we started automating

Team-specific FFsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• A single FF is a bottleneck• Individual teams can run their own:

• On-call training - cause pages on a test account• “What’s your status?” - look at dashboards for

previous time periods: some healthy, some unhealthy

Team-specific FFsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• A single FF is a bottleneck• Individual teams can run their own:

• On-call training - cause pages on a test account• “What’s your status?” - look at dashboards for

previous time periods: some healthy, some unhealthy

• Reviewing dashboards is a gold mine!• For all exercises, have a pair act with the rest of

the team observing

What’s next?

More Game DaysTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Team / office is unavailable• Github is down• Slack / hangouts is down• CI server is down

Capacity planningTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• If your traffic spiked by 20%, do you have enough capacity?

• Take down servers and find out!

Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

“100% is the wrong reliability target for basically everything"

https://landing.google.com/sre/interview/ben-treynor.html

Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

Nines Monthly Unavailability

1: 90% 3 days2: 99% 7.2 hours3: 99.9% 43.8 minutes4: 99.99% 4.38 minutes5: 99.999% 25.9 seconds6: 99.9999% 2.6 seconds7: 99.99999% 263 milliseconds8: 99.999999% 26.3 milliseconds9: 99.9999999% 2.63 milliseconds

Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Crazy idea: if you’re under budget for 3 months in a row, take down your service to use up the budget

Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Crazy idea: if you’re under budget for 3 months in a row, take down your service to use up the budget

• Find hidden dependencies (priority inversion)

Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Crazy idea: if you’re under budget for 3 months in a row, take down your service to use up the budget

• Find hidden dependencies (priority inversion)• Gut-check your target availability

Conclusions

ConclusionsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

1. Failure Friday is awesome, you should do it2. Don’t automate it… yet3. When it gets boring, switch it up