site reliability engineering getting started with...site reliability engineering principles 1 sre...

63
Geing Staed with Site Reliability Engineering Jennifer Petoff, Google Ireland Twier: @jennski 1 28/01/2019 1 Melbourne — March 21-22, Auckland — March 26-27, Sydney — September 10-11 DEVOPS TALKS CONFERENCE 2019

Upload: others

Post on 01-Jun-2020

10 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Getting Started with Site Reliability Engineering

Jennifer Petoff, Google IrelandTwitter: @jennski

128/01/2019 1 Melbourne — March 21-22, Auckland — March 26-27, Sydney — September 10-11

DEVOPS TALKS CONFERENCE 2019

Page 2: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

12 years at Google

wide range of project management and training/education experience:

● University Programs

● DCLK Publisher Training Team

● AdWords Global Customer Service

● Site Reliability Engineering

Jennifer Petoff (aka Dr. J)

Google Ireland

Page 3: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Jennifer Petoff (aka Dr. J)

Google Ireland

Senior Program Manager in SRE for >5 years

● Lead and Global Program Manager for SRE EDU

● Co-editor of the original SRE Book

Page 4: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Jennifer Petoff (aka Dr. J)

Google Ireland

Fun Facts

● PhD in Chemistry

● Part-time Travel Blogger at Sidewalk Safari

Page 5: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Software engineering as a discipline focuses on designing and building rather than operating and maintaining, despite estimates that 40%1 to 90%2 of the total costs are incurred after launch.1 Glass, R. (2002). Facts and Fallacies of Software Engineering, Addison-Wesley Professional; p. 115.2 Dehaghani, S. M. H., & Hajrahimi, N. (2013). Which Factors Affect Software Projects Maintenance Cost More? Acta Informatica Medica, 21(1), 63–66. http://doi.org/10.5455/AIM.2012.21.63-66

Software's long-term cost

Image:Pixabay License. No attribution required.

Page 6: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Incentives aren't aligned.

DevelopersAgility

OperatorsStability

Page 7: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

DevOps

is a set of practices, guidelines and culture designed to break down silos in IT development, operations, architecture, networking and security.

class SRE implements DevOps

Site Reliability Engineering

is a set of practices we've found to work, some beliefs that animate those practices, and a job role.

Page 8: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Reducing product lifecycle friction

Concept Business Development Operations Market

Agile solves this

DevOps solves this

Page 9: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● Originated at Google in 2003

● Framework for operating large scale systems reliably

● "SRE is what happens when you ask a software engineer to design an operations function"

● Focus on running systems in production

What is Site Reliability Engineering?

Page 10: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Site Reliability Engineering Principles

1 SRE needs Service Level Objectives, with consequences.

2 SREs must have time to make tomorrow better than today.

3 SRE teams have the ability to regulate their workload.

4 Failure is an opportunity to improve.

Page 11: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Product lifecycle

Concept Business Development Operations Market

Site Reliability Engineering

solves this problem

Business Process

Page 12: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

But getting started can feel daunting...

Image: CC0 license: https://pxhere.com/en/photo/739800

Page 13: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Service Level Objectives

Page 14: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● Goal for how well the system should operate

● Tracks the customer experience

○ SLOs met = Customers

○ Customers = SLOs not met

What is a Service Level Objective?

Page 15: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● 99.99% of HTTP requests per month succeed with 200 OK

● 90% of HTTP requests returned in under 300ms

● 99% of log entries processed in under 5 minutes.

Example SLOs

Page 16: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● Service Level Agreements = contractual guarantees

● SLAs met != Customers

But What About SLAs?

Page 17: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● You could implement SLOs today for your application, but this is only a foundation.

● You need consequences.

What Next?

Page 18: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Error Budget Policy

Page 19: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

How Reliable Do You Want To Be?

The Bosses of the Senate (1889): Public Domain

Page 20: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

How Reliable Do You Want To Be?

More!The Bosses of the Senate (1889): Public Domain

Page 21: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

“ Anything that can go wrong will go wrong

Murphy's Law

Public Domain Image

Page 22: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

“Anything that can go wrong, will…

Finagle's Law of Dynamic Negatives

Public Domain Image

Page 23: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Public Domain Image

“Anything that can go wrong, will…

...at the worst possible moment.

Finagle's Law of Dynamic Negatives

Page 24: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

100% is the wrong reliability target for basically everything.Benjamin Treynor SlossVice President of 24x7 Engineering, Google

Page 25: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Reliability

Engineering Time

Development Velocity

Cost

SRE is About Balance

williamcho Pixabay License

Page 26: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

So we introduce a budget

Image Source: Florent Darrault CC BY-SA 2.0

Public Domain Image

Page 27: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● Gap between perfect reliability and our SLO.

● This is a budget to be spent.

● Given an uptime SLO of 99.9%, after a 20 minute outage you still have 23 minutes of budget remaining for the month!

Error Budgets

Page 28: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● What you agree to do when the application exceeds its error budget.

● This is not "pay $$$"

● Must be something that will visibly improve reliability.

Error Budget Policy

Page 29: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Until the application is again meeting its SLO and has some Error Budget:

● "No new feature launches allowed."

● "Sprint planning may only pull Postmortem Action Items from the backlog."

● "Software Development Team must meet with SRE Team daily to outline their improvements"

Error Budget Policy Examples

Page 30: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

SRE needs Service Level Objectives with Consequences.

SRE Principle #1

Page 31: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● Even without hiring a single SRE, you can have an Error Budget Policy.

● Lever you can use to keep your customers from experiencing pain and sadness.

● You can implement this today: measure, account and act.

SRE Principle #1

Page 32: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Making Tomorrow Better Than Today

Page 33: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● SLOs and Error Budgets are the first step.

● The next step is staffing an SRE role...

● ...endowed with real responsibility.

Making Tomorrow Better Than Today

Page 34: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● Defines and refines Service Level Objectives.

● Enacts the Error Budget Policy when necessary.

● Makes sure that the application meets the reliability expectations of its users.

Your First SRE

Page 35: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● A bounded part of the role.

● Recommend less than 50% of the workload be operations.

Toil

Page 36: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● Consulting on System Architecture and Design.

● Authoring and iterating on Monitoring.

● Automating repetitive work.

● Coordinating implementation of Postmortem Action Items

Project Work

Page 37: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

SREs have time to make tomorrow better than today.

SRE Principle #2

Page 38: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

SRE Principle #2

● An SRE’s job is not to suffer under operational load, but to make each day brighter.

● "Brighter" might mean different things: It depends on what your SREs find most useful to do.

● Less toil, more meaningful system improvements.

Page 39: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Shared Responsibility Model

Page 40: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Dumping all production services on an SRE team cannot work.

Photo By: Air Force Tech. Sgt. Jorge Intriago (Public Domain)

Page 41: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

An overloaded team doesn’t have time to make tomorrow better than today.

Used with permission of the image owner Jennifer Petoff, Sidewalk Safari Blog

Page 42: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Implementing a mechanism to give back-pressure to dev partners provides balance.

Used with permission of the image owner Jennifer Petoff, Sidewalk Safari Blog

Page 43: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● Give 5% of the operational work to the developers

● Track SRE team project work.

○ Not completing projects? → something’s wrong.

● Analyse and on-board new systems only if they can be operated safely.

● If every problem has to be escalated to its developer: why is SRE carrying the pager?

Regulating Workload

Page 44: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Without leadership buy-in, SRE cannot work.

Leadership Buy-in

Image Credit: geralt Pixabay License

Page 45: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● When applications miss their SLOs and run out of Error Budget, it puts additional load on the SRE team:

○ Need to devote more company resources to addressing reliability concerns.

○ or: Loosen the SLO.

Leadership Buy-in

Page 46: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● Fixing a product after launch is always more expensive.

● SRE teams can and should consult up-front on designs:

○ Architecting resilient systems.

○ Maintaining consistency means fewer SREs can support more products.

Reliability & Consistency Up Front

Page 47: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Three places SRE teams can benefit from Automation:

1. To eliminate their toil - don't do things over and over!

2. To do capacity planning - auto-scaling instead of manual forecasting!

3. To fix issues automatically - if you can write the fix in a playbook, you can make the computer do it!

Automation

Page 48: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

SRE teams have the

ability to regulate

their workload.

SRE Principle #3

Page 49: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

SRE Principle #3

● Teams need to be able to prioritise and do the work.

● Each new system to maintain has a human cost.

● Must be able to push-back on unreliable practices and systems.

Page 50: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

A Culture of Blamelessness

Page 51: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

I'm extremely angry right now. People should lose their jobs if this was an error.

--Hawaii State Representative Matt Lopresti (in reference to the 2018 Hawaii nuclear alert false alarm)

Recognize the Antipattern

Source: “How Hawaii Could Have Sent A False Nuclear Alarm”, Wired, Lapowski, January 13, 2018 https://www.wired.com/story/hawaii-nuclear-missile-alert-false-explanation/

Page 52: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● by setting SLOs less than 100%

● by modeling blamelessness at all levels

● by stamping out blame wherever it is found

● by celebrating cases of “I made a mistake” that lead to outages being resolved faster.

Embrace Failure

Page 53: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● You’ve already paid the price in an outage.

● Write a blameless postmortem.

● Make postmortems widely available so others can learn too.

Learn from Failure

Page 54: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

“Human” errors are really systems problems.

Page 55: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

● The root cause of an outage is never a person.

● Ask “why” for as many iterations as it takes to identify system-related causes.

● Prioritize system fixes that support people to make the right choices.

Keep Asking Why

Page 56: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Failure is an

opportunity to

improve.

SRE Principle #4

Page 57: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Failure is an

opportunity to

improve.Not an excuse to brandish pitchforks

SRE Principle #4

Page 58: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

SRE Principle #4

● Failure happens, there is no way around it.

● Stop pointing fingers.

● Embrace failure to improve MTTD and MTTR.

● Proactively addressing failure → more robust systems

Page 59: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Site Reliability Engineering Principles

1 SRE needs Service Level Objectives, with consequences.

2 SREs must have time to make tomorrow better than today.

3 SRE teams have the ability to regulate their workload.

4 Failure is an opportunity to improve.

Page 60: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Cover images used with permission. These books can be found on shop.oreilly.comThe full text of the Google SRE Books are available at www.google.com/sre

Page 61: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Getting Started with Site Reliability Engineering

Jennifer PetoffSr. Program Manager

Google IrelandTwitter: @jennski

6128/01/2019 61 Melbourne — March 21-22, Auckland — March 26-27, Sydney — September 10-11

DEVOPS TALKS CONFERENCE 2019

Page 62: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Reliability level

Allowed unreliability window

per year per quarter per 30 days

90% 36.5 days 9 days 3 days

95% 18.25 days 4.5 days 1.5 days

99% 3.65 days 21.6 hours 7.2 hours

99.5% 1.83 days 10.8 hours 3.6 hours

99.9% 8.76 hours 2.16 hours 43.2 minutes

99.95% 4.38 hours 1.08 hours 21.6 minutes

99.99% 52.6 minutes 12.96 minutes 4.32 minutes

99.999% 5.26 minutes 1.30 minutes 25.9 seconds

Source: https://landing.google.com/sre/sre-book/chapters/availability-table/

Page 63: Site Reliability Engineering Getting Started with...Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow

Reliability level

Allowed unreliability window

per year per quarter per 30 days

90% 36.5 days 9 days 3 days

95% 18.25 days 4.5 days 1.5 days

99% 3.65 days 21.6 hours 7.2 hours

99.5% 1.83 days 10.8 hours 3.6 hours

99.9% 8.76 hours 2.16 hours 43.2 minutes

99.95% 4.38 hours 1.08 hours 21.6 minutes

99.99% 52.6 minutes 12.96 minutes 4.32 minutes

99.999% 5.26 minutes 1.30 minutes 25.9 seconds

Error Rate Allowed duration

100% 21.6 minutes

10% 3.6 hours

1% 36 hours

0.1% 15 days

<0.05% all month

Source: https://landing.google.com/sre/sre-book/chapters/availability-table/