site reliability engineering - usenix · availability and reliability meet slos • defend customer...

24
7/22/16 1 Greg Veith Director – Microsoft Azure SRE Site Reliability Engineering

Upload: ledieu

Post on 28-Aug-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

7/22/16 1

GregVeithDirector– MicrosoftAzureSRE

SiteReliabilityEngineering

Page 2: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

They’re Alive!

7/22/16 2

Organizations Are Living Organisms

Page 3: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

Evolution and Complexity

7/22/16 3

Page 4: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

7/22/164

Azure Service Offerings

Page 5: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

%Revenue from Startups

and ISVs

kNew Azure customer subscriptions/month Distinct Azure Service Offerings

Datacenters

24Datacenter Regions

Scale

MMessages per second

processed by Azure IoT

Page 6: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

7/22/16 6

Transformation

Page 7: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

7/22/16 7

Learning Culture, Growth Mindset

Page 8: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

Scaling Up Operational Models

7/22/16 8

Page 9: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

Welcome To The Team!

7/22/16 9

SR-

Page 10: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

North is…

7/22/16 10

Page 11: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

Symptoms of Success

7/22/16 11

• defendcustomertrustAvailabilityandreliabilitymeetSLOs

• ToileliminationEliminatehumantouchestoprod

• Reduceinventory,shipfast,safelySpeedupdeployments

Alltheaboveareasreinforcemeasurement.Reliability’sfoundation.

Page 12: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

3 Strategic Pillars

7/22/16 12

Provethemodel– ApplyPrinciples

StartSREatMicrosoft– EstablishPrinciples

Accelerateandimprove– ScalethePrinciples

Page 13: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

7/22/16 13

SRE Engagement Types

Services at Planetary Scale

Newer Service Facing Rapid Growth

Greenfield Services or Redesign

SRE develops solutions to close operational gaps, fire suppressant, iterate toward transformation

SRE attaches to team, develops targeted improvements to prepare for growth, get on call

Operability and continuous innovation, design for scale from the beginning

Ops Transformation at Scale

Growth and Maturation

Design and Architecture

Page 14: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

Production Readiness

7/22/16 14

Page 15: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

3 Strategic Pillars

7/22/16 15

Provethemodel– Pilots– ApplyPrinciples

StartSREatMicrosoft- EstablishPrinciples

Accelerateandimprove– ScalethePrinciples

Page 16: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

Service Facing Rapid GrowthAzure IoT

7/22/16 16

Page 17: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

Established Service at Planetary Scale Azure Storage

7/22/16 17

Page 18: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

3 Prong Strategy

7/22/16 18

Provethemodel– Pilots– ApplyPrinciples

StartSREatMicrosoft- EstablishPrinciples

Accelerateandimprove– ScalethePrinciples

Page 19: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

Production Virtuous Cycle

7/22/16 19

Goal:EnablethislooptorunasfastandoftenaspossiblewhilemaintainingSLOs

Code

Test

Deploy

Monitor,Measure,Alert

Mitigate

Restore

PostMortem

Learn

SRE

Page 20: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

7/22/16 20

• Instrumentation,SLOs,Alarms,insightsà actionsMetricsandMonitoring

• Tooling,infraforglobaloptimaInfrastructureEngineering

• ChangeManagement,DeploymentReleaseEngineering

• EnoughSaidIncidentResponse

• Integratingexistingbestinclass infraCommonInfrastructure

• Buildout,decomm,fleetunderstandingandmgmtCapacity&FleetManagement

SRE Areas of Focus

Page 21: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

Metrics and Monitoring

7/22/16 21

Page 22: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

Incident Response

7/22/16 22

Page 23: Site Reliability Engineering - USENIX · Availability and reliability meet SLOs • defend customer trust ... Start SRE at Microsoft –Establish Principles Accelerate and improve

Critical Moves, LearningsBuildandprotecttheSREbrand

Managethechange

Meetteamswheretheyare

GrabaShovel(andbuildabackhoe)

Findthebrightspots

7/22/16 23