site reliability engineering - usenix · availability and reliability meet slos • defend customer...

Post on 28-Aug-2018

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

7/22/16 1

GregVeithDirector– MicrosoftAzureSRE

SiteReliabilityEngineering

They’re Alive!

7/22/16 2

Organizations Are Living Organisms

Evolution and Complexity

7/22/16 3

7/22/164

Azure Service Offerings

%Revenue from Startups

and ISVs

kNew Azure customer subscriptions/month Distinct Azure Service Offerings

Datacenters

24Datacenter Regions

Scale

MMessages per second

processed by Azure IoT

7/22/16 6

Transformation

7/22/16 7

Learning Culture, Growth Mindset

Scaling Up Operational Models

7/22/16 8

Welcome To The Team!

7/22/16 9

SR-

North is…

7/22/16 10

Symptoms of Success

7/22/16 11

• defendcustomertrustAvailabilityandreliabilitymeetSLOs

• ToileliminationEliminatehumantouchestoprod

• Reduceinventory,shipfast,safelySpeedupdeployments

Alltheaboveareasreinforcemeasurement.Reliability’sfoundation.

3 Strategic Pillars

7/22/16 12

Provethemodel– ApplyPrinciples

StartSREatMicrosoft– EstablishPrinciples

Accelerateandimprove– ScalethePrinciples

7/22/16 13

SRE Engagement Types

Services at Planetary Scale

Newer Service Facing Rapid Growth

Greenfield Services or Redesign

SRE develops solutions to close operational gaps, fire suppressant, iterate toward transformation

SRE attaches to team, develops targeted improvements to prepare for growth, get on call

Operability and continuous innovation, design for scale from the beginning

Ops Transformation at Scale

Growth and Maturation

Design and Architecture

Production Readiness

7/22/16 14

3 Strategic Pillars

7/22/16 15

Provethemodel– Pilots– ApplyPrinciples

StartSREatMicrosoft- EstablishPrinciples

Accelerateandimprove– ScalethePrinciples

Service Facing Rapid GrowthAzure IoT

7/22/16 16

Established Service at Planetary Scale Azure Storage

7/22/16 17

3 Prong Strategy

7/22/16 18

Provethemodel– Pilots– ApplyPrinciples

StartSREatMicrosoft- EstablishPrinciples

Accelerateandimprove– ScalethePrinciples

Production Virtuous Cycle

7/22/16 19

Goal:EnablethislooptorunasfastandoftenaspossiblewhilemaintainingSLOs

Code

Test

Deploy

Monitor,Measure,Alert

Mitigate

Restore

PostMortem

Learn

SRE

7/22/16 20

• Instrumentation,SLOs,Alarms,insightsà actionsMetricsandMonitoring

• Tooling,infraforglobaloptimaInfrastructureEngineering

• ChangeManagement,DeploymentReleaseEngineering

• EnoughSaidIncidentResponse

• Integratingexistingbestinclass infraCommonInfrastructure

• Buildout,decomm,fleetunderstandingandmgmtCapacity&FleetManagement

SRE Areas of Focus

Metrics and Monitoring

7/22/16 21

Incident Response

7/22/16 22

Critical Moves, LearningsBuildandprotecttheSREbrand

Managethechange

Meetteamswheretheyare

GrabaShovel(andbuildabackhoe)

Findthebrightspots

7/22/16 23

gveith@microsoft.com

top related