rewriting devops

TITLE SPONSORS

TRACK SPONSORS

HEADLINE SPONSORS

PARTNER SPONSORS MEMBER SPONSORSGuiceworksCooleyBridgepoint EducationFull ContactGeneral AssemblyDripjoyLyftOnDeckConnect for HealthWazee DigitalOfficescapesJake Jabs Center for EntrepreneurshipDenver Office of Economic Development

Alchemy SecurityAyla NetworksEdge linkSwift pageTaxnologiSpotxDavis Graham & StubbsDocumotoRight pointName.comThe Denver FoundationBoomtownSix ActualMaker SourceSlider Smith & FramptonNetsuiteLogistical Meetings & Events

Rewriting DevOps

Matthew BoeckmanVP - Infrastructure

Craftsy@matthewboeckman

This is not a DevOps definition

●Common Tooling●Organizational Empathy●Shared Responsibility

Why Rewrite?

1. Support new business initiatives2. Scale and resilience3. Quicker iterations

#30 on Forbes' 2015 list of Most Promising Companies10+MM registered members, 11+MM enrolled courses350 course enrollments/hour

DevOps 1.0● Some Ops dev’d, and a few Devs Ops’d● Great cross team culture, still separate teams● Shared Oncall but heavy Ops burden● Limited common tooling

DevOps 2.0 goals● Integrated DevOps team and workflows● Common tools● Shared Oncall

Common Tooling

Common Tools

Jenkins (build, deploy, ETL, scheduled tasks)Terraform (infrastructure configuration)Splunk (data intelligence)AWS (all infrastructure)

Backend

OpsFrontend

Organizational Empathy

SiteReliabilityEngineering

*not DevOps

"Fundamentally, it's what happens when you ask a software engineer to design an operations function."

Ben Treynor Sloss, Vice President, Google Engineering, founder of Google SRE

SRE Phase 1 (Feb-May)

● Determine tooling○ Nagios, graphite, splunk, confluence

● SWAG at reliability metrics○ Errors; response time

● Runbooks● Blameless Postmortem every outage● Iterate

The primary hurdle to DevOps and SRE adoption is

The Skill Gap

Runbooks:● System overview● Escalation path● Alert descriptions● Common failure conditions● Known recovery procedures● Incident history

Postmortem - 7 W’s and an H

1. What (happened)2. What (systems were impacted)3. When (did it occur)4. Who (was involved)5. How (did we discover the issue)6. Why (did it go explody)7. What (will we do to remedy it)8. When (will that remedy be actioned)

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an

accident can give a detailed account of:

what actions they took at what time,

what effects they observed,

expectations they had,

assumptions they had made,

and their understanding of timeline of events as they occurred.

…and that they can give this detailed account without fear of punishment or retribution

*John Allspaw, CTO - Etsyhttps://codeascraft.com/2012/05/22/blameless-postmortems/

Shared Responsibility

Empathy drives action

Common tools and Runbooks bridge the skills gap

Postmortems direct iterations

IncidentPost-Mortem

ToolsRunbook

Reward

SRE Phase 2 (May-> … forever)

● Build a production environment● Tune reliability metrics● Load tests● Resilience tests● Recovery tests● Blameless Postmortem every outage● Runbooks● Iterate

Fastly - Content DeliveryF5 & ELB - load balancingFE - Node.jsBE - JavaPacker - AMI’sConsul - service discoveryTerraform - InfrastructurePostgres/RDS - databaseSQS/SNS/Lambda/S3 - everything else

SRE - Two metrics

Mean Time to Identify

Mean Time to Resolve

DevOps + SRET-18 days3 hours

…

This is not a DevOps definition approach

●Common Tooling●Organizational Empathy●Shared Responsibility●Land and expand●Start with pre-prod and grow

Thank you!Questions?

@matthewboeckman

rewriting devops

Software