rewriting devops
TRANSCRIPT
TITLE SPONSORS
TRACK SPONSORS
HEADLINE SPONSORS
PARTNER SPONSORS MEMBER SPONSORSGuiceworksCooleyBridgepoint EducationFull ContactGeneral AssemblyDripjoyLyftOnDeckConnect for HealthWazee DigitalOfficescapesJake Jabs Center for EntrepreneurshipDenver Office of Economic Development
Alchemy SecurityAyla NetworksEdge linkSwift pageTaxnologiSpotxDavis Graham & StubbsDocumotoRight pointName.comThe Denver FoundationBoomtownSix ActualMaker SourceSlider Smith & FramptonNetsuiteLogistical Meetings & Events
Rewriting DevOps
Matthew BoeckmanVP - Infrastructure
Craftsy@matthewboeckman
This is not a DevOps definition
●Common Tooling●Organizational Empathy●Shared Responsibility
Why Rewrite?
1. Support new business initiatives2. Scale and resilience3. Quicker iterations
#30 on Forbes' 2015 list of Most Promising Companies10+MM registered members, 11+MM enrolled courses350 course enrollments/hour
DevOps 1.0● Some Ops dev’d, and a few Devs Ops’d● Great cross team culture, still separate teams● Shared Oncall but heavy Ops burden● Limited common tooling
DevOps 2.0 goals● Integrated DevOps team and workflows● Common tools● Shared Oncall
Common Tooling
Common Tools
Jenkins (build, deploy, ETL, scheduled tasks)Terraform (infrastructure configuration)Splunk (data intelligence)AWS (all infrastructure)
Backend
OpsFrontend
Organizational Empathy
SiteReliabilityEngineering
*not DevOps
"Fundamentally, it's what happens when you ask a software engineer to design an operations function."
Ben Treynor Sloss, Vice President, Google Engineering, founder of Google SRE
SRE Phase 1 (Feb-May)
● Determine tooling○ Nagios, graphite, splunk, confluence
● SWAG at reliability metrics○ Errors; response time
● Runbooks● Blameless Postmortem every outage● Iterate
The primary hurdle to DevOps and SRE adoption is
The Skill Gap
Runbooks:● System overview● Escalation path● Alert descriptions● Common failure conditions● Known recovery procedures● Incident history
Postmortem - 7 W’s and an H
1. What (happened)2. What (systems were impacted)3. When (did it occur)4. Who (was involved)5. How (did we discover the issue)6. Why (did it go explody)7. What (will we do to remedy it)8. When (will that remedy be actioned)
Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an
accident can give a detailed account of:
what actions they took at what time,
what effects they observed,
expectations they had,
assumptions they had made,
and their understanding of timeline of events as they occurred.
…and that they can give this detailed account without fear of punishment or retribution
*John Allspaw, CTO - Etsyhttps://codeascraft.com/2012/05/22/blameless-postmortems/
Shared Responsibility
Empathy drives action
Common tools and Runbooks bridge the skills gap
Postmortems direct iterations
IncidentPost-Mortem
ToolsRunbook
Reward
SRE Phase 2 (May-> … forever)
● Build a production environment● Tune reliability metrics● Load tests● Resilience tests● Recovery tests● Blameless Postmortem every outage● Runbooks● Iterate
Fastly - Content DeliveryF5 & ELB - load balancingFE - Node.jsBE - JavaPacker - AMI’sConsul - service discoveryTerraform - InfrastructurePostgres/RDS - databaseSQS/SNS/Lambda/S3 - everything else
SRE - Two metrics
Mean Time to Identify
Mean Time to Resolve
DevOps + SRET-18 days3 hours
…
This is not a DevOps definition approach
●Common Tooling●Organizational Empathy●Shared Responsibility●Land and expand●Start with pre-prod and grow
Thank you!Questions?
@matthewboeckman