software operability and run book collaboration - devops summit, bangalore
DESCRIPTION
Making software work well in production (through good software operability) is one of the goals of DevOps. Collaboration between Dev and Ops on the 'run book' or operation manual is one way to open up communication channels between Dev and Ops, leading to improved software operability. This is the slide deck I used at DevOps Summit, Bangalore, on 18th December 2013.TRANSCRIPT
#u
nid
ev
op
s
Software Operability,
Run Book Collaboration,
and DevOps
Matthew Skelton18th December 2013
DevOps Summit,
Bangalore, India
www.devops-summit.org
@matthewpskelton
softwareoperability.com
#u
nid
ev
op
s
Agenda
• Software Operability
• Run Book Collaboration
• Making Operability Work
• Questions
#u
nid
ev
op
s
Background
• Software systems since 1998
• Software build & deployment
specialist & DevOps enthusiast
• London Continuous Delivery
meetup group - londoncd.org.uk
• Experience DevOps workshops
#u
nid
ev
op
s
Software
Operability
#u
nid
ev
op
s
Software Operability
• Definitions
• Examples
• Why focus on operability?
• How DevOps can help
#u
nid
ev
op
s
Operability?
#u
nid
ev
op
s
Etymology of Operability?
• Cognates:
– Opera
– Operate
– Operational
– Inter-operability
#u
nid
ev
op
s
#u
nid
ev
op
s
Software Operability
• Operability: the properties of a
system which make it work well in
Production
#u
nid
ev
op
s
Operable Systems
Since 1929,
Mallorca, Spain
#u
nid
ev
op
s
Software Operability
• David Copeland (@davetron5000):
“How your software runs in
production is all that matters. The
most amazing abstractions, cleanest
code, or beautiful algorithms are
meaningless if your code doesn’t run
well on production.”
• http://www.naildrivin5.com/blog/2013/06/16/production-is-all-that-matters.html
#u
nid
ev
op
s
Operational Criteria
• Deploy
• Monitor
• Diagnose
• Debug
• Query
• Control
• Inspect
• Clear
• ...
#u
nid
ev
op
s
“Non-Functional”
#u
nid
ev
op
s
Shaped by Operability
• Hooks (internal APIs) for:
– Logging
– Monitoring
– Diagnostics
– Health checks
– Data clear-down
– Service / daemon / container control
#u
nid
ev
op
s
Ops Folk are Users Too!
#u
nid
ev
op
s
#u
nid
ev
op
s
Why focus on Operability?
• Deploy more rapidly, frequently
• High cost of Production outage
• Systems now more complicated
#u
nid
ev
op
s
Outages are Embarrassing!
#u
nid
ev
op
s
Operational considerations
#u
nid
ev
op
s
Operational considerations
#u
nid
ev
op
s
Operational considerations
#u
nid
ev
op
s
How DevOps can help
• DevOps is one way to address poor operability
• Improved collaboration and communication between Dev teams and Ops teams
• Example: Run Book Collaboration
#u
nid
ev
op
s
Run Book
Collaboration
#u
nid
ev
op
s
Run Book Collaboration
• Feedback loops and learning
• What is a run book?
• How can run book collaboration
help operability?
#u
nid
ev
op
s
Feedback Loops
Gene Kim:
http://itrevolution.com/the-three-ways-principles-underpinning-devops/
#u
nid
ev
op
s
Run Book
#u
nid
ev
op
s
Templates
#u
nid
ev
op
s
Example
• 1 Table of Contents
• 2 System Overview – 2.1 Service Overview
– 2.2 Contributing Applications, Daemons, and Windows Services
– 2.3 Hours of Operation
– 2.4 Execution Design
– 2.5 Infrastructure and Network Design
– 2.6 Resilience, Fault Tolerance and High-Availability
– 2.7 Throttling and Partial Shutdown– 2.8 Required Resources
– 2.9 Expected Traffic and Load • 2.9.1 Hot or Peak Periods• 2.9.2 Warm Periods• 2.9.3 Cool or Quiet Periods
– 2.10 Environmental Differences
– 2.11 Tools
• 3 Security and Access Control
• 4 System Configuration – 4.1 Configuration Management
• 5 System Backup and Restore – 5.1 Backup Requirements
• 5.1.1 Special Files
– 5.2 Backup Procedures
– 5.3 Restore Procedures
• 6 Monitoring and Alerting – 6.1 Error Messages
– 6.2 Events
– 6.3 Health Checks
– 6.4 Other Messages
• 7 Operational Tasks – 7.1 Deployment
– 7.2 Batch Processing
– 7.3 Power Procedures
– 7.4 Routine Checks • 7.4.1 System Rebuilds
– 7.5 Troubleshooting
• 8 Maintenance Tasks – 8.1 Maintenance Procedures
• 8.1.1 Patching – 8.1.1.1 Normal Cycle
– 8.1.1.2 Zero-Day Vulnerabilities
• 8.1.2 GMT/BST time changes• 8.1.3 Cleardown Activities
– 8.1.3.1 Log Rotation
– 8.2 Testing • 8.2.1 Technical Testing• 8.2.2 Post-Deployment
• 9 Failure and Recovery Procedures – 9.1 Failover– 9.2 Recovery
– 9.3 Troubleshooting Failover and Recovery
• 10 Contact Details
#u
nid
ev
op
s
Example
• 1 Table of Contents
• 2 System Overview – 2.1 Service Overview
– 2.2 Contributing Applications, Daemons, and Windows Services
– 2.3 Hours of Operation
– 2.4 Execution Design
– 2.5 Infrastructure and Network Design
– 2.6 Resilience, Fault Tolerance and High-Availability
– 2.7 Throttling and Partial Shutdown
– 2.8 Required Resources
– 2.9 Expected Traffic and Load
• 3 Security and Access Control
• 4 System Configuration
• 5 System Backup and Restore
• 6 Monitoring and Alerting
• 7 Operational Tasks
• 8 Maintenance Tasks
• 9 Failure and Recovery Procedures
• 10 Contact Details
#u
nid
ev
op
s
Example
2.1 Service Overview
2.2 Contributing Applications, Daemons, and Windows Services
2.3 Hours of Operation
2.4 Execution Design
2.5 Infrastructure and Network Design
2.6 Resilience, Fault Tolerance and High-Availability
2.7 Throttling and Partial Shutdown
2.8 Required Resources
2.9 Expected Traffic and Load
#u
nid
ev
op
s
It’s Not Documentation
#u
nid
ev
op
s
Focus on Collaboration
#u
nid
ev
op
s
Outcomes
• Better understanding
• Better cross-team working
• Reduction in operational problems
• Fewer outages
• Reduced long-term cost-of-
ownership
#u
nid
ev
op
s
Run Book as Collaboration
• Focus on the collaboration
• Run book is a means, not an end
• Throw it away when complete (?)
• Aim to automate more over time
• See http://runbookcollab.info/
#u
nid
ev
op
s
Making Operability
Work
#u
nid
ev
op
s
Making Operability Work
• NFRs vs Operational Features
• Budget changes
• Organisation changes
• Responsibility changes
• Avoid on-call anti-patterns
#u
nid
ev
op
s
“Non-Functional”
#u
nid
ev
op
s
Operational Features
Features
#u
nid
ev
op
s
Taking Operability Seriously
• Single product backlog
– End-user + Operational features
– New features + bugs
• Product Owner on call
– Accountable for operational failures
– Seriously!
#u
nid
ev
op
s
#u
nid
ev
op
s
Budget changes
• “What is your budget code?”
• Capex vs. Opex?
• Remove budget barriers to
regular, effective communication
#u
nid
ev
op
s
Niek Bartholomeus (@niekbartho) - http://niek.bartholomeus.be/https://speakerdeck.com/niekbartho/self-organization-vs-global-optimization-a-comparison-between-
traditional-and-modern-organizations
#u
nid
ev
op
s
Organisation changes
• “I’ll need to ask my manager first”
• Lack of autonomy
• Remove reporting barriers to regular, effective communication
• More at http://bit.ly/DevOpsTopologies
#u
nid
ev
op
s
“I just want to write code”
#u
nid
ev
op
s
Mysterious Coding Tricks
#u
nid
ev
op
s
On-call for Responsibility
#u
nid
ev
op
s
On-call Anti-Patterns
• Too much overtime pay
• Too little overtime pay
• Rota team too small
• No training in incident response
• No team ownership of product
• No team autonomy for changes
#u
nid
ev
op
s
On call - Goal
• Team members want to help
make things better
• Empowered to fix problems
• Reduce the times they are woken
up
#u
nid
ev
op
s
The operability of operability
• Operational Features, not “NFRs”
• Sustainable collaboration
• Sensible, fair on-call rotas
• Over-compensate in time off
• Avoid burn-out
#u
nid
ev
op
s
Recapitulation
#u
nid
ev
op
s
Software Operability
Making software
systems work well
in Production
#u
nid
ev
op
s
Run Book Collaboration
Shared focus on operability throughout the delivery cycle
#u
nid
ev
op
s
Making Operability Operable
Use DevOps team patterns for sustainable operability
#u
nid
ev
op
s
What’s Next?
#u
nid
ev
op
s
Further Reading
• Patterns for
Performance and
Operability
– Ford, Gileadi, Purba,
Moerman
• http://whoownsmyoperability.com/
– Recommended reading lists
#u
nid
ev
op
s
Operability Book
• Software Operability – How to make software work well in Production– Due early 2014
• Sign up at OperabilityBook.com
• Discount code for DevOps Summit attendees
#u
nid
ev
op
s
Experience DevOps
• A hands-on workshop for DevOps
culture
• Forthcoming dates:
– Bangalore: 19th December 2013
– London: February 2014 (tbc)
• http://experiencedevops.org/
#u
nid
ev
op
s
PIPELINE Conference
• Continuous Delivery
• Tuesday 8th April 2014
• London, UK
• http://pipelineconf.info/
• @PipelineConf
#u
nid
ev
op
s
Questions &
Discussion
Matthew Skelton
@matthewpskelton
softwareoperability.com
operabilitybook.com
bit.ly/DevOpsTopologies
#u
nid
ev
op
s
Acknowledgements
http://pianofortekeys.files.wordpress.com/ 2013/04/ariadnne_wideweb__470x3300.jpg
http://www.blinkenlights.nl/images/ blinkenlights-big.jpeg
http://www.danatronics.com/s db_apps.html
http://riverbankoftruth.com/ wp-content/uploads/2013/07/embarrassed-chimp22.jpg
http://www.thinkgeek.com/edm/ 20040709.html
http://indianaohindiana.com/wp-content/uploads/2013/10/Tome.jpg
http://www.guavaworks.com/company-blog/guava-doesnt-do-cookie-cutter.html
http://www.carpages.co.uk/ford/ford-sand-sculptures-05-09-11.asp
http://www.thisismoney.co.uk/money/experts/ article-2324270/Take-smaller-pension-pots-tax-free-leave-final-salary-untouched.html
http://paranoidnews.org/wp-content/uploads/2010/10/Alien-Hunt-Alarm-Clock.jpg
http://particulations.blogspot.co.uk/ 2010/08/headingley-hole.html
http://marvel.wikia.com/ Stephen_Strange_(Earth-616)
#u
nid
ev
op
s
Further Slides
#u
nid
ev
op
s
The Phoenix Project
#u
nid
ev
op
s
Continuous Delivery