heroku operations - heavybit€¦ · heroku operations noah zoschke [email protected]. problem: cloud...
TRANSCRIPT
![Page 2: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/2.jpg)
![Page 3: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/3.jpg)
Problem: Cloud ServicesLost EBS + EC2 API + Customer Databases
30% App Servers
Internal Ops Apps + Instances
AWS API Offline Can’t get new capacity
PagerDuty Crushed No alerts coming through
Heroku API Offline Customers helpless
!
!
![Page 4: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/4.jpg)
Solution: HA Architecture
![Page 5: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/5.jpg)
Solution: HA Architecture
![Page 6: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/6.jpg)
Solution: People & Operational Culture
![Page 7: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/7.jpg)
A Personal Account...
Oh sh*t, I think the pager is blowing up...
![Page 8: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/8.jpg)
Problem: CultureFeature Culture → Too Much Software No inventory of what’s up or down
Surprising dependencies
Hacker Culture → Poorly Written Software Feature rich, not fool proof
Lots of “beta” services with production workloads
Rockstar Culture → Individual Ownership Lots of low bus factor
Implicit Culture → Unclear Expectations Can I escalate?
How do we prioritize services and customers?
![Page 9: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/9.jpg)
“Feature Culture” Side Effect: Legacy Services
![Page 10: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/10.jpg)
Problem: Legacy ServicesTwo Routing Services
Router → Nginx/Varnish → Dyno (Bamboo)
Router → Dyno (Cedar)
Two Database Services Shiny New Dedicated Databases (Heroku Postgres)
Years-old Legacy Shared Databases
Five Metrics Services, etc...
![Page 11: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/11.jpg)
Solution: Sunsetting CultureTreat Sunsetting as First Class Product and
Engineering Work
Meticulously Catalog Running Services
Celebrate Success When Shutting One Down
![Page 12: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/12.jpg)
Recipe: Lifecycle BoardPrototype → Development → Production →
Deprecated → Deactivated → Sunset
Follow Checklists to Advance
Reflect Explicit Owners
![Page 13: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/13.jpg)
![Page 14: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/14.jpg)
![Page 15: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/15.jpg)
“Hacker Culture” Side Effect: Inoperable Software
![Page 16: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/16.jpg)
Recipe: Production ChecklistCode is visible on GitHub
Has operations docs with executable instructions for common tasks
Has a high-fidelity staging setup with production parity
Alerts a human if it is down
Uses structured logging
Enforces SSL access
Any credentials and their rotation procedures are added to “cred rolls” list
Send a launch email to engineering@ describing the new component
Move to Production on the Engineering Lifecycle board
Auto-scaled to maintain the needed number of instances
Set up to terminate unhealthy instances
![Page 17: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/17.jpg)
“Implicit Culture” Side Effect: Platform and Pager Chaos
![Page 18: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/18.jpg)
Problem: ImplicitDo I need to fix these warnings this week? Or put it
off?
Can I escalate this alert? Should I?
Should I update the status site? Will someone else?
![Page 19: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/19.jpg)
Recipe: PagerDuty DisciplineEveryone engineer is on-call
Every page is visible in HipChat
Monkey - Everyone should help ack pages in HipChat during work hours
Level 1 - Explicit expectation of first responder after hours
Level 2 - Explicit value that the team has each other’s back
Engineering Manager - Explicit accountability for the whole team and its body of work service
Incident Commander - Experts trained in explicit procedures around updating the status site, opening up AWS tickets, paging extra engineers, etc.
![Page 20: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/20.jpg)
![Page 21: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/21.jpg)
Recipe: Pager MetricsMeasure everything and review weekly
After hours pages are detrimental to engineering health and well being
Engineers deserve weeks with no pages
Engineers have power to improve the operator experience
Engineering Managers are responsible for managing balance between operations and feature work
Service Reliability Engineering (SRE) team is accountable for overall pager burden program
![Page 22: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/22.jpg)
![Page 23: Heroku Operations - Heavybit€¦ · Heroku Operations Noah Zoschke noah@heroku.com. Problem: Cloud Services Lost EBS + EC2 API + Customer Databases 30% App Servers Internal Ops Apps](https://reader033.vdocument.in/reader033/viewer/2022052612/5f0ed2977e708231d4411c04/html5/thumbnails/23.jpg)