![Page 1: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/1.jpg)
Surveyhttp://bit.ly/survey-srecon19
What I Wish I Knew Before Going On-call
SRECon 2019
![Page 3: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/3.jpg)
PLACE SCREENHERE
WHO WE ARE
Yelp Local Ads
Connect people with great local businesses
Advertiser billing and analytics
![Page 4: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/4.jpg)
Our team’s challenges
WHO WE ARE
2. Wears many hats 🎩
On-call + Feature + Infra
3. Owns systems with many different tech stacks
Makes being on-call more challenging
4. Majority of the team is new grad hires
Makes onboarding even more important
1. Financially critical systems
~90% of company revenue is from ads
![Page 5: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/5.jpg)
Our story
WHO WE ARE
Joined the team as new grad hires
Learned how to be on-call the hard way...
Now mentoring other engineers
![Page 6: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/6.jpg)
Newbie on-call struggles
● No established training process
● Decentralized + Outdated documentations
● So much financial impact/pressure
STORY TIME: BEING ON-CALL
![Page 7: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/7.jpg)
SURVEY RESULTS
Did you feel ready before going on-call for the first time?
![Page 8: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/8.jpg)
SURVEY RESULTS
Why didn't you feel ready? 62%
76%
54%
38%
24%
Afraid of unknown situations
Lack of confidence
Poor understanding of systems
Lack of protocol
Afraid of asking for help
Survey within Yelp Engineering (2018)
![Page 9: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/9.jpg)
Why care about good onboarding?
Win 1: Makes your team scalable!
Win 2: Improve incident response
Win 3: Teaching is the best way to learn
Win 4: Confident new hires
ONBOARDING
![Page 10: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/10.jpg)
Workshop GoalBuild an efficient oncall onboarding system for your organization
![Page 11: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/11.jpg)
Agenda
1. Common Myths about On-Call
2. How to Create Training Program
3. Runbook for Effective Incident Response
![Page 12: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/12.jpg)
4 Common Myths About On-calls
![Page 13: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/13.jpg)
Myth #1“I need to know everything”
You are not supposed to know everything
![Page 14: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/14.jpg)
Myth #2 “I need to solve everything by myself”
You are supposed to ask for help
![Page 15: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/15.jpg)
Myth #3“I need to find the root cause”
Root cause finding is a non-goal
![Page 16: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/16.jpg)
Myth #4“I need to make the best/long-term fix”
You are supposed to mitigate the issue
![Page 17: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/17.jpg)
Setting the right expectations
1. Reduce (unnecessary) fear2. More productive + efficient on-call
![Page 18: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/18.jpg)
Set the right expectations during training!
![Page 19: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/19.jpg)
Now onto the training program...
![Page 20: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/20.jpg)
Join team Join on-call rotation
time
My On-call “Training”
“Ads Academy”(8H)
On-call Intro (2H)
On-point Rotation
Shadowing
![Page 21: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/21.jpg)
What was good about my training?
It existed
On-point rotation
Shadowing
ONBOARDING
![Page 22: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/22.jpg)
What was difficult about my training?
Information dump
No emphasis on connections between systems
No emphasis on investigation/debugging tools
ONBOARDING
![Page 23: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/23.jpg)
The Goal of Training Program
Goal 1. Be able to draw a mental picture of your system
Goal 2. Understand failure modes/alerts for the system
Goal 3. Know the tools for investigation
![Page 24: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/24.jpg)
ExerciseLet’s make an oncall training program!
![Page 25: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/25.jpg)
Exercise AgendaLet’s make an oncall training program!
1. Make a Curriculum2. Create Introduction3. Cover Failure Modes4. List Necessary Tools
What you need: Text editor of your choice
![Page 26: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/26.jpg)
Exercise AgendaLet’s make an oncall training program!
1. Make a Curriculum2. Create Introduction3. Cover Failure Modes4. List Necessary Tools
![Page 27: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/27.jpg)
Exercise #1Let’s make a curriculum!
![Page 28: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/28.jpg)
Anti-example
Lesson Topic
1 Everything you need to know about ads on-call (2 hours)
![Page 29: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/29.jpg)
Tip: Avoid information overload
![Page 30: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/30.jpg)
Lesson Topic
1 Oncall Expectation + Overview of Ad systems
2 Billing (Critical)
3 Ad Delivery (Critical)
4 Ad Internal Reports/Metrics (Less Critical)
5 Targeting (Less Critical)
Ask yourself a question: Is there information-overload happening?
![Page 31: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/31.jpg)
Ask yourself a question: Is there information-overload happening?
Lesson Topic
1 Oncall Expectation + Overview of Ad systems
2 Billing (Critical)
3 Ad Delivery (Critical)
4 Ad Internal Reports/Metrics (Less Critical)
5 Targeting (Less Critical)
← Should be super high level
![Page 32: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/32.jpg)
Ask yourself a question: Is there information-overload happening?
Lesson Topic
1 Oncall Expectation + Overview of Ad systems
2 Billing (Critical)
3 Ad Delivery (Critical)
4 Ad Internal Reports/Metrics (Less Critical)
5 Targeting (Less Critical)
← What if this is a complicated data pipeline with many alerts?
![Page 33: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/33.jpg)
Lesson Topic
1 Oncall Expectation + Overview of Ad systems
2 Ad Analytics Pipeline (Critical)
3 Billing Pipeline(Critical)
4 Ad Delivery (Critical)
5 Ad Internal Reports/Metrics (Less Critical)
6 Targeting (Less Critical)
Split it into a reasonable unit!
![Page 34: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/34.jpg)
Lesson Topic
1 Oncall Expectation + Overview of Ad systems
2 Ad Usage Pipeline (Critical)
3 Billing Pipeline(Critical)
4 Ad Delivery (Critical)
5 Ad Internal Reports/Metrics (Less Critical)
6 Targeting (Less Critical)
Ask yourself a question: Does the order of the topics make sense?
![Page 35: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/35.jpg)
Lesson Topic
1 Oncall Expectation + Overview of Ad systems
2 Ad Usage Pipeline (Critical)
3 Billing Pipeline(Critical)
4 Ad Delivery (Critical)
5 Ad Internal Reports/Metrics (Less Critical)
6 Targeting (Less Critical)
Ask yourself a question: Does the order of the topics make sense?
![Page 36: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/36.jpg)
Lesson Topic
1 Oncall Expectation + Overview of Ad systems
2 Ad Analytics Pipeline
3 Billing Pipeline(Critical)
4 Ad Delivery (Critical)
5 Ad Internal Reports/Metrics (Less Critical)
6 Targeting (Less Critical)
← This is an upstream of #2 and #3
Ask yourself a question: Does the order of the topics make sense?
![Page 37: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/37.jpg)
Lesson Topic
1 Oncall Expectation + Overview of Ad systems
2 Ad Delivery (Critical)
3 Ad Analytics Pipeline (Critical)
4 Billing Pipeline (Critical)
5 Ad Internal Reports/Metrics (Less Critical)
6 Targeting (Less Critical)
Ask yourself a question: Does the order of the topics make sense?
![Page 38: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/38.jpg)
Exercise #1Let’s make an oncall training curriculum!
● Come up with a list of topics● Chunk it into a “reasonable” size● Sort them
3 mins
![Page 39: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/39.jpg)
Exercise AgendaLet’s make an oncall training program!
1. Make a Curriculum2. Create Introduction3. Cover Failure Modes4. List Necessary Tools
![Page 40: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/40.jpg)
Exercise #2Let’s write a 10000 ft overview of the system!
10000 ft overview Actual oncall training
![Page 41: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/41.jpg)
Exercise #2Why give an overview in on-call training?
● Make sure students are on the same page● Make failure points clearer
10000 ft overview Actual oncall training
![Page 42: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/42.jpg)
Exercise #2What should a 10000 ft overview include?
● Simple Diagram● Summary of the system (What it does, what depends on it etc)
10000 ft overview Actual oncall training
![Page 43: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/43.jpg)
Lesson Topic
1 What is oncall? + Overview of Ad systems
2 Ad Delivery (Critical)
3 Ad Analytics Pipeline (Critical)
4 Billing Pipeline (Critical)
5 Ad Internal Reports/Metrics (Less Critical)
6 Targeting (Less Critical)
![Page 44: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/44.jpg)
Example: Ad Analytics Pipeline
Ad_view log
Data Storead_analytics
Join logs Aggregate data
Ad_click log
Billing Pipeline
TargetingSystem
DownstreamConsumers
Input (S3)
Output (Cassandra)
MapReduce
![Page 45: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/45.jpg)
Tip: Use visual aid you can reuse
![Page 46: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/46.jpg)
Example: Ad Analytics Pipeline
Ad_view log
Data Storead_analytics
Join logs Aggregate data
Ad_click log
Billing Pipeline
TargetingSystem
DownstreamConsumers
Input (S3)
Output (Cassandra)
MapReduce
![Page 47: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/47.jpg)
Example: Ad Analytics Pipeline
Ad_view log
Data Storead_analytics
Join logs Aggregate data
Ad_click log
Billing Pipeline
TargetingSystem
DownstreamConsumers
Input (S3)
Output (Cassandra)
MapReduce
![Page 48: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/48.jpg)
Exercise #2Let’s write a 10000 ft overview of the system!
1. Pick one topic from the curriculum2. Summarize the system, techstack, and failure points3. Add a diagram
3 mins
![Page 49: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/49.jpg)
Exercise AgendaLet’s make an oncall training program!
1. Make a Curriculum2. Create Introduction3. Cover Failure Modes4. List Necessary Tools
![Page 50: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/50.jpg)
Exercise #3Let’s write the “actual on-call training”
10000 ft overview Actual oncall training
![Page 51: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/51.jpg)
Exercise #3Let’s write the “actual on-call training”
↑Usually talks about failure modes/alerts and how to respond to them
10000 ft overview Actual oncall training
![Page 52: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/52.jpg)
TipUse Past Incidents
![Page 53: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/53.jpg)
Exercise #3Why use past incidents?
● Examples are the best teachers!● Opportunity to make it interactive
![Page 54: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/54.jpg)
Example: Ad Analytics Pipeline
Alert: Ad Analytics Data Processing Failure
![Page 55: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/55.jpg)
Example: Ad Analytics Pipeline
Alert: Ad Analytics Data Processing Failure
Ad_view log
Data Storead_analytics
Ad_click log
Billing Pipeline
TargetingSystem
Input (S3)
Output (Cassandra)
MapReduce
![Page 56: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/56.jpg)
Example: Ad Analytics Pipeline
Alert: Ad Analytics Data Processing Failure
Past Incidents:● Backward-incompatible input schema change● MapReduce task timeouts
![Page 57: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/57.jpg)
Exercise #3Let’s write the “actual on-call training”
● List alerts/failure modes● Map it in your 10000 ft diagram● Find at least one past incident for each alert
3 mins
![Page 58: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/58.jpg)
Exercise AgendaLet’s make an oncall training program!
1. Make a Curriculum2. Create Introduction3. Cover Failure Modes4. List Necessary Tools
![Page 59: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/59.jpg)
Exercise #4Let’s teach necessary tools and know-hows
![Page 60: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/60.jpg)
Example
How to read a service SignalFx dashboard
1. Explain2. Show a dashboard screenshot from a past incident3. Let students debug + ask questions
![Page 61: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/61.jpg)
Example
How to read a service SignalFx dashboard
![Page 62: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/62.jpg)
Example
How to read a service SignalFx dashboard
(This should ideally be in runbook)
![Page 63: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/63.jpg)
Tip: Let students apply knowledge ASAP
![Page 64: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/64.jpg)
Example
How to read a service SignalFx dashboard
1. Explain2. Show a dashboard screenshot from a past incident3. Let students debug + ask questions
![Page 65: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/65.jpg)
Example
How to read a service SignalFx dashboard
1. Explain2. Show a dashboard screenshot from a past incident3. Let students debug + ask questions
![Page 66: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/66.jpg)
Example
How to read a service SignalFx dashboard
1. Explain2. Show a dashboard screenshot from a past incident3. Let students debug + ask questions
![Page 67: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/67.jpg)
Exercise #4Let’s teach necessary tools and know-hows
1. List tools and know-hows(Based on your answers from Exercise #3)
2. Make it interactive
3 mins
![Page 68: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/68.jpg)
Congratulations!You have a (partially complete)
oncall training program!
![Page 69: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/69.jpg)
Tips (Recap)
Avoid information overloadUse visual aid you can reuse
Use Past IncidentsLet students apply knowledge ASAP
![Page 70: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/70.jpg)
Beyond Training
![Page 71: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/71.jpg)
Beyond training
Knowledge sharing
Oncall handoff meetingShow and tell how recent incidents were resolved
WargameGain experience in a fast and safe way
PostmortemLearning from the past incidents
![Page 72: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/72.jpg)
Wargames
Wargame
Multi-person incident simulation game
Game master
Investigate and mitigate
Apply knowledge and practice using tools
Reproduce the incident
Drive conversations
Ask questions and give hints
Oncall Player(s)
![Page 73: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/73.jpg)
3 Steps to start a wargame
![Page 74: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/74.jpg)
Step 1:Pick a scenario
![Page 75: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/75.jpg)
Wargame
Examples Real past incidents
Imaginary Incidents
- Seasonal traffic: Black Friday
- Critical System/Database crashed
- Brainstorm or discuss what could happen and how to handle
![Page 76: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/76.jpg)
Step 2:Prepare a game
![Page 77: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/77.jpg)
Wargame
Example
Wargame templateIncident SetupInteractive
- conduct in safe environment
Static- dashboards/screenshots/logs/history of
code
![Page 78: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/78.jpg)
Wargame
Example
❏ Prepare bad code <link>❏ Prepare dashboard screenshot❏ Set up an isolated env <config file link>❏ Cmd to run batch in the env
❏ python ./batch.py --config config.yaml❏ Wait for batch to fail
Wargame templateIncident SetupStep-by-step instruction of how to trigger incident
![Page 79: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/79.jpg)
Wargame
Example❏ Investigator --- <name>❏ Communicator --<name>❏ Commander -- <name>
Wargame template
Player roles
❏ Get relevant permissions❏ Join external wifi/set up VPN❏ Use wargame-only communication tools
❏ channel #wargame❏ email alias wargame ❏ JIRA project WARGAME
Player checklist
![Page 80: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/80.jpg)
Wargame
Example❏ Did you read runbook?❏ Did you check batch log?❏ Did you check recent code changes?❏ Did you check dashboard: <screenshot>?
...
Wargame template Hints
![Page 81: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/81.jpg)
Step 3:Run the game
![Page 82: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/82.jpg)
Wargame
Tips for running the game
- Ask what makes they take actions- Make sure player(s) and audience understand the
situation
Ask questions
Take notes
- Runbook/Training/Monitoring/Alerting improvement- Follow-up learning process
Invite audience
![Page 83: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/83.jpg)
Wargames
Oncall simulation text adventure game using Twine
Wargame
Use tools to build your game
http://bit.ly/oncall-game
![Page 84: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/84.jpg)
Break (5 mins) Oncall twine game:
http://bit.ly/oncall-gameOptional Materials:
http://bit.ly/srecon19-oncall
![Page 85: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/85.jpg)
Runbooks for Effective Incident Response
![Page 86: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/86.jpg)
SURVEY RESULTS
“Clear protocol of pages we can get and how to
handle them”
“Runbooks should be obvious to find and execute. At 3 AM you
need dummy-proof instructions.”
Reviewed the team’s runbooks before going on-call70%
“Better documentation”
“More documentation”“Update and improve
documentation and runbooks”
Survey within Yelp Engineering (2018)
Why didn't you feel ready?
40% Didn’t feel ready due to lack of protocol
![Page 87: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/87.jpg)
THEORY OF THE RUNBOOK
Why care about good runbooks?
Win 1: Increase efficiency
Win 3: Stand-in for a mentor or back-up
Win 2: Reduce nervousness
![Page 88: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/88.jpg)
THEORY OF THE RUNBOOK
What is a runbook?
Non Technical Runbook
Technical runbook
Step-by-step instruction on how to act in an incident- Impact assessment- Mitigation- Disaster recovery
Guidelines for human process- Human roles- Communication process- Escalation policy
![Page 89: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/89.jpg)
ExampleSymptoms of bad runbook
http://bit.ly/srecon19-oncall
![Page 90: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/90.jpg)
Daytime Web Traffic Nightly Batch Job
STORY TIME: BATCH RECOVERY
Ad_view log
Data Storead_analytics
Join logs Aggregate data
Ad_click log
Billing Pipeline
TargetingSystem
![Page 91: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/91.jpg)
Daytime Web Traffic Nightly Batch Job
STORY TIME: BATCH RECOVERY
Ad_view log
Data Storead_analytics
Join logs Aggregate data
Ad_click log
Billing Pipeline
TargetingSystem
![Page 92: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/92.jpg)
What made this runbook difficult to use?
STORY TIME: BATCH RECOVERY
2:00 amPaged for failed batch job.-----ALERT: ad_analytics failed
2:05 amHow do I rerun?Is it idempotent?Which cmd?
2:10 amSearch internal wiki for batch name.-----1 result found[Ads]Runbooks - Operations-----
![Page 93: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/93.jpg)
Runbooks - Operations● General recovering tips
○ Campaigns not in ad_store○ Errors in ad template
● Nagios○ Background○ Updating Alerts○ Alerts
● ad_analytics○ Man tronview and man tronctl to understand how to use tron○ 1.Identify which run failed○ 2.Identify which action failed○ 3.fix/retry broken actions○ Specific Batches
■ calculated_ad_analytics■ clculate_ad_spend■ Business_ad_control
● Reports● Rerunning procedures
○ Identify which days need to be rerun○ Identify which batches need to be rerun
● Gearman○ View the logging output of the gearman workers○ View the number of gearman workers and the number of jobs in the queue○ Adding the removing gearman workers for particular queues○ Cleaning out a queue
RUNBOOK EXAMPLES
What made this runbook difficult to use?
![Page 94: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/94.jpg)
RUNBOOK EXAMPLES
AlertsTODO: This section would benefit a lot from having our actual alerts listed and detailed here.
What made this runbook difficult to use?
![Page 95: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/95.jpg)
RUNBOOK EXAMPLES
What made this runbook difficult to use?
Runbooks - Operations● General recovering tips
○ Campaigns not in ad_store○ Errors in ad template
● Nagios○ Background○ Updating Alerts○ Alerts
● ad_analytics○ Man tronview and man tronctl to understand how to use tron○ 1.Identify which run failed○ 2.Identify which action failed○ 3.fix/retry broken actions○ Specific Batches
■ calculated_ad_analytics■ clculate_ad_spend■ Business_ad_control
● Reports● Rerunning procedures
○ Identify which days need to be rerun○ Identify which batches need to be rerun
● Gearman○ View the logging output of the gearman workers○ View the number of gearman workers and the number of jobs in the queue○ Adding the removing gearman workers for particular queues○ Cleaning out a queue
![Page 96: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/96.jpg)
RUNBOOK EXAMPLES 3. Fix/retry broken actionsIf a batch died due to an EMR, DB, or other intermittent issue, attempt to run the action manually
If a batch died due to a logic error, push a fix and run the action manually
To run manually, read the command line printed in this output. It's between the "Node:" and "Requirements:" lines. You'll have to execute this as batch yourself.
$ tronview ad_analytics.XX.the_action_name
Once they run successfully manually, resume the rest of the job by skipping the action. tronctl skip ad_analytics.XX.the_action_name
What made this runbook difficult to use?
![Page 97: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/97.jpg)
RUNBOOK EXAMPLES 3. Fix/retry broken actionsIf a batch died due to an EMR, DB, or other intermittent issue, attempt to run the action manually
If a batch died due to a logic error, push a fix and run the action manually
To run manually, read the command line printed in this output. It's between the "Node:" and "Requirements:" lines. You'll have to execute this as batch yourself.
$ tronview ad_analytics.XX.the_action_name
Once they run successfully manually, resume the rest of the job by skipping the action. tronctl skip ad_analytics.XX.the_action_name
What made this runbook difficult to use?
![Page 98: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/98.jpg)
2:40: amPage secondary oncall
2:50 amSecondary oncall comes online
STORY TIME: BATCH RECOVERY 2:05 am
How do I rerun?Is it idempotent?Which cmd?
2:10 amSearch internal wiki for batch name.-----1 result found[Ads]Runbooks - Operations-----
What made this runbook difficult to use?
2:00 amPaged for failed batch job.-----ALERT: ad_analytics failed
![Page 99: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/99.jpg)
Me: “Where can find the rerun command?”
Secondary: “You can try looking for that in the wiki”
Me: “I just checked, but it’s not very clear.”
Secondary: “Or maybe it’s in the Google Docs repo. Oh, and I’ve got some notes in my home directory, and I think I saw some emails about that a while ago.”
Me:
![Page 100: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/100.jpg)
2:40: amPage secondary oncall
2:50 amSecondary oncall comes online
STORY TIME: BATCH RECOVERY
3:10 amRun tron cmd and find previous run.----Action: $ python -m batch.ad_analytics --date 2019-03-02
4:20 amRerun with correct command.----RESOLVED: ad_analytics
What made this runbook difficult to use?
2:05 amHow do I rerun?Is it idempotent?Which cmd?
2:10 amSearch internal wiki for batch name.-----1 result found[Ads]Runbooks - Operations-----
2:00 amPaged for failed batch job.-----ALERT: ad_analytics failed
![Page 101: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/101.jpg)
What made this runbook difficult to use?
Information overload
No clear action items
Ambiguous wording
STORY TIME: BATCH RECOVERY
Out of date
Hard to find/search
![Page 102: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/102.jpg)
What makes a good Technical runbook?
![Page 103: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/103.jpg)
TECHNICAL RUNBOOK
Tips for writing good technical runbooks
Include actual commands/screenshots
Map alert to clear action items
Inverted pyramid
Keep format consistent
Keep it up-to-date
![Page 104: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/104.jpg)
Alert Name <exact alert name>
Description <1 sentence description>
Stakeholder impact <1 sentence impact>
Mitigation steps 1. Try restarting: <command>2. Monitor dashboards.3. Inspect logs to diagnose issue: <link or See steps below>
If things do not recover, follow Escalation steps.
Escalation steps Contact <team>. Massive ingestion delays should be communicated to <upstream and downstream teams>.
Related services <upstream and downstream dependencies>
Dashboards <links>
Related links <other docs or related runbooks>
http://bit.ly/srecon19-oncall
![Page 105: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/105.jpg)
ExerciseLet’s make your own runbook!
1. List all alerts2. Customize the template3. Pick a home for runbooks
![Page 106: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/106.jpg)
Step 1:List all alerts
2 mins
![Page 107: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/107.jpg)
Example: Ad Analytics Pipeline
Alert: Ad Analytics Data Processing Failure
Past Incidents:● Backward-incompatible input schema change● MapReduce task timeouts
![Page 108: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/108.jpg)
Step 2:Customize the template
![Page 109: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/109.jpg)
TECHNICAL RUNBOOK
Tips for writing good technical runbooks
Include actual commands/screenshots
Map alert to clear action items
Inverted pyramid
Keep format consistent
Keep it up-to-date
![Page 110: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/110.jpg)
Alert Name <exact alert name>
Description <1 sentence description>
Stakeholder impact <1 sentence impact>
Mitigation steps 1. Try restarting: <command>2. Monitor dashboards.3. Inspect logs to diagnose issue: <link or See steps below>
If things do not recover, follow Escalation steps.
Escalation steps Contact <team>. Massive ingestion delays should be communicated to <upstream and downstream teams>.
Related services <upstream and downstream dependencies>
Dashboards <links>
Related links <other docs or related runbooks>
http://bit.ly/srecon19-oncall
![Page 111: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/111.jpg)
Alert Name <exact alert name>
Description <1 sentence description>
Stakeholder impact <1 sentence impact>
Mitigation steps 1. Try restarting: <command>2. Monitor dashboards.3. Inspect logs to diagnose issue: <link or See steps below>
If things do not recover, follow Escalation steps.
Escalation steps Contact <team>. Massive ingestion delays should be communicated to <upstream and downstream teams>.
Related services <upstream and downstream dependencies>
Dashboards <links>
Related links <other docs or related runbooks>
http://bit.ly/srecon19-oncall
2 mins
![Page 112: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/112.jpg)
Step 3:Pick a home for runbooks
1 mins
![Page 113: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/113.jpg)
“You can try looking for that in the wiki, or maybe it’s in the Google Docs repo. Oh, and I’ve got some notes in my home directory, and I think I saw some emails about that a while ago”
![Page 114: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/114.jpg)
A good runbook is easy to find
Make alerts richPut actual commands and/or runbook link in the alert
Make runbooks searchable
Centralized “home”
TECHNICAL RUNBOOK
![Page 115: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/115.jpg)
Step 3:Pick a home for runbooks
1 mins
![Page 116: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/116.jpg)
ExampleNon-Technical Runbook
![Page 117: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/117.jpg)
RUNBOOK EXAMPLES
Non-technical runbook
This document is for Ads incident first responders. First assess, escalate until the appropriate team is established, and take on the appropriate role.
Assess
Escalate
Communicate
Investigate and Fix
Clean Up
Incident Response Checklist
http://bit.ly/oncall-srecon19
![Page 118: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/118.jpg)
NON-TECHNICAL RUNBOOK
❏ Create a ticket in the ADS project with a brief description of the issue.
❏ Add secondary and manager as watchers
❏ Consolidate triage communications to #ads-incident.
❏ Send email to ads-incident@ to liaise with financial stakeholders and downstream consumers of data: email templates.
Incident Response ChecklistCommunicate
Non-technical runbook
http://bit.ly/oncall-srecon19
![Page 119: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/119.jpg)
Oncall Training
Debunk myths
Avoid information overload
Use Visual Aid
Focus on tools
Beyond training
Knowledge share
Wargames
Effective Runbooks
Productive and Happy On-call
Continuous Improvement
![Page 121: What I Wish I Knew Before Going On-call · 1 Oncall Expectation + Overview of Ad systems 2 Ad Analytics Pipeline 3 Billing Pipeline(Critical) 4 Ad Delivery (Critical) 5 Ad Internal](https://reader034.vdocument.in/reader034/viewer/2022042206/5ea825acdc652429544134e9/html5/thumbnails/121.jpg)
REFERENCES
Additional Resources
Training new on-calls
● Accelerating SREs to On-Call and Beyond● From Zero to Hero: Recommended Practices for
Training your Ever-Evolving SRE Teams
Runbooks
● 7 Deadly Sins of Documentation● Do Docs Better: Practical Tips
Postmortems/wargames
● Postmortem culture: learning from failure● The oncall simulator: Building an interactive
game for teaching incident response!