production monitoring platform

Ariel Smoliar

Monitoring Platform

Objective

Develop a data-driven service to understand, mitigate and prevent production outages

“You can observe a lot by just watching.”

(Yogi Berra)

Deliver reliable and scalable intelligent monitoring platformto make customers and production happy

Leveraging DataImplement

Machine Learning

Embrace DevOps

• Logging • Time-series metrics• APIs performance• Normalization • Trends on time-series data

• Metrics correlation• Outlier and anomaly detection• Predictive analytics

• Collaboration • MTTI and MTTR• Failure automation• War room

Approach to Solution

Data Monitoring• The goal of monitoring is to detect problems before they turn

into outages, not to detect outages

• In my product planning I will be focusing on the following components:– Collecting data – Visualizing data – Trending and alerting

Let’s Proceed in Three Phases:

Phase 1

Phase 2

Phase 3

Interview dev and ops teams to better understand theproduction, monitoring methods and DevOps practice

Implement immediate changes to the postmortem process based on challenges that were identified

Develop a data-driven monitoring system to handle the outages in a period of one year

Roadmap Over the Next Year

Phase 2: Outage Understanding

Outcome: Detailed and focused postmortem service

Q1 Q2 Q3 Q4

Phase 3(a): Outage Mitigation

Outcome: New capabilities to reduce mean time to identification of outages

Phase 3(c): Continuing Outage Prevention

Outcome: Contextualized data platform to reduce and prevent outages

Phase 1: Interviewing

Phase 3 (b): Outage Prevention

Outcome: New capabilities to reduce mean time to resolution of outages

Which production alerts or incidents require postmortem?How is knowledge shared today between Ops and Dev teams?How do you allocate ownership for fixing bugs after an outage?What is the actionable learning process after outage investigation?What are the communication channels?

Which monitoring and alerting systems are being used?Which metrics are you using to measure continuous improvement?What KPIs are you using?What data do you log?

What are the main problems you see today in your production deployment?Can you specify any common or unusual patterns (dependency on user traffic, etc.)?

Across how many data centers and cloud providers is the code deployed?

Phase 1: Interview Dev and Ops TeamsPr

oduc

tion

Mon

itorin

gDe

vOps

Discuss the following topics:

Phase 2: Outage Understanding Immediate Changes

• Postmortem format should include four main components and not take too much time to complete: – Description of the outage– Timeline of the events that identify the sequence of what actually happened– Contributing conditions analysis: why the outage occurred and what contributed to it– Recommendations to prevent the outage in the future

• Company’s greatest asset is its people. We need to make sure that the engineers/ops feel comfortable to share the relevant information to better conduct root cause analysis

• Actionable learning and ownership:– Assign tasks to team members and track progress (field ticket/bug id)– Update playbook (github/wiki) depending on the recommendations– Encourage discussion between engineering and ops teams in live chat rooms

Goal: Make sure postmortem focuses on the process and the technology, not finding who to blame; ensure that data allows for actionable learning process

Priorities for the Team

• Expanding the functionalities of the service to:– Assign ownership and prioritize tasks– Automatically open JIRA ticket to

track the progress – Update production launch readiness

checklist (optional)– Tag events (data center, device, etc.)

• Adding screenshot of graphs to the form

• Visualizing events that lead to outage on timeline

• Storing event timelines

• Exploring option to use monitoring tools (ganglia/CloudWatch) API to pull metric data

• Reviewing recent outage data to look for patterns

Backend/UI Data Science

MockupsTimeline visualization of events during an outage investigation

Phase 3(a): Outage Mitigation

• We should be able to better investigate outages with the PostMortem service– Analyzing simultaneously multiple timelines of previous outages (historical data) can help to

identify patterns and improve time for MTTI and MTTR– If an outage events sequence is repeated, we should make sure that that the postmortem

recommendations are better implemented – Sharing knowledge, graphs and reports from the PostMortem service can improve

collaboration between teams

• We will be designing an open API platform to collect and analyze data (network, databases, APM metrics, servers, system, logs, CDN) across all domains from all our monitoring systems into a single place

• We will start exploring multiple analytics areas (baselining, correlation, trending, outlier and anomaly detection) on time-series data and can expand to include categorical data

• We will set bi-monthly meetings to share information and get feedback from our internal customers in order to learn from recent outages and communicate our progress

Goal: Expand the postmortem process with new tools to reduce the time spent on identifying and investigating an outage. This phase will also involve designing the advanced platform


• Designing and implementing platform and data pipeline to collect, analyze and store timestamped numerical data

• Automating historical outage timelines comparison

• Adding reporting system and option to share analysis insights

• Tracking system of open tasks from previous outages

• Examining baseline creation for production

• Initial work on correlation analysis across multiple domains (PCA, etc.)

• Exploring open source projects (Netflix, Twitter, Etsy) for outlier and anomaly detection

• Reviewing trending algorithms


MockupsPresenting multiple timelines of previous outages

Phase 3(b): Outage Mitigation

• We should work with other teams to identify business’s KPIs and then determine which metrics can be collected to create and monitor those KPIs. Some examples for KPIs:– Availability, latency, HTTP error codes (4xx, 5xx), user experience/number of users/revenue, etc.

• As we are moving forward with the new monitoring platform, it’s important to see if we are improving these three parameters:– Mean Time to Identification (MTTI)– Mean Time to Resolution (MTTR)– Number of outages

• We will focus on data quality and stress the importance of logging to the engineering teams because the results of our analytics engine (for example correlating infrastructure metrics related to end user experience with our mobile app) depend on the data we have

• We will keep automating our analytics engine to ensure that the platform is scalable and not built on top of pre-defined patterns or rules

Goal: Improve data collection, processing, normalization and correlation capabilities across the environments and data sources


• Building scalable and stable platform to ingest data from multiple sources

• Visualization of results:– beautiful dashboards– trends– correlations

• Alerting based on trends• Implementing better data

flow and sharing (RBAC)

• Implementing trends based on time-series data

• Implementing and evaluating results of running metrics correlation on-demand

• Testing baselines and AD (ROC curves)


Logs are not sexy but…

Logging Practice

• Log everything – will enable to take every customer action or internal transaction to gain insights into what’s working and what’s not

• Assign transaction ID (session ID for example) through the app server for every transaction, expediting the investigation process

• Collect logs into our log management system; later alerts will be streamed to the new platform

API Monitoring

To enrich the data, log each API call and monitor the following information: – Error code rate (autorization failures)– Latency (90th, 95th percentile)– Dependencies on 3rd party APIs as time spent on

external services

Phase 3(c): Continuing Outage Prevention

• At this point our platform is already contributing to outage mitigation:– Data across all domains is collected, analyzed and visualized– Easier to share information based on historical data – Trends on time-series data allows us to predict if something may go wrong

earlier, preventing outages

• Improving data collection, processing, normalization and centralizing monitoring data sources is an ongoing process. Any new sources can enrich the data and help adjust the algorithms

• This phase will be critical in evaluating the machine learning algorithms and making sure we have a robust alerting platform (false positives and true positives) to reduce the number of outages

Goal: Converge the capabilities we have built towards a better system to reduce the number of outages


• Implementing outlier and anomaly detection and evaluating performance

• Testing predictive analytics– alerting based on sequence

of events (divergence from normal baseline) that may lead to an outage

• Open source the new AD framework

Backend/UI Data Science• Improving the platform

infrastructure• Monitoring the performance of the

platform with the new solution • Visualizing outlier and anomaly

detection results• Providing visibility into potential

problems (predictive)• Configuring chat rooms, emails,

teams and owners to share information/alerts

• Planning a failure automation process

Long-Term Product Vision

Automation

Collaboration

Analytics

Automating workflow for relevant teams and advancing failure automation will be needed for the growing number of employees and the increasingly complex infrastructure.

Utilizing war room will make sure that all relevant teams are involved and monitoring together. An enhanced onboarding process will be needed for new engineers to understand potential issues with production.

Reducing the massive data stream to a more contextualized view for faster escalation. Clustering, predictive analytics, and a recommendation capability will be the core for the success of the solution.

Conclusions

• Contextualize insights across all domains to make sure the best user experience is continually provided

• Accelerate time required to investigate and resolve production problems, leading to increased uptime

• Increase productivity: right information gets to the right

people at the right time

Deploying this three phase approach will help to:

production monitoring platform

Software