having a pulse on your platform - endpointcon amsterdam 2015 keynote
TRANSCRIPT
Having a Pulse On Your Platform
Kamyar Mohager (@kamyarsayshello)Engineering Lead, Partner Engineering
WHAT WE’LL COVER
THE TECHNOLOGY
HOW WE OPERATIONALIZE
WHY BOTHER MONITORING?
WHY BOTHER MONITORING?
INTERNALLY• Operations: Need to know the health of your platform just like any
other app or frontend client. Know your API is down before your developers do
• Business: Make data-driven decisions based on the data
EXTERNALLY
• API availability impacts external apps and their business• Provide some level of monitoring (and possibly alerting) for
developers externally so they’re not left in the dark• Developer empathy is important
Technology
APACHE KAFKA INGRAPHS
● Pub-Sub Messaging and Queuing System
● Data backbone for LinkedIn
● Visualization Frontend for metrics
● Standard tool for all LinkedIn Eng & Ops
API-ANALYZER
● Visualization Frontend specific to LinkedIn Platform
● Used by Platform and SRE teams for Operational needs
APACHE HADOOP
● Distributed Data Storage and Processing
● Used by Platform for Business / Product Analytics
KAFKA AT A GLANCE
Broker
Consumer
Producer
AP0
AP1
API Gateway
InGraphs, API-Analyzer, Hadoop
Kafka Topic: ExternalApiAccessEvent
EXAMPLE KAFKA TOPIC
ExternalApiAccessEvent
INGRAPHS
• Standard visualization framework for operational metrics used @ LinkedIn
• Configuration driven with pre-selected applications to create monitoring dashboards
• Hooks into auto-alerting system
DATA FLOWING TO INGRAPH
DEVIL IN THE (MONITORING) DETAILS
WHO
WHAT
● Entire Platform (aggregate)● Per Partner Program● Per Application
● QPS● Latency● HTTP Response codes (4xx, 5xx)● APIs / Endpoints (granular to specific HTTP methods)
INGRAPHS FOR PLATFORM
PROS
CONS
● Efficient: filters latency/QPS/error rates/call types based on configurations
● Stable: used by all of Engineering and Ops
● Doesn’t support ad hoc queries● Dependency on SRE team to add any configuration changes
API-ANALYZER
• Visualization fronted specifically for ExternalApiAccessEvent metrics• Used by Platform and SRE Teams supporting API• Ad hoc based queries to help with troubleshooting
API-ANALYZER PROCESS FLOW
API-ANALYZER
PROS
CONS
● Supports fast ad hoc queries against a number of facets: appid, IP address, call types
● Free of dependencies on SRE team to maintain configurations for predefined applications
● Limited historical data available
APACHE HADOOP
• The hub of all offline tracking data @ LinkedIn• All ExternalApiAccessEvent data gets ETL’d into Hadoop in near real-
time• Platform team relies on Hadoop for product and business analytics• In-depth analytics beyond just QPS, Latency, Call Types, etc• Historical Data
How Do We Operationalize?
PARTNER ENGINEERING AT LINKEDIN
TEAM GOAL
ROLE OF A PARTNER ENGINEER
Provide a world-class developer platform where our partners and developers can build fantastic 3rd party applications for LinkedIn members
Guide and support partners and developers using our RESTful APIs and mobile SDKs
TREAT PLATFORM AS A PRODUCT Incorporate feedback from our external developers to influence roadmap
SUPPORT MODEL
• Organized by Partner Programs• Open Program: Stack Overflow + Developer Portal• Partner Programs: Dedicated Partner Engineers provide white-glove
support• SLAs vary by Partner Programs (and in certain cases, by strategic
partner)
THE TECHNOLOGY IN ACTION
InGraphs
API-Analyzer
● Dashboards created for a given Partner Program or a specific application
● Charts any metrics we care about (e.g. QPS)● Set up alerts for support teams based on a given threshold● Depending on SLA, team gets emailed and/or called (via on-call
rotation)● Used for ad hoc queries● Fast when needing to troubleshoot and triage a production issue for
a partnerHadoop● Long term look backs● Provides all ExternalApiAccessEvent tracking data not available in
visualization frontends (e.g. member IDs, paths, query params, etc)● Ability to create complex, in-depth reports
[In]SUMMARY
• Your external apps expect 99.99% API “site up”• Monitoring and Alerting essential for knowing health of your platform• Use data to make business and product decisions• It all goes back to tracking: necessary to solve operational and
business needs• Many different types of solutions: up to you to decide whether to
build or buy
THANKS!
Kamyar Mohager (@kamyarsayshello)Engineering Lead, Platform