taming the etl beast: how linkedin uses metadata to run complex etl flows reliably
DESCRIPTION
Data is the lifeblood of many LinkedIn products and must be delivered to the appropriate systems in a reliably and timely manner. This talk provides details of a metadata system that we built at LinkedIn to help manage the set of ETL flows that are responsible for data delivery at scale.TRANSCRIPT
![Page 1: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/1.jpg)
Taming the ETL beastHow LinkedIn uses metadata to run complex ETL flows reliably
Rajappa IyerStrata Conference, London, November 12,
2013
![Page 2: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/2.jpg)
`whoami`
Data Infrastructure @ LinkedIn since 2011 Prior to that:
– Director of Engineering at Digg– Enterprise Data Architect at eBay
www.linkedin.com/in/rajappaiyer/
![Page 3: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/3.jpg)
Outline of talk
Background and Context – The Why Challenges with Data Delivery – The What Metadata to the Rescue – The How Q&A
![Page 4: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/4.jpg)
LinkedIn: The World’s Largest Professional Network
Members Worldwide
2 newMembers Per Second
100M+Monthly Unique Visitors
259M+ 3M+ Company Pages
Connecting Talent Opportunity. At scale…
![Page 5: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/5.jpg)
Insights
(Analysts and Data Scientists)
Data Driven Products and Insights
Products for Members
(Professionals)
Products for Enterprises
(Companies)
Data,Platforms,Analytics
![Page 6: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/6.jpg)
Products for Members
![Page 7: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/7.jpg)
Products for Enterprises
Sell - Sales Navigator Market - Marketing Solutions
Hire - Talent Solutions
![Page 8: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/8.jpg)
Examples of Insights
![Page 9: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/9.jpg)
Example of Deeper Insight
Job Migration After Financial Collapse
![Page 10: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/10.jpg)
LinkedIn Confidential ©2013 All Rights Reserved
Data is critical to LinkedIn’s products
It needs to be delivered in a reliable and timely manner
10
![Page 11: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/11.jpg)
A Simplified Overview of Data Flow
![Page 12: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/12.jpg)
LinkedIn Confidential ©2013 All Rights Reserved 12
Ingress / Egress of message-oriented data– Logs and clickstream data
Ingress / Egress of record-oriented data– Database data
Transformations– Select, project, join– Aggregations– Partitioning– Cleansing and data normalization– Schema conversions – e.g., Nested JSON to
Relational
Components of typical ETL jobs
![Page 13: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/13.jpg)
LinkedIn Confidential ©2013 All Rights Reserved 13
An Example ETL Flow
![Page 14: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/14.jpg)
LinkedIn Confidential ©2013 All Rights Reserved 14
Challenges Complex process dependencies
– Some flows are over 30 levels deep– Flows may span multiple platforms (Hadoop, RDBMS etc.)
Complex data dependencies– Multiple flows may consume a data element– Multiple data elements feed into a single flow– Can be viewed as “data sync barriers”
Recovery– Restartable flows that pick up from last checkpoint– Catch up mode to compensate for downtime
Monitoring and Alerting– Prioritization of “important” flows for ops attention– Who do you call when things fail?
![Page 15: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/15.jpg)
LinkedIn Confidential ©2013 All Rights Reserved 15
Metadata to the rescue
What metadata is collected?– Process dependencies– Data dependencies– Execution history and data processing
statistics How is it used?
– Drives the ETL framework with lots of functionality Check for data availability Retries and restarts Standardized error reporting / alerting Prioritized view of business critical flows
![Page 16: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/16.jpg)
Metadata: Process Dependencies
Capture process dependency graph
– Also capture metadata such as process owners, importance, SLA etc.
Capture stats for each execution of a workflow
– Time of execution– Execution status– Pointer to error logs
Alert on delayed processes
– Based on execution history
![Page 17: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/17.jpg)
Metadata: Data Dependencies
For each flow, capture input and output data elements
For each flow execution, capture stats on data element
Number of records or messages processed
Error counts Watermarks
– Can be time based or sequence based
– This can be per flow as more than one flow can consume a data element
![Page 18: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/18.jpg)
18
Metadata: Data Elements
Simple catalog of data elements– Name, physical location, owner etc.
Data elements can have logical names– Names resolve to one or more physical entity– Logical names can represent useful
collections E.g., data as of a particular interval
Data element availability can trigger processes
– E.g., kick off hourly process when hourly data is complete and available
– Enables data driven ETL scheduling
![Page 19: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/19.jpg)
LinkedIn Confidential ©2013 All Rights Reserved 19
ETL Framework
Putting it all together
Metadata Management System
SchedulerCheckpoint Execution
State
Retry / Resume
Data CheckStatistics (process and data)
Alerting / Monitoring
Dashboards,Reports
Data Availability
Status
Execution History
Data Lineage
ETL applications
Name resolver
Log Parsers
![Page 20: Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably](https://reader035.vdocument.in/reader035/viewer/2022062401/554dc83ab4c905bd488b526d/html5/thumbnails/20.jpg)
Questions?
More at data.linkedin.comCome Work on Challenging Data Infrastructure problems - We’re Hiring