building a data pipeline from scratch - joe crobak
DESCRIPTION
http://www.hakkalabs.co/articles/building-data-pipeline-scratchTRANSCRIPT
![Page 1: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/1.jpg)
From Scratch
1
Joe Crobak @joecrobak
!Tuesday, June 24, 2014
Axium Lyceum - New York, NY
BUILDING A DATA PIPELINE
![Page 2: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/2.jpg)
INTRODUCTION
2
Software Engineer @ Project Florida
!
Previously: • Foursquare •Adconion Media Group • Joost
![Page 3: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/3.jpg)
OVERVIEW
3
Why do we care?
Defining Data Pipeline
Events
System Architecture
![Page 4: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/4.jpg)
4
DATA PIPELINES ARE EVERYWHERE
![Page 5: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/5.jpg)
RECOMMENDATIONS
5
http://blog.linkedin.com/2010/05/12/linkedin-pymk/
![Page 6: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/6.jpg)
RECOMMENDATIONS
6
Clicks
Views
Recommendations
http://blog.linkedin.com/2010/05/12/linkedin-pymk/
![Page 7: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/7.jpg)
AD NETWORKS
7
![Page 8: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/8.jpg)
AD NETWORKS
8
Clicks
Impressions
User Ad Profile
![Page 10: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/10.jpg)
SEARCH
10
Search Rankings
Page Rank
http://www.jevans.com/pubnetmap.html
![Page 12: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/12.jpg)
A / B TESTING
12
https://flic.kr/p/4ieVGa
A conversions
B conversions
Experiment Analysis
![Page 13: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/13.jpg)
DATA WAREHOUSING
13
http://gethue.com/hadoop-ui-hue-3-6-and-the-search-dashboards-are-out/
![Page 14: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/14.jpg)
DATA WAREHOUSING
14
http://gethue.com/hadoop-ui-hue-3-6-and-the-search-dashboards-are-out/
key metrics
user events
Data Warehouse
![Page 15: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/15.jpg)
15
WHAT IS A DATA PIPELINE?
![Page 16: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/16.jpg)
DATA PIPELINE
16
A Data Pipeline is a unified system for capturing events for analysis and building products.
![Page 17: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/17.jpg)
DATA PIPELINE
17
click data
user events
Data Warehouse
web visits
email sends
…
Product Features
Ad Hoc analysis•Counting •Machine Learning • Extract Transform Load (ETL)
![Page 18: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/18.jpg)
DATA PIPELINE
18
A Data Pipeline is a unified system for capturing events for analysis and building products.
![Page 19: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/19.jpg)
19
EVENTS
![Page 20: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/20.jpg)
EVENTS
20
Each of these actions can be thought of as an event.
![Page 21: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/21.jpg)
COARSE-GRAINED EVENTS
21
• Events are captured as a by-product.
• Stored in text logs used primarily for debugging and secondarily for analysis.
![Page 22: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/22.jpg)
COARSE-GRAINED EVENTS
22
127.0.0.1 - - [17/Jun/2014:01:53:16 UTC] "GET / HTTP/1.1" 200 3969!
IP Address Timestamp Action Status
•Events are captured as a
• Stored in debugging and secondarily for analysis.
![Page 23: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/23.jpg)
COARSE-GRAINED EVENTS
23
Implicit tracking—i.e. a “page load” event is a proxy for ≥1 other event. !
e.g. event GET /newsfeed corresponds to:
•App Load (but only if this is the first time loaded this session)
• Timeline load, user is in “group A” of an A/B Test
These implementations details have to be known at analysis time.
![Page 24: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/24.jpg)
FINE-GRAINED EVENTS
24
Record events like:
• app opened
• auto refresh
•user pull down refresh !
Rather than:
•GET /newsfeed
![Page 25: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/25.jpg)
FINE-GRAINED EVENTS
25
Annotate events with contextual information like:
• view the user was on
•which button was clicked
![Page 26: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/26.jpg)
FINE-GRAINED EVENTS
26
Decouple logging and analysis. Create events for everything!
![Page 27: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/27.jpg)
FINE-GRAINED EVENTS
27
A couple of schema-less formats are popular (e.g. JSON and CSV), but they have drawbacks.
• harder to change schemas
• inefficient
• require writing parsers
![Page 28: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/28.jpg)
SCHEMA
28
Used to describe data, providing a contract about fields and their types. !
Two schemas are compatible if you can read data written in schema 1 with schema 2.
![Page 29: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/29.jpg)
SCHEMA
29
Facilities automated analytics—summary statistics, session/funnel analysis, a/b testing.
![Page 30: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/30.jpg)
SCHEMA
30
https://engineering.twitter.com/research/publication/the-unified-logging-infrastructure-for-data-analytics-at-twitter
Facilities automated analytics—summary statistics, session/funnel analysis, a/b testing.
![Page 31: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/31.jpg)
SCHEMA
31
client:page:section:component:element:action e.g.: !iphone:home:mentions:tweet:button:click!!
Count iPhone users clicking from home page: !iphone:home:*:*:*:click!!
Count home clicks on buttons or avatars: !*:home:*:*:{button,avatar}:click
![Page 32: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/32.jpg)
32
KEY COMPONENTS
![Page 33: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/33.jpg)
EVENT FRAMEWORK
33
For easily generating events from your applications
![Page 34: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/34.jpg)
EVENT FRAMEWORK
34
For applications
![Page 35: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/35.jpg)
BIG MESSAGE BUS
35
•Horizontally scalable
•Redundant
•APIs / easy to integrate
![Page 36: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/36.jpg)
BIG MESSAGE BUS
36
•Scribe (Facebook) •Apache Chukwa •Apache Flume •Apache Kafka*
!
•Horizontally scalable
•Redundant
•APIs / easy to integrate
* My recommendation
![Page 37: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/37.jpg)
DATA PERSISTENCE
37
For storing your events in files for batch processing
![Page 38: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/38.jpg)
DATA PERSISTENCE
38
For processing
Kite Software Development Kit http://kitesdk.org/ !Spring Hadoop http://projects.spring.io/spring-hadoop/
![Page 39: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/39.jpg)
WORKFLOW MANAGEMENT
39
For coordinating the tasks in your data pipeline
![Page 40: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/40.jpg)
WORKFLOW MANAGEMENT
40
… or your own system written in your own language of choice.
*
For pipeline
![Page 41: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/41.jpg)
SERIALIZATION FRAMEWORK
41
Used for converting an Event to bytes on disk. Provides efficient, cross-language framework for serializing/deserializing data.
![Page 42: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/42.jpg)
SERIALIZATION FRAMEWORK
42
•Apache Avro* •Apache Thrift •Protocol Buffers (google)
Used for diskframework for serializing/deserializing data.
![Page 43: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/43.jpg)
BATCH PROCESSING AND AD HOC ANALYSIS
43
• Apache Hadoop (MapReduce)
•Apache Hive (or other SQL-on-Hadoop)
•Apache Spark
![Page 44: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/44.jpg)
SYSTEM OVERVIEW
44
Applicationlogging
frameworkdata
serialization
Message BusPersistant Storage
Data Warehouse
Ad hoc Analysis
Product data flow
workflow engine
Production DB dumps
![Page 45: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/45.jpg)
SYSTEM OVERVIEW (OPINIONATED)
45
Applicationlogging
frameworkdata
serialization
Message BusPersistant Storage
Data Warehouse
Ad hoc Analysis
Product data flow
workflow engine
Production DB dumps
Apache Avro
Apache Kafka Luigi
![Page 46: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/46.jpg)
NEXT STEPS
46
This architecture opens up a lot of possibilities
•Near-real time computation—Apache Storm, Apache Samza (incubating), Apache Spark streaming.
•Sharing information between services asynchronously—e.g. to augment user profile information.
• Cross-datacenter replication
• Columnar storage
![Page 47: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/47.jpg)
LAMBDA ARCHITECTURE
47
Term coined by Nathan Marz (creator of Apache Storm) for hybrid batch and real-time processing. !
Batch processing is treated as source of truth, and real-time updates models/insights between batches.
![Page 49: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/49.jpg)
SUMMARY
49
•Data Pipelines are everywhere.
•Useful to think of data as events.
• A unified data pipeline is very powerful.
• Plethora of open-source tools to build data pipeline.
![Page 50: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/50.jpg)
FURTHER READING
50
The Unified Logging Infrastructure for Data Analytics at Twitter !
The Log: What every software engineer should know about real-time data's unifying abstraction (Jay Kreps, LinkedIn) !
Big Data by Nathan Marz and James Warren !
Implementing Microservice Architectures
![Page 52: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/52.jpg)
52
EXTRA SLIDES
![Page 53: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/53.jpg)
WHY KAFKA?
53
• https://kafka.apache.org/documentation.html#design
• Pull model works well
• Easy to configure and deploy
• Good JVM support
• Well-integrated with the LinkedIn stack
![Page 54: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/54.jpg)
WHY LUIGI?
54
• Scripting language (you’ll end up writing scripts anyway)
• Simplicity (low learning curve)
• Idempotency
• Easy to deploy
![Page 55: Building a Data Pipeline from Scratch - Joe Crobak](https://reader030.vdocument.in/reader030/viewer/2022020206/53fd9e368d7f72a81c8b49f3/html5/thumbnails/55.jpg)
WHY AVRO?
55
• Self-describing files
• Integrated with nearly everything in the ecosystem
• CLI tools for dumping to JSON, CSV