the evolution of data infrastructure at linkedin linkedin confidential ©2013 all rights reserved
TRANSCRIPT
LinkedIn Confidential ©2013 All Rights Reserved
The Evolution of Data Infrastructure at Linkedin
Lei Gaohttp://www.linkedin.com/in/gaolei
LinkedIn Confidential ©2013 All Rights Reserved 2
Outline
1. Company and Mission
2. Products and Science
3. Data Infrastructure
4. Conclusion
The World’s Largest Professional Network
Members Worldwide
2+ newMembers Per Second
132M+Monthly Unique Visitors
225M+ 2.9M+ Company Pages
Connecting the world’s professionals to make them more productive and successful
LinkedIn Confidential ©2013 All Rights Reserved 3
4
Member ProfilesLarge dataset
Medium writes
Very high reads
Freshness <1s
5
People You May KnowLarge dataset
Compute intensive
High reads
Freshness ~hrs
6
LinkedIn Today Moving dataset
High writes
High reads
Freshness ~mins
LinkedIn Confidential ©2013 All Rights Reserved 7
LinkedIn Data Infrastructure: Three-Phase Abstraction
Users Online Data Infra
Near-Line Infra
Application Offline Data Infra
Infrastructure Latency & Freshness Requirements Products
Online Activity that should be reflected immediately• Member Profiles• Company Profiles• Connections
• Messages • Endorsements• Skills
Near-Line Activity that should be reflected soon
• Activity Streams• Profile Standardization• News
• Recommendations• Search• Messages
Offline Activity that can be reflected later
• People You May Know• Connection Strength• News
• Recommendations• Next best idea…
LinkedIn Confidential ©2013 All Rights Reserved 8
The Big-Data Feedback Loop
Value
Insights
Scale
Product
ScienceData
Member
Engagement
Virality
Signals
Refinement
InfrastructureAnalytics
9
LinkedIn Data Infrastructure: Sample Stack
Infra challenges in 3-phase ecosystem are diverse, complex and specific
Some off-the-shelf.Significant investment in home-grown, deep and
interesting platforms
Databus
LinkedIn Confidential ©2013 All Rights Reserved 10
The Original RDBMS Model
11
Streaming Transactions for Search/Connections
Databus : Timeline-Consistent Change Data Capture
LinkedIn Data Infrastructure Solutions
13
Streaming Transactions for Search/Connections
RO
RO
RO
Databus at LinkedIn
14
DB
Bootstrap
CaptureChanges
On-lineChanges
On-lineChanges
DB
Compressed
Delta Since T
Consistent
Snapshot at U
Transport independent of data source: Oracle, MySQL, …
Transactional semantics In order, at least once delivery
Tens of relays Hundreds of sources Low latency - milliseconds
Consumer 1
Consumer n
Client
Dat
abus
C
lient
Lib
Consumer 1
Consumer n
Dat
abus
C
lient
Lib
Client
Relay
Event Win
15
Scaling Core Databases
RO
RO
RO
16
Voldemort: Highly-Available Distributed KV Store
LinkedIn Data Infrastructure Solutions
17
Scaling Core Databases
• Pluggable components• Tunable consistency /
availability• Highly scalable key/value store
• 14 clusters, 400 nodes• 400K peak QPS• 100TB data• 2~3ms avg latency
Voldemort: Architecture
19
Scaling Core Databases
Secondary Index
20
Espresso: Indexed Timeline-Consistent Distributed Data Store
LinkedIn Data Infrastructure Solutions
21
Storage with Richer Data Model
Espresso
Application View
22
Hierarchical data model
Rich functionality on resources Conditional updates Partial updates Atomic counters
Rich functionality withinresource groups
Transactions Secondary index Text search
23
Espresso: System Components
• Partitioning/replication• Timeline consistency• Change propagation
24
Generic Cluster Manager: Helix
• Generic Distributed State Model• Config Management• Automatic Load Balancing• Fault tolerance• Cluster expansion and rebalancing
• Espresso, Databus and Search• Open Source Apr 2012• https://github.com/linkedin/helix
25
Streaming Non-transactional Events
Hadoop/DW
Espresso
26
Kafka: High-Volume Low-Latency Messaging System
LinkedIn Data Infrastructure Solutions
27
Ingress – Offline Data Analytics
SecuredHadoop/
DW
Kafka Architecture
Producer
Consumer
Producer
Consumer
Zookeeper
topic1-part1
topic2-part2
topic2-part1
topic1-part2
topic2-part2
topic2-part1
topic1-part1 topic1-part2
topic1-part1 topic1-part2
topic2-part2
topic2-part1
Broker 1 Broker 2 Broker 3 Broker 4
Key features• Scale-out architecture• High throughput• Automatic load balancing• Intra-cluster replication
Per day stats• writes: 10+ billion messages• reads: 50+ billion messages
29
Egress – Analytics Results for Online Serving
SecuredHadoop/
DW
30
WebHDFS + Faust
LinkedIn Data Infrastructure Solutions
+
31
Egress – Getting Data Out from Offline
SecuredHadoop/
DW
WebHDFS
KafkaFaust
32
Batch Environment Data Flow
33
Workflow management: Azkaban
LinkedIn Confidential ©2013 All Rights Reserved 34
• Map-reduce jobs generate RO files• All index fits in memory for fast reads• File system cache for data
• Data transferred in parallel via WebHDFS
• Authentication always required for each file transfer out of Hadoop
Read-only Data Generation and Transfer
LinkedIn Confidential ©2013 All Rights Reserved 35
• Map-reduce jobs generate records• In Avro format• Annotated key and value fields
• Records published from Hadoop to Kakfa
• Faust consumes records from Kafka
• Faust streams records into Voldemort, Espresso, and other serving platforms
Modifiable Data Generation and Transfer
Plug-ins
V. Plug-in
E. Plug-in
Plug-ins
Kafka Plug-
in
Databus
Plug-in
Other Data Sources
Voldemort
Espresso
Other Data Sources
Hadoop
Teradata/ DWH
Kafka
Monitoring Throttling Scheduling
Faust
LinkedIn Confidential ©2013 All Rights Reserved 36
Summary
Read more @ data.linkedin.com
1. E2E: The Big-Data feedback loop is essential for product design
2. Infrastructure
1. Data Infra needs continuous innovation and iteration to scale out
2. Fast moving, Big, Clean Data + Agile Metadata = Goodness
3. Data-driven products need agile feedback infrastructure and measurement methodology.
3. Methodology
1. Data-Driven experimentation enables insights and agile products
2. Recommendation-driven products have big impact.
LinkedIn Confidential ©2013 All Rights Reserved 37
Help us. Come Have Fun with Us!
Info: data.linkedin.com
1. Science and Data Mining: Recommendation and Optimization Problems
2. Next-generation ad-hoc and OLAP query processing on Hadoop
3. Graph Computations: Off-line mining and On-line integration loops
4. nRT Data Streams in Near-line infrastructure
5. And much more…
39