linkedin infrastructure (analytics@webscale, at fb 2013)
DESCRIPTION
This is the presentation at analytics@webscale in 2013 (http://analyticswebscale.splashthat.com/?em=187&utm_campaign=website&utm_source=sg&utm_medium=em)TRANSCRIPT
LinkedIn Confidential ©2013 All Rights Reserved
Data Infrastructure at Linkedin
Jun Rao and Sam Shah
LinkedIn Confidential ©2013 All Rights Reserved 2
Outline
1. LinkedIn introduction
2. Online/nearline infrastructure
3. Offline infrastructure
4. Conclusion
The World’s Largest Professional Network
Members Worldwide
2 newMembers Per Second
100M+Monthly Unique Visitors
200M+ 2M+ Company Pages
Connecting Talent Opportunity. At scale…
LinkedIn Confidential ©2013 All Rights Reserved 3
LinkedIn Confidential ©2013 All Rights Reserved 4
Two Product Families
Data
Data Infrastructure
Science and Analytics
Professionals Companies
Connections
Profiles Actions
Content
For Members For Partners
People You May Know Who’s Viewed My Profile Jobs You May Be
Interested In News/Sharing Today Search Subscriptions
Hire
Market
Sell
LinkedIn Confidential ©2013 All Rights Reserved 5
The Big-Data Feedback Loop
Value
Insights
Scale
Product
ScienceData
Member
Engagement
Virality
Signals
Refinement
InfrastructureAnalytics
LinkedIn Confidential ©2013 All Rights Reserved 6
LinkedIn Data Infrastructure: Three-Phase Abstraction
Users Online Data Infra
Near-Line Infra
Application Offline Data Infra
Infrastructure Latency & Freshness Requirements Products
Online Activity that should be reflected immediately• Member Profiles• Company Profiles• Connections
• Messages • Endorsements• Skills
Near-Line Activity that should be reflected soon
• Activity Streams• Profile Standardization• News
• Recommendations• Search• Messages
Offline Activity that can be reflected later
• People You May Know• Connection Strength• News
• Recommendations• Next best idea…
7
LinkedIn Data Infrastructure: Sample Stack
Infra challenges in 3-phase ecosystem are diverse, complex and specific
Some off-the-shelf.Significant investment in home-grown, deep and
interesting platforms
Databus
8
Voldemort: Highly-Available Distributed KV Store
LinkedIn Data Infrastructure Solutions
• Key/value access at scale
• Pluggable components• Tunable consistency /
availability• Key/value model,
server side “views”
• 10 clusters, 100+ nodes• Largest cluster – 10K+ qps• Avg latency: 3ms• Hundreds of Stores• Largest store – 2.8TB+
Voldemort: Architecture
10
Espresso: Indexed Timeline-Consistent Distributed Data Store
LinkedIn Data Infrastructure Solutions
• Fill in the gap btw Oracle and KV store
11
Espresso: System Components
• Hierarchical data model• Timeline consistency• Rich functionality
• Transactions• Secondary index• Text search
• Partitioning/replication• Change propagation
12
Generic Cluster Manager: Helix
• Generic Distributed State Model• Config Management• Automatic Load Balancing• Fault tolerance• Cluster expansion and rebalancing
• Espresso, Databus and Search• Open Source Apr 2012• https://github.com/linkedin/helix
Databus : Timeline-Consistent Change Data Capture
LinkedIn Data Infrastructure Solutions
• Deliver data store changes to apps
Databus at LinkedIn
14
DB
Bootstrap
CaptureChanges
On-lineChanges
On-lineChanges
DB
Compressed
Delta Since T
Consistent
Snapshot at U
Transport independent of data source: Oracle, MySQL, …
Transactional semantics In order, at least once delivery
Tens of relays Hundreds of sources Low latency - milliseconds
Consumer 1
Consumer n
Client
Dat
abus
C
lient
Lib
Consumer 1
Consumer n
Dat
abus
C
lient
Lib
Client
Relay
Event Win
15
Kafka: High-Volume Low-Latency Messaging System
LinkedIn Data Infrastructure Solutions
• Log aggregation and queuing
Kafka Architecture
Producer
Consumer
Producer
Consumer
Zookeeper
topic1-part1
topic2-part2
topic2-part1
topic1-part2
topic2-part2
topic2-part1
topic1-part1 topic1-part2
topic1-part1 topic1-part2
topic2-part2
topic2-part1
Broker 1 Broker 2 Broker 3 Broker 4
Key features• Scale-out architecture• Automatic load balancing• High throughput/low latency• Rewindability• Intra-cluster replication
Per day stats• writes: 10+ billion messages• reads: 50+ billion messages
LinkedIn Confidential ©2013 All Rights Reserved 17
LinkedIn Data Infrastructure: A few take-aways
1. Building infrastructure in a hyper-growth environment is challenging.
2. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*)
3. Balance open-source products with home-grown platforms (**)