the evolution of data infrastructure at linkedin linkedin confidential ©2013 all rights reserved

LinkedIn Confidential ©2013 All Rights Reserved

The Evolution of Data Infrastructure at Linkedin

Lei Gaohttp://www.linkedin.com/in/gaolei

LinkedIn Confidential ©2013 All Rights Reserved 2

Outline

1. Company and Mission

2. Products and Science

3. Data Infrastructure

4. Conclusion

The World’s Largest Professional Network

Members Worldwide

2+ newMembers Per Second

132M+Monthly Unique Visitors

225M+ 2.9M+ Company Pages

Connecting the world’s professionals to make them more productive and successful


4

Member ProfilesLarge dataset

Medium writes

Very high reads

Freshness <1s

5

People You May KnowLarge dataset

Compute intensive

High reads

Freshness ~hrs

6

LinkedIn Today Moving dataset

High writes

High reads

Freshness ~mins


LinkedIn Data Infrastructure: Three-Phase Abstraction

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Infrastructure Latency & Freshness Requirements Products

Online Activity that should be reflected immediately• Member Profiles• Company Profiles• Connections

• Messages • Endorsements• Skills

Near-Line Activity that should be reflected soon

• Activity Streams• Profile Standardization• News

• Recommendations• Search• Messages

Offline Activity that can be reflected later

• People You May Know• Connection Strength• News

• Recommendations• Next best idea…


The Big-Data Feedback Loop

Value

Insights

Scale

Product

ScienceData

Member

Engagement

Virality

Signals

Refinement

InfrastructureAnalytics

9

LinkedIn Data Infrastructure: Sample Stack

Infra challenges in 3-phase ecosystem are diverse, complex and specific

Some off-the-shelf.Significant investment in home-grown, deep and

interesting platforms

Databus


The Original RDBMS Model

11

Streaming Transactions for Search/Connections

Databus : Timeline-Consistent Change Data Capture

LinkedIn Data Infrastructure Solutions

13

Streaming Transactions for Search/Connections

RO

RO

RO

Databus at LinkedIn

14

DB

Bootstrap

CaptureChanges

On-lineChanges

On-lineChanges

DB

Compressed

Delta Since T

Consistent

Snapshot at U

Transport independent of data source: Oracle, MySQL, …

Transactional semantics In order, at least once delivery

Tens of relays Hundreds of sources Low latency - milliseconds

Consumer 1

Consumer n

Client

Dat

abus

C

lient

Lib

Consumer 1

Consumer n

Dat

abus

C

lient

Lib

Client

Relay

Event Win

15

Scaling Core Databases

RO

RO

RO

16

Voldemort: Highly-Available Distributed KV Store


17


• Pluggable components• Tunable consistency /

availability• Highly scalable key/value store

• 14 clusters, 400 nodes• 400K peak QPS• 100TB data• 2~3ms avg latency

Voldemort: Architecture

19


Secondary Index

20

Espresso: Indexed Timeline-Consistent Distributed Data Store


21

Storage with Richer Data Model

Espresso

Application View

22

Hierarchical data model

Rich functionality on resources Conditional updates Partial updates Atomic counters

Rich functionality withinresource groups

Transactions Secondary index Text search

23

Espresso: System Components

• Partitioning/replication• Timeline consistency• Change propagation

24

Generic Cluster Manager: Helix

• Generic Distributed State Model• Config Management• Automatic Load Balancing• Fault tolerance• Cluster expansion and rebalancing

• Espresso, Databus and Search• Open Source Apr 2012• https://github.com/linkedin/helix

https://github.com/linkedin/helix

25

Streaming Non-transactional Events

Hadoop/DW

Espresso

26

Kafka: High-Volume Low-Latency Messaging System


27

Ingress – Offline Data Analytics

SecuredHadoop/

DW

Kafka Architecture

Producer

Consumer

Producer

Consumer

Zookeeper

topic1-part1

topic2-part2

topic2-part1

topic1-part2

topic2-part2

topic2-part1

topic1-part1 topic1-part2

topic1-part1 topic1-part2

topic2-part2

topic2-part1

Broker 1 Broker 2 Broker 3 Broker 4

Key features• Scale-out architecture• High throughput• Automatic load balancing• Intra-cluster replication

Per day stats• writes: 10+ billion messages• reads: 50+ billion messages

29

Egress – Analytics Results for Online Serving

SecuredHadoop/

DW

30

WebHDFS + Faust


+

31

Egress – Getting Data Out from Offline

SecuredHadoop/

DW

WebHDFS

KafkaFaust

32

Batch Environment Data Flow

33

Workflow management: Azkaban


• Map-reduce jobs generate RO files• All index fits in memory for fast reads• File system cache for data

• Data transferred in parallel via WebHDFS

• Authentication always required for each file transfer out of Hadoop

Read-only Data Generation and Transfer


• Map-reduce jobs generate records• In Avro format• Annotated key and value fields

• Records published from Hadoop to Kakfa

• Faust consumes records from Kafka

• Faust streams records into Voldemort, Espresso, and other serving platforms

Modifiable Data Generation and Transfer

Plug-ins

V. Plug-in

E. Plug-in

Plug-ins

Kafka Plug-

in

Databus

Plug-in

Other Data Sources

Voldemort

Espresso

Other Data Sources

Hadoop

Teradata/ DWH

Kafka

Monitoring Throttling Scheduling

Faust


Summary

Read more @ data.linkedin.com

1. E2E: The Big-Data feedback loop is essential for product design

2. Infrastructure

1. Data Infra needs continuous innovation and iteration to scale out

2. Fast moving, Big, Clean Data + Agile Metadata = Goodness

3. Data-driven products need agile feedback infrastructure and measurement methodology.

3. Methodology

1. Data-Driven experimentation enables insights and agile products

2. Recommendation-driven products have big impact.


Help us. Come Have Fun with Us!

Info: data.linkedin.com

1. Science and Data Mining: Recommendation and Optimization Problems

2. Next-generation ad-hoc and OLAP query processing on Hadoop

3. Graph Computations: Off-line mining and On-line integration loops

4. nRT Data Streams in Near-line infrastructure

5. And much more…


In Closing

[email protected]

Thank You!

the evolution of data infrastructure at linkedin linkedin confidential ©2013 all rights reserved

Documents

linkedin linkedin confidential

rights reserved slide

outline linkedin confidential

conclusion slide

new members

worlds professionals

company pages

high reads freshness