data infrastructure at linkedin

Data Infrastructure at LinkedIn Kapil Surlaker http://www.linkedin.com/in/kapilsurlaker @kapilsurlaker

Outline

LinkedIn Products Data Ecosystem LinkedIn Data Infrastructure Solutions Next Play

LinkedIn By The Numbers

150M + users* ~ 4.2B People Searches in 2011** >2M companies with LinkedIn Company Pages** 16 languages 75% of Fortune 100 Companies use LinkedIn to hire***

* As of February 9th 2012 ** As of December 31st 2011

*** As of September 30th 2011

Broad Range of Products & Services

User Profiles Large dataset Medium writes Very high reads Freshness <1s

Communications Large dataset

High writes High reads

Freshness <1s

People You May Know

Large dataset Compute intensive

High reads Freshness ~hrs

LinkedIn Today

Moving dataset High writes High reads

Freshness ~mins

Outline

LinkedIn Products Data Ecosystem LinkedIn Data Infrastructure Solutions Next Play

Three Paradigms : Simplifying the Data Continuum

• Member Profiles

• Company Profiles • Connections • Communications

Online

• Linkedin Today

• Profile Standardization • News • Recommendations • Search • Communications

Nearline

• People You May Know

• Connection Strength • News • Recommendations • Next best idea

Offline

Activity that should be reflected immediately

Activity that should be reflected soon

Activity that can be reflected later

LinkedIn Product Architecture

Databus : Timeline-Consistent Change Data Capture

LinkedIn Data Infrastructure Solutions

Databus at LinkedIn

Bootstrap

Capture Changes

On-line Changes

Consistent Snapshot at U

Consumer 1 Consumer n

Client

Relay Event Win

Databus at LinkedIn

Bootstrap

Capture Changes

On-line Changes

Consistent Snapshot at U

Transport independent of data source: Oracle, MySQL, …

Transactional semantics In order, at least once delivery

Tens of relays Hundreds of sources Low latency - milliseconds

Client

Relay Event Win

Voldemort: Highly-Available Distributed KV Store

• Pluggable components • Tunable consistency /

availability • Key/value model,

server side “views”

• 10 clusters, 100+ nodes • Largest cluster – 10K+ qps • Avg latency: 3ms • Hundreds of Stores • Largest store – 2.8TB+

Voldemort: Architecture

Kafka: High-Volume Low-Latency Messaging System

Kafka: Architecture

WebTier

Topic 1

Broker Tier

Push Events Topic 2

Topic N

Zookeeper Offset Management

Topic, Partition Ownership

Sequential write sendfile

Consumers

Pull Events Iterator 1

Iterator n

Topic Offset

100 MB/sec 200 MB/sec

Kafka: Architecture

WebTier

Topic 1

Broker Tier

Push Events Topic 2

Topic N

Zookeeper Offset Management

Topic, Partition Ownership

Sequential write sendfile

Consumers

Pull Events Iterator 1

Iterator n

Topic Offset

100 MB/sec 200 MB/sec

Billions of Events, TBs per day 50K+ per sec at peak Inter and Intra-cluster replication End-to-end latency: few seconds

At least once delivery Very high throughput Low latency Durability

Espresso: Indexed Timeline-Consistent Distributed Data Store

Application View

Hierarchical data model

Rich functionality on resources Conditional updates Partial updates Atomic counters

Rich functionality within resource groups

Transactions Secondary index Text search

Partitioning

Node 3

Node 2

Espresso Partition Layout: Master, Slave

Cluster Manager

Partition: P.1 Node: 1 … Partition: P.12 Node: 3

Database

Node: 1 M: P.1 – Active … S: P.5 – Active …

Cluster Node 1

P.1 P.2

P.5 P.6

P.9 P.10

P.5 P.6

P.1 P.2

P.11 P.12

P.9 P.10

P.3 P.4

P.7 P.8 Master

3 Storage Engine nodes, 2 way replication

Espresso: System Components

Generic Cluster Manager: Helix

• Generic Distributed State Model • Centralized Config Management • Automatic Load Balancing • Fault tolerance • Health monitoring • Cluster expansion and

rebalancing • Espresso, Databus and Search • Open Source Apr 2012 • https://github.com/linkedin/helix

Espresso@Linkedin

Launched first application Oct 2011 Open source 2012 Future

– Multi-Datacenter support – Global secondary indexes – Time-partitioned data

Acknowledgments

Siddharth Anand, Aditya Auradkar, Chavdar Botev, Vinoth Chandar, Shirshanka Das, Dave DeMaagd, Alex Feinberg, John Fung, Phanindra Ganti, Mihir Gandhi, Lei Gao, Bhaskar Ghosh, Kishore Gopalakrishna, Brendan Harris, Rajappa Iyer, Swaroop Jagadish, Joel Koshy, Kevin Krawez, Jay Kreps, Shi Lu, Sunil Nagaraj, Neha Narkhede, Sasha Pachev, Igor Perisic, Lin Qiao, Tom Quiggle, Jun Rao, Bob Schulman, Abraham Sebastian, Oliver Seeliger, Adam Silberstein, Boris Shkolnik, Chinmay Soman, Subbu Subramaniam, Roshan Sumbaly, Kapil Surlaker, Sajid Topiwala, Cuong Tran, Balaji Varadarajan, Jemiah Westerman, Zach White, Victor Ye, David Zhang, and Jason Zhang

Questions?

data infrastructure at linkedin

linkedin company pages

linkedinkapil surlakerhttp

b people

Technology

©2014 linkedin corporation. all rights reserved. gobblin’...

data infrastructure at linkedin - stanford university

data infrastructure at linkedin shirshanka das xldb 2011 1

the "big data" ecosystem at linkedin

data storage infra at linkedin

metrics and monitoring infrastructure: lessons learned...

data analytics for growth presentation-linkedin

the big data ecosystem at linkedin jay kreps. me background...

data infrastructure at linkedin - university of...

big data analytics in linkedin -...

a small overview of big data products, analytics, and...

linkedin-data driven recriting-ebook-final

the evolution of data infrastructure at linkedin linkedin...

storage infrastructure behind linkedin’s...

how linkedin democratizes big data visualization

linkedin sales navigator data sheet

linkedin data analytics

linkedin lead accelerator data sheet

linkedin demographic data jun08

data driven recruiting with linkedin