espresso: linkedin's distributed data serving platform (talk)

On Brewing Fresh Espresso: LinkedIn’s Distributed Data

Serving Platform

Swaroop Jagadish

http://www.linkedin.com/in/swaroopjagadish

Outline

LinkedIn Data Ecosystem

Espresso: Design Points

Data Model and API

Architecture

Deep Dive: Fault Tolerance

Deep Dive: Secondary Indexing

Espresso In Production

Future work

The World’s Largest Professional Network

Members Worldwide

2 new Members Per Second

100M+ Monthly Unique Visitors

225M+ 2M+ Company Pages

Connecting Talent Opportunity. At scale…

LinkedIn Data Ecosystem

Espresso: Key Design Points

Source-of-truth – Master-Slave, Timeline consistent

– Query-after-write

– Backup/Restore

– High Availability

Horizontally Scalable

Rich functionality – Hierarchical data model

– Document oriented

– Transactions within a hierarchy

– Secondary Indexes

Espresso: Key Design Points

Agility – no “pause the world” operations

– “On the fly” Schema Evolution

– Elasticity

Integration with the data ecosystem

– Change stream with freshness in O(seconds)

– ETL to Hadoop

– Bulk import

Modular and Pluggable

– Off-the-shelf: MySQL, Lucene, Avro

Data Model and API

Application View

REST API: /mailbox/msg_meta/bob/2

Partitioning

/mailbox/msg_meta/bob/2

MemberId is the partitioning key

Document based data model

Richer than a plain key-value store

Hierarchical keys

Values are rich documents and may contain

nested types

from : { name : "Chris", email : "chris@linkedin.com" }subject : "Go Giants!"body : "World Series 2012! w00t!"unread : true

Messages

mailboxID : StringmessageID : long

from : { name : String email : String }subject : Stringbody : Stringunread : boolean

REST based API

• Secondary Index query – GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true

+isInbox:true”&start=0&count=15

• Partial updates POST /MailboxDB/MessageMeta/bob/1

Content-Type: application/json

Content-Length: 21

{“unread” : “false”}

• Conditional operations – Get a message, only if recently updated

GET /MailboxDB/MessageMeta/bob/1

If-Modifed-Since: Wed, 31 Oct 2012 02:54:12 GMT

Transactional writes within a hierarchy

mboxId value

George { “numUnread”:

MessageCounter

mboxId msgId value etag

George 0 {…, “unread”: false, …} 7abf8091

George 1 {…, “unread”: true, …} b648bc5f

George 2 {…, “unread”: true, …} 4fde8701

Message /Message/George/0 {…, “unread”: false, …} 7abf8091

/Message/George/0 {…, “unread”: true, …}

/MessageCounter/George {…, “numUnread”: “+1”, …}

1. Read, record etags

2. Prepare after-image

3.Update

mboxId value

George { “numUnread”:

Espresso Architecture

Cluster Management and Fault

Tolerance

Generic Cluster Manager: Apache Helix

Generic cluster management

– State model + constraints

– Ideal state of distribution of partitions

across the cluster

– Migrate cluster from current state to

ideal state

• More Info

• SoCC 2012

• http://helix.incubator.apache.org

Espresso Partition Layout: Master, Slave

3 Storage Engine nodes, 2-way replication

Apache Helix

Partition: P1

Node: 1

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

S: P5 – Active

Cluster

Node 1

P9 P10

Node 2

P11 P12

Node 3

P9 P10

Master

Offline

Cluster Management

Cluster Expansion

Node Failover

Cluster Expansion

Initial State with 3 Storage Nodes. Step1: Compute new Ideal

Partition: P1

Node: 1

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

S: P5 – Active

Cluster

Node 1

P9 P10

Node 2

P11 P12

Node 3

P9 P10

Master

Offline Node 4

Cluster Expansion

Step 2: Bootstrap new node’s partitions by restoring from

backups

Partition: P1

Node: 1

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

S: P5 – Active

Cluster

Node 1

P9 P10

Node 2

P11 P12

Node 3

P9 P10

Master

Offline Node 4

P4 P8 P12

P7 P9 P1

Snapshots

Cluster Expansion

Step 3: Catch up from live replication stream

Partition: P1

Node: 1

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

S: P5 – Active

Cluster

Node 1

P9 P10

Node 2

P11 P12

Node 3

P9 P10

Master

Offline Node 4

P4 P8 P12

P7 P9 P1

Snapshots

Cluster Expansion

Step 4: Migrate masters and slaves to rebalance

Partition: P1

Node: 1

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

S: P5 – Active

Cluster

Node 1

P1 P2 P3

Node 2

P5 P6 P7

P11 P12

Node 3

P9 P10 P11

Master

Offline Node 4

P4 P8 P12

P7 P9 P1

Cluster Expansion

Partitions are balanced. Router starts sending traffic to new

Partition: P1

Node: 1

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

S: P5 – Active

Cluster

Node 1 Node 2

P5 P6 P7

P2 P11 P12

Node 3

Master

Offline Node 4

P1 P2 P3

P5 P6 P10

P9 P10 P11

P3 P4 P8

P4 P8 P12

P1 P7 P9

Node Failover

• During failure or planned maintenance

Node 1

P1 P2 P3

P10 P5 P6

Node 2

P5 P6 P7

P12 P2 P11

Node 3

P9 P10 P11

P8 P3 P4

Node 4

P4 P8 P12

P7 P9 P1

Partition: P1

Node: 1

Partition: P12

Node: 4

Database

Cluster

Node: 4

M: P4 – Active

S: P7 – Active

Node Failover

• Step 1: Detect Node failure

Node 1

P1 P2 P3

P10 P5 P6

Node 2

P5 P6 P7

P12 P2 P11

Node 3

P9 P10 P11

P8 P3 P4

Node 4

P4 P8 P12

P7 P9 P1

Partition: P1

Node: 1

Partition: P12

Node: 4

Database

Cluster

Node: 4

M: P4 – Active

S: P7 – Active

Node Failover

• Step 2: Compute new ideal state for promoting slaves to

master

Node 1

P1 P2 P3

Node 2

P5 P6 P7

P12 P2

Node 3

P10 P11

P8 P3 P4

Node 4

P4 P8 P12

P7 P9 P1

Partition: P1

Node: 1

Partition: P12

Node: 4

Database

Cluster

Node: 4

M: P4 – Active

S: P7 – Active

P11 P10

Failover Performance

Secondary indexing

Espresso Secondary Indexing

• Local Secondary Index Requirements

• Read after write

• Consistent with primary data under failure

• Rich query support: match, prefix, range, text search

• Cost-to-serve proportional to working set

• Pluggable Index Implementations

• MySQL B-Tree

• Inverted index using Apache Lucene with MySQL backing store

• Inverted index using Prefix Index

• Fastbit based bitmap index

Lucene based implementation

• Requires entire index to be memory-resident to support low latency

query response times

• For the Mailbox application, we have two options

Optimizations for Lucene based implementation

• Concurrent transactions on the same Lucene

index leads to inconsistency

• Need to acquire a lock

• Opening an index repeatedly is expensive

• Group commit to amortize index opening cost

Request 2

Request 3

Request 4

Request 5

Request 1

Optimizations for Lucene based implementation

High value users of the site accumulate large

mailboxes

– Query performance degrades with a large index

Performance shouldn’t get worse with more usage!

Time Partitioned Indexes: Partition index into buckets

based on created time

Espresso in Production

Unified Social Content Platform –social activity aggregation

High Read:Write ratio

Espresso in Production

InMail - Allows members to communicate with each other

Large storage footprint

Low latency requirement for secondary index queries involving text

search and relational predicates

Performance

Average Failover Latency with 1024 partitions is

around 300ms

Primary Data Reads and Writes

For Single Storage Node on SSD

Average row size = 1KB

Operation Average Latency Average

Throughput

Reads ~3ms 40,000 per

second

Writes ~6ms 20,000 per

second

Performance

Partition-key level Secondary Index using Lucene

One Index per Mailbox use-case

Base data on SAS, Indexes on SSDs

Average throughput per index = ~1000 per second

(after the group commit and partitioned index

optimizations)

Operation Average Latency

Queries (average

of 5 indexed

fields)

Writes (Around

30 indexed fields)

Durability and Consistency

Within a Data Center

Across Data Centers

Within a Data Center

– Write latency vs Durability

Asynchronous replication

– May lead to data loss

– Tooling can mitigate some of this

Semi-synchronous replication

– Wait for at least one relay to acknowledge

– During failover, slaves wait for catchup

Consistency over availability

Helix selects slave with least replication lag to take over

mastership

Failover time is ~300ms in practice

Across data centers

– Asynchronous replication

– Stale reads possible

– Active-active: Conflict resolution via last-writer-wins

Lessons learned

Dealing with transient failures

Planned upgrades

Slave reads

Storage Devices

– SSDs vs SAS disks

Scaling Cluster Management

Future work

Coprocessors

– Synchronous, Asynchronous

Richer query processing

– Group-by, Aggregation

Key Takeaways

Espresso is a timeline consistent,

document-oriented distributed database

Feature rich: Secondary indexing,

transactions over related documents,

seamless integration with the data

ecosystem

In production since June 2012 serving

several key use-cases

Questions?

espresso: linkedin's distributed data serving platform (talk)

p1 node

database node

p5 active cluster node

p5 p6 p7 p2 p11 p12

p1 p2 p3 p5 p6 p10 node

p5 p6 p8 p7 p1 p2 p11

p1 p2 p4 p3 p5 p6 p9

p1 p2 p3 p10p5 p6 node

Technology

linkedin's most engaged marketers australia

linkedin's financial services breakfast - tuesday, 26th may...

linkedin's desktop redesign explained... visually

linkedin's new company pages - hubspot

linkedin's global recruiting trends 2016

linkedin's 2015 global talent trends report

linkedin's student editorial calendar for 2015

how to use linkedin's publishing platform to generate leads

linkedin's new company pages

linkedin's transformation story

linkedin's top sponsored updates of 2014

databus - linkedin's change data capture pipeline

harvesting the power of samza in linkedin's feed

linkedin's latest updates for your business

guide your career path using linkedin's alumni tool

architecting for change: linkedin's new data ecosystem

linkedin's middle east audience - 360 infographic

linkedin's series b pitch to greylock

kafka - linkedin's messaging backbone

ask me anything with linkedin's senior product manager