espresso: linkedin's distributed data serving platform (talk)

49
On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform Swaroop Jagadish http://www.linkedin.com/in/swaroopjagadish LinkedIn Confidential ©2013 All Rights Reserved

Upload: amy-w-tang

Post on 11-May-2015

3.810 views

Category:

Technology


1 download

DESCRIPTION

This talk was given by Swaroop Jagadish (Staff Software Engineer @ LinkedIn) at the ACM SIGMOD/PODS Conference (June 2013). For the paper written by the LinkedIn Espresso Team, go here: http://www.slideshare.net/amywtang/espresso-20952131

TRANSCRIPT

Page 1: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

On Brewing Fresh Espresso: LinkedIn’s Distributed Data

Serving Platform

Swaroop Jagadish

http://www.linkedin.com/in/swaroopjagadish

LinkedIn Confidential ©2013 All Rights Reserved

Page 2: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Outline

LinkedIn Data Ecosystem

Espresso: Design Points

Data Model and API

Architecture

Deep Dive: Fault Tolerance

Deep Dive: Secondary Indexing

Espresso In Production

Future work

2

Page 3: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

The World’s Largest Professional Network

Members Worldwide

2 new Members Per Second

100M+ Monthly Unique Visitors

225M+ 2M+ Company Pages

Connecting Talent Opportunity. At scale…

LinkedIn Confidential ©2013 All Rights Reserved 3

Page 4: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

LinkedIn Data Ecosystem

4

Page 5: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Espresso: Key Design Points

Source-of-truth – Master-Slave, Timeline consistent

– Query-after-write

– Backup/Restore

– High Availability

Horizontally Scalable

Rich functionality – Hierarchical data model

– Document oriented

– Transactions within a hierarchy

– Secondary Indexes

5

Page 6: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Espresso: Key Design Points

Agility – no “pause the world” operations

– “On the fly” Schema Evolution

– Elasticity

Integration with the data ecosystem

– Change stream with freshness in O(seconds)

– ETL to Hadoop

– Bulk import

Modular and Pluggable

– Off-the-shelf: MySQL, Lucene, Avro

6

Page 7: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Data Model and API

7

Page 8: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Application View

8

key

value

REST API: /mailbox/msg_meta/bob/2

Page 9: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Partitioning

9

/mailbox/msg_meta/bob/2

MemberId is the partitioning key

Page 10: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Document based data model

Richer than a plain key-value store

Hierarchical keys

Values are rich documents and may contain

nested types

10

from : { name : "Chris", email : "[email protected]" }subject : "Go Giants!"body : "World Series 2012! w00t!"unread : true

Messages

mailboxID : StringmessageID : long

from : { name : String email : String }subject : Stringbody : Stringunread : boolean

Page 11: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

REST based API

• Secondary Index query – GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true

+isInbox:true”&start=0&count=15

• Partial updates POST /MailboxDB/MessageMeta/bob/1

Content-Type: application/json

Content-Length: 21

{“unread” : “false”}

• Conditional operations – Get a message, only if recently updated

GET /MailboxDB/MessageMeta/bob/1

If-Modifed-Since: Wed, 31 Oct 2012 02:54:12 GMT

11

Page 12: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Transactional writes within a hierarchy

mboxId value

George { “numUnread”:

2 }

MessageCounter

mboxId msgId value etag

George 0 {…, “unread”: false, …} 7abf8091

George 1 {…, “unread”: true, …} b648bc5f

George 2 {…, “unread”: true, …} 4fde8701

Message /Message/George/0 {…, “unread”: false, …} 7abf8091

/Message/George/0 {…, “unread”: true, …}

/MessageCounter/George {…, “numUnread”: “+1”, …}

1. Read, record etags

2. Prepare after-image

3.Update

mboxId value

George { “numUnread”:

3 }

Page 13: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Espresso Architecture

13

Page 14: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

14

Page 15: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

15

Page 16: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

16

Page 17: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

17

Page 18: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

18

Page 19: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

19

Page 20: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Cluster Management and Fault

Tolerance

20

Page 21: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Generic Cluster Manager: Apache Helix

Generic cluster management

– State model + constraints

– Ideal state of distribution of partitions

across the cluster

– Migrate cluster from current state to

ideal state

• More Info

• SoCC 2012

• http://helix.incubator.apache.org

21

Page 22: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Espresso Partition Layout: Master, Slave

3 Storage Engine nodes, 2-way replication

22

Apache Helix

Partition: P1

Node: 1

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

S: P5 – Active

Cluster

Node 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Master

Slave

Offline

Page 23: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Cluster Management

Cluster Expansion

Node Failover

Page 24: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Cluster Expansion

Initial State with 3 Storage Nodes. Step1: Compute new Ideal

state

24

Helix

Partition: P1

Node: 1

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

S: P5 – Active

Cluster

Node 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Master

Slave

Offline Node 4

Page 25: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Cluster Expansion

Step 2: Bootstrap new node’s partitions by restoring from

backups

25

Helix

Partition: P1

Node: 1

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

S: P5 – Active

Cluster

Node 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Master

Slave

Offline Node 4

P4 P8 P12

P7 P9 P1

Snapshots

Page 26: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Cluster Expansion

Step 3: Catch up from live replication stream

26

Helix

Partition: P1

Node: 1

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

S: P5 – Active

Cluster

Node 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Master

Slave

Offline Node 4

P4 P8 P12

P7 P9 P1

Snapshots

Page 27: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Cluster Expansion

Step 4: Migrate masters and slaves to rebalance

27

Helix

Partition: P1

Node: 1

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

S: P5 – Active

Cluster

Node 1

P1 P2 P3

P5 P6

P10

Node 2

P5 P6 P7

P2

P11 P12

Node 3

P9 P10 P11

P3 P4

P8

Master

Slave

Offline Node 4

P4 P8 P12

P7 P9 P1

Page 28: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Cluster Expansion

Partitions are balanced. Router starts sending traffic to new

node

28

Helix

Partition: P1

Node: 1

Partition: P12

Node: 3

Database

Node: 1

M: P1 – Active

S: P5 – Active

Cluster

Node 1 Node 2

P5 P6 P7

P2 P11 P12

Node 3

Master

Slave

Offline Node 4

P1 P2 P3

P5 P6 P10

P9 P10 P11

P3 P4 P8

P4 P8 P12

P1 P7 P9

Page 29: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Node Failover

• During failure or planned maintenance

29

Node 1

P1 P2 P3

P10 P5 P6

Node 2

P5 P6 P7

P12 P2 P11

Node 3

P9 P10 P11

P8 P3 P4

Node 4

P4 P8 P12

P7 P9 P1

Helix

Partition: P1

Node: 1

Partition: P12

Node: 4

Database

Cluster

Node: 4

M: P4 – Active

S: P7 – Active

Page 30: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Node Failover

• Step 1: Detect Node failure

30

Node 1

P1 P2 P3

P10 P5 P6

Node 2

P5 P6 P7

P12 P2 P11

Node 3

P9 P10 P11

P8 P3 P4

Node 4

P4 P8 P12

P7 P9 P1

Helix

Partition: P1

Node: 1

Partition: P12

Node: 4

Database

Cluster

Node: 4

M: P4 – Active

S: P7 – Active

Page 31: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Node Failover

• Step 2: Compute new ideal state for promoting slaves to

master

31

Node 1

P1 P2 P3

P5 P6

Node 2

P5 P6 P7

P12 P2

Node 3

P10 P11

P8 P3 P4

Node 4

P4 P8 P12

P7 P9 P1

Helix

Partition: P1

Node: 1

Partition: P12

Node: 4

Database

Cluster

Node: 4

M: P4 – Active

S: P7 – Active

P11 P10

P9

Page 32: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Failover Performance

32

Page 33: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Secondary indexing

33

Page 34: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Espresso Secondary Indexing

• Local Secondary Index Requirements

• Read after write

• Consistent with primary data under failure

• Rich query support: match, prefix, range, text search

• Cost-to-serve proportional to working set

• Pluggable Index Implementations

• MySQL B-Tree

• Inverted index using Apache Lucene with MySQL backing store

• Inverted index using Prefix Index

• Fastbit based bitmap index

Page 35: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Lucene based implementation

• Requires entire index to be memory-resident to support low latency

query response times

• For the Mailbox application, we have two options

Page 36: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Optimizations for Lucene based implementation

• Concurrent transactions on the same Lucene

index leads to inconsistency

• Need to acquire a lock

• Opening an index repeatedly is expensive

• Group commit to amortize index opening cost

write

Request 2

Request 3

Request 4

Request 5

Request 1

Page 37: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Optimizations for Lucene based implementation

High value users of the site accumulate large

mailboxes

– Query performance degrades with a large index

Performance shouldn’t get worse with more usage!

Time Partitioned Indexes: Partition index into buckets

based on created time

Page 38: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Espresso in Production

38

Page 39: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Espresso in Production

Unified Social Content Platform –social activity aggregation

High Read:Write ratio

39

Page 40: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Espresso in Production

InMail - Allows members to communicate with each other

Large storage footprint

Low latency requirement for secondary index queries involving text

search and relational predicates

40

Page 41: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Performance

Average Failover Latency with 1024 partitions is

around 300ms

Primary Data Reads and Writes

For Single Storage Node on SSD

Average row size = 1KB

41

Operation Average Latency Average

Throughput

Reads ~3ms 40,000 per

second

Writes ~6ms 20,000 per

second

Page 42: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Performance

Partition-key level Secondary Index using Lucene

One Index per Mailbox use-case

Base data on SAS, Indexes on SSDs

Average throughput per index = ~1000 per second

(after the group commit and partitioned index

optimizations)

42

Operation Average Latency

Queries (average

of 5 indexed

fields)

~20ms

Writes (Around

30 indexed fields)

~20ms

Page 43: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Durability and Consistency

Within a Data Center

Across Data Centers

Page 44: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Durability and Consistency

Within a Data Center

– Write latency vs Durability

Asynchronous replication

– May lead to data loss

– Tooling can mitigate some of this

Semi-synchronous replication

– Wait for at least one relay to acknowledge

– During failover, slaves wait for catchup

Consistency over availability

Helix selects slave with least replication lag to take over

mastership

Failover time is ~300ms in practice

Page 45: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Durability and Consistency

Across data centers

– Asynchronous replication

– Stale reads possible

– Active-active: Conflict resolution via last-writer-wins

Page 46: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Lessons learned

Dealing with transient failures

Planned upgrades

Slave reads

Storage Devices

– SSDs vs SAS disks

Scaling Cluster Management

46

Page 47: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Future work

Coprocessors

– Synchronous, Asynchronous

Richer query processing

– Group-by, Aggregation

47

Page 48: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Key Takeaways

Espresso is a timeline consistent,

document-oriented distributed database

Feature rich: Secondary indexing,

transactions over related documents,

seamless integration with the data

ecosystem

In production since June 2012 serving

several key use-cases

48

Page 49: Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

49

Questions?