harvesting the power of samza in linkedin's feed

30
Harvesting the Power of Samza in News Feed Providing fresh and relevant content to hundreds of millions of members

Upload: mohamed-mahmoud

Post on 15-Apr-2017

584 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Harvesting the Power of Samza in LinkedIn's Feed

Harvesting the Power of Samza in News FeedProviding fresh and relevant content to hundreds of millions of members

Page 2: Harvesting the Power of Samza in LinkedIn's Feed

 A Few Things Mentioned Here

Prerequisites

1 Samza

2 RocksDB (a key-value store)

3 SerDe (Serializer/Deserializer)

4 Kafka (a distributed messaging system)

5 Java

2

Page 3: Harvesting the Power of Samza in LinkedIn's Feed

The Challenge

Page 4: Harvesting the Power of Samza in LinkedIn's Feed

Relevant content is a great way to stay informed about your professional interests; Fresh relevant content is even better!

How do we keep track of what hundreds of millions of members

viewed on their News Feeds?

4

Page 5: Harvesting the Power of Samza in LinkedIn's Feed

Tracking

Page 6: Harvesting the Power of Samza in LinkedIn's Feed

  News Feed is the Landing Page for Most MembersScale

6

Source: investors.linkedin.com | 1 as of quarter end | 2 monthly average during the quarter

Page 7: Harvesting the Power of Samza in LinkedIn's Feed

• Lightweight events that track what

the member viewed

• Tiny payload (bandwidth-friendly)

• Events end up in a Kafka topic

Client-Side Tracking

Page 8: Harvesting the Power of Samza in LinkedIn's Feed

• Events that have more data about

served feeds

• Rich payload

• Events end up in a Kafka topic

Server-Side Tracking

Page 9: Harvesting the Power of Samza in LinkedIn's Feed

Improving Member Experience Using Samza (Overview)

A stream-stream join task buffers events from both streams; matches are sent to an output Kafka stream1 Join input streams

A custom TTL mechanism reaps stale events every n seconds2 Purge stale events

Convert the rich data about impressions into machine learning features used for ranking items in the News Feed3 Consume output stream

9

Page 10: Harvesting the Power of Samza in LinkedIn's Feed

Join

10

1

Page 11: Harvesting the Power of Samza in LinkedIn's Feed

Overview

11

Client Events

Server Events

Process Client Events

Process Server EventsOutput Events

Page 12: Harvesting the Power of Samza in LinkedIn's Feed

Client-Side Events Processor Overview

12

ID in server-

side events store?

Match events

Store (ID, const.)

Yes

No

Output to Kafka

Page 13: Harvesting the Power of Samza in LinkedIn's Feed

 OptimizationsClient-Side Events Processor

13

• Initial capacity of matches map (event, matched IDs) is determined by a metric (GC-friendly)

• Initial capacity of value set is equal to |IDs|

• An empty byte array is used as a dummy value for IDs to store in RocksDB (passes through the NOP byte array SerDe); acting as a set

Page 14: Harvesting the Power of Samza in LinkedIn's Feed

Server-Side Events Processor Overview

14

ID in client-side

events store?

Match events

Store (ID, event)

Yes

No

Output to Kafka

Page 15: Harvesting the Power of Samza in LinkedIn's Feed

• Header (shared event data)

• List of payloads (one for each item)

• Each payload has a join key (ID)

Event AnatomyShared Event Data(e.g. member ID)

ID: 123

ID: 456

ID: 789

ID: 012

ID: 345

ID: 678

ID: 901

ID: 234

Page 16: Harvesting the Power of Samza in LinkedIn's Feed

Server-Side Events Storage

16

Shared Event Data(e.g. member ID)

ID: 789

ID: 012

ID: 345

ID: 678

ID: 901

ID: 234

ID: 123

ID: 456

ID: 123

ID: 456

ID: 789

ID: 012

ID: 345

ID: 678

ID: 901

ID: 234

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

UUID: 1A67343FE…83B

Page 17: Harvesting the Power of Samza in LinkedIn's Feed

 ManyKeysToOneValueStore<K, V>Server-Side Events Storage

17

• Space-efficient• Insertion is transactional• Rolling back a transaction is a best effort

thing• Requires an additional lookup (but it’s

worth it)

Page 18: Harvesting the Power of Samza in LinkedIn's Feed

Event Matching

18

Client-Side Event

ID: 789

ID: 012

ID: 345

ID: 678

ID: 901

ID: 234

ID: 123

ID: 456

Server-Side Event

A

ID: 111

ID: 456

ID: 906

ID: 678

ID: 901

ID: 431

ID: 746

Server-Side Event

B

ID: 234

ID: 012

ID: 123

ID: 100

ID: 313

ID: 345

ID: 333

Output Event

A

ID: 901

ID: 456

ID: 678

Output Event

B

ID: 012

ID: 123

ID: 345

ID: 234

Page 19: Harvesting the Power of Samza in LinkedIn's Feed

 [SAMZA-647] Key-Value Store Contributions to Samza

19

• The access pattern is getAll(List<K>)• RocksDB supports multiGet that’s faster

than get• Added that support to samza’s

KeyValueStore• Perf test results confirm that of RocksDB

(with caching disabled)

Page 20: Harvesting the Power of Samza in LinkedIn's Feed

TTL

20

2

Page 21: Harvesting the Power of Samza in LinkedIn's Feed

Custom TTL Mechanism

Records the timestamp of when an event was stored The “death row” store: key is the timestamp and the value is an ID Because the key is a timestamp, collisions occur:

21

Generate timestamp

Bucket is taken

Bucket is free

Attempts <= max Attempts > max

put(timestamp, ID)

Page 22: Harvesting the Power of Samza in LinkedIn's Feed

Linear Probing Timestamper

22

TTL calculation is not mission-critical (currentTimeMillis() is not very precise anyways); events get deleted in the next window

Keeping it simple and stupid works

Page 23: Harvesting the Power of Samza in LinkedIn's Feed

Reapers

Every n seconds:

Get death rows (t < now – TTL)

For each entry in death row:

Remove from core stores

Remove from death row

23

Page 24: Harvesting the Power of Samza in LinkedIn's Feed

 OptimizationsReapers

24

• Keys (timestamps) are stored in order• A range query (0, now – TTL) is much

faster than a range scan (testing all values)

• Even though TTL is in the order of minutes/hours, reaping stale events happens every 10 seconds (the window method is blocking)

Page 25: Harvesting the Power of Samza in LinkedIn's Feed

Stats

25

Page 26: Harvesting the Power of Samza in LinkedIn's Feed

[SAMZA-647] getAll is %23 FasterRocksDB Get All vs. Get Performance

26

Page 27: Harvesting the Power of Samza in LinkedIn's Feed

Timestamp Collision Resolution Metrics

27

Page 28: Harvesting the Power of Samza in LinkedIn's Feed

The Most Important Metric

28

Page 29: Harvesting the Power of Samza in LinkedIn's Feed

29

of messages handled by the job everyday

Billions

Page 30: Harvesting the Power of Samza in LinkedIn's Feed

Find out more:

©2015 LinkedIn Corporation. All Rights Reserved.

blog.linkedin.com linkedin.com/in/elgeish

[email protected]

30