the "big data" ecosystem at linkedin

32
The "Big Data" Ecosystem at LinkedIn SIGMOD 2013 Roshan Sumbaly, Jay Kreps, & Sam Shah June 2013

Upload: sshah

Post on 11-May-2015

914 views

Category:

Technology


1 download

DESCRIPTION

[This work was presented at SIGMOD'13.] The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the "last mile" issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.

TRANSCRIPT

Page 1: The "Big Data" Ecosystem at LinkedIn

The "Big Data" Ecosystem at LinkedInSIGMOD 2013Roshan Sumbaly, Jay Kreps, & Sam ShahJune 2013

Page 2: The "Big Data" Ecosystem at LinkedIn

©2012 LinkedIn Corporation. All Rights Reserved.

2

LinkedIn: the professional profile of record

225MMembers 225M MemberProfiles

1 2

Page 3: The "Big Data" Ecosystem at LinkedIn

3

Applications

Page 4: The "Big Data" Ecosystem at LinkedIn

4

Application examples

People You May Know (2 people) Year In Review Email (1 person, 1 month) Skills and Endorsements (2 people) Network Updates Digest (1 person, 3

months) Who’s Viewed My Profile (2 people) Collaborative Filtering (1 person) Related Searches (1 person, 3 months) and more…

Page 5: The "Big Data" Ecosystem at LinkedIn

5

Skill sets

Page 6: The "Big Data" Ecosystem at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. 6

Rich Hadoop-based ecosystem

Page 7: The "Big Data" Ecosystem at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. 7

“Last mile” problems

Ingress– Moving data from online to offline system

Workflow management– Managing offline processes

Egress– Moving results from offline to online

systems Key/Value Streams OLAP

Page 8: The "Big Data" Ecosystem at LinkedIn

8

Application examples

People You May Know (2 people) Year In Review Email (1 person, 1 month) Skills and Endorsements (2 people) Network Updates Digest (1 person, 3

months) Who’s Viewed My Profile (2 people) Collaborative Filtering (1 person) Related Searches (1 person, 3 months) and more…

Page 9: The "Big Data" Ecosystem at LinkedIn

9

People You May Know

Page 10: The "Big Data" Ecosystem at LinkedIn

10

People You May Know – Workflow

Perform triangle closing for all

members

Triangle closing

Rank by discounting previously shown recommendations

Push recommendations to

online service

Connection stream

Impression stream

Page 11: The "Big Data" Ecosystem at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. 11

“Last mile” problems

Ingress– Moving data from online to offline system

Workflow management– Managing offline processes

Egress– Moving results from offline to online

systems Key/Value Streams OLAP

Page 12: The "Big Data" Ecosystem at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. 12

Ingress - O(n2) data integration complexity

Point to point Fragile, delayed and potentially lossy Non-standardized

Page 13: The "Big Data" Ecosystem at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. 13

Ingress - O(n) data integration

Page 14: The "Big Data" Ecosystem at LinkedIn

14

Ingress – Kafka

Distributed and elastic– Multi-broker system

Categorized topics– “PeopleYouMayKnowTopic”– “ConnectionUpdateTopic”

Page 15: The "Big Data" Ecosystem at LinkedIn

15

Ingress

Standardized schemas– Avro– Central repository– Programmatic compatibility

Audited ETL to Hadoop

Page 16: The "Big Data" Ecosystem at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. 16

“Last mile” problems

Ingress– Moving data from online to offline system

Workflow management– Managing offline processes

Egress– Moving results form offline to online

systems Key/Value Streams OLAP

Page 17: The "Big Data" Ecosystem at LinkedIn

17

People You May Know – Workflow

Perform triangle closing for all

members

Rank by discounting previously shown recommendations

Push recommendations to

online service

Connection stream

Impression stream

Page 18: The "Big Data" Ecosystem at LinkedIn

18

People You May Know – Workflow (in reality)

Page 19: The "Big Data" Ecosystem at LinkedIn

19

Workflow Management - Azkaban

Dependency management– Historical logs

Diverse job types– Pig, Hive, Java

Scheduling Monitoring Visualization Configuration Retry/restart on failure Resource locking

Page 20: The "Big Data" Ecosystem at LinkedIn

20

People You May Know – Workflow

Perform triangle closing for all

members

Rank by discounting previously shown recommendations

Push recommendations to

online service

Connection stream

Impression stream

Member Id 1213 => [ Recommended member id 1734,

Recommended member id 1523 … Recommended member id 6332 ]

Page 21: The "Big Data" Ecosystem at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. 21

“Last mile” problems

Ingress– Moving data from online to offline system

Workflow management– Managing offline processes

Egress– Moving results from offline to online

systems Key/Value Streams OLAP

Page 22: The "Big Data" Ecosystem at LinkedIn

22

Egress – Key/Value

Voldemort– Based on Amazon’s Dynamo

Distributed and Elastic Horizontally scalable Bulk load pipeline from Hadoop Simple to usestore results into ‘url’ using KeyValue(‘member_id’)

Page 23: The "Big Data" Ecosystem at LinkedIn

23

People You May Know - Summary

Page 24: The "Big Data" Ecosystem at LinkedIn

24

Application examples

People You May Know (2 people) Year In Review Email (1 person, 1 month) Skills and Endorsements (2 people) Network Updates Digest (1 person, 3

months) Who’s Viewed My Profile (2 people) Collaborative Filtering (1 person) Related Searches (1 person, 3 months) and more…

Page 25: The "Big Data" Ecosystem at LinkedIn

25

Year In Review Email

Page 26: The "Big Data" Ecosystem at LinkedIn

26

Year In Review EmailmemberPosition = LOAD '$latest_positions' USING BinaryJSON;memberWithPositionsChangedLastYear = FOREACH ( FILTER memberPosition BY ((start_date >= $start_date_low ) AND (start_date <= $start_date_high))) GENERATE member_id, start_date, end_date;

allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON;

allConnectionsWithChange_nondistinct = FOREACH ( JOIN memberWithPositionsChangedLastYear BY member_id, allConnections BY dest ) GENERATE allConnections::source AS source, allConnections::dest AS dest;

allConnectionsWithChange = DISTINCT allConnectionsWithChange_nondistinct;

memberinfowpics = LOAD '$latest_memberinfowpics' USING BinaryJSON;pictures = FOREACH ( FILTER memberinfowpics BY ((cropped_picture_id is not null) AND ( (member_picture_privacy == 'N') OR (member_picture_privacy == 'E'))) ) GENERATE member_id, cropped_picture_id, first_name as dest_first_name, last_name as dest_last_name;

resultPic = JOIN allConnectionsWithChange BY dest, pictures BY member_id;connectionsWithChangeWithPic = FOREACH resultPic GENERATE allConnectionsWithChange::source AS source_id, allConnectionsWithChange::dest AS member_id, pictures::cropped_picture_id AS pic_id, pictures::dest_first_name AS dest_first_name, pictures::dest_last_name AS dest_last_name;

joinResult = JOIN connectionsWithChangeWithPic BY source_id, memberinfowpics BY member_id; withName = FOREACH joinResult GENERATE

connectionsWithChangeWithPic::source_id AS source_id, connectionsWithChangeWithPic::member_id AS member_id, connectionsWithChangeWithPic::dest_first_name as first_name, connectionsWithChangeWithPic::dest_last_name as last_name, connectionsWithChangeWithPic::pic_id AS pic_id, memberinfowpics::first_name AS firstName, memberinfowpics::last_name AS lastName, memberinfowpics::gmt_offset as gmt_offset, memberinfowpics::email_locale as email_locale, memberinfowpics::email_address as email_address;

resultGroup = GROUP withName BY (source_id, firstName, lastName, email_address, email_locale, gmt_offset);

-- Get the count of results per recipientresultGroupCount = FOREACH resultGroup GENERATE group , withName as toomany, COUNT_STAR(withName) as num_results;resultGroupPre = filter resultGroupCount by num_results > 2;resultGroup = FOREACH resultGroupPre { withName = LIMIT toomany 64; GENERATE group, withName, num_results;}

x_in_review_pre_out = FOREACH resultGroup GENERATE FLATTEN(group) as (source_id, firstName, lastName, email_address, email_locale, gmt_offset), withName.(member_id, pic_id, first_name, last_name) as jobChanger, '2013' as changeYear:chararray, num_results as num_results;

x_in_review = FOREACH x_in_review_pre_out GENERATE source_id as recipientID, gmt_offset as gmtOffset, firstName as first_name, lastName as last_name, email_address, email_locale, TOTUPLE( changeYear, source_id,firstName, lastName, num_results,jobChanger) as body;

rmf $xir;STORE x_in_review INTO '$url' USING Kafka();

Page 27: The "Big Data" Ecosystem at LinkedIn

27

Year In Review Email – Workflow

Find users that have changed jobs

Join with connections and metadata

(pictures)

Group by connections of these users

Push content to email service

Page 28: The "Big Data" Ecosystem at LinkedIn

©2013 LinkedIn Corporation. All Rights Reserved. 28

“Last mile” problems

Ingress– Moving data from online to offline system

Workflow management– Managing offline processes

Egress– Moving results from offline to online

systems Key/Value Streams OLAP

Page 29: The "Big Data" Ecosystem at LinkedIn

29

Egress - Streams

Service acts as consumer “EmailContentTopic”store emails into ‘url’ using Stream(“topic=x“)

Page 30: The "Big Data" Ecosystem at LinkedIn

30

Conclusion

Hadoop: simple programmatic model, rich developer ecosystem

Primitives for – Ingress:

Structured, complete data available Automatically handles data evolution

– Workflow management Run and operate production processes

– Egress 1-line command for data for exporting data Horizontally scalable, little need for capacity planning

Empowers data scientists to focus on new product ideas, not infrastructure

Page 31: The "Big Data" Ecosystem at LinkedIn

Future work: models of computation

• Alternating Direction Method of Multipliers (ADMM)

• Distributed Conjugate Gradient Descent (DCGD)• Distributed L-BFGS• Bayesian Distributed Learning (BDL)

Graphs

Distributed learning

Near-line processing

Page 32: The "Big Data" Ecosystem at LinkedIn

32

data.linkedin.com