path to 400m members: linkedin’s data powered journey

Post on 07-Jan-2017

145 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Xin Fu, Carl Steinbach

Hadoop SummitTokyo, October 26, 2016

Path to 400M* Members: LinkedIn’s Data Powered Journey

* As of Q2 2016, LinkedIn had 450M members world wide

2

2004

2011 2012

2009

2012 2015

3

Real Time Visualization of New Sign-ups

What Does “Data-Driven” Mean at LinkedIn?

4

What Does “Data-Driven” Mean at LinkedIn?

5

Monitoring & Learning

6

What is This Phase Comprised of?

7

● Dashboards● Reports

● Trend explanation

○ Short term fluctuation: investigation

○ Long term trend: strategic analysis

Past Challenges

8

Reliability● Easily broken without operational support, huge time spent in

maintenance

Diverse technology● Self maintained pipelines● Various UIs with different visualization capabilities● Redundant computation

Standardized Reporting Tool

9

● Reduces dependency on 3rd party BI tools● Closer integration with LinkedIn’s ecosystem of experimentation

and anomaly detection solutions

Towards Real Time Monitoring

10

Sign

-up

Country

Platform

Language

Browser

Signup Type

OS

Experimentation & Analysis

11

What is This Phase Comprised of?

12

● Experiment design● Experiment analysis to inform ramp decisions

● Learning from multiple experiments to identify what works and what doesn’t work

Past Challenges

13

Experiment design● Interaction between experiments

Experiment analysis and ramp decision● Manual analysis, extended time-to-

decision● Ramp decisions based on localized

metrics● Reruns needed sometimes due to

undetected errors in setup

Worst of all, some ramps happened without A/B testing● e.g. infrastructural changes

Experimentation Platform @ LinkedIn

14

● Company-wide platform for A/B testing, ramping, and advanced targeting needs

● Automated reporting and analysis capabilities

Tiering of Metrics

15

Metrics at different tier:● Different review processes

● Different levels of visibility in dashboards and experiment scorecards

● Different computation priorities and SLAs in data pipelines

● Different life cycles

Backend Infrastructure for Tracking & Instrumentation

16

17

InvitationClickEvent()

Scale fact: ~1000 tracking event types, ~20TB per day, hundreds of metrics & data products

Tracking Data Records User Activity

Tracking Data Lifecycle and Teams

18

Product teams:PMs, Developers, TestEng

Infra teams: Hadoop, Kafka, DWH, ...

Data teams: Analytics, Relevance Engineers,...

Example: How Do We Track a Profile View?

19

PageViewEventRecord 1:{"header" : {"memberId" : 12345,"time" : 1454745292951,"appName" : {"string" : "LinkedIn"

"pageKey" : "profile_page"},

},"trackingInfo" : {["vieweeID" : "23456"],

...}

}

pageViews = LOAD ‘/data/tracking/PageViewEvent’;

profileViews = FILTER pageViews by header.pageKey==‘profile_page’;

Example: How Do We Track a Profile View?

20

PageViewEventRecord 1:{"header" : {"memberId" : 12345,"time" : 1454745292951,"appName" : {"string" : "LinkedIn"

"pageKey" : "new_profile_page"},

},"trackingInfo" : {["vieweeID" : "23456"],

...}

}

pageViews = LOAD ‘/data/tracking/PageViewEvent’;

profileViews = FILTER pageViews by header.pageKey==‘profile_page’ or header.pageKey==‘new_profile_page’;

At Some Point It Becomes Unmaintainable ...

21

How Do We Handle Old and New?

22

Producers Consumers

DALI: A Data Access Layer for LinkedInAbstract away underlying physical details to allow users to focus solely on the logical concerns

Logical Tables + Views

Logical FileSystem

We had been working on something that could help...

24

Data Catalog + Discovery

(DALI)

DaliFileSystem Client

Data Source(HDFS)

Data Sink(HDFS)

Processing Engine(MapReduce, Spark, Presto)

DALI Datasets (Tables + Views)

Query Layers (Hive, Pig, Spark)

View Defs + UDFs(Artifactory, Git)

Dataflow APIs(MR, Spark, Scalding)DALI CLI

DALI: Implementation Details in Context

Solving with DALI Views

Producers Consumers

State of the World Today with Dali

~ 100 producer views~ 200 consumer views~ 80 unique tracking event data sources

What’s next?! Views on streaming data! Selective materialization and caching! Open source

At the Core of “Data-Driven” is ....

27

28

Used to be Tug of War Between Speed and Quality

29

Before We Learned that Technology Could Break the Dichotomy Between Speed and Quality

30

Cultural Aspects: Partnership Data Scientists and Engineers

Interesting Challenges

- Metric trade-off, e.g. between engagement vs. monetization

- Real-time everything?- A/B test in a social

network- Human judge for

personalized search- Value of an action

31

It Took a Village

32

Thanks to all the Data Scientists, Engineers and Product partners at LinkedIn for being part of this great journey!

https://engineering.linkedin.com/data

top related