big data and data standardization at linkedin

27
Recruiting Solutions Reading the Tea Leaves: Big Data at LinkedIn Alexis Baird Product Manager LinkedIn Alexis 1

Upload: alexis-baird

Post on 24-May-2015

459 views

Category:

Technology


2 download

DESCRIPTION

From a talk I gave to a group of Connecticut College students in November of 2012. This looks at some of the challenges of dealing with huge amounts of member-inputted data as well as techniques used to solve these challenges and product applications of that member-inputted data.

TRANSCRIPT

Page 1: Big Data and Data Standardization at LinkedIn

Recruiting Solutions Recruiting Solutions Recruiting Solutions

Reading the Tea Leaves:

Big Data at LinkedIn

Alexis Baird Product Manager LinkedIn

Alexis

1

Page 2: Big Data and Data Standardization at LinkedIn

What is LinkedIn?

§  LinkedIn’s mission: “Connect the world’s professionals to make them more productive and successful”

§  The site officially launched on May 5, 2003 §  Now has >187 million members worldwide §  LinkedIn has >3,000 employees in offices all around the

world §  Headquartered in Mountain View, CA §  Three different lines of revenue:

–  Subscriptions –  Talent Solutions –  Marketing Solutions

2

Page 3: Big Data and Data Standardization at LinkedIn

Who am I?

3

Page 4: Big Data and Data Standardization at LinkedIn

The Age of Big Data

4

Page 5: Big Data and Data Standardization at LinkedIn

Big Data at LinkedIn

§  187+ million members from >200 countries §  Each month, 52 million members come to the site

generating ~2 billion page views: –  Performing searches –  Connecting with other members –  Editing their profile –  Sharing, commenting on, or liking news articles –  Participating in group discussions –  And much more…

5

Page 6: Big Data and Data Standardization at LinkedIn

Big Data Challenges

§  Storage and processing constraints

§  Noisy signal

–  Variation –  People are not always rational or consistent

6

Page 7: Big Data and Data Standardization at LinkedIn

Data Messiness

§  Job titles: §  “programmer”, §  “software developer” §  “engineer” §  “coding ninja”

§  Schools: §  “Connecticut College” §  “Conn College” §  “Conn” §  “CC” §  “Conn College (NOT

Uconn)”

§  Companies: §  “Microsoft” §  “MSFT” §  “Bing” §  “Microsoft/Bing” §  “Microsoft-Mountain View”

7

Page 8: Big Data and Data Standardization at LinkedIn

Data Standardization

§  Take an input (usually a user-entered string) and turn it into a meaningful abstract id

8

“Microsoft” “MSFT” “Bing” “Microsoft/Bing” “Microsoft-Mountain View

Company_id = 1035 (“Microsoft Corporation”)

Page 9: Big Data and Data Standardization at LinkedIn

Why is this important?

9

Page 10: Big Data and Data Standardization at LinkedIn

Search

10

Page 11: Big Data and Data Standardization at LinkedIn

Structured data > Unstructured data

11 11

P(“linkedin” = company_id 1337) = .87 P(“ceo” = title_id 238) = .92

Page 12: Big Data and Data Standardization at LinkedIn

Recommendations

12 12

Page 13: Big Data and Data Standardization at LinkedIn

Recommendation products at LinkedIn

13 13

Similar Profiles

Events You May Be Interested In

News

Network updates

Connections

Page 14: Big Data and Data Standardization at LinkedIn

LinkedIn’s recommender ecosystem

14

Recommendations drive:

> 50% of connections > 50% of job applications > 50% of group joins

Page 15: Big Data and Data Standardization at LinkedIn

Jobs You Might Be Interested In

15

Page 16: Big Data and Data Standardization at LinkedIn

How LinkedIn matches people to jobs

16

Corpus Stats

Job

User Base

Filtered

title geo company

industry description functional area

Candidate

General expertise specialties education headline geo experience

Current Position title summary tenure length industry functional area …

Similarity (candidate expertise, job description)

0.56 Similarity

(candidate specialties, job description)

0.2 Transition probability

(candidate industry, job industry)

0.43

Title Similarity

0.8

Similarity (headline, title)

0.7 . . .

derived

Matching Binary Exact matches: geo, industry, … Soft transition probabilities, similarity, … Text

Transition probabilities Connectivity yrs of experience to reach title education needed for this title …

Page 17: Big Data and Data Standardization at LinkedIn

Data Standardization: Occupations

§  How do we know a “senior software developer” and a “software developer” are the same occupation?

17

Page 18: Big Data and Data Standardization at LinkedIn

Data Standardization: Occupations

§  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority

18

Page 19: Big Data and Data Standardization at LinkedIn

Data Standardization: Occupations

§  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority

§  How do we know a “software developer” and a “software engineer” are the same occupation?

19

Page 20: Big Data and Data Standardization at LinkedIn

Data Standardization: Occupations

§  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority

§  How do we know a “software developer” and a “software engineer” are the same occupation? –  Term similarity

20

Page 21: Big Data and Data Standardization at LinkedIn

Data Standardization: Occupations

§  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority

§  How do we know a “software developer” and a “software engineer” are the same occupation? –  Term similarity

§  How do we know a “programmer” and a “software developer” are the same occupation but a “programmer” and a “program director” are not?

21

Page 22: Big Data and Data Standardization at LinkedIn

Data Standardization: Occupations

§  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority

§  How do we know a “software developer” and a “software engineer” are the same occupation? –  Term similarity

§  How do we know a “programmer” and a “software developer” are the same occupation but a “programmer” and a “program director” are not? –  Need something more complicated

22

Page 23: Big Data and Data Standardization at LinkedIn

Data standardization: Occupations

1.  Rule-based string clean up: –  ~2 million different titles => 24,000 different “cleaned” titles –  Eg. “Sr software dev” => “senior software developer”

2.  Create “virtual profiles” for each title using various extracted and normalized profile features (i.e. skills, degree, field of study, summary, job description, honors, etc.)

3.  Cluster similar titles 4.  Get rid of uninformative titles spread across too many

different topics 5.  Apply hand QA to tune the clusters/name the clusters

23

Page 24: Big Data and Data Standardization at LinkedIn
Page 25: Big Data and Data Standardization at LinkedIn

Lessons learned

§  Know your machine learning! §  Know your success metric! §  Need to allow for ambiguity within a given title

§  “Head of production” §  DDS

§  Some titles are not standardizable:

25

Page 26: Big Data and Data Standardization at LinkedIn

Take aways

§  The more information you give, the better your standardization will be

§  Why do you want LI to do a good job standardizing the data on your profile? –  Better recommendations:

§  News §  Jobs §  Groups §  Connections §  Etc.

–  Recruiters can find you more easily –  Potential connections can find you

26

Page 27: Big Data and Data Standardization at LinkedIn

2 4 8

17

32

55

90

2004 2005 2006 2007 2008 2009 2010 2011 LinkedIn Members (Millions)

175M+

25th Most visit website worldwide (Comscore 6-12)

Company pages

>2M

62% non U.S.

2/sec

85% Fortune 500 Companies use LinkedIn to hire

Thank You!

27

We’re

Hiring!

Learn more at http://data.linkedin.com/