search at linkedin by sriram sankar and kumaresh pattabiraman

Recruiting SolutionsRecruiting SolutionsRecruiting Solutions

Search at LinkedIn

Sriram Sankar, Principal Staff EngineerKumaresh Pattabiraman, Senior Product Manager

https://www.youtube.com/watch?v=obCHKPYHuhA

Search at LinkedIn

Personalized professional search

Part of a bigger product experience

But a really big part of it

Some history . . .

Approach to Search

Off the shelf components (Lucene) Extended to address Lucene limitations (Sensei,

Bobo, Zoie, Content Store) Specialized verticals (Cleo, Krati)

Stack adopted for other purposes (recommendations, newsfeed, ads, analytics, etc.)

Lucene

An open source API that supports search functionality: Add new documents to index Delete documents from the index Construct queries Search the index using the query Score the retrieved documents

The Search Index

Inverted Index: Mapping from (search) terms to list of documents (they are present in)

Forward Index: Mapping from documents to metadata about them

BLAH BLAH BLAH Kumaresh BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH

BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH2.

Kumaresh Sriram LinkedIn

Inverted Index Forward Index

The Search Index

The lists are called posting lists Upto hundreds of millions of posting lists Upto hundreds of millions of documents Posting lists may contain as few as a single hit and

as many as tens of millions of hits Terms can be

– words in the document– inferred attributes about the document

Lucene Queries

“Sriram Sankar” Sriram Kumaresh +Sriram +LinkedIn +Kumaresh connection:418001 +Kumaresh industry:software

connection:418001^4

Lucene Scoring

As documents are added to the index, Lucene maintains some metadata on the terms (e.g., term position, tf/idf)

Lucene accepts scoring information via query modifications, boosts, etc.

Lucene assigns a score to each retrieved document using this information

Sensei

Layer over Lucene that provides: Sharding Cluster management Enhanced query language

Sensei BQL

SELECT *FROM carsWHERE price > 2000.00USING RELEVANCE MODEL my_model (favoriteColor:"black", favoriteTag:"cool") DEFINED AS (String favoriteColor, String favoriteTag) BEGIN float boost = 1.0; if (tags.contains(favoriteTag)) boost += 0.5; if (color.equals(my_color)) boost += 1.2; return _INNER_SCORE * boost; END

Live Updates – Zoie and Content Store

The index reader has to be reopened before earlier live updates are visible

The only way to perform a live update is to replace the entire document – which requires access to the unchanged attributes also

Search Content Store

SearchContent

LuceneIndex

ActivityFeeds Deletes

Inserts

Faceting

Typeahead (Instant Search)

Results as you type

Conventional wisdom: Inverted indices cannot support typeahead

Cleo, Krati

Fast forward to last year – and growing pains . . .

Scalability

Rebuilding index from scratch extremely difficult

Not possible to use complex algorithms during indexing

Live updates at document granularity

Inflexible scoring – both at Lucene and Sensei levels

Fragmentation

Too many open source components glued together with primary developers spread across many companies

Different instantiations starting to diverge to deal with their specific growing pains – so diverging stacks and distracted engineers

Our new search stack . . .Two verticals already in

production

Life of a Query

Query Rewriter/Planner

ResultsMerging

UserQuer

Search

Results

Search Shard

Life of a Query – Within A Search Shard

Rewritten

TopResult

sFromShard

TopResult

Retrieve aDocument

Score theDocument

Life of a Query – Within A Rewriter

DATAMODEL

Rewriter

Module

DATAMODEL

Rewritten

Rewriter

Module

Rewriter

Module

Life of Data - Offline

Derived Data

Raw Data

DATAMODEL

Benefits of New Stack

A complete search engine Frequent reindexing possible (a full reset) Resharding becomes easy Clear separation of infrastructure and relevance

functions

A single stack with a single identity!

Early Termination

We order documents in the index based on a static rank – from most important to least important

An offline relevance algorithm assigns a static rank to each document on which the sorting is performed

This allows retrieval to be early-terminated (assuming a strong correlation between static rank and importance of result for a specific query)

Happens to work well with personalized search also

New Strategy for Live Updates

Lucene segments are “document-partitioned” We have enhanced Lucene with “term-partitioned”

segments We use 3 term-partitioned segments:

– Base index (never changed)– Live update buffer– Snapshot index

Fault tolerant, and performant No more content store!

Base IndexSnapshot

IndexLive Update

Buffer

Data Distribution

Bit torrent based data distribution framework

More details at a later time

Relevance

Offline analysis – resulting in a better index and data models

Query rewriting – for better and more accurate recall

Scoring – to fine tune each of the retrieved results

Reranking – selection of top results for overall result set quality

Blending – to combine results from multiple verticals

Machine Learned Scorers

Goal: To automatically build a function whose arguments are interesting features of the query and the document

Input to the machine learning system is a set of training data that describes how the function should behave on various combination of feature values

The function takes the form of standard templates – a linear formula is commonly used (due to simplicity)

Linear Regression on a Single Feature

LinkedIn Scorer:Different Linear Models for Different Intents

Relevance models incorporate user features:

score = P (Document | Query, User)

Tree with linear regression leaves

X10< 0.1234 ?

Going Forward

Further standardize infrastructure for relevance components

Scatter-gather

Java GC issues

Extend infrastructure to browser/device

Reintegrate diverging stacks

Product Overview

LinkedIn’s Vision

“Create economic opportunity for every member of the global workforce”

The Economic Graph

Search is core to the economic graph vision

LI as a way to get the day job

Job Seeker

Who uses search?

Casual User

LI as professional identity

Outbound professional(Recruiter / Sales)

LI as day job

Casual User

Name SearchTopic Search

Instant: Name Search

Search all members by name or approximate name

Unified Search: Topic Search

One federated search result page with all relevant entities about the topic

Outbound professional

Exploratory people search

Instant: Search Suggestions

Entity-aware suggestions for companies, skills & titles

Instant: Just one keystroke

From name search to exploratory search

People Search

Explore using facets and advanced search fields

People Search

Leverage the network through shared connections

Recruiter & Sales Navigator

Products powered by search

Job Seeker

Job Search

Instant: Search Suggestions

Entity-aware suggestions for companies, skills & titles

Job Search

Explore using facets and advanced search fields

Job Search

Leverage the network through relationship to job poster or connections in the company

Other Search Users include…

Students – University SearchInformation Seekers / Researchers - Content SearchAdvertisers / Content Marketers – Company & Group Search

Bringing it all together

300 Million+ members

Search the economic graph of300M profiles

3B Endorsements300K jobs

3M Companies2M Groups

25K Schools100M+ pieces of professional

content

One indexOne unified search stack

Product

Platform

search at linkedin by sriram sankar and kumaresh pattabiraman

Technology

cryptography /sankar jayam

sankar polytechnic college (autonomous...

business plan by raj sankar

vedic mathematicss by sankar

curriculum vitae_amrit sankar narayan

soft computing, machine intelligence and data mining sankar...

dr. m. sankar

speed cameras by girija sankar dash

sankar homes

sankar sivagnanam - ammattikorkeakoulut

sankar pro22

curriculum vitae dr. ravi sankar …sankar/resume.pdf1...

bhavani v. sankar - ufl...

ravi sankar technology evangelist | microsoft

feeaksfekollam.comfeeaksfekollam.com/upload_files/48175637909-2013... ·...

uma sankar sekar - crja

ganesh kumaresh profile feb 2019.pdf · ganesh and kumaresh...

windows infrastructure ravi sankar

sankar sambasivan atfi founder, president & ceo … ·...

layali rashid , karthik pattabiraman and sathish...