lucene at yelp - by sudarshan gaikaiwari

54
Lucene @ Yelp Sudarshan Gaikaiwari

Upload: lucenerevolution

Post on 14-Dec-2014

568 views

Category:

Technology


2 download

DESCRIPTION

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

TRANSCRIPT

Page 1: Lucene at Yelp - By Sudarshan Gaikaiwari

Lucene @ Yelp

Sudarshan Gaikaiwari

Page 2: Lucene at Yelp - By Sudarshan Gaikaiwari

Bio

1. Over a decade of experience in information retrieval2. Used IR techniques at Symantec's DLP group3. Search Engineer at Yelp

Page 3: Lucene at Yelp - By Sudarshan Gaikaiwari

Outline

1. Overview of search services at Yelp2. Federation Motivation3. Lucy Indexing4. Lucy Searching5. Efficiently Retrieving top k hits

Page 4: Lucene at Yelp - By Sudarshan Gaikaiwari

The services we provide

Page 5: Lucene at Yelp - By Sudarshan Gaikaiwari

Lucy: business search

Page 6: Lucene at Yelp - By Sudarshan Gaikaiwari

Lucy also powers phone search

Page 7: Lucene at Yelp - By Sudarshan Gaikaiwari

Cathy: she 'talks' a lot

Page 8: Lucene at Yelp - By Sudarshan Gaikaiwari

Listsearch: it searches lists....

Page 9: Lucene at Yelp - By Sudarshan Gaikaiwari

Reviewsearch: it searches reviews....

Page 10: Lucene at Yelp - By Sudarshan Gaikaiwari

DYM: did you really mean that?

Page 11: Lucene at Yelp - By Sudarshan Gaikaiwari

Suggest: auto completion

Page 12: Lucene at Yelp - By Sudarshan Gaikaiwari

Federation Motivation

Page 13: Lucene at Yelp - By Sudarshan Gaikaiwari

Problem

Search is too slow

Page 14: Lucene at Yelp - By Sudarshan Gaikaiwari

Hard Disk Seek LatencyDisk seek 10,000,000 ns

Source Software Engineering Advice from Building Large-Scale Distributed SystemsJeffery Dean

Page 15: Lucene at Yelp - By Sudarshan Gaikaiwari

RAM read latency

Main memory reference100 ns

Page 16: Lucene at Yelp - By Sudarshan Gaikaiwari

Pinning Index in RAM

● vmtouch● mlock● http://hoytech.com/vmtouch/

Page 17: Lucene at Yelp - By Sudarshan Gaikaiwari

Problem

Index is too large fit in memory on a single machine

Page 18: Lucene at Yelp - By Sudarshan Gaikaiwari

Geographical sharding

Page 19: Lucene at Yelp - By Sudarshan Gaikaiwari

Geographical Sharding drawbacks

1. Cumbersome manual process to determine shard boundary2. No guarantee that a boundary can be found.

Page 20: Lucene at Yelp - By Sudarshan Gaikaiwari

Federation

1. �Split index across multiple machines2. Shard on business id3. TF-IDF scores from different machines should be

comparable

Page 21: Lucene at Yelp - By Sudarshan Gaikaiwari

Mapping businesses to shards

1. Assigning businesses to shards

shard = shardlist[hash(business_id) % len(shardlist)]

Problems 1. Involves re-indexing all the businesses if we want to add a new shard

Page 22: Lucene at Yelp - By Sudarshan Gaikaiwari

Virtual Nodes

Page 23: Lucene at Yelp - By Sudarshan Gaikaiwari

Advantages

1. Flexibility (move vbuckets from one shard to another)2. Split hot spot shards

Page 24: Lucene at Yelp - By Sudarshan Gaikaiwari

Lucy Master Slave Architecture

Separate indexing (masters)A master for each shard of a service

Searching (slaves)A slave for every replica of a service

Page 25: Lucene at Yelp - By Sudarshan Gaikaiwari

Lucy Indexing

Page 26: Lucene at Yelp - By Sudarshan Gaikaiwari
Page 27: Lucene at Yelp - By Sudarshan Gaikaiwari
Page 28: Lucene at Yelp - By Sudarshan Gaikaiwari
Page 29: Lucene at Yelp - By Sudarshan Gaikaiwari
Page 30: Lucene at Yelp - By Sudarshan Gaikaiwari

Lucy Searching

Page 31: Lucene at Yelp - By Sudarshan Gaikaiwari
Page 32: Lucene at Yelp - By Sudarshan Gaikaiwari

Federator: Combining results across shards1. Once we distribute an index across shards we need a

component which will search all these shards and combine their results.

2. Written in Python (runs inside a python web process).3. Uses Tornado IO loop to send requests to all shards.4. The transfer protocol for the requests in JSON RPC

Page 33: Lucene at Yelp - By Sudarshan Gaikaiwari

Lucy Server

Page 34: Lucene at Yelp - By Sudarshan Gaikaiwari
Page 35: Lucene at Yelp - By Sudarshan Gaikaiwari
Page 36: Lucene at Yelp - By Sudarshan Gaikaiwari

Tokens to Business Attributes

Page 37: Lucene at Yelp - By Sudarshan Gaikaiwari

Executing queries

1. Gather the top results for a query2. Collect attribute statitics for attributes like places, categories

Page 38: Lucene at Yelp - By Sudarshan Gaikaiwari

Lucene

1. Efficiently executes queries over the index2. Provides how relevant the business is to the words in the

query (word score)3. Upgrading lucene to 2.9/3.1 is WIP

Page 39: Lucene at Yelp - By Sudarshan Gaikaiwari
Page 40: Lucene at Yelp - By Sudarshan Gaikaiwari

Successive geobounds relaxation

Page 41: Lucene at Yelp - By Sudarshan Gaikaiwari

Successive geobounds relaxation

Page 42: Lucene at Yelp - By Sudarshan Gaikaiwari

Federation

Page 43: Lucene at Yelp - By Sudarshan Gaikaiwari

Efficiently Retrieving top k hits

1. When user moves through multiple pages the number of hits to be returned increases

num hits = start + count

2. So if we need to retrieve 500 hits the naive way would be to retrieve 500 hits from each shard and then sort them

Page 44: Lucene at Yelp - By Sudarshan Gaikaiwari

Distribution of hits in shards

Page 45: Lucene at Yelp - By Sudarshan Gaikaiwari
Page 46: Lucene at Yelp - By Sudarshan Gaikaiwari

Probability a hit is in a shard

Page 47: Lucene at Yelp - By Sudarshan Gaikaiwari

Binomial DistributionProbability (r of top k hits) are in a particular shard

Mean

Variance

Page 48: Lucene at Yelp - By Sudarshan Gaikaiwari

Formula

Std Deviation

Formula

Page 49: Lucene at Yelp - By Sudarshan Gaikaiwari

Simulation

Formula Hits selected from each shard k = 100p = 0.2

Results Missed (%)

24 0.017

32 0.0001407

44 0.00000

Page 50: Lucene at Yelp - By Sudarshan Gaikaiwari

Simulation Graph

Page 51: Lucene at Yelp - By Sudarshan Gaikaiwari

Results

1. ~ 50% savings over 100 hits (44 hits requested from each shard)

2. 77% savings over 1000 hits (228 hits requested from each shard)

Page 52: Lucene at Yelp - By Sudarshan Gaikaiwari

Future work

1. In memory index2. Move towards real time search

Page 53: Lucene at Yelp - By Sudarshan Gaikaiwari

Come Join Us!

Page 54: Lucene at Yelp - By Sudarshan Gaikaiwari

Thank You

[email protected]