lucene @ yelp

54
Lucene @ Yelp Sudarshan Gaikaiwari

Upload: lucidimagination

Post on 07-Apr-2018

237 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 1/54

Page 2: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 2/54

Bio

1. Over a decade of experience in information retrieval2. Used IR techniques at Symantec's DLP group3. Search Engineer at Yelp

Page 3: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 3/54

Outline

1. Overview of search services at Yelp2. Federation Motivation3. Lucy Indexing4. Lucy Searching

5. Efficiently Retrieving top k hits

Page 4: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 4/54

The services we provide

Page 5: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 5/54

Lucy: business search

Page 6: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 6/54

Lucy also powers phone search

Page 7: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 7/54

Cathy: she 'talks' a lot

Page 8: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 8/54

Listsearch: it searches lists....

Page 9: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 9/54

Reviewsearch: it searches reviews....

Page 10: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 10/54

DYM: did you really mean that?

Page 11: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 11/54

Suggest: auto completion

Page 12: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 12/54

Federation Motivation

Page 13: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 13/54

Problem

Search is too slow

Page 14: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 14/54

Hard Disk Seek Latency

Disk seek 10,000,000 ns

Source Software Engineering Advice from

Building Large-Scale Distributed Systems

Jeffery Dean

Page 15: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 15/54

RAM read latency

Main memoryreference100 ns

Page 16: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 16/54

Pinning Index in RAM

● vmtouch● mlock● http://hoytech.com/vmtouch/

Page 17: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 17/54

Problem

Index is too large fit in memory on a single machine

Page 18: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 18/54

Geographical sharding

 

Page 19: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 19/54

Geographical Sharding drawbacks

1. Cumbersome manual process to determine shard boundary2. No guarantee that a boundary can be found.

Page 20: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 20/54

Federation

1. Split index across multiple machines2. Shard on business id3. TF-IDF scores from different machines should be

comparable

Page 21: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 21/54

Mapping businesses to shards

1. Assigning businesses to shards

shard = shardlist[hash(business_id) % len(shardlist)]

Problems1. Involves re-indexing all the businesses if we want to add anew shard

Page 22: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 22/54

Virtual Nodes

 

Page 23: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 23/54

Advantages

1. Flexibility (move vbuckets from one shard to another)2. Split hot spot shards

Page 24: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 24/54

Lucy Master Slave Architecture

Separate indexing (masters)A master for each shard of a service

Searching (slaves)

A slave for every replica of a service

Page 25: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 25/54

Lucy Indexing

 

Page 26: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 26/54

Page 27: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 27/54

Page 28: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 28/54

Page 29: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 29/54

Page 30: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 30/54

Lucy Searching

 

Page 31: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 31/54

Page 32: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 32/54

Federator: Combining results acrossshards

1. Once we distribute an index across shards we need acomponent which will search all these shards and combinetheir results.

2. Written in Python (runs inside a python web process).

3. Uses Tornado IO loop to send requests to all shards.4. The transfer protocol for the requests in JSON RPC

Page 33: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 33/54

Page 34: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 34/54

Page 35: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 35/54

Page 36: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 36/54

Tokens to Business Attributes

Page 37: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 37/54

Executing queries

1. Gather the top results for a query2. Collect attribute statitics for attributes like places, categories

Page 38: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 38/54

Page 39: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 39/54

Page 40: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 40/54

Successive geobounds relaxation

 

Page 41: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 41/54

Successive geobounds relaxation

 

Page 42: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 42/54

Federation

 

Page 43: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 43/54

Efficiently Retrieving top k hits

1. When user moves through multiple pages the number of hits to be returned increases

num hits = start + count

2. So if we need to retrieve 500 hits the naive way would be toretrieve 500 hits from each shard and then sort them

Page 44: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 44/54

Distribution of hits in shards

 

Page 45: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 45/54

P b bili hi i i h d

Page 46: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 46/54

Probability a hit is in a shard

Bi i l Di t ib ti

Page 47: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 47/54

Binomial Distribution

Probability (r of top k hits) are in a particular shard

Mean

Variance

F l

Page 48: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 48/54

Formula

Std Deviation

Formula

 

Si l ti

Page 49: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 49/54

Simulation

Formula Hits selected from eachshardk = 100p = 0.2

Results Missed (%)

24 0.017

32 0.0001407

44 0.00000

Si l ti G h

Page 50: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 50/54

Simulation Graph

R lt

Page 51: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 51/54

Results

1. ~ 50% savings over 100 hits (44 hits requested from eachshard)

2. 77% savings over 1000 hits (228 hits requested from eachshard)

F t k

Page 52: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 52/54

Future work

1. In memory index2. Move towards real time search

Page 53: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 53/54

Th k Y

Page 54: Lucene @ Yelp

8/6/2019 Lucene @ Yelp

http://slidepdf.com/reader/full/lucene-yelp 54/54

Thank You

[email protected]