lucene @ yelp
TRANSCRIPT
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 1/54
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 2/54
Bio
1. Over a decade of experience in information retrieval2. Used IR techniques at Symantec's DLP group3. Search Engineer at Yelp
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 3/54
Outline
1. Overview of search services at Yelp2. Federation Motivation3. Lucy Indexing4. Lucy Searching
5. Efficiently Retrieving top k hits
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 4/54
The services we provide
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 5/54
Lucy: business search
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 6/54
Lucy also powers phone search
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 7/54
Cathy: she 'talks' a lot
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 8/54
Listsearch: it searches lists....
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 9/54
Reviewsearch: it searches reviews....
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 10/54
DYM: did you really mean that?
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 11/54
Suggest: auto completion
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 12/54
Federation Motivation
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 13/54
Problem
Search is too slow
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 14/54
Hard Disk Seek Latency
Disk seek 10,000,000 ns
Source Software Engineering Advice from
Building Large-Scale Distributed Systems
Jeffery Dean
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 15/54
RAM read latency
Main memoryreference100 ns
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 16/54
Pinning Index in RAM
● vmtouch● mlock● http://hoytech.com/vmtouch/
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 17/54
Problem
Index is too large fit in memory on a single machine
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 18/54
Geographical sharding
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 19/54
Geographical Sharding drawbacks
1. Cumbersome manual process to determine shard boundary2. No guarantee that a boundary can be found.
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 20/54
Federation
1. Split index across multiple machines2. Shard on business id3. TF-IDF scores from different machines should be
comparable
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 21/54
Mapping businesses to shards
1. Assigning businesses to shards
shard = shardlist[hash(business_id) % len(shardlist)]
Problems1. Involves re-indexing all the businesses if we want to add anew shard
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 22/54
Virtual Nodes
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 23/54
Advantages
1. Flexibility (move vbuckets from one shard to another)2. Split hot spot shards
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 24/54
Lucy Master Slave Architecture
Separate indexing (masters)A master for each shard of a service
Searching (slaves)
A slave for every replica of a service
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 25/54
Lucy Indexing
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 26/54
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 27/54
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 28/54
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 29/54
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 30/54
Lucy Searching
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 31/54
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 32/54
Federator: Combining results acrossshards
1. Once we distribute an index across shards we need acomponent which will search all these shards and combinetheir results.
2. Written in Python (runs inside a python web process).
3. Uses Tornado IO loop to send requests to all shards.4. The transfer protocol for the requests in JSON RPC
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 33/54
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 34/54
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 35/54
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 36/54
Tokens to Business Attributes
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 37/54
Executing queries
1. Gather the top results for a query2. Collect attribute statitics for attributes like places, categories
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 38/54
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 39/54
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 40/54
Successive geobounds relaxation
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 41/54
Successive geobounds relaxation
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 42/54
Federation
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 43/54
Efficiently Retrieving top k hits
1. When user moves through multiple pages the number of hits to be returned increases
num hits = start + count
2. So if we need to retrieve 500 hits the naive way would be toretrieve 500 hits from each shard and then sort them
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 44/54
Distribution of hits in shards
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 45/54
P b bili hi i i h d
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 46/54
Probability a hit is in a shard
Bi i l Di t ib ti
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 47/54
Binomial Distribution
Probability (r of top k hits) are in a particular shard
Mean
Variance
F l
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 48/54
Formula
Std Deviation
Formula
Si l ti
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 49/54
Simulation
Formula Hits selected from eachshardk = 100p = 0.2
Results Missed (%)
24 0.017
32 0.0001407
44 0.00000
Si l ti G h
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 50/54
Simulation Graph
R lt
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 51/54
Results
1. ~ 50% savings over 100 hits (44 hits requested from eachshard)
2. 77% savings over 1000 hits (228 hits requested from eachshard)
F t k
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 52/54
Future work
1. In memory index2. Move towards real time search
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 53/54
Th k Y
8/6/2019 Lucene @ Yelp
http://slidepdf.com/reader/full/lucene-yelp 54/54
Thank You