beyond tf idf why, what & how
TRANSCRIPT
![Page 1: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/1.jpg)
Beyond TF-IDF
Stephen Murtaghetsy.com
![Page 2: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/2.jpg)
![Page 3: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/3.jpg)
20,000,000 items
![Page 4: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/4.jpg)
1,000,000 sellers
![Page 5: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/5.jpg)
![Page 6: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/6.jpg)
15,000,000 daily searches
80,000,000 daily calls to Solr
![Page 7: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/7.jpg)
Etsy Engineering
• Code as Craft - our engineering blog
• http://codeascraft.etsy.com/
• Continuous Deployment
• https://github.com/etsy/deployinator
• Experiment-driven culture
• Hybrid engineering roles
• Dev-Ops
• Data-Driven Products
![Page 8: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/8.jpg)
Etsy Search
• 2 search clusters: Flip and Flop
• Master -> 20 slaves
• Only one cluster takes traffic
• Thrift (no HTTP endpoint)
• BitTorrent for index replication
• Solr 4.1
• Incremental index every 12 minutes
![Page 9: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/9.jpg)
Beyond TF-IDF
•Why?
•What?
•How?
![Page 10: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/10.jpg)
![Page 11: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/11.jpg)
Luggage tags
“unique bag”
q = unique+bag
![Page 12: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/12.jpg)
q = unique+bag
>
![Page 13: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/13.jpg)
Scoring in Lucene
![Page 14: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/14.jpg)
Scoring in Lucene
Fixed for any given query
constant
![Page 15: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/15.jpg)
Scoring in Lucenef(term, document)
f(term)
![Page 16: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/16.jpg)
Scoring in LuceneUser content
Only measure rarity
![Page 17: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/17.jpg)
IDF(“unique”)4.429547
IDF(“bag”)4.32836>
![Page 18: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/18.jpg)
q = unique+bag“unique unique bag” “unique bag bag”
>
![Page 19: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/19.jpg)
“unique” tells us nothing...
![Page 20: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/20.jpg)
Stop words
• Add “unique” to stop word list?
• What about “handmade” or “blue”?
• Low-information words can still be useful for matching
• ... but harmful for ranking
![Page 21: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/21.jpg)
Why not replace IDF?
![Page 22: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/22.jpg)
Beyond TF-IDF
•Why?• IDF ignores term “usefulness”
•What?
•How?
![Page 23: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/23.jpg)
Beyond TF-IDF
•Why?• IDF ignores term “usefulness”
•What?
•How?
![Page 24: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/24.jpg)
What do we replace it with?
![Page 25: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/25.jpg)
Benefits of IDF
I1 =
doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...
. . .termm 1 0 1 . . . 0
![Page 26: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/26.jpg)
Benefits of IDF
I1 =
doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...
. . .termm 1 0 1 . . . 0
IDF (jewelry) = 1 + log(n�
d id,jewelry)
![Page 27: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/27.jpg)
Sharding
I1 =
doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...
. . .termm 1 0 1 . . . 0
I2 =
dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0...
. . .termm 0 1 1 . . . 0
![Page 28: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/28.jpg)
Sharding
I1 =
doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...
. . .termm 1 0 1 . . . 0
I2 =
dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0...
. . .termm 0 1 1 . . . 0
IDF (jewelry) = 1 + log(n�
d id,jewelry)
![Page 29: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/29.jpg)
Sharding
I1 =
doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...
. . .termm 1 0 1 . . . 0
I2 =
dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0...
. . .termm 0 1 1 . . . 0
IDF1(jewelry) �= IDF2(jewelry) �= IDF (jewelry)
![Page 30: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/30.jpg)
Sharded IDF options• Ignore it - Shards score differently
• Shards exchange stats - Messy
• Central source distributes IDF to shards
![Page 31: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/31.jpg)
Information Gain
• P(x) - Probability of "x" appearing in a listing
• P(x|y) - Probability of "x" appearing given "y" appears
info(y) = D(P (X|y)||P (X))
info(y) = Σx∈X log(P (x|y)P (x)
) ∗ P (x|y)
![Page 32: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/32.jpg)
Term Info(x) IDFunique 0.26 4.43
bag 1.24 4.33
pattern 1.20 4.38
original 0.85 4.38
dress 1.31 4.42
man 0.64 4.41
photo 0.74 4.37
stone 0.92 4.35
Similar IDF
![Page 33: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/33.jpg)
Term Info(x) IDFunique 0.26 4.39
black 0.22 3.32
red 0.22 3.52
handmade 0.20 3.26
two 0.32 5.64
white 0.19 3.32
three 0.37 6.19
for 0.21 3.59
Similar Info Gain
![Page 34: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/34.jpg)
q = unique+bagUsing IDF
score(“unique unique bag”)
> score(“unique bag bag”)
Using information gain
score(“unique unique bag”)
< score(“unique bag bag”)
![Page 35: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/35.jpg)
Beyond TF-IDF
•Why?• IDF ignores term “usefulness”
•What?
•How?
![Page 36: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/36.jpg)
Beyond TF-IDF
•Why?• IDF ignores term “usefulness”
•What?• Information gain accounts for term quality
•How?
![Page 37: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/37.jpg)
Beyond TF-IDF
•Why?• IDF ignores term “usefulness”
•What?• Information gain accounts for term quality
•How?
![Page 38: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/38.jpg)
Listing Quality
• Performance relative to rank
• Hadoop: logs -> hdfs
• cron: hdfs -> master
• bash: master -> slave
• Loaded as external file field
![Page 39: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/39.jpg)
Computing info gain
I1 =
doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...
. . .termm 1 0 1 . . . 0
info(y) = D(P (X|y)||P (X))
info(y) = Σx∈X log(P (x|y)P (x)
) ∗ P (x|y)
![Page 40: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/40.jpg)
Hadoop
• Brute-force
• Count all terms
• Count all co-occuring terms
• Construct distributions
• Compute info gain for all terms
![Page 41: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/41.jpg)
File Distribution
• cron copies score file to master
• master replicates file to slaves
infogain=`find /search/data/ -maxdepth 1 -type f -name info_gain.* -print | sort | tail -n 1`
scp $infogain user@$slave:$infogain
![Page 42: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/42.jpg)
File Distribution
![Page 43: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/43.jpg)
schema.xml
![Page 44: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/44.jpg)
Beyond TF-IDF
•Why?• IDF ignores term “usefulness”
•What?• Information gain accounts for term quality
•How?• Hadoop + similarity factory = win
![Page 45: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/45.jpg)
Fast Deploys, Careful Testing
• Idea
• Proof of Concept
• Side-By-Side
• A/B test
• 100% Live
![Page 46: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/46.jpg)
Side-by-Side
![Page 47: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/47.jpg)
![Page 48: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/48.jpg)
![Page 49: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/49.jpg)
![Page 50: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/50.jpg)
Relevant != High quality
![Page 51: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/51.jpg)
A/B Test
• Users are randomly assigned to A or B
• A sees IDF-based results
• B sees info gain-based results
![Page 52: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/52.jpg)
A/B Test
• Users are randomly assigned to A or B
• A sees IDF-based results
• B sees info gain-based results
• Small but significant decrease in clicks, page views, etc.
![Page 53: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/53.jpg)
More homogeneous resultsLower average quality score
![Page 54: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/54.jpg)
Next Steps
![Page 55: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/55.jpg)
Parameter Tweaking...Rebalance relevancy and quality signals in score
![Page 56: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/56.jpg)
The Future
![Page 57: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/57.jpg)
Latent Semantic Indexing in Solr/Lucene
![Page 58: Beyond tf idf why, what & how](https://reader031.vdocument.in/reader031/viewer/2022022203/587148191a28ab55588b5d8d/html5/thumbnails/58.jpg)
Latent Semantic Indexing• In TF-IDF, documents are sparse vectors in
term space
• LSI re-maps these to dense vectors in “concept” space
• Construct transformation matrix:
• Load file at index and query time
• Re-map query and documents
Rm+
Rr
Tr×m