hacking lucene and solr for fun and profit
DESCRIPTION
TRANSCRIPT
![Page 1: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/1.jpg)
![Page 2: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/2.jpg)
HACKING LUCENE AND
SOLR FOR FUN AND
PROFIT
Grant Ingersoll
CTO, LucidWorks,
[email protected], @gsingers
![Page 3: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/3.jpg)
• Search is a system building block
– text is only a part of the story
• If the algorithms fit,
use them!
• Embrace fuzziness!
• Scoring features are everywhere
Keyword Search is so yesterday
![Page 4: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/4.jpg)
• Classic: Fast, fuzzy text matching across a large document collection
• Data Quality and Analysis
– Faceting, slicing and dicing of numerical/enumerated data
– Spatial
– Spell checking, record linkage, highlighting
– Stats, Missing fields, etc.
• Top N problems
Lucene and Solr can do…
![Page 5: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/5.jpg)
• Search Hacks
• “Trust me, I’m a mathematician”
• “I wish I had thought of that” Hack
Topics
![Page 6: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/6.jpg)
Search Hacks
![Page 7: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/7.jpg)
• SimpleTextCodec Example
conf.setCodec(new SimpleTextCodec());
File simpleText = new File("simpletext");
directory = new SimpleFSDirectory(simpleText);
writer = new IndexWriter(directory, conf);
index(writer);
• Similarity:
BM25Similarity bm25Similarity = new BM25Similarity();
conf.setSimilarity(bm25Similarity);
• http://www.ibm.com/developerworks/java/library/j-solr-lucene/index.html
Learn IR
![Page 8: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/8.jpg)
http://localhost:8983/solr/answer?q=what+is+trimethylbenzene&defType=qa&qa=true&qa.qf=body
![Page 9: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/9.jpg)
Simple QA Workflow
![Page 10: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/10.jpg)
• Split into sentences
– Buffer tokens – see com.tamingtext.texttamer.solr.SentenceTokenizer
• Identify Names using OpenNLP
• Add Entity marker tokens at the same position as original token
– Could also be done with Payloads
• Index
• https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/textta
mer/solr
• https://github.com/tamingtext/book/blob/master/apache-solr/solr-
qa/conf/schema.xml
Analysis
![Page 11: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/11.jpg)
• Custom Query Parser takes in user’s natural language query,
classifies it to find the Answer Type and generates Solr query
• Retrieve candidate passages that match keywords and expected
answer type
• Unlike keyword search, we need to know exactly where matches
occur
• https://github.com/tamingtext/book/tree/master/src/main/java/com/
tamingtext/qa
Search Side
![Page 12: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/12.jpg)
• Answer Type examples:
– Person (P), Location (L), Organization (O), Time Point (T),
Duration (R), Money (M)
– See page 248 for more
• Train an OpenNLP classifier off of a set of previously annotated
questions, e.g.:
– P Which French monarch reinstated the divine right of the
monarchy to France and was known as `The Sun King'
because of the splendour of his reign?
Answer Type Classification
![Page 13: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/13.jpg)
“Trust me, I’m a mathematician”
![Page 14: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/14.jpg)
Classification
![Page 15: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/15.jpg)
kNN and TF/IDF Classification w/ Lucene
https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/classifier/mlt
![Page 16: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/16.jpg)
• Builds classifier off of index information
• See the org.apache.lucene.classification package
• Naïve Bayes Classifier
• kNN Classifier
• Perceptron Classifier
Lucene Classification Module
![Page 17: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/17.jpg)
• Cross recommendation as search
– with search used to build cross recommendation!
• Recommend content to people who exhibit certain behaviors (clicks, query terms,
other)
• (Ab)use of a search engine
– but not as a search engine for content
– more like a search engine for behavior
• See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation
Algorithms
– http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms
• Go get Mahout/Myrrix or just do it in y(our) search engine
Recommenders
![Page 18: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/18.jpg)
• History:
Recommendation Basics
User Thing
1 3
2 4
3 4
2 3
3 2
1 1
2 1
![Page 19: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/19.jpg)
• History as matrix:
• t1+t3 cooccur 2 times, t1+t4 once, t2+t4 once
Recommendation Basics
t1 t2 t3 t4
u1 1 0 1 0
u2 1 0 1 1
u3 0 1 0 1
![Page 20: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/20.jpg)
• Coocurrence
• More details at http://lucenerevolution.org/2013/Crowd-sourced-intelligence-built-
into-Search-over-Hadoop
Recommendation Basics
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1 2
t3 not t3
t1 2 1
not t1 1 1
![Page 21: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/21.jpg)
“I wish I had thought of that”
![Page 22: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/22.jpg)
Time Space Continuum
• Leverage Solr’s new spatial capabilities to index non-spatial data, such as time
ranges
– Useful for Open Hours, Shifts, etc.
• Key: multi-valued range data
• Query using rectangle intersections
– q = shift:"Intersects(0 19 23 365)”
• Credits to David Smiley and Hoss…
https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
![Page 23: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/23.jpg)
Finance Example
Time
% change
AAPL
MSFT
IBM
IBM
AAPL
AAPL
MSFT
MSFT
AAPL
![Page 24: Hacking Lucene and Solr for Fun and Profit](https://reader033.vdocument.in/reader033/viewer/2022051014/54c6640d4a79594b538b46ff/html5/thumbnails/24.jpg)
• http://www.manning.com/ingersoll
– http://github.com/tamingtext/book
• http://www.tamingtext.com
• Me:
– @gsingers
Resources