© Copyright 2013
Next Generation Search with Lucene and Solr 4
Grant IngersollCTO, LucidWorksRead More: http://ibm.co/1dJvL9k
© 2013 LucidWorks
• Search is Everywhere!
• The Bar is Raised
• Holistic view of the data AND the users is critical
Search is Dead, Long Live Search
© 2013 LucidWorks3
Search is good for…
• Classic: Fast, fuzzy text matching across a large document collection
• NoSQL and De-normalized data- “light” relational
• Top N problems
• Faceting, slicing and dicing of numerical/enumerated data
• Spatial, spell checking, record linkage, highlighting
© Copyright 2013
© 2013 LucidWorks
Lucene: Speed and Memory
• Native Near Real Time (NRT) support- Per segment- FieldCache can be controlled to only load new segments- Soft commit -- faster without fsync, allows quicker update
visibility
• DWPT (Document Writer per Thread)- Faster more consistent index speed
• Faster fuzzy & wildcard query processing
• String -> BytesRef- Much improved data structure- … means less memory and less garbage collection effort
© 2013 LucidWorks
Up and to the Right
• http://people.apache.org/~mikemccand/lucenebench/indexing.html
6
© 2013 LucidWorks
Lucene: Flexibility
• Flexible Index Formats- New posting list codecs: Block, Simple Text, Append (HDFS..),
etc- Pulsing codec: improves performance of primary key searches,
inlining docs, positions, and payloads, saves disk seeks
• Pluggable Scoring- Decoupled from TF/IDF- Built in alternatives include BM25 & DFR, and others
» http://en.wikipedia.org/wiki/Okapi_BM25
» http://terrier.org/docs/v3.5/dfr_description.html- Add your own
© 2013 LucidWorks
FS(A|T)
• Keys:- byte[] – write-once
- Linear time build of min. automata (nlogn if not sorted, which isn’t our case)
- Compression
- Reverse lookups
- Weights (used for auto-suggest)
- Pluggable Algebra
• Uses:- Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others
- FuzzyQuery is 100x faster -- http://bit.ly/hgO65c
• More: - http://slidesha.re/vKtpVA
- http://bit.ly/Pkjyu0
- “Smaller Representation of Finite State Automata” » Proc. of the 16th Inter. Conf. on Implementation and Application of Automata,
CIAA'2011, vol. 6807, 2011, pp. 118—192.
© 2013 LucidWorks9
Recent Additions
• Replication module
• New Faceting capabilities
• New Suggester to handle infix suggestions
• Analysis Additions- Norwegian, Scandinavian alternatives
• Memory and FST improvements
© Copyright 2013
© 2013 LucidWorks
Solr 4: New Features
• Search/Faceting/Relevance- New Relevance Function Queries (tf, df, others)- Pivot Faceting- Pseudo-join- Improved Spatial (more later)- Full support for Lucene Codecs, pluggable scoring
• Indexing- New Update Processors, including scripting option- Near real time
• Codec/Similarity support from Lucene 4• Other
- New Admin UI
© 2013 LucidWorks
Geospatial improvements
• Index shapes other than points (circles, polygons, etc)• More complex interactions than point in a circle using
Well Known Text
• Indexing:- "geo”:”43.17614,-90.57341”- “geo”:”Circle(4.56,1.23 d=0.0710)”- “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”
• Searching:- fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"- fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0
0, -10 30)))”
© 2013 LucidWorks
Scaling Solr
• Distributed/sharded indexing & search- Auto distributes updates and queries to appropriate shards- Near Real Time (NRT) indexing capable
• Dynamically scalable- New SolrCloud instances add indexing and query capacity- Supports re-balancing
• Reliable- No single point of failure- Transactions logged- Robust, automatic recover
• http://wiki.apache.org/solr/SolrCloud
© 2013 LucidWorks
Solr as NoSQL
• Characteristics- Non-traditional data stores- Not designed for SQL type queries- Distributed fault tolerant architecture- Document oriented, data format agnostic(JSON, XML, CSV,
binary)
• Updated durability via transaction log• Real-time /get fetches latest version w/o hard commit• Versioning and optimistic locking
- w/ Real Time GET, allows read/write/update w/o conflicts
• Atomic updates- Can add/remove/change and increment a field in existing doc
w/o re-indexing
© 2013 LucidWorks15
Recent Additions
• HDFS backed directory for storing index and transaction logs in Apache Hadoop
• New Core discovery capabilities
• Schemaless/External Schema/Field Guessing
• Schema APIs
• Add documents from the Admin UI
© Copyright 2013
Applications
16
© 2013 LucidWorks
… Find your Keys, Store Your Content
• Lucene/Solr is a fast key-value store- Bonus: search your values!
• NoSQL before NoSQL was cool
• Solr: distributed key/value- Durable, Isolated, Redundant, Fast,
Real-time- Joins, Column Storage
• Solr or Tika + Lucene can index popular office formats
• Solr can backup/replicate and scale as content grows
© 2013 LucidWorks
… Find Love! Upsell! Cross-sell!
• Cross recommendation as search- with search used to build cross recommendation!
• Recommend content to people who exhibit certain behaviors (clicks, query terms, other)
• (Ab)use of a search engine- but not as a search engine for content
- more like a search engine for behavior
• See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation Algorithms- http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms
• Go get Mahout/Myrrix or just do it in y(our) search engine
© 2013 LucidWorks19
… Avoid Delays
© 2013 LucidWorks20
… Wibbly-wobbly Timey-wimey Stuff
• Leverage Solr’s new spatial capabilities to index non-spatial data, such as time ranges- Useful for Open Hours, Shifts,
etc.
• Query using rectangle intersections- q = shift:"Intersects(0 19 23
365)”
https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
© 2013 LucidWorks21
Summary
• Lucene/Solr 4.x: - Faster- More Flexible- Easier than ever scaling- More reliable than ever
• If you need to rank a bunch of stuff according to some notion of similarity, a search engine is the way to go
© 2013 LucidWorks22
Where to Next?
• Full article: http://ibm.co/1dJvL9k• • http://www.lucidworks.com• http://lucene.apache.org/
• Training: http://bit.ly/lws-training
• LucidWorks Search (Solr++) more info: http://bit.ly/lws-more-info
• Twitter: @gsingers, @LucidWorks
• Taming Text: http://www.manning.com/ingersoll