what's new in lucene/solr presented by grant ingersoll at solrexchage dc

22
What’s new in Lucene and Solr? Grant Ingersoll CTO, LucidWorks Lucene/Solr Committer

Upload: lucidworks-archived

Post on 27-Jan-2015

107 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

What’s new in Lucene and Solr?Grant Ingersoll

CTO, LucidWorksLucene/Solr Committer

Page 2: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Sink or Swim?

Page 3: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Search is good for…• Traditional: Fast, fuzzy text matching across a large document

collection• De-normalized data

– “light” relational• Top N problems

– Key-value (top 1)– Recommendations, “Good enough” classification, clustering

• Faceting, slicing and dicing of numerical/enumerated data• Spatial, spell checking, record linkage, highlighting• NoSQL

Page 4: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

What’s New?

• Community

• Lucene

• Solr

Page 5: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Relax, You’re Among Friends• Large, diverse search community with many non-traditional search

engine usages– Object stores, Record linkage, Social, mobile -> web

• “The Apache Way”– Meritocracy – Those who do, decide!

• Always Be Testing– Randomized system tests are all the rage– http://vimeo.com/32087114

• Patches Welcome!

Page 6: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Acceleration!

Page 7: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Coming Soon: Lucene and Solr 4.8

Java 1.7

Page 8: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Page 9: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Lucene: Speed and Memory• Native Near Real Time (NRT) support

– Per segment– FieldCache can be controlled to only load new segments– Soft commit -- faster without fsync, allows quicker update visibility

• DWPT (Document Writer per Thread)– Faster more consistent index speed

• Faster fuzzy & wildcard query processing• Automatic compression of stored fields and term vectors• String -> BytesRef

– Much improved data structure– … means less memory and less garbage collection effort

Page 10: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Lucene: Flexibility• Flexible Index Formats

– New posting list codecs: Block, Simple Text, HDFS, etc.– Pulsing codec: improves performance of primary key searches, inlining

docs, positions, and payloads, saves disk seeks

• Pluggable Scoring– Decoupled from TF/IDF– Built in alternatives include BM25 & DFR, and others

• http://en.wikipedia.org/wiki/Okapi_BM25• http://terrier.org/docs/v3.5/dfr_description.html

– Add your own

Page 11: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

FS(A|T)• Keys:

– byte[] – write-once– Linear time build of min. automata– Compression, Reverse lookups– Weights (used for auto-suggest)– Pluggable Algebra

• Uses:– Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others– FuzzyQuery is 100x faster -- http://bit.ly/hgO65c

• More: – http://slidesha.re/vKtpVA, http://bit.ly/Pkjyu0– “Smaller Representation of Finite State Automata”

• Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807, 2011, pp. 118—192.

Page 12: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Grab Bag• Lots of new suggesters

– Available in Solr

• Doc Values– Column oriented store– Numeric and binary variants are updatable (coming to Solr soon)

• Overhauled term vectors APIs– Now look a lot like Terms

Page 13: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Page 14: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Solr 4: New Features• Search/Faceting/Relevance

– New Relevance Function Queries (tf, df, others)– Pivot Faceting– Pseudo-join– Improved Spatial (more later)– Full support for Lucene Codecs, pluggable scoring

• Indexing– New Update Processors, including scripting option– Near real time

• Schema and Config APIs + Schemaless• Cursors (aka Deep Paging)• Admin UI

Page 15: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Geospatial improvements• Index shapes other than points (circles, polygons, etc)• More complex interactions than point in a circle

• Indexing:– "geo”:”43.17614,-90.57341”– “geo”:”Circle(4.56,1.23 d=0.0710)”– “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”

• Searching:– fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"– fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10

30)))”

Page 16: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Scaling Solr• Distributed/sharded indexing & search

– Auto distributes updates and queries to appropriate shards– Near Real Time (NRT) indexing capable– Document routing extensions

• Dynamically scalable– New SolrCloud instances add indexing and query capacity– Supports re-balancing (shard-splitting)

• Reliable– No single point of failure– Transactions logged– Robust, automatic recover

• http://wiki.apache.org/solr/SolrCloud

Page 17: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Solr as NoSQL• Non-traditional data stores

• Not designed for SQL type queries

• Distributed fault tolerant architecture

• Document oriented, data format agnostic (JSON, XML, CSV, binary)

Page 18: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Go Deep!

Page 19: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

APIs• New APIs for Schema and Solr Config

– XML becoming more of an implementation detail

• Managed Schema mode

• Data-driven schema (aka schemaless)

• Synonyms, stopwords, request handlers

Page 20: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Beyond Solr: LucidWorks Open Source• Effortless AWS deployment and monitoring: http

://www.github.com/lucidworks/solr-scale-tk

• Logstash for Solr: https://github.com/LucidWorks/solrlogmanager• Banana (Kibana for Solr): https://github.com/LucidWorks/banana

• Data Quality Toolkit: https://github.com/LucidWorks/data-quality

• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash

Page 21: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Summary• Lucene/Solr 4.x:

– Faster– More Flexible– Easier than ever scaling– More reliable than ever

• Go forth and rank!

Page 22: What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Resources• Me

[email protected]– @gsingers on Twitter

• LucidWorks– http://www.lucidworks.com– http://www.lucidworks.com/support-services/ask-the-experts/