apache solr: beyond the boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfwhat is solr (to...

38
Apache Solr Beyond The Box Chris Hostetter 2008-11-05 http://people.apache.org/~hossman/apachecon2008us/ http://lucene.apache.org/solr/

Upload: others

Post on 16-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

Apache SolrBeyond The Box

Chris Hostetter2008-11-05

http://people.apache.org/~hossman/apachecon2008us/

http://lucene.apache.org/solr/

Page 2: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

2

Why Are We Here?

Plugins!

●What, How, Where, When, Why?●Solr Internals In A Nutshell●Real World Examples●Testing●Questions

Page 3: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

3

What, How, Where, Who, When, Why?

Page 4: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

4

What Is Solr (To Users)● Information Retrieval Application● Index/Query Via HTTP●Comprehensive HTML Administration Interfaces●Scalability - Efficient Replication To Other Solr

Search Servers●Highly Configurable Caching●Flexible And Adaptable With XML Configuration

Customizable Request Handlers And Response Writers

Data Schema With Dynamic Fields And Unique Keys Analyzers Created At Runtime From Tokenizers And

TokenFilters

Page 5: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

What Is Solr (To Developers)● Information Retrieval Application● Java5 WebApp (WAR) With A Web Services-ish API●Extensible Plugin Architecture●MVC-ish Framework Around The Java Lucene

Search Library●Allows Custom Business Logic and Text Analysis

Rules To Live Close To The Data●Abstracts Away The Tricky Stuff:

Index Consistency Data Replication Cache Management

Page 6: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

How It Started

Page 7: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

When/Why To Write A Plugin

“X can be done more

efficiently closer to the data.”

OR

“To force X

for all clients.”

Page 8: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

8

Solr Internals In A Nutshell

Page 9: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

9

50,000' ViewHTTP

SolrDispatchFilter

Java

EmbeddedSolrServer

SolrCore

SolrCore

SolrCore

SolrRequestHandler

CoreContainer

SolrQuery(Request/Response)

QueryResponseWriter

Page 10: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

MVC-ish●SolrRequestHandler ... A Controller

handleRequest( SolrQueryRequest,SolrQueryResponse )

●SolrQueryRequest ... An Event (++) Input Parameters List of ContentStreams Maintains SolrCore & SolrIndexSearcher References

●SolrQueryResponse ... Model Tree of "Simple" Objects and DocLists

●ResponseWriter ... View write(Writer, SolrQueryRequest,

SolrQueryResponse)

Page 11: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

11

public class HelloWorld extends RequestHandlerBase {

  public void handleRequestBody(SolrQueryRequest req,

                                SolrQueryResponse rsp) {

    String name = req.getParams().get("name");

    Integer age = req.getParams().getInt("age");

    rsp.add("greeting", "Hello " + name);

    rsp.add("yourage", age);

  }

  public String getVersion() { return "$Revision:$"; }

  public String getSource() { return "$Id:$"; }

  public String getSourceId() { return "$URL:$"; }

  public String getDescription() { return "Says Hello"; }

}

Hello World

Page 12: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

12

http://localhost:8983/solr/hello?name=Hoss&age=32&wt=xml

    <response>

      <lst name="responseHeader">

        <int name="status">0</int>

        <int name="QTime">1</int>

      </lst>

      <str name="greeting">Hello Hoss</str>

      <int name="yourage">32</int>

    </response>

http://localhost:8983/solr/hello?name=Hoss&age=32&wt=json

    { "responseHeader":{ "status":0, "Qtime":1},

      "greeting":"Hello Hoss",

      "yourage":32

    }

Hello World Output

Page 13: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

Types Of Plugins● SolrRequestHandlerSolrRequestHandler

SearchComponentSearchComponent QparserPluginQparserPlugin ValueSourceParserValueSourceParser

● SolrHighlighterSolrHighlighter SolrFragmenterSolrFragmenter SolrFormatterSolrFormatter

● UpdateRequestProcessorFactoryUpdateRequestProcessorFactory● QueryResponseWriterQueryResponseWriter

Italics: Only One Per SolrCore

CCololoror: Likelihood Of Needing To Write Your Own

● Similarity(Factory)Similarity(Factory)● AnalyzerAnalyzer

TokenizerFactoryTokenizerFactory TokenFilterFactoryTokenFilterFactory

● FieldTypeFieldType

● SolrCacheSolrCache CacheRegeneratorCacheRegenerator

● SolrEventListenerSolrEventListener● UpdateHandlerUpdateHandler

Page 14: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

14

Real World Examples

Page 15: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

15

Tibetan And Himalayan Digital Library Tools

Page 16: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

16

   public class TshegBarTokenizerFactory 

                extends BaseTokenizerFactory {

     public TokenStream create(Reader input) {

       return new TshegBarTokenizer(input);

     }

   }

   public class EdgeTshegTrimmerFactory 

                extends BaseTokenFilterFactory {

       public TokenStream create(TokenStream input) {

           return new EdgeTshegTrimmer(input);

       }

   }

Tsheg Analysis Factories

Page 17: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

17

DFLL

Page 18: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

DFLL: Faceted Browsing

Page 19: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

DFLL Category Metadata●Category ID and Label: 3126 == “Tablet PCs”

●Category Query: tablet_form:[* TO *]●Ordered List of Facets

Facet ID and Label: 500016 == “OS Provided” Facet Display Info: Count vs. Alphabetical, etc... Ordered List of Constraints

● Constraint ID and Label: 111536 == “Apple OS X”● Constraint Query: os:(“OSX10.1” “OSX10.2” ...)

Page 20: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

20

Document catMetaDoc = searcher.getFirstMatch(catDocId)

Metadata m = parseAndCacheMetadata(catMetaDoc, searcher)

m = m.clone()

DocListAndSet results =

              searcher.getDocListAndSet(m.catQuery, ...)

response.add(“products”, results.docList)

foreach (Facet f : m) {

  foreach (Constraint c : f) {

    c.setCount(searcher.numDocs(c.query,

                                results.docSet))

  }

}

response.add(“metadata”, m.asSimpleObjects())

DfllHandler Psuedo-Code

Page 21: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

Conceptual Picture

DocList

getDocListAndSet(Query,Query[],Sort,offset,n)

os:(“OSX10.1” “OSX10.2” ...)

memory:[1GB TO *]

tablet_form:[* TO *] price ascproc_manu:Intel

proc_manu:AMD

Section of ordered results

DocSet

Unordered set of all results

price:[0 TO 500]

price:[500 TO 1000]

manu:Dell

manu:HP

manu:LenovonumDocs()

= 594

= 382

= 247

= 689

= 104

= 92

= 75

Query Response

Page 22: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

22

<result name="products" numFound="394" start="0">...</results>

<lst name="metadata">

 ...

 <lst name="500016">

   <int name="rankDir">0</int><int name="datatype">1</int>

   <int name="rating">88</int><str name="name">OS provided</str>

   <lst name="values">

     <lst name="111536">

       <int name="valueId">111536</int>

       <str name="label">Apple Mac OS X</str>

       <str name="rating">50</str>

       <int name="count">1</int>

     </lst>

     ...

   </lst>

DFLL Response

Page 23: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

23

DfllCacheRegeneratorSolrCore “Auto-warms” all SolrCaches when new versions of the index are opened for searching (after a commit).

 public interface CacheRegenerator {

   public boolean regenerateItem(SolrIndexSearcher newSearcher,

                                 SolrCache newCache, 

                                 SolrCache oldCache, 

                                 Object oldKey, 

                                 Object oldVal) 

          throws IOException;

}

Page 24: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

24

DataImportHandler

Page 25: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

25

Builds and incrementally updates indexes based on configured SQL or XPath queries.

<entity name="item" pk="ID" query="select * from ITEM"

   deltaQuery="select ID ... where 

               ITEMDATE > '${dataimporter.last_index_time}'">

 <field column="NAME" name="name" />

 ...

 <entity name="f" pk="ITEMID" 

    query="select DESC from FEATURE where ITEMID='${item.ID}'"

    deltaQuery="select ITEMID from FEATURE where 

                UPDATEDATE > '${dataimporter.last_index_time}'"

    parentDeltaQuery="select ID from ITEM where ID=${f.ITEMID}">

  <field name="features" column="DESC" />

  ...

DataImportHandler

Page 26: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

DataImportHandler Plugins●DataSource

FileDataSource HttpDataSource JdbcDataSource

●EntityProcessor FileListEntityProcessor SqlEntityProcessor

● CachedSqlEntityProcessor

XPathEntityProcessor

●Transformer DateFormatTransformer NumberFormatTransformer RegexTransformer ScriptTransformer TemplateTransformer

Page 27: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

27

LocalSolr

Page 28: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

LocalSolr

Page 29: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

LocalUpdateProcessorFactory●Uses lat/lon fields to compute Cartesian Tier info●Adds grid bodes of various sizes as new fields

 <updateRequestProcessorChain name="standard" default=”true”>

   <processor class="....LocalUpdateProcessorFactory">

      <str name="latField">lat</str>

      <str name="lngField">lng</str>

      <int name="startTier">9</int>

      <int name="endTier">17</int>

   </processor>

   <processor class="solr.LogUpdateProcessorFactory" />

   <processor class="solr.RunUpdateProcessorFactory" />

 </updateRequestProcessorChain>

Page 30: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

LocalSolr Cartesian Tiers

Page 31: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

LocalSolrQueryComponent●Use in place of default QueryComponent●Augments regular query with DistanceQuery and

DistanceSortSource●Can use a custom SolrCache for distances for

commonly used points

  <searchComponent name="geoquery"

                   class="....LocalSolrQueryComponent" />

  <requestHandler name="geo" class="solr.SearchHandler">

     <arr name="components">

       <str>geoquery</str>

       ...

     </arr>

  </requestHandler>

Page 32: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

32

GuardianComponent

Page 33: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

GuardianComponent Goal●When Searching Really Short Docs, Rule Out

Matches That Are “Significantly” Longer Then Query

● Increase Precision At The Expense Of Recall  

    q = Dance Party  

  Dance Party (1995)

  Dance Party (2005) (V)

  Dance Party, USA (2006)

  Workout Party... Let's Dance! (2004) (V)

  Shrek in the Swamp Karaoke Dance Party (2001) (V)

Page 34: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

Implementation●SearchComponent●Configured To Run After QueryComponent●Post-Processes DocList

Pick MAX_LEN Based On Number Of Query Clauses Re-analyze Stored “title“ Field Eliminate Any Results That Are With More Then MAX_LEN Tokens In “title“

Page 35: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

Alternate Approach●<copyField source=“title” dest=“titleLen”/>

●Write TokenCountingTokenFilter For titleLen

●Write MaxLenQParserPlugin Subclass Your Favorite QParser Pick MAX_LEN Based On Number Of Query Clauses

From Super Add +titleLen:[* TO MAX_LEN] Clause To Query

Page 36: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

36

Testing Your Plugins

Page 37: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

37

AbstractSolrTestCasepublic class YourTest extends AbstractSolrTestCase {

  ...

  public void testSomeStuff() throws Exception {

    assertU(adoc("id", "7",    "description", "Travel Guide”,

                  "title", "Paris in 10 Days"));

    assertU(adoc("id", "42",   "description", "Cool Book",

                 "title", "Hitch Hiker's Guide to the Galaxy"));

    assertU(commit());

    assertQ("multi qf", req("q",  "guide",

                            "qt", "dismax",

                            "qf", "title^2 description^1") 

            ,"//*[@numFound='2']"

            ,"//result/doc[1]/int[@name='id'][.='42']"

            ,"//result/doc[2]/int[@name='id'][.='7']"

            );

  }

Page 38: Apache Solr: Beyond The Boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfWhat Is Solr (To Developers) Information Retrieval Application Java5 WebApp (WAR) With A Web Services-ish

38

Questions?

http://lucene.apache.org/solr/

?