apache solr: beyond the boxpeople.apache.org/.../btb/apache-solr-beyond-the-box.pdfwhat is solr (to...
TRANSCRIPT
Apache SolrBeyond The Box
Chris Hostetter2008-11-05
http://people.apache.org/~hossman/apachecon2008us/
http://lucene.apache.org/solr/
2
Why Are We Here?
Plugins!
●What, How, Where, When, Why?●Solr Internals In A Nutshell●Real World Examples●Testing●Questions
3
What, How, Where, Who, When, Why?
4
What Is Solr (To Users)● Information Retrieval Application● Index/Query Via HTTP●Comprehensive HTML Administration Interfaces●Scalability - Efficient Replication To Other Solr
Search Servers●Highly Configurable Caching●Flexible And Adaptable With XML Configuration
Customizable Request Handlers And Response Writers
Data Schema With Dynamic Fields And Unique Keys Analyzers Created At Runtime From Tokenizers And
TokenFilters
What Is Solr (To Developers)● Information Retrieval Application● Java5 WebApp (WAR) With A Web Services-ish API●Extensible Plugin Architecture●MVC-ish Framework Around The Java Lucene
Search Library●Allows Custom Business Logic and Text Analysis
Rules To Live Close To The Data●Abstracts Away The Tricky Stuff:
Index Consistency Data Replication Cache Management
How It Started
When/Why To Write A Plugin
“X can be done more
efficiently closer to the data.”
OR
“To force X
for all clients.”
8
Solr Internals In A Nutshell
9
50,000' ViewHTTP
SolrDispatchFilter
Java
EmbeddedSolrServer
SolrCore
SolrCore
SolrCore
SolrRequestHandler
CoreContainer
SolrQuery(Request/Response)
QueryResponseWriter
MVC-ish●SolrRequestHandler ... A Controller
handleRequest( SolrQueryRequest,SolrQueryResponse )
●SolrQueryRequest ... An Event (++) Input Parameters List of ContentStreams Maintains SolrCore & SolrIndexSearcher References
●SolrQueryResponse ... Model Tree of "Simple" Objects and DocLists
●ResponseWriter ... View write(Writer, SolrQueryRequest,
SolrQueryResponse)
11
public class HelloWorld extends RequestHandlerBase {
public void handleRequestBody(SolrQueryRequest req,
SolrQueryResponse rsp) {
String name = req.getParams().get("name");
Integer age = req.getParams().getInt("age");
rsp.add("greeting", "Hello " + name);
rsp.add("yourage", age);
}
public String getVersion() { return "$Revision:$"; }
public String getSource() { return "$Id:$"; }
public String getSourceId() { return "$URL:$"; }
public String getDescription() { return "Says Hello"; }
}
Hello World
12
http://localhost:8983/solr/hello?name=Hoss&age=32&wt=xml
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
<str name="greeting">Hello Hoss</str>
<int name="yourage">32</int>
</response>
http://localhost:8983/solr/hello?name=Hoss&age=32&wt=json
{ "responseHeader":{ "status":0, "Qtime":1},
"greeting":"Hello Hoss",
"yourage":32
}
Hello World Output
Types Of Plugins● SolrRequestHandlerSolrRequestHandler
SearchComponentSearchComponent QparserPluginQparserPlugin ValueSourceParserValueSourceParser
● SolrHighlighterSolrHighlighter SolrFragmenterSolrFragmenter SolrFormatterSolrFormatter
● UpdateRequestProcessorFactoryUpdateRequestProcessorFactory● QueryResponseWriterQueryResponseWriter
Italics: Only One Per SolrCore
CCololoror: Likelihood Of Needing To Write Your Own
● Similarity(Factory)Similarity(Factory)● AnalyzerAnalyzer
TokenizerFactoryTokenizerFactory TokenFilterFactoryTokenFilterFactory
● FieldTypeFieldType
● SolrCacheSolrCache CacheRegeneratorCacheRegenerator
● SolrEventListenerSolrEventListener● UpdateHandlerUpdateHandler
14
Real World Examples
15
Tibetan And Himalayan Digital Library Tools
16
public class TshegBarTokenizerFactory
extends BaseTokenizerFactory {
public TokenStream create(Reader input) {
return new TshegBarTokenizer(input);
}
}
public class EdgeTshegTrimmerFactory
extends BaseTokenFilterFactory {
public TokenStream create(TokenStream input) {
return new EdgeTshegTrimmer(input);
}
}
Tsheg Analysis Factories
17
DFLL
DFLL: Faceted Browsing
DFLL Category Metadata●Category ID and Label: 3126 == “Tablet PCs”
●Category Query: tablet_form:[* TO *]●Ordered List of Facets
Facet ID and Label: 500016 == “OS Provided” Facet Display Info: Count vs. Alphabetical, etc... Ordered List of Constraints
● Constraint ID and Label: 111536 == “Apple OS X”● Constraint Query: os:(“OSX10.1” “OSX10.2” ...)
20
Document catMetaDoc = searcher.getFirstMatch(catDocId)
Metadata m = parseAndCacheMetadata(catMetaDoc, searcher)
m = m.clone()
DocListAndSet results =
searcher.getDocListAndSet(m.catQuery, ...)
response.add(“products”, results.docList)
foreach (Facet f : m) {
foreach (Constraint c : f) {
c.setCount(searcher.numDocs(c.query,
results.docSet))
}
}
response.add(“metadata”, m.asSimpleObjects())
DfllHandler Psuedo-Code
Conceptual Picture
DocList
getDocListAndSet(Query,Query[],Sort,offset,n)
os:(“OSX10.1” “OSX10.2” ...)
memory:[1GB TO *]
tablet_form:[* TO *] price ascproc_manu:Intel
proc_manu:AMD
Section of ordered results
DocSet
Unordered set of all results
price:[0 TO 500]
price:[500 TO 1000]
manu:Dell
manu:HP
manu:LenovonumDocs()
= 594
= 382
= 247
= 689
= 104
= 92
= 75
Query Response
22
<result name="products" numFound="394" start="0">...</results>
<lst name="metadata">
...
<lst name="500016">
<int name="rankDir">0</int><int name="datatype">1</int>
<int name="rating">88</int><str name="name">OS provided</str>
<lst name="values">
<lst name="111536">
<int name="valueId">111536</int>
<str name="label">Apple Mac OS X</str>
<str name="rating">50</str>
<int name="count">1</int>
</lst>
...
</lst>
DFLL Response
23
DfllCacheRegeneratorSolrCore “Auto-warms” all SolrCaches when new versions of the index are opened for searching (after a commit).
public interface CacheRegenerator {
public boolean regenerateItem(SolrIndexSearcher newSearcher,
SolrCache newCache,
SolrCache oldCache,
Object oldKey,
Object oldVal)
throws IOException;
}
24
DataImportHandler
25
Builds and incrementally updates indexes based on configured SQL or XPath queries.
<entity name="item" pk="ID" query="select * from ITEM"
deltaQuery="select ID ... where
ITEMDATE > '${dataimporter.last_index_time}'">
<field column="NAME" name="name" />
...
<entity name="f" pk="ITEMID"
query="select DESC from FEATURE where ITEMID='${item.ID}'"
deltaQuery="select ITEMID from FEATURE where
UPDATEDATE > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from ITEM where ID=${f.ITEMID}">
<field name="features" column="DESC" />
...
DataImportHandler
DataImportHandler Plugins●DataSource
FileDataSource HttpDataSource JdbcDataSource
●EntityProcessor FileListEntityProcessor SqlEntityProcessor
● CachedSqlEntityProcessor
XPathEntityProcessor
●Transformer DateFormatTransformer NumberFormatTransformer RegexTransformer ScriptTransformer TemplateTransformer
27
LocalSolr
LocalSolr
LocalUpdateProcessorFactory●Uses lat/lon fields to compute Cartesian Tier info●Adds grid bodes of various sizes as new fields
<updateRequestProcessorChain name="standard" default=”true”>
<processor class="....LocalUpdateProcessorFactory">
<str name="latField">lat</str>
<str name="lngField">lng</str>
<int name="startTier">9</int>
<int name="endTier">17</int>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
LocalSolr Cartesian Tiers
LocalSolrQueryComponent●Use in place of default QueryComponent●Augments regular query with DistanceQuery and
DistanceSortSource●Can use a custom SolrCache for distances for
commonly used points
<searchComponent name="geoquery"
class="....LocalSolrQueryComponent" />
<requestHandler name="geo" class="solr.SearchHandler">
<arr name="components">
<str>geoquery</str>
...
</arr>
</requestHandler>
32
GuardianComponent
GuardianComponent Goal●When Searching Really Short Docs, Rule Out
Matches That Are “Significantly” Longer Then Query
● Increase Precision At The Expense Of Recall
q = Dance Party
Dance Party (1995)
Dance Party (2005) (V)
Dance Party, USA (2006)
Workout Party... Let's Dance! (2004) (V)
Shrek in the Swamp Karaoke Dance Party (2001) (V)
Implementation●SearchComponent●Configured To Run After QueryComponent●Post-Processes DocList
Pick MAX_LEN Based On Number Of Query Clauses Re-analyze Stored “title“ Field Eliminate Any Results That Are With More Then MAX_LEN Tokens In “title“
Alternate Approach●<copyField source=“title” dest=“titleLen”/>
●Write TokenCountingTokenFilter For titleLen
●Write MaxLenQParserPlugin Subclass Your Favorite QParser Pick MAX_LEN Based On Number Of Query Clauses
From Super Add +titleLen:[* TO MAX_LEN] Clause To Query
36
Testing Your Plugins
37
AbstractSolrTestCasepublic class YourTest extends AbstractSolrTestCase {
...
public void testSomeStuff() throws Exception {
assertU(adoc("id", "7", "description", "Travel Guide”,
"title", "Paris in 10 Days"));
assertU(adoc("id", "42", "description", "Cool Book",
"title", "Hitch Hiker's Guide to the Galaxy"));
assertU(commit());
assertQ("multi qf", req("q", "guide",
"qt", "dismax",
"qf", "title^2 description^1")
,"//*[@numFound='2']"
,"//result/doc[1]/int[@name='id'][.='42']"
,"//result/doc[2]/int[@name='id'][.='7']"
);
}
38
Questions?
http://lucene.apache.org/solr/
?