make plone search act like google using solr
DESCRIPTION
Solr is a powerful open source search engine server which has become a popular choice for extending the search capabilities of Plone sites. The default configuration works well, but how do you answer the client's request to "Make my search just like Google's"? In this talk we will take a look at the various options that are available for configuring Solr's schema and configuration. We will discuss how to set up stop words, spell checking, n-grams and alternate query handlers. We will see what effect these settings will have on the search results and find out how to debug problems when they arise.TRANSCRIPT
Clayton Parker | Senior Web Developer
Make Plone Search Act Like Google Using Solr
PLONE CONFERENCE 2011
Who Am I
PLONE CONFERENCE 2011What will we learn?
PLONE CONFERENCE 2011What will we learn?
• Intro to Solr
PLONE CONFERENCE 2011What will we learn?
• Intro to Solr
• Brief overview of Plone integration points
PLONE CONFERENCE 2011What will we learn?
• Intro to Solr
• Brief overview of Plone integration points
• Solr configuration
PLONE CONFERENCE 2011What will we learn?
• Intro to Solr
• Brief overview of Plone integration points
• Solr configuration
• Solr schema setup
PLONE CONFERENCE 2011What will we learn?
• Intro to Solr
• Brief overview of Plone integration points
• Solr configuration
• Solr schema setup
• Debugging tips and tricks
PLONE CONFERENCE 2011
What is Solr ?
PLONE CONFERENCE 2011Version Madness
1.x(up to 1.4)
1.5(number abandoned)
3.x(merge of Lucene and Solr)
PLONE CONFERENCE 2011Books
PLONE CONFERENCE 2011
Integration
PLONE CONFERENCE 2011
alm.solrindex
PLONE CONFERENCE 2011
collective.solr
Solr Configuration
PLONE CONFERENCE 2011Query Handlers
• Standard
• Disjunction Max (DisMax)
• Extended DisMax (experimental)
PLONE CONFERENCE 2011DisMax
• Multiple index searches
• Boosting
• Friendlier to end users
PLONE CONFERENCE 2011DisMax
qf=SearchableText^1.0 substring^0.2
Index Name
Weight
PLONE CONFERENCE 2011MinShouldMatchmm=100%
mm=50%
mm=-2
All terms required
Half of the terms required
All but two terms required
PLONE CONFERENCE 2011MinShouldMatch
mm=2<-25% 9<-3
2 or less terms are required
3-9 terms all but 25% required
more than 9 terms all but
three are required
PLONE CONFERENCE 2011Spelling Component
<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <lst name="spellchecker"> <str name="name">default</str> <str name="classname">solr.IndexBasedSpellChecker</str> <str name="buildOnCommit">true</str> <str name="spellcheckIndexDir">path/to/spellcheck</str> <!-- The field that will contain the dynamic spelling data --> <str name="field">spell</str> <str name="accuracy">0.5</str> </lst> <!-- Control indexing and query of spelling data --> <str name="queryAnalyzerFieldType">spell-text</str></searchComponent>
PLONE CONFERENCE 2011Spelling Schema
<fieldType name="spell-text" class="solr.TextField"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory"/> <filter class="solr.ICUFoldingFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer></fieldType>
Solr Schema
PLONE CONFERENCE 2011Index vs Query
http://www.cominvent.com/2011/04/04/solr-architecture-diagram/
PLONE CONFERENCE 2011
PLONE CONFERENCE 2011
Character Filters
PLONE CONFERENCE 2011
Character Filters
Tokenizer
PLONE CONFERENCE 2011
Character Filters
Tokenizer
Filters
PLONE CONFERENCE 2011
Character Filters
Tokenizer
Filters
PLONE CONFERENCE 2011Complete Field<fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/> <tokenizer class="solr.ICUTokenizerFactory"/> <filter class="solr.ICUFoldingFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" preserveOriginal="1"/> </analyzer>
<analyzer type="query"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/> <tokenizer class="solr.ICUTokenizerFactory"/> <filter class="solr.ICUFoldingFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" preserveOriginal="1"/> <filter class="solr.PositionFilterFactory"/> </analyzer></fieldType>
PLONE CONFERENCE 2011Copy Field
<copyField source="SearchableText" dest="spell"/><copyField source="SearchableText" dest="substring"/>
PLONE CONFERENCE 2011
Character Filters• Process text before tokenizing
• Remove irrelevant characters
PLONE CONFERENCE 2011Pattern Replace
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^a-zA-z0-9_-]" replacement="" replace="all"/>
'That WAS a narrow escape!' said Alice, a good deal frightened
That WAS a narrow escape said Alice a good deal frightened
PLONE CONFERENCE 2011Mapping
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
# œ => oe"\u0153" => "oe"# ß => ss"\u00DF" => "ss"
PLONE CONFERENCE 2011HTML Strip
<charFilter class="solr.HTMLStripCharFilterFactory"/>
PLONE CONFERENCE 2011
Tokenizers• Split raw text into tokens / terms
• Typically the first step
PLONE CONFERENCE 2011Whitespace Tokenizer
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
'That WAS a narrow escape!' said Alice
'ThatWASanarrowescape!'saidAlice
PLONE CONFERENCE 2011ICU Tokenizer
<tokenizer class="solr.ICUTokenizerFactory"/>
'That WAS a narrow escape!' said Alice
ThatWASanarrowescapesaidAlice
PLONE CONFERENCE 2011Pattern Tokenizer
<tokenizer class="solr.PatternTokenizerFactory" pattern=";\s*" />
one; two; three
onetwothree
PLONE CONFERENCE 2011Path Hierarchy
<tokenizer class="solr.PathHierarchyTokenizerFactory"/>
/usr/local/etc/nginx
/usr/usr/local/usr/local/etc/usr/local/etc/nginx
PLONE CONFERENCE 2011
Token Filters• Process after tokenizing
• Normalization of terms
PLONE CONFERENCE 2011Lower Case
<filter class="solr.LowerCaseFilterFactory"/>
FoobArBAZ
foobarbaz
PLONE CONFERENCE 2011ASCII Folding
<filter class="solr.ASCIIFoldingFilterFactory"/>
idéebêtegrüßen
ideebetegrussen
PLONE CONFERENCE 2011ICU Folding
<filter class="solr.ICUFoldingFilterFactory"/>
IdéeBÊTEGrüßeN
ideebetegrussen
PLONE CONFERENCE 2011Pattern Replace
<filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-z0-9_-]" replacement="" replace="all"/>
'ThatWASanarrowescape!'saidAlice
ThatWASanarrowescapesaidAlice
PLONE CONFERENCE 2011Word Delimiter
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" preserveOriginal="1"/>
StudlyCaps1234-5678
StudlyCaps1234-5678CapsStudly12345678
PLONE CONFERENCE 2011Edge N Gram
<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="100" side="front"/>
Conqueror
ConquerorConqueroConquerConqueConquConq
PLONE CONFERENCE 2011Stop Words
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
narrowescapesaidAlicegooddealfrightened
ThatWASanarrowescapesaidAliceagooddealfrightened
PLONE CONFERENCE 2011Synonyms
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
# synonyms.txt
# add multiple termsfoozball, foosball, baby-foot
# merge into onetv, t.v., tele => television
foosballfoozballfoosballbaby-foot
telet.v.tv
televisiontelevisiontelevision
PLONE CONFERENCE 2011Language Stemming
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
drydryingdried
dridridri
PLONE CONFERENCE 2011Language Stemming<filter class="solr.ElisionFilterFactory" articles="stopwordarticles.txt"/>
<filter class="solr.EnglishPorterFilterFactory" language="French"/>
considereconsideresconsiderent
considerconsiderconsider
qu'ilnecomprendpasl'anglais
ilnecomprendpasanglais
Solr Debugging
PLONE CONFERENCE 2011Schema Browser
PLONE CONFERENCE 2011Analysis
PLONE CONFERENCE 2011Analysis
PLONE CONFERENCE 2011Analysis
PLONE CONFERENCE 2011Analysis
PLONE CONFERENCE 2011Analysis
PLONE CONFERENCE 2011Search Interface
PLONE CONFERENCE 2011Crafting a URLhttp://localhost:8983/solr/select?qf=SearchableText^1.0&rows=10&fl=*,score&debugQuery=on&explainOther=True&indent=true&defType=dismax&q=test
q=testqf=SearchableText^1.0defType=dismax
debugQuery=onexplainOther=onindent=on
PLONE CONFERENCE 2011Verbose XML*
* like there is any other kind
<lst name="responseHeader"> <int name="status">0</int> <int name="QTime">2</int> <lst name="params"> <str name="explainOther">True</str> <str name="fl">*,score</str> <str name="debugQuery">on</str> <str name="indent">true</str> <str name="q">test</str> <str name="qf">SearchableText^1.0</str> <str name="rows">10</str> <str name="defType">dismax</str> </lst></lst>
PLONE CONFERENCE 2011Verbose XML*
* like there is any other kind
<result name="response" numFound="2" start="0" maxScore="0.70710677"> <doc> <float name="score">0.70710677</float> <int name="docid">-643919099</int> </doc> <doc> <float name="score">0.3788861</float> <int name="docid">-643919097</int> </doc></result>
PLONE CONFERENCE 2011Verbose XML*
* like there is any other kind
<lst name="debug"> <str name="rawquerystring">test</str> <str name="querystring">test</str> <str name="parsedquery">+DisjunctionMaxQuery((SearchableText:test)) ()</str> <str name="parsedquery_toString">+(SearchableText:test) ()</str> <lst name="explain"> <str name="-643919099">0.70710677 = (MATCH) sum of: 0.70710677 = (MATCH) fieldWeight(SearchableText:test in 4), product of: 1.4142135 = tf(termFreq(SearchableText:test)=2) 1.0 = idf(docFreq=5, maxDocs=6) 0.5 = fieldNorm(field=SearchableText, doc=4) </str> <str name="-643919097">0.3788861 = (MATCH) sum of: 0.3788861 = (MATCH) fieldWeight(SearchableText:test in 0), product of: 1.7320508 = tf(termFreq(SearchableText:test)=3) 1.0 = idf(docFreq=5, maxDocs=6) 0.21875 = fieldNorm(field=SearchableText, doc=0) </str></lst>
PLONE CONFERENCE 2011Links
• Solr (http://lucene.apache.org/solr)
• Solr Wiki (http://wiki.apache.org/solr)
• Books (http://www.packtpub.com/books/all?keys=solr)
• SolrIndex (http://pypi.python.org/pypi/alm.solrindex/)
• collective.solr (http://pypi.python.org/pypi/collective.solr)
PLONE CONFERENCE 2011Flickr Credits
• http://www.flickr.com/photos/naturegeak/5642083189/ (who)
• http://www.flickr.com/photos/eklektikos/2541408630/ (schema)
• http://www.flickr.com/photos/sidelong/13954593/ (char filter)
• http://www.flickr.com/photos/benimoto/2214240119/ (tokenizers)
• http://www.flickr.com/photos/chaunceydavis/3264077445/ (filters)
• http://www.flickr.com/photos/comedynose/3271760209/ (configuration)
• http://www.flickr.com/photos/nicksart/4821509371/ (debugging)
Thanks to
Check out
sixfeetup.com/demos
Questions?