rebuilding solr 6 examples - layer by layer: presented by alexandre rafalovitch, search stack...
TRANSCRIPT
O C T O B E R 1 1 -‐ 1 4 , 2 0 1 6 • B O S T O N , M A
Rebuilding Solr 6 examples – layer by layer Alexandre Rafalovitch www.solr-‐start.com
Who am I • So)ware developer with 20+ years of experience – Including 3 years as Senior Tech Support (BEA Weblogic)
• Solr popularizer • Published book author on Solr Indexing (for Solr 4.3) • Run hLp://www.solr-‐start.com resource site • Solr commiLer (since August 2016) • Past and present Solr focus on onboarding, usability, tooling, informaSon sharing
Example catch-‐22 • Search is a – surprisingly -‐ complex experSse • Solr is a complex product – Wide – Deep – History-‐rich
• And so are its many examples
Fasten the seatbelt • Review all of the (Solr 6.2) OOTB examples
• Make a small one from scratch
• Deconstruct a real shipped example
• Next learning acSon...
OOTB Examples – how many? bin/solr start –e -‐e <example> Name of the example to run; available examples: cloud: SolrCloud example techproducts: Comprehensive example illustraSng many of Solr's core capabiliSes dih: Data Import Handler schemaless: Schema-‐less example
techproducts example • Used to be collec/on1 • solr.home: example/techproducts/solr
– Can restart with bin/solr start -‐s example/techproducts/solr
– Actual core at example/techproducts/solr/techproducts
techproducts example (cont.) • Source configuraSon – server/solr/configset/sample_techproducts_config – Not actually a configset (copy, not share)
• Can be rebuilt rm –rf example/techproducts
• Has data (14 files of products, money, uc8 tests) bin/post -‐c techproducts example/exampledocs/*.xml
schemaless example • solr.home: example/schemaless/solr • Actual core: example/schemaless/solr/ge?ngstarted • Source configuraSon: – server/solr/configset/data_driven_schema_configs – Config you get when you are not using config: bin/solr create -‐c newcore
• No data, but can take (nearly) anything: bin/post -‐c <name> example/exampledocs/*.xml
schemaless mode? • “Let us guess what you mean” – Auto-‐guess field type based on first content occurrence – Create explicit field definiSons
• booleans, dates, numbers, strings • Always mulSvalued (because: who knows?!?) • Can be configured (URP chain in solrconfig.xml)
– Rewrites managed-‐schema (coments begone!) – Makes search work with <copyField source="*" dest="_text_"/>
techproducts vs schemaless • Configured techproducts vs
auto-‐detecSng schemaless • Strings
"name":"Test with some GB18030 encoded characters", "name":["Test with some GB18030 encoded characters"],
• Numbers "price":0.0, "price_c":"0.0,USD", "price":[0.0],
• Booleans "inStock":true, "inStock":[true],
cloud example • Highly configurable (unless using –noprompt) • solr.home: example/cloud/nodeX/solr • Source configuraSon is a choice
Please choose a configuraSon for the genngstarted collecSon, available opSons are: basic_configs, data_driven_schema_configs, or sample_techproducts_configs [data_driven_schema_configs]
• Can be rebuilt: bin/solr stop -‐all rm -‐rf example/cloud
• Demonstrates Config API (configoverlay.json)
dih example(s) • Data import handler – legacy, but sSll kicking • solr.home: example/example-‐DIH/solr • Has 5 (five!) different cores
– db -‐ database import (example/example-‐DIH/hsqldb/ex.*) – solr -‐ import from another Solr core (configured for db core) – mail -‐ import from IMAP (needs some configuraSon) – /ka -‐ import rich-‐content (example/exampledocs/solr-‐word.pdf) – rss -‐ external XML feed (very broken right now)
• Cannot be rebuilt – only empSed bin/post -‐c db -‐type 'applica/on/json' -‐d '{delete: {query:"*:*"}}'
What about: bin/solr start? • solr.home: server/solr • No iniSal collecSon/cores, have to create explicitly: – With script (see bin/solr create_core –h for details): bin/solr create –c <corename> -‐d <name or path>
– With Core Admin UI for non-‐SolrCloud: hRp://localhost:8983/solr/admin/cores?ac/on=CREATE&…
– With CollecSon API for SolrCloud: hRp://localhost:8983/admin/collec'ons?ac/on=CREATE&…
basic_configs configuraSon • Available for cloud example and explicit creaSon
• Schemaless mode is configured, not enabled • “Minimal Solr configuraSon” !?! – managed-‐schema: 1005 lines – solrconfig.xml: 1484 lines
files example • Specifically tuned for file indexing – Augmented schemaless mode with language, content-‐type guessing
– Custom /browse end-‐point – Source configuraSon: example/files/conf – Setup instrucSons: example/files/README.txt – Bring your own data
films example • Schemaless (Based on data_driven_schema_configs) – Uses Schema API to add custom fields – Uses schemaless for rest of fields
• Comes with its own data (1100 film records) • Uses velocity (/browse), Schema API, Request Parameters API (params.json)
• Setup instrucSons: example/films/README.txt
That was a good news • Many examples • Easy to get one running • Some come with data • Some you can throw your own data into • Lots of comments
This is the bad news Files Types Fields Dynamic
Fields managed-‐schema size
solrconfig.xml size
basic 46 71 4 73 1005 1484
data_driven 46 71 4 73 1005 1482
techproducts 101 66 33 28 1149 1701
dih db 62 62 31 28 1129 1490
dih Ska 6 61 3 27 901 1466
files 69 73 9 73 517 1508
films (data_driven+)
46 71 8 73 481 1482
Tip – genng these numbers • XML extracSon with XMLStarlet (XLST CLI) – xml sel -‐t -‐m "//fieldType" -‐v @name -‐n managed-‐schema – xml sel -‐t -‐m "//copyField" -‐c . -‐n managed-‐schema |wc -‐l – xml sel -‐t -‐m "//*[@docValues]" -‐v "concat(local-‐name(), ' ', @name, ' docValues:', @docValues)" -‐n managed-‐schema
– xml sel -‐t -‐m "//requestHandler" -‐v "@name" -‐n solrconfig.xml
Why is it like this? • Many examples predate Solr Reference Guide • grep for opSons, possibiliSes, defaults • Each example is a kitchen sink
“Too much of a good thing is also a bad thing”
Source: 1980s Soviet joke about Virtual Reality
Go small – managed-‐schema <schema name="demo" version="1.6">
<dynamicField name="*" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="text" type="text_basic" indexed="true" stored="false" multiValued="true"/>
<copyField source="*" dest="text"/>
…
Go small – managed-‐schema(2) … <fieldType name="string" class="solr.StrField"/>
<fieldType name="text_basic" class="solr.TextField">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory" />
</analyzer> </fieldType>
</schema>
Go small – solrconfig.xml <config> <luceneMatchVersion>6.2.0</luceneMatchVersion>
<requestHandler name="/select” class="solr.SearchHandler”>
<lst name="defaults">
<str name="df">text</str> </lst>
</requestHandler>
</config>
Go small – load and test • bin/solr create -‐c demo -‐d .../demo-‐config/ • bin/post -‐c demo example/exampledocs/*.xml • Test it works, using HTTPie (HTTP CLI)
Go small -‐ review • Minimal example could be very minimal • Some things will not work – No uniqueKey – no way to update documents, no SolrCloud
– No _version_ – no SolrCloud – Everything is mulSValued – no sorSng – copyField * => text, no meaningful relevancy, specialized analyzer chain processing
DeconstrucSng films example • bin/solr create –c films • curl hLp://localhost:8983/solr/films/schema ... (add name,
ini/al_release_date) • Index 1100 records from
– (Solr) XML, – (generic) JSON (doc), or – CSV format
• Search for batman • Use /browse end-‐point and search for batman • Enable highlighSng in results
IniSal stats for films core Sizes (line counts)
managed-‐schema* 481 solrconfig.xml 1482 params.json 20
File count in conf
.txt 41
.xml 3
.json 1
managed-‐schema (xml) 1
* already has no comments
DeconstrucSng – just straight tags • managed-‐schema lost comments during construcSon
• Let's remove comments from solrconfig.xml • xml ed -‐L -‐d "//comment()" solrconfig.xml – Edit in place – Delete XPATH
solrconfig.xml without comments Sizes (line counts)
managed-‐schema 481 solrconfig.xml 1482
278 params.json 20
File count in conf
.txt 41
.xml 3
.json 1
managed-‐schema (xml) 1
DeconstrucSng – what to clean • Currently – (explicit) fields: 8 – dynamic fields: 73
• xml sel -‐t -‐m "//dynamicField" -‐v @name -‐n managed-‐schema |wc -‐l
– types: 71 – copyFields: 1
• Let's start from dynamic fields
DeconstrucSng – dynamic fields • Used dynamic fields – do NOT modify schema – DO show up in Admin UI, if used – Example from different schema:
• Used/matched fields • Generic definiSons
DeconstrucSng – in use dynamic fields
DeconstrucSng – in use dynamic fields
• NO dynamic fields are used – * is a copyField instrucSon
• Can remove them all • xml ed -‐L -‐d "//dynamicField" managed-‐schema
Remove dynamicFields Sizes (line counts)
managed-‐schema 481 409
solrconfig.xml 278 params.json 20
File count in conf
.txt 41
.xml 3
.json 1
managed-‐schema (xml) 1
DeconstrucSng – field types • How many types out of 71 do we use? – xml sel -‐t -‐m "//field|//dynamicField" -‐v "@type" -‐n conf/managed-‐schema |sort –u
– long, string, strings, tdate, text_general • But also some in solrconfig.xml – booleans, string, strings, tdates, tdoubles, text_general, tlongs
• Combined total: 9 field type definiSons • Delete the rest (by hand)
Remove no-‐longer used types Sizes (line counts)
managed-‐schema 409 34 (!!!)
solrconfig.xml 278 params.json 20
File count in conf
.txt 41
.xml 3
.json 1
managed-‐schema (xml) 1
DeconstrucSng – support files • Inside lang directory (38 files) – find lang –name 'stopwords_*.txt' | wc -‐l
• stopwords_*.txt: 30 files • contracSons_*.txt: 4 files
– find lang -‐type f |egrep -‐v 'stopwords_|contrac/ons_' • hyphenaSons_ga.txt, stemdict_nl.txt, stoptags_ja.txt, userdict_ja.txt
Support files – sSll in use? • Check for usage
– grep -‐o 'stopwords_.*.txt' managed-‐schema solrconfig.xml – grep -‐o 'contrac/ons_.*.txt' ... – ...
• NO Matches (we no longer have related types) – Delete the whole lang directory
• What about files just inside config directory – Don't need currency.xml, protwords.txt
Remove no-‐longer used types Sizes (line counts)
managed-‐schema 34 solrconfig.xml 278 params.json 20
File count in conf
.txt 41 2
.xml 3 2
.json 1
managed-‐schema (xml) 1
DeconstrucSng – actual field usage
Actual field usage -‐ _root_
The mystery of _root_ • In the original schema – no explanaSons • DocumentaSon – used for nested documents: To support nested documents, the schema must include an indexed/non-‐stored field _root_ . The value of that field is populated automa/cally and is the same for all documents in the block, regardless of the inheritance depth.
• We are not using nested documents • And neither does any other shipped example...
Remove _root_ Sizes (line counts)
managed-‐schema 34 33 solrconfig.xml 278 params.json 20
File count in conf
.txt 2
.xml 2
.json 1
managed-‐schema (xml) 1
DeconstrucSng – text_general type <fieldType name="text_general" class="solr.TextField" posiSonIncrementGap="100" mulSValued="true"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
text_general support files stopwords.txt # Licensed to the Apache Sokware Founda/on (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for addi/onal informa/on regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # hRp://www.apache.org/licenses/LICENSE-‐2.0# # Unless required by applicable law or agreed to in wri/ng, sokware # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limita/ons under the License.
• synonyms.txt # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with# the License. You may obtain a copy of the License at#. ...... .#-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ #some test synonym mappings unlikely to appear in real input textaaafoo => aaabar bbbfoo => bbbfoo bbbbar cccfoo => cccbar cccbaz fooaaa,baraaa,bazaaa # Some synonym groups specific to this example GB,gib,gigabyte,gigabytes MB,mib,megabyte,megabytes Television, Televisions, TV, TVs #no/ce we use "gib" instead of "GiB" so any WordDelimiterFilter coming #aker us won't split it into two words. # Synonym mappings can be used for spelling correc/on toopixima => pixma
text_general's empty stopwords • No file => default stopwords => English
• Empty file => disabled stopwords
• Currently – NOT used
text_general simplified definiSon <fieldType name="text_general" class="solr.TextField" posiSonIncrementGap="100" mulSValued="true"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
Remove stopwords and synonyms Sizes (line counts)
managed-‐schema 33 26 solrconfig.xml 278 params.json 20
File count in conf
.txt 2 0
.xml 2
.json 1
managed-‐schema (xml) 1
How far did we get Sizes (line counts)
managed-‐schema* 481 26 solrconfig.xml 1482
278 params.json 20
File count in conf
.txt 41 0
.xml 3 2
.json 1
managed-‐schema (xml) 1 * already has no comments
DeconstrucSng – solrconfig.xml • solrconfig.xml is more complex than schema • Heterogeneous SecSons • Nested definiSons • AlternaSve implementaSons (e.g. highlighter) • Also remember – configoverlay.json – overrides solrconfig.xml – params.json – addiSonal configuraSon parameters
solrconfig.xml – feature counts 11 requestHandler 8 lib 5 searchComponent 3 queryResponseWriter 2 initParams 1 updateRequestProcessorChain 1 updateHandler 1 requestDispatcher
1 query 1 luceneMatchVersion 1 jmx 1 indexConfig 1 directoryFactory 1 dataDir 1 codecFactory
solrconfig.xml – line counts 55:<updateRequestProcessorChain name="add-‐unknown-‐fields-‐to-‐the-‐schema"> 52:<searchComponent class="solr.HighlightComponent" name="highlight"> 18:<query> 17:<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy"> 15:<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> 13:<updateHandler class="solr.DirectUpdateHandler2"> 9:<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy"> 8:<requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy"> 8:<requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy"> 7:<requestHandler name="/update/extract" startup="lazy" class="solr.extracSon.ExtracSngRequestHandler"> 7:<requestHandler name="/query" class="solr.SearchHandler"> 6:<requestHandler name="/debug/dump" class="solr.DumpRequestHandler"> ......
Remember, this works! <config> <luceneMatchVersion>6.2.0</luceneMatchVersion>
<requestHandler name="/select” class="solr.SearchHandler”>
<lst name="defaults">
<str name="df">text</str> </lst>
</requestHandler>
</config>
add-‐unknown-‐fields-‐to-‐the-‐schema • Famous "schemaless" mode • Generic, but fully configurable • Far from perfect – Remember, we had to manually pre-‐add fields – Development, not producSon – Has normalizaSon side-‐effects (normalizes dates)
• Cannot remove it in our example
solrconfig.xml -‐ highlighter <searchComponent class="solr.HighlightComponent" name="highlight"> <highlighSng> <fragmenter name="gap" default="true" class="solr.highlight.GapFragmenter"> <lst name="defaults"> <int name="hl.fragsize">100</int> </lst> </fragmenter> <fragmenter name="regex" class="solr.highlight.RegexFragmenter"> <lst name="defaults"> <int name="hl.fragsize">70</int> <float name="hl.regex.slop">0.5</float> <str name="hl.regex.paLern">[-‐\w ,/\n\"']{20,200}</str> </lst> </fragmenter> <formaLer name="html" default="true" class="solr.highlight.HtmlFormaLer"> <lst name="defaults"> <str name="hl.simple.pre"><![CDATA[<em>]]></str> <str name="hl.simple.post"><![CDATA[</em>]]></str> </lst> </formaLer> <encoder name="html" class="solr.highlight.HtmlEncoder"/> <fragListBuilder name="simple" class="solr.highlight.SimpleFragListBuilder"/> <fragListBuilder name="single" class="solr.highlight.SingleFragListBuilder"/>
.......
• fragmenters • encoders • fragListBuilders • fragmentBuilders • boundaryScanners • ....
highlighter – the truth • Highlighter searchComponent is in default stack • The params are a mix of standard highlighter, alternaSve FastVector highlighter
• Cannot use FastVector version as schema fields are missing termVectors, etc
• And standard highlighter params are same as implicit values
• Therefore, we can remove the WHOLE definiSon
Remove highlighter Sizes (line counts)
managed-‐schema 26 solrconfig.xml 278 226 params.json 20
File count in conf
.txt 0
.xml 2
.json 1
managed-‐schema (xml) 1
Other searchComponents • Not on the default stack
– spellcheck – term – termVector – elevator
• Have dedicated requestHandlers • IncepSon (example within example) • Can be deleted
– also delete elevate.xml
15:<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> 17:<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy"> 1:<searchComponent name="terms" class="solr.TermsComponent"/> 9:<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy"> 1:<searchComponent name="tvComponent" class="solr.TermVectorComponent"/> 8:<requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy"> 4:<searchComponent name="elevator" class="solr.QueryElevaSonComponent"> 8:<requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy">
Remove custom searchComponents Sizes (line counts)
managed-‐schema 26 solrconfig.xml 226 163 params.json 20
File count in conf
.txt 0
.xml 2 1
.json 1
managed-‐schema (xml) 1
solrconfig.xml – more stuff • There is more that can be taken out – query secSon, since you have to tune it anyway – updateHandler, and revert to basic commits – jmx – enableRemoteStreaming – definitely take that out
• But keep velocity, browse, search support
Next acSon • Join the (virtual) Solr Example Reading Group – Starts November 2016 – Register at hLp://bit.ly/SolrERG
• Join mailing list at hLp://www.solr-‐start.com – Get the link to the presentaSon source – Learn about other similar projects – Get news of Solr arScles and projects on the web
Rebuilding Solr 6 examples – layer by layer Alexandre Rafalovitch www.solr-‐start.com