how solr search works
TRANSCRIPT
How SOLR Search WorksRajat Jain - 20th Dec, 2016
Agenda
• What do you mean by Search?
• Search Requirements
• Comparison of SOLR with SQL/NoSQL
• SOLR Architecture
• SOLR Usage in Trellis
• How Google Search Works
• Other Search Technologies
What do you mean by Search?
What do you mean by Search?
What do you mean by Search?
Search Requirements
• Text Search – eg. “Architects”
• Filters – eg. “In New Delhi”, “iOS”
• Sorting – eg. “Best Match”, “Highest Rating”, etc.
• And More..• Facets
• Stemming
• Fuzzy Matching
• Image Search, etc.
Search Requirements
• Full Text Search
• Fast reads (writes can be slower)
• Various Combinations of Filters
• Various Combinations of Sorting
• Non Features:• Real-time – usually staleness is not a problem
• Data Integrity – usually not a source of storage – can be ‘lossy’
Search Requirements – Faceted Search
• A Type of Filtering with suggestions
• In most cases – sorted by number
• Basically helps the user to narrow down the search without having to ‘guess’ how to narrow it
Conventional Storage for Search
• SQL (MySQL)• Relational Tables
• Normalized Data
• Assuming using Keys / Indexes for reads & writes
• Optimized for reads and writes & transactional data (acid transactions)
• Lots of security, etc.
• Table Data stored in File System
• Indexing - Individual columns – set of columns
• Full Text search – recent addition (full text index)
Conventional Storage for Search
• No SQL (think MongoDB)• Key Value Pairs
• De-normalized Data
• Unstructured Data
• Optimized for Reads – writes can be slightly slower (in case of transactional)
• Data stored in File System
• Indexing – individual fields
• Full Text Search – has in-built support
Advantages of SOLR over MySQL/NoSQL
• Reversed Index
• Mind-blowing Text-analysis / stemming / scoring / fuzziness
• Weighting fields / boosting – custom scoring functions
• Single document concept – no relations (in general)
• Faceting support out-of-the box
• Optimized for search and search alone (at scale without performance drop)
SOLR Architecture – Indexing
• Take a ‘document’ / field, etc.
• For each field apply set of filters / tokenizers
• Convert to individual tokens
• Update the ‘inverted’ index based on the tokens
• In general in the Index keep track of stats, etc. for the various terms
• Different indexes per field
SOLR Architecture - Indexing
13
XML Update Handler
CSV Update Handler
/update /update/csv
XML Update with custom
processor chain
/update/xml
Extracting RequestHandler(PDF, Word, …)
/update/extract
Lucene Index
Data ImportHandler
Database pullRSS pullSimple
transformsSQL DB
RSS feed
<doc><title>
Remove Duplicatesprocessor
Loggingprocessor
Indexprocessor
Custom Transformprocessor
HTTP POSTHTTP POST
pull
pull
Update Processor Chain (per handler)
Lucene
Text Index Analyzers
SOLR Architecture – Searching
• User enters query
• Parse the query, i.e. apply the required filters and tokenizers
• Converted to tokens
• Parallel search across multiple indexes (per field)
• Score all the documents
• Sort in async fashion
SOLR Architecture - Full
SOLR Architecture – Updating Index
• Types of Index Updates• Instant Index
• Incremental Indexing
• Full Indexing
• Index Update Strategies• Instant / Incremental Index cannot happen continuously
• Too much causes performance degradation
• Full Index periodically to optimize the index
SOLR Architecture – Scalability
• Sharding• Splitting collections across servers
– search in parallel
• Replication• More than one copy of the data
for failover
• SolrCloud• Using Zookeeper for managing
clusters
SOLR Architecture – Other Features
• Stemming• Identify root word and variations of the word, eg. "stems", "stemmer",
"stemming", "stemmed" as based on "stem"
• Fuzzy Matching• Similar Words / Misspellings
• Edit Distance
• NLP• Identify Entities / Nouns in Search Query
• OpenNLP Plugin for SOLR
• And much more…
SOLR Usage in Trellis
• Architecture• Data-in from MySQL
• Index Update Strategy
• AutoComplete
• Basic Search
• Advanced Search
• Filters / Sorting / Facets & More
• Demo (Incl. Config Files)
How Google Search Works
• Crawling• Robots.txt
• Indexing• Multiple Indexes – Instant / Daily / Weekly / Long Tail
• Searching• NLP, Stemming, Auto-correct, etc.
• Ranking – PageRank
• Video - https://www.youtube.com/watch?v=BNHR6IQJGZs
Other Search Technologies
• ElasticSearch• Much newer than Solr
• Built-in scalability
• Uses same Lucene as the base
• JSON instead of XML
• Good for Analytical querying
• Others• Splunk
• Sphinx
That’s All Folks
References• SOLR Home Page -
http://lucene.apache.org/solr/
• Tutorials• http://www.solrtutorial.com/index.h
tml
• https://lucene.apache.org/solr/4_10_0/tutorial.html
• Just Google the rest!!