solr presentation
TRANSCRIPT
TorreSaracena GroupGestione delle Informazioni su Web
Esperienza IR.
Francesco Maglia
Ilario Maiolo
Gianluca Porcino
Matteo Cannaviccio
Apache SOLRSolr is an open source enterprise search platform from the
Apache Lucene project.
Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Apache Tomcat or Jetty.
Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it usable from most popular programming languages.
Solr's powerful external configuration allows it to be tailored to many types of application without Java coding, and it has an plugin architecture to support more advanced customization.
Summary
Dataset Indexing
Dataset Querying
Implementation of additional Features
Web Application for Search
Dataset Indexing
“Search engine indexing collects, parses, and stores data to facilitate fast and accurate information
retrieval.” Wikipedia
What needs to be indexedMetatag Relevants fields
What does not Useless field
Terms IndexingMetatag
Relevant field
Image IndexingMetatag (who has)
Image fields
Check the image
Dataset QueryingQueries made through the Request Handlers
that are responsible to answer your request
Have been implemented two SearchHandler:
First handler manages the search of terms
<requestHandler name="/lighthouse"
class="solr.SearchHandler">.../>
Second handler manages the the search of images
<requestHandler name="/lighthouseimg" class="solr.SearchHandler">.../>
Querying settingsThe handler allows you to assign different weights for
the terms in the index in order to sort the search results
HighlightsSolr provides a collection of highlighting utilities which can be called by Request Handlers to include "highlighted" matches in field values.
Configuration FilesSchema.xml
Describe the structure of the data index. It consists of several parts:
field definitions body, title, description, keywords, alt, src, text_autocomplete
type definitions (tokenizer)text_html………………………… (body, alt)
text_general…………………….. (title, description, keywords)
text_auto………………………… (text_autocomplete)
copyField section copy of body, title, description in text_autocomplete
Solrconfig.xmlConfiguration file for search components and request handlers
Analysis ProcessEach document (html page) consists of searchable
fields. The rules for searching each field are defined using field type definitions.
When a document is added/updated, its fields are analyzed and tokenized, and those tokens are stored in the index.
The analysis process in SOLR consists of the following phases: Analysis pre-tokenization (through the class
CharFilter) Tokenization (class Tokenizer) Analysis of post-tokenization (classes Filter)
Tokenizers and Filters Text_html
solr.HTMLStripCharFilterFactory (strip out HTML elements from an analyzed text)
solr.StandardTokenizerFactory solr.LowerCaseFilterFactory
solr.StopFilterFactory (stopwords.txt)
Text_auto solr.WhitespaceTokenizerFactory
(divides text at whitespace)
solr.WordDelimiterFilterFactory
(split on intra-word delimiters, ex. “Wi-Fi” → “Wi”, “Fi”)
solr.LowerCaseFilterFactory
Text_general 2,3,4 solr.SynonymFilterFactory (only for query, ex. I-pod → Ipod)
solr.SnowballPorterFilterFactory (both for query and index)
MisspellingFor the misspelling feature has been implemented an
Index Based Spell checker.
Solr uses one of the configured field in the indexed document as Dictionary input and uses it for spell suggestions.
The field used in the indexed document as Dictionary input is “name_autocomplete”.
The spellcheck distance measure used is the Levenshtein distance.
AutocompleteThe aim of Autocomplete is suggest individual words
that begin with the letters specified by the user.
For configure the suggester we had to prepare the appropriate field on which we will build hints. In our case, we use the field “name_autocomplete”.
Web Application
/lighthouse
Search terms
Search images
/lighthouseimg
response
response
$.parseJSON()
$.parseJSON()
Screenshots
MENÙ
TEXTBOX
Search with Autocomplete
SUGGESTSTYPING…
Misspelling
MISSPELLING
Snippet Results
NUMBER RESULTS
Snippet Results
PAGEHYPERLIN
K
Sorted Results
Images Search
Images Search