solr presentation

21
TorreSaracena Group Gestione delle Informazioni su Web Esperienza IR. Francesco Maglia Ilario Maiolo Gianluca Porcino Matteo Cannaviccio

Upload: matteo-cannaviccio

Post on 11-May-2015

192 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Solr Presentation

TorreSaracena GroupGestione delle Informazioni su Web

Esperienza IR.

Francesco Maglia

Ilario Maiolo

Gianluca Porcino

Matteo Cannaviccio

Page 2: Solr Presentation

Apache SOLRSolr is an open source enterprise search platform from the

Apache Lucene project.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Apache Tomcat or Jetty.

Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it usable from most popular programming languages.

Solr's powerful external configuration allows it to be tailored to many types of application without Java coding, and it has an plugin architecture to support more advanced customization.

Page 3: Solr Presentation

Summary

Dataset Indexing

Dataset Querying

Implementation of additional Features

Web Application for Search

Page 4: Solr Presentation

Dataset Indexing

“Search engine indexing collects, parses, and stores data to facilitate fast and accurate information

retrieval.” Wikipedia

What needs to be indexedMetatag Relevants fields

What does not Useless field

Page 5: Solr Presentation

Terms IndexingMetatag

Relevant field

Page 6: Solr Presentation

Image IndexingMetatag (who has)

Image fields

Check the image

Page 7: Solr Presentation

Dataset QueryingQueries made through the Request Handlers

that are responsible to answer your request

Have been implemented two SearchHandler:

First handler manages the search of terms

<requestHandler name="/lighthouse"

class="solr.SearchHandler">.../>

Second handler manages the the search of images

<requestHandler name="/lighthouseimg" class="solr.SearchHandler">.../>

Page 8: Solr Presentation

Querying settingsThe handler allows you to assign different weights for

the terms in the index in order to sort the search results

HighlightsSolr provides a collection of highlighting utilities which can be called by Request Handlers to include "highlighted" matches in field values.

Page 9: Solr Presentation

Configuration FilesSchema.xml

Describe the structure of the data index. It consists of several parts:

field definitions body, title, description, keywords, alt, src, text_autocomplete

type definitions (tokenizer)text_html………………………… (body, alt)

text_general…………………….. (title, description, keywords)

text_auto………………………… (text_autocomplete)

copyField section copy of body, title, description in text_autocomplete

Solrconfig.xmlConfiguration file for search components and request handlers

Page 10: Solr Presentation

Analysis ProcessEach document (html page) consists of searchable

fields. The rules for searching each field are defined using field type definitions.

When a document is added/updated, its fields are analyzed and tokenized, and those tokens are stored in the index.

The analysis process in SOLR consists of the following phases: Analysis pre-tokenization (through the class

CharFilter) Tokenization (class Tokenizer) Analysis of post-tokenization (classes Filter)

Page 11: Solr Presentation

Tokenizers and Filters Text_html

solr.HTMLStripCharFilterFactory (strip out HTML elements from an analyzed text)

solr.StandardTokenizerFactory solr.LowerCaseFilterFactory

solr.StopFilterFactory (stopwords.txt)

Text_auto solr.WhitespaceTokenizerFactory

(divides text at whitespace)

solr.WordDelimiterFilterFactory

(split on intra-word delimiters, ex. “Wi-Fi” → “Wi”, “Fi”)

solr.LowerCaseFilterFactory

Text_general 2,3,4 solr.SynonymFilterFactory (only for query, ex. I-pod → Ipod)

solr.SnowballPorterFilterFactory (both for query and index)

Page 12: Solr Presentation

MisspellingFor the misspelling feature has been implemented an

Index Based Spell checker.

Solr uses one of the configured field in the indexed document as Dictionary input and uses it for spell suggestions.

The field used in the indexed document as Dictionary input is “name_autocomplete”.

The spellcheck distance measure used is the Levenshtein distance.

Page 13: Solr Presentation

AutocompleteThe aim of Autocomplete is suggest individual words

that begin with the letters specified by the user.

For configure the suggester we had to prepare the appropriate field on which we will build hints. In our case, we use the field “name_autocomplete”.

Page 14: Solr Presentation

Web Application

/lighthouse

Search terms

Search images

/lighthouseimg

response

response

$.parseJSON()

$.parseJSON()

Page 15: Solr Presentation

Screenshots

MENÙ

TEXTBOX

Page 16: Solr Presentation

Search with Autocomplete

SUGGESTSTYPING…

Page 17: Solr Presentation

Misspelling

MISSPELLING

Page 18: Solr Presentation

Snippet Results

NUMBER RESULTS

Page 19: Solr Presentation

Snippet Results

PAGEHYPERLIN

K

Sorted Results

Page 20: Solr Presentation

Images Search

Page 21: Solr Presentation

Images Search