solr presentation

Post on 11-May-2015

192 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

TorreSaracena GroupGestione delle Informazioni su Web

Esperienza IR.

Francesco Maglia

Ilario Maiolo

Gianluca Porcino

Matteo Cannaviccio

Apache SOLRSolr is an open source enterprise search platform from the

Apache Lucene project.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Apache Tomcat or Jetty.

Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it usable from most popular programming languages.

Solr's powerful external configuration allows it to be tailored to many types of application without Java coding, and it has an plugin architecture to support more advanced customization.

Summary

Dataset Indexing

Dataset Querying

Implementation of additional Features

Web Application for Search

Dataset Indexing

“Search engine indexing collects, parses, and stores data to facilitate fast and accurate information

retrieval.” Wikipedia

What needs to be indexedMetatag Relevants fields

What does not Useless field

Terms IndexingMetatag

Relevant field

Image IndexingMetatag (who has)

Image fields

Check the image

Dataset QueryingQueries made through the Request Handlers

that are responsible to answer your request

Have been implemented two SearchHandler:

First handler manages the search of terms

<requestHandler name="/lighthouse"

class="solr.SearchHandler">.../>

Second handler manages the the search of images

<requestHandler name="/lighthouseimg" class="solr.SearchHandler">.../>

Querying settingsThe handler allows you to assign different weights for

the terms in the index in order to sort the search results

HighlightsSolr provides a collection of highlighting utilities which can be called by Request Handlers to include "highlighted" matches in field values.

Configuration FilesSchema.xml

Describe the structure of the data index. It consists of several parts:

field definitions body, title, description, keywords, alt, src, text_autocomplete

type definitions (tokenizer)text_html………………………… (body, alt)

text_general…………………….. (title, description, keywords)

text_auto………………………… (text_autocomplete)

copyField section copy of body, title, description in text_autocomplete

Solrconfig.xmlConfiguration file for search components and request handlers

Analysis ProcessEach document (html page) consists of searchable

fields. The rules for searching each field are defined using field type definitions.

When a document is added/updated, its fields are analyzed and tokenized, and those tokens are stored in the index.

The analysis process in SOLR consists of the following phases: Analysis pre-tokenization (through the class

CharFilter) Tokenization (class Tokenizer) Analysis of post-tokenization (classes Filter)

Tokenizers and Filters Text_html

solr.HTMLStripCharFilterFactory (strip out HTML elements from an analyzed text)

solr.StandardTokenizerFactory solr.LowerCaseFilterFactory

solr.StopFilterFactory (stopwords.txt)

Text_auto solr.WhitespaceTokenizerFactory

(divides text at whitespace)

solr.WordDelimiterFilterFactory

(split on intra-word delimiters, ex. “Wi-Fi” → “Wi”, “Fi”)

solr.LowerCaseFilterFactory

Text_general 2,3,4 solr.SynonymFilterFactory (only for query, ex. I-pod → Ipod)

solr.SnowballPorterFilterFactory (both for query and index)

MisspellingFor the misspelling feature has been implemented an

Index Based Spell checker.

Solr uses one of the configured field in the indexed document as Dictionary input and uses it for spell suggestions.

The field used in the indexed document as Dictionary input is “name_autocomplete”.

The spellcheck distance measure used is the Levenshtein distance.

AutocompleteThe aim of Autocomplete is suggest individual words

that begin with the letters specified by the user.

For configure the suggester we had to prepare the appropriate field on which we will build hints. In our case, we use the field “name_autocomplete”.

Web Application

/lighthouse

Search terms

Search images

/lighthouseimg

response

response

$.parseJSON()

$.parseJSON()

Screenshots

MENÙ

TEXTBOX

Search with Autocomplete

SUGGESTSTYPING…

Misspelling

MISSPELLING

Snippet Results

NUMBER RESULTS

Snippet Results

PAGEHYPERLIN

K

Sorted Results

Images Search

Images Search

top related