introduction to search engine-building with lucene

Introduction to Search Engine-Building with Lucene

Kai ChanSoCal Code Camp, June 2012

How to Search

• One (common) approach to searching all your documents:

for each document d { if (query is a substring of d’s content) { add d to the list of results }}sort the results

How to Search

• Problems– Slow: Reads the whole database for each search– Not scalable: If your database grows by 10x, your

search slows down by 10x– How to show the most relevant documents first?

Inverted Index

• (term -> document list) map

Example taken from Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)

"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}

T0 = "it is what it is"T1 = "what is it"T2 = "it is a banana"

Documents:

Inverted index:

Inverted Index

• (term -> <document, position> list) map

T0 = "it is what it is” 0 1 2 3 4

T1 = "what is it” 0 1 2

T2 = "it is a banana” 0 1 2 3

Inverted Index

• (term -> <document, position> list) map

"a": {(2, 2)}"banana": {(2, 3)}"is": {(0, 1), (0, 4), (1, 1), (2, 1)}"it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)}

T0 = "it is what it is"T1 = "what is it"T2 = "it is a banana"

Inverted Index

• Speed– Term list • Very small compared to documents’ content• Tends to grow at a slower speed than documents

(after a certain level)

– Term lookup: O(1) to O(log of number of terms)– Document lists are very small– Document + position lists still small

Inverted Index

• Relevance– Extra information in the index• Stored in a easily accessible way• Determine relevance of each document to the query

– Enables sorting by (decreasing) relevance

Determining Relevancy

• Two models used in the searching process– Boolean model• AND, OR, NOT, etc.• Either a document matches a query, or not

– Vector space model• How often a query term appears in a document vs.

how often the term appears in all documents• Scoring and sorting by relevancy possible

Determining Relevancy

all documents

some documents (unsorted)

some documents (sorted by score)

filtering (Boolean Model)

scoring (Vector Space Model)

Lucene uses both models

Vector Space Model

querydocument 1

document 2

f(frequency of term B)

f(frequency of term A)

Scoring

• Term frequency (TF)– How many times does this term (t) appear in this

document (d)?– Score proportional to TF

• Document frequency (DF)– How many documents have this term (t)?– Score proportional to the inverse of DF (IDF)

Scoring

• Coordination factor (coord)– Documents that contains all or most query terms

get higher scores• Normalizing factor (norm)– Adjust for field length and query complexity

Scoring

• Boost– “Manual override”: ask Lucene to give a higher

score to some particular thing– Index-time• Document• Field (of a particular document)

– Search-time• Query

score(q, d) = coord(q, d) . queryNorm(q) . Σ t in q (tf (t in d) . idf(t)2 . boost(t) . norm(t, d))

coordination factor

term boost

query normalizing factor

document boost, field boost,

length normalizing factor

inverse document frequency

term frequency

Scoring

http://lucene.apache.org/core/3_6_0/scoring.html

Work Flow

• Indexing– Index: storage of inverted index + documents– Add fields to a document– Add the document to the index– Repeat for every document

• Searching– Generate a query– Search with this query– Get back a sorted document list (top N docs)

Adding Field to Document

• Store?• Index?– Analyzed (split text into multiple terms)– Not analyzed (treat the whole text as ONE term)– Not indexed (this field will not be searchable)– Store norms?

Analyzed vs. Not Analyzed

Text: “the quick brown fox”

Analyzed: 4 terms1. the2. quick3. brown4. fox

Not analyzed: 1 term1. the quick brown fox

Index-time Analysis

• Analyzer– Determine which TokenStream classes to use

• TokenStream– Does the actual hard work– Tokenizer: text to tokens– Token filter: tokens to tokens

Text:San Franciso, the Bay Area’s city-county http://www.ci.sf.ca.us controller@sfgov.org

WhitespaceAnalyzer:[San] [Francisco,] [the] [Bay] [Area’s] [city-county] [http://www.ci.sf.ca.us/] [controller@sfgov.org]

StopAnalyzer:[san] [francisco] [bay] [area] [s] [city] [county] [http] [www] [ci] [sf] [ca] [us] [controller] [sfgov] [org]

StandardAnalyzer:[san] [francisco] [bay] [area's] [city] [county] [http] [www.ci.fs.ca.us] [controller] [sfgov.org]

Notable TokenStream Classes

• ASCIIFoldingFilter– Converts alphabetic characters into basic forms

• PorterStemFilter– Reduces tokens into their stems

• SynonymTokenFilter– Converts words to their synonyms

• ShingleFilter– Creates shingles (n-grams)

Tokens

• Information about a token– Field– Text– Start offset, end offset– Position increment

Attributes

• Past versions of Lucene: Token object• Recent version of Lucene: attributes– Efficiency, flexibility– Ask for attributes you want– Receive attribute objects– Use these object for information about tokens

TokenStream tokenStream = analyzer.reusableTokenStream(fieldName, reader);tokenStream.reset();

CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);

OffsetAttribute offset = stream.addAttribute(OffsetAttribute.class);

PositionIncrementAttribute posInc = stream.addAttribute(PositionIncrementAttribute.class);

while (tokenStream.incrementToken()) { doSomething(term.toString(), offset.startOffset(), offset.endOffset(), posInc.getPositionIncrement());}

tokenStream.end();tokenStream.close();

create token stream

obtain each attribute you want to know

use information about the current token

go to the next token

close token stream

Query-time Analysis

• Text in a query is analyzed like fields• Use the same analyzer that analyzed the

particular field

+field1:“quick brown fox” +(field2:“lazy dog” field2:“cozy cat”)

brown fox lazy dog cozy cat

Query Formation

• Query parsing– A query parser in core code– Additional query parsers in contributed code

• Or build query from the Lucene query classes

Term Query

• Matches documents with a particular term– Field– Text

Term Range Query

• Matches documents with any of the terms in a particular range– Field– Lowest term text– Highest term text– Include lowest term text?– Include highest term text?

Prefix Query

• Matches documents with any of the terms with a particular prefix– Field– Prefix

Wildcard/Regex Query

• Matches documents with any of the terms that match a particular pattern– Field– Pattern• Wildcard: * for 0+ characters, ? for 0-1 character• Regular expression• Pattern matching on individual terms only

Fuzzy Query

• Matches documents with any of the terms that are “similar” to a particular term– Levenshtein distance (“edit distance”):

Number of character insertions, deletions or substitutions needed to transform one string into another• e.g. kitten -> sitten -> sittin -> sitting (3 edits)

– Field– Text– Minimum similarity score

Phrase Query

• Matches documents with all the given words present and being “near” each other– Field– Terms– Slop• Number of “moves of words” permitted• Slop = 0 means exact phrase match required

Boolean Query

• Conceptually similar to boolean operators (“AND”, “OR”, “NOT”), but not identical

• Why Not AND, OR, And NOT?– http://www.lucidimagination.com/blog/2011/12/

28/why-not-and-or-and-not/

– In short, boolean operators do not handle > 2 clauses well

Boolean Query

• Three types of clauses– Must – Should– Must not

• For a boolean query to match a document– All “must” clauses must match– All “must not” clauses must not match– At least one “must” or “should” clause must

Span Query

• Similar to other queries, but matches spans• Span– particular place/part of a particular document– <document ID, start position, end position> tuple

T0 = "it is what it is” 0 1 2 3 4

T1 = "what is it” 0 1 2

T2 = "it is a banana” 0 1 2 3

“it is”: <doc ID, start pos., end pos.><0, 0, 2><0, 3, 5><2, 0, 2>

Span Query

• SpanTermQuery– Same as TermQuery, except your can build other

span queries with it• SpanOrQuery– Matches spans that are matched by any of some

span queries• SpanNotQuery– Matches spans that are matched by one span

query but not the other span query

apple orange

spanNot(apple, orange)

spanOr([apple, orange])

apple orange

spanTerm(orange)

spanTerm(apple)

Span Query

• SpanNearQuery– Matches spans that are within a certain “slop” of

each other– Slop: max number of positions between spans– Can specify whether order matters

brownquickthe fox

1. spanNear([brown, fox, the, quick], slop = 4, inOrder = false)

4. spanNear([brown, fox, the, quick], slop = 3, inOrder = true)

5. spanNear([the, quick, brown, fox], slop = 3, inOrder = true)

Filtering

• A Filter narrows down the search result– Creates a set of document IDs– Decides what documents get processed further– Does not affect scoring, i.e. does not score/rank

documents that pass the filter– Can be cached easily– Useful for access control, presets, etc.

Notable Filter classes

• TermsFilter– Allows documents with any of the given terms

• TermRangeFilter– Filter version of TermRangeQuery

• PrefixFilter– Filter version of PrefixQuery

• QueryWrapperFilter– “Adapts” a query into a filter

• CachingWrapperFilter– Cache the result of the wrapped filter

Sorting

• Score (default)• Index order• Field– Requires the field be indexed & not analyzed– Specify type (string, int, etc.)– Normal or reverse order– Single or multiple fields

Interfacing Lucene with “Outside”

• Embedding directly• Language bridge– E.g. PHP/Java Bridge

• Web service– E.g. Jetty + your own request handler

• Solr– Lucene + Jetty + lots of useful functionality

• Lucene in Action, 2nd Edition– Written by 3 committers and PMC members– http://www.manning.com/hatcher3/

• Introduction to Information Retrieval– Not specific to Lucene, but about IR concepts– Free e-book– http://nlp.stanford.edu/IR-book/

Web Resources

• Official Website– http://lucene.apache.org/

• StackOverflow– http://stackoverflow.com/questions/tagged/lucene

• Mailing lists– http://lucene.apache.org/core/discussion.html

• Blogs– http://www.lucidimagination.com/blog/– http://blog.mikemccandless.com/– http://lucene.grantingersoll.com/

Getting Started

• Getting started– Download lucene-3.6.0.zip (or .tgz)– Add lucene-core-3.6.0.jar to your classpath– Consider using an IDE (e.g. Eclipse)– Luke (Lucene Index Toolbox)

http://code.google.com/p/luke/

introduction to search engine-building with lucene

query document

small document

document store

term t

query search

inverted index speed

sorted document list

resetchartermattribute

Technology

search engine-building with lucene and solr, part 1 (socal...

buildingyourﬁrstsearchengine...

introduction to search engine-building with lucene

hacking lucene for custom search results

search engines exercise 1 - hasso-plattner-institut€¦ ·...

search engine-building with lucene and solr, part 2 (socal...

lucene rev preso busch realtime search lr1010

overview of wipo’s patentscope services€¦ · new...

the lucene search engine kira radinsky based on the material...

grammar-based suggestion engine with keyword search ·...

full-text search with lucene - apache software...

using lucene for search within xis

better search with apache lucene and solr

multi-language search using solr netflix: autocompletesolr -...

netflix global search - lucene revolution

the lucene full-text search engine - harvard...

the lucene full-text search...

fosdem (feb 2011) - a real-time search engine with lucene...

nutch and lucene framework - cse, iit...

lucene part2. lucene jarkarta lucene ( is a high-...