introduction to search engine-building with lucene

Introduction to Search Engine-Building with Lucene

Kai ChanSoCal Code Camp, June 2012

2

How to Search

• One (common) approach to searching all your documents:

for each document d { if (query is a substring of d’s content) { add d to the list of results }}sort the results

3

How to Search

• Problems– Slow: Reads the whole database for each search– Not scalable: If your database grows by 10x, your

search slows down by 10x– How to show the most relevant documents first?

4

Inverted Index

• (term -> document list) map

Example taken from Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)

"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}

T0 = "it is what it is"T1 = "what is it"T2 = "it is a banana"

Documents:

Inverted index:

http://en.wikipedia.org/wiki/Inverted_index











































5

Inverted Index

• (term -> <document, position> list) map


T0 = "it is what it is” 0 1 2 3 4

T1 = "what is it” 0 1 2

T2 = "it is a banana” 0 1 2 3












































6

Inverted Index

• (term -> <document, position> list) map


"a": {(2, 2)}"banana": {(2, 3)}"is": {(0, 1), (0, 4), (1, 1), (2, 1)}"it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)}

T0 = "it is what it is"T1 = "what is it"T2 = "it is a banana"












































7

Inverted Index

• Speed– Term list • Very small compared to documents’ content• Tends to grow at a slower speed than documents

(after a certain level)

– Term lookup: O(1) to O(log of number of terms)– Document lists are very small– Document + position lists still small

8

Inverted Index

• Relevance– Extra information in the index• Stored in a easily accessible way• Determine relevance of each document to the query

– Enables sorting by (decreasing) relevance

9

Determining Relevancy

• Two models used in the searching process– Boolean model• AND, OR, NOT, etc.• Either a document matches a query, or not

– Vector space model• How often a query term appears in a document vs.

how often the term appears in all documents• Scoring and sorting by relevancy possible

10

Determining Relevancy

all documents

some documents (unsorted)

some documents (sorted by score)

filtering (Boolean Model)

scoring (Vector Space Model)

Lucene uses both models

11

Vector Space Model

querydocument 1

document 2

f(frequency of term B)

f(frequency of term A)

12

Scoring

• Term frequency (TF)– How many times does this term (t) appear in this

document (d)?– Score proportional to TF

• Document frequency (DF)– How many documents have this term (t)?– Score proportional to the inverse of DF (IDF)

13

Scoring

• Coordination factor (coord)– Documents that contains all or most query terms

get higher scores• Normalizing factor (norm)– Adjust for field length and query complexity

14

Scoring

• Boost– “Manual override”: ask Lucene to give a higher

score to some particular thing– Index-time• Document• Field (of a particular document)

– Search-time• Query

15

score(q, d) = coord(q, d) . queryNorm(q) . Σ t in q (tf (t in d) . idf(t)2 . boost(t) . norm(t, d))

coordination factor

term boost

query normalizing factor

document boost, field boost,

length normalizing factor

inverse document frequency

term frequency

Scoring

http://lucene.apache.org/core/3_6_0/scoring.html

16

Work Flow

• Indexing– Index: storage of inverted index + documents– Add fields to a document– Add the document to the index– Repeat for every document

• Searching– Generate a query– Search with this query– Get back a sorted document list (top N docs)

17

Adding Field to Document

• Store?• Index?– Analyzed (split text into multiple terms)– Not analyzed (treat the whole text as ONE term)– Not indexed (this field will not be searchable)– Store norms?

18

Analyzed vs. Not Analyzed

Text: “the quick brown fox”

Analyzed: 4 terms1. the2. quick3. brown4. fox

Not analyzed: 1 term1. the quick brown fox

19

Index-time Analysis

• Analyzer– Determine which TokenStream classes to use

• TokenStream– Does the actual hard work– Tokenizer: text to tokens– Token filter: tokens to tokens

20

Text:San Franciso, the Bay Area’s city-county http://www.ci.sf.ca.us [email protected]

WhitespaceAnalyzer:[San] [Francisco,] [the] [Bay] [Area’s] [city-county] [http://www.ci.sf.ca.us/] [[email protected]]

StopAnalyzer:[san] [francisco] [bay] [area] [s] [city] [county] [http] [www] [ci] [sf] [ca] [us] [controller] [sfgov] [org]

StandardAnalyzer:[san] [francisco] [bay] [area's] [city] [county] [http] [www.ci.fs.ca.us] [controller] [sfgov.org]

21

Notable TokenStream Classes

• ASCIIFoldingFilter– Converts alphabetic characters into basic forms

• PorterStemFilter– Reduces tokens into their stems

• SynonymTokenFilter– Converts words to their synonyms

• ShingleFilter– Creates shingles (n-grams)

22

Tokens

• Information about a token– Field– Text– Start offset, end offset– Position increment

23

Attributes

• Past versions of Lucene: Token object• Recent version of Lucene: attributes– Efficiency, flexibility– Ask for attributes you want– Receive attribute objects– Use these object for information about tokens

24

TokenStream tokenStream = analyzer.reusableTokenStream(fieldName, reader);tokenStream.reset();

CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);

OffsetAttribute offset = stream.addAttribute(OffsetAttribute.class);

PositionIncrementAttribute posInc = stream.addAttribute(PositionIncrementAttribute.class);

while (tokenStream.incrementToken()) { doSomething(term.toString(), offset.startOffset(), offset.endOffset(), posInc.getPositionIncrement());}

tokenStream.end();tokenStream.close();

create token stream

obtain each attribute you want to know

use information about the current token

go to the next token

close token stream

25

Query-time Analysis

• Text in a query is analyzed like fields• Use the same analyzer that analyzed the

particular field

quick

+field1:“quick brown fox” +(field2:“lazy dog” field2:“cozy cat”)

brown fox lazy dog cozy cat

26

Query Formation

• Query parsing– A query parser in core code– Additional query parsers in contributed code

• Or build query from the Lucene query classes

27

Term Query

• Matches documents with a particular term– Field– Text

28

Term Range Query

• Matches documents with any of the terms in a particular range– Field– Lowest term text– Highest term text– Include lowest term text?– Include highest term text?

29

Prefix Query

• Matches documents with any of the terms with a particular prefix– Field– Prefix

30

Wildcard/Regex Query

• Matches documents with any of the terms that match a particular pattern– Field– Pattern• Wildcard: * for 0+ characters, ? for 0-1 character• Regular expression• Pattern matching on individual terms only

31

Fuzzy Query

• Matches documents with any of the terms that are “similar” to a particular term– Levenshtein distance (“edit distance”):

Number of character insertions, deletions or substitutions needed to transform one string into another• e.g. kitten -> sitten -> sittin -> sitting (3 edits)

– Field– Text– Minimum similarity score

32

Phrase Query

• Matches documents with all the given words present and being “near” each other– Field– Terms– Slop• Number of “moves of words” permitted• Slop = 0 means exact phrase match required

33

Boolean Query

• Conceptually similar to boolean operators (“AND”, “OR”, “NOT”), but not identical

• Why Not AND, OR, And NOT?– http://www.lucidimagination.com/blog/2011/12/

28/why-not-and-or-and-not/

– In short, boolean operators do not handle > 2 clauses well

http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/



34

Boolean Query

• Three types of clauses– Must – Should– Must not

• For a boolean query to match a document– All “must” clauses must match– All “must not” clauses must not match– At least one “must” or “should” clause must

match

35

Span Query

• Similar to other queries, but matches spans• Span– particular place/part of a particular document– <document ID, start position, end position> tuple

36

T0 = "it is what it is” 0 1 2 3 4

T1 = "what is it” 0 1 2

T2 = "it is a banana” 0 1 2 3

“it is”: <doc ID, start pos., end pos.><0, 0, 2><0, 3, 5><2, 0, 2>

37

Span Query

• SpanTermQuery– Same as TermQuery, except your can build other

span queries with it• SpanOrQuery– Matches spans that are matched by any of some

span queries• SpanNotQuery– Matches spans that are matched by one span

query but not the other span query

38

apple orange

spanNot(apple, orange)

spanOr([apple, orange])

apple orange

spanTerm(orange)

spanTerm(apple)

39

Span Query

• SpanNearQuery– Matches spans that are within a certain “slop” of

each other– Slop: max number of positions between spans– Can specify whether order matters

40

brownquickthe fox

1. spanNear([brown, fox, the, quick], slop = 4, inOrder = false)

2 1 0



4. spanNear([brown, fox, the, quick], slop = 3, inOrder = true)

5. spanNear([the, quick, brown, fox], slop = 3, inOrder = true)

✔

✖

✖

✔

✔

41

Filtering

• A Filter narrows down the search result– Creates a set of document IDs– Decides what documents get processed further– Does not affect scoring, i.e. does not score/rank

documents that pass the filter– Can be cached easily– Useful for access control, presets, etc.

42

Notable Filter classes

• TermsFilter– Allows documents with any of the given terms

• TermRangeFilter– Filter version of TermRangeQuery

• PrefixFilter– Filter version of PrefixQuery

• QueryWrapperFilter– “Adapts” a query into a filter

• CachingWrapperFilter– Cache the result of the wrapped filter

43

Sorting

• Score (default)• Index order• Field– Requires the field be indexed & not analyzed– Specify type (string, int, etc.)– Normal or reverse order– Single or multiple fields

44

Interfacing Lucene with “Outside”

• Embedding directly• Language bridge– E.g. PHP/Java Bridge

• Web service– E.g. Jetty + your own request handler

• Solr– Lucene + Jetty + lots of useful functionality

45

Books

• Lucene in Action, 2nd Edition– Written by 3 committers and PMC members– http://www.manning.com/hatcher3/

• Introduction to Information Retrieval– Not specific to Lucene, but about IR concepts– Free e-book– http://nlp.stanford.edu/IR-book/

http://www.manning.com/hatcher3/

http://www.manning.com/hatcher3/

http://nlp.stanford.edu/IR-book/



46

Web Resources

• Official Website– http://lucene.apache.org/

• StackOverflow– http://stackoverflow.com/questions/tagged/lucene

• Mailing lists– http://lucene.apache.org/core/discussion.html

• Blogs– http://www.lucidimagination.com/blog/– http://blog.mikemccandless.com/– http://lucene.grantingersoll.com/

http://lucene.apache.org/

http://lucene.apache.org/

http://stackoverflow.com/questions/tagged/lucene

http://lucene.apache.org/core/discussion.html

http://www.lucidimagination.com/blog/

http://www.lucidimagination.com/blog/

http://blog.mikemccandless.com/



http://lucene.grantingersoll.com/



47

Getting Started

• Getting started– Download lucene-3.6.0.zip (or .tgz)– Add lucene-core-3.6.0.jar to your classpath– Consider using an IDE (e.g. Eclipse)– Luke (Lucene Index Toolbox)

http://code.google.com/p/luke/



introduction to search engine-building with lucene

Technology

query document

small document

document store

term t

query search

inverted index speed

sorted document list

resetchartermattribute