introduction to search engine-building with lucene

48
Introduction to Search Engine-Building with Lucene Kai Chan SoCal Code Camp, June 2012

Upload: kai-chan

Post on 25-May-2015

1.117 views

Category:

Technology


0 download

DESCRIPTION

These are the slides for the session I presented at SoCal Code Camp San Diego on June 24, 2012. http://www.socalcodecamp.com/session.aspx?sid=f9e83f56-3c56-4aa1-9cff-154c6537ccbe

TRANSCRIPT

Page 1: Introduction to search engine-building with Lucene

Introduction to Search Engine-Building with Lucene

Kai ChanSoCal Code Camp, June 2012

Page 2: Introduction to search engine-building with Lucene

2

How to Search

• One (common) approach to searching all your documents:

for each document d { if (query is a substring of d’s content) { add d to the list of results }}sort the results

Page 3: Introduction to search engine-building with Lucene

3

How to Search

• Problems– Slow: Reads the whole database for each search– Not scalable: If your database grows by 10x, your

search slows down by 10x– How to show the most relevant documents first?

Page 4: Introduction to search engine-building with Lucene

4

Inverted Index

• (term -> document list) map

Example taken from Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)

"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}

T0 = "it is what it is"T1 = "what is it"T2 = "it is a banana"

Documents:

Inverted index:

Page 5: Introduction to search engine-building with Lucene

5

Inverted Index

• (term -> <document, position> list) map

Example taken from Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)

T0 = "it is what it is” 0 1 2 3 4

T1 = "what is it” 0 1 2

T2 = "it is a banana” 0 1 2 3

Page 6: Introduction to search engine-building with Lucene

6

Inverted Index

• (term -> <document, position> list) map

Example taken from Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)

"a": {(2, 2)}"banana": {(2, 3)}"is": {(0, 1), (0, 4), (1, 1), (2, 1)}"it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)}

T0 = "it is what it is"T1 = "what is it"T2 = "it is a banana"

Page 7: Introduction to search engine-building with Lucene

7

Inverted Index

• Speed– Term list • Very small compared to documents’ content• Tends to grow at a slower speed than documents

(after a certain level)

– Term lookup: O(1) to O(log of number of terms)– Document lists are very small– Document + position lists still small

Page 8: Introduction to search engine-building with Lucene

8

Inverted Index

• Relevance– Extra information in the index• Stored in a easily accessible way• Determine relevance of each document to the query

– Enables sorting by (decreasing) relevance

Page 9: Introduction to search engine-building with Lucene

9

Determining Relevancy

• Two models used in the searching process– Boolean model• AND, OR, NOT, etc.• Either a document matches a query, or not

– Vector space model• How often a query term appears in a document vs.

how often the term appears in all documents• Scoring and sorting by relevancy possible

Page 10: Introduction to search engine-building with Lucene

10

Determining Relevancy

all documents

some documents (unsorted)

some documents (sorted by score)

filtering (Boolean Model)

scoring (Vector Space Model)

Lucene uses both models

Page 11: Introduction to search engine-building with Lucene

11

Vector Space Model

querydocument 1

document 2

f(frequency of term B)

f(frequency of term A)

Page 12: Introduction to search engine-building with Lucene

12

Scoring

• Term frequency (TF)– How many times does this term (t) appear in this

document (d)?– Score proportional to TF

• Document frequency (DF)– How many documents have this term (t)?– Score proportional to the inverse of DF (IDF)

Page 13: Introduction to search engine-building with Lucene

13

Scoring

• Coordination factor (coord)– Documents that contains all or most query terms

get higher scores• Normalizing factor (norm)– Adjust for field length and query complexity

Page 14: Introduction to search engine-building with Lucene

14

Scoring

• Boost– “Manual override”: ask Lucene to give a higher

score to some particular thing– Index-time• Document• Field (of a particular document)

– Search-time• Query

Page 15: Introduction to search engine-building with Lucene

15

score(q, d) = coord(q, d) . queryNorm(q) . Σ t in q (tf (t in d) . idf(t)2 . boost(t) . norm(t, d))

coordination factor

term boost

query normalizing factor

document boost, field boost,

length normalizing factor

inverse document frequency

term frequency

Scoring

http://lucene.apache.org/core/3_6_0/scoring.html

Page 16: Introduction to search engine-building with Lucene

16

Work Flow

• Indexing– Index: storage of inverted index + documents– Add fields to a document– Add the document to the index– Repeat for every document

• Searching– Generate a query– Search with this query– Get back a sorted document list (top N docs)

Page 17: Introduction to search engine-building with Lucene

17

Adding Field to Document

• Store?• Index?– Analyzed (split text into multiple terms)– Not analyzed (treat the whole text as ONE term)– Not indexed (this field will not be searchable)– Store norms?

Page 18: Introduction to search engine-building with Lucene

18

Analyzed vs. Not Analyzed

Text: “the quick brown fox”

Analyzed: 4 terms1. the2. quick3. brown4. fox

Not analyzed: 1 term1. the quick brown fox

Page 19: Introduction to search engine-building with Lucene

19

Index-time Analysis

• Analyzer– Determine which TokenStream classes to use

• TokenStream– Does the actual hard work– Tokenizer: text to tokens– Token filter: tokens to tokens

Page 20: Introduction to search engine-building with Lucene

20

Text:San Franciso, the Bay Area’s city-county http://www.ci.sf.ca.us [email protected]

WhitespaceAnalyzer:[San] [Francisco,] [the] [Bay] [Area’s] [city-county] [http://www.ci.sf.ca.us/] [[email protected]]

StopAnalyzer:[san] [francisco] [bay] [area] [s] [city] [county] [http] [www] [ci] [sf] [ca] [us] [controller] [sfgov] [org]

StandardAnalyzer:[san] [francisco] [bay] [area's] [city] [county] [http] [www.ci.fs.ca.us] [controller] [sfgov.org]

Page 21: Introduction to search engine-building with Lucene

21

Notable TokenStream Classes

• ASCIIFoldingFilter– Converts alphabetic characters into basic forms

• PorterStemFilter– Reduces tokens into their stems

• SynonymTokenFilter– Converts words to their synonyms

• ShingleFilter– Creates shingles (n-grams)

Page 22: Introduction to search engine-building with Lucene

22

Tokens

• Information about a token– Field– Text– Start offset, end offset– Position increment

Page 23: Introduction to search engine-building with Lucene

23

Attributes

• Past versions of Lucene: Token object• Recent version of Lucene: attributes– Efficiency, flexibility– Ask for attributes you want– Receive attribute objects– Use these object for information about tokens

Page 24: Introduction to search engine-building with Lucene

24

TokenStream tokenStream = analyzer.reusableTokenStream(fieldName, reader);tokenStream.reset();

CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);

OffsetAttribute offset = stream.addAttribute(OffsetAttribute.class);

PositionIncrementAttribute posInc = stream.addAttribute(PositionIncrementAttribute.class);

while (tokenStream.incrementToken()) { doSomething(term.toString(), offset.startOffset(), offset.endOffset(), posInc.getPositionIncrement());}

tokenStream.end();tokenStream.close();

create token stream

obtain each attribute you want to know

use information about the current token

go to the next token

close token stream

Page 25: Introduction to search engine-building with Lucene

25

Query-time Analysis

• Text in a query is analyzed like fields• Use the same analyzer that analyzed the

particular field

quick

+field1:“quick brown fox” +(field2:“lazy dog” field2:“cozy cat”)

brown fox lazy dog cozy cat

Page 26: Introduction to search engine-building with Lucene

26

Query Formation

• Query parsing– A query parser in core code– Additional query parsers in contributed code

• Or build query from the Lucene query classes

Page 27: Introduction to search engine-building with Lucene

27

Term Query

• Matches documents with a particular term– Field– Text

Page 28: Introduction to search engine-building with Lucene

28

Term Range Query

• Matches documents with any of the terms in a particular range– Field– Lowest term text– Highest term text– Include lowest term text?– Include highest term text?

Page 29: Introduction to search engine-building with Lucene

29

Prefix Query

• Matches documents with any of the terms with a particular prefix– Field– Prefix

Page 30: Introduction to search engine-building with Lucene

30

Wildcard/Regex Query

• Matches documents with any of the terms that match a particular pattern– Field– Pattern• Wildcard: * for 0+ characters, ? for 0-1 character• Regular expression• Pattern matching on individual terms only

Page 31: Introduction to search engine-building with Lucene

31

Fuzzy Query

• Matches documents with any of the terms that are “similar” to a particular term– Levenshtein distance (“edit distance”):

Number of character insertions, deletions or substitutions needed to transform one string into another• e.g. kitten -> sitten -> sittin -> sitting (3 edits)

– Field– Text– Minimum similarity score

Page 32: Introduction to search engine-building with Lucene

32

Phrase Query

• Matches documents with all the given words present and being “near” each other– Field– Terms– Slop• Number of “moves of words” permitted• Slop = 0 means exact phrase match required

Page 33: Introduction to search engine-building with Lucene

33

Boolean Query

• Conceptually similar to boolean operators (“AND”, “OR”, “NOT”), but not identical

• Why Not AND, OR, And NOT?– http://www.lucidimagination.com/blog/2011/12/

28/why-not-and-or-and-not/

– In short, boolean operators do not handle > 2 clauses well

Page 34: Introduction to search engine-building with Lucene

34

Boolean Query

• Three types of clauses– Must – Should– Must not

• For a boolean query to match a document– All “must” clauses must match– All “must not” clauses must not match– At least one “must” or “should” clause must

match

Page 35: Introduction to search engine-building with Lucene

35

Span Query

• Similar to other queries, but matches spans• Span– particular place/part of a particular document– <document ID, start position, end position> tuple

Page 36: Introduction to search engine-building with Lucene

36

T0 = "it is what it is” 0 1 2 3 4

T1 = "what is it” 0 1 2

T2 = "it is a banana” 0 1 2 3

“it is”: <doc ID, start pos., end pos.><0, 0, 2><0, 3, 5><2, 0, 2>

Page 37: Introduction to search engine-building with Lucene

37

Span Query

• SpanTermQuery– Same as TermQuery, except your can build other

span queries with it• SpanOrQuery– Matches spans that are matched by any of some

span queries• SpanNotQuery– Matches spans that are matched by one span

query but not the other span query

Page 38: Introduction to search engine-building with Lucene

38

apple orange

spanNot(apple, orange)

spanOr([apple, orange])

apple orange

spanTerm(orange)

spanTerm(apple)

Page 39: Introduction to search engine-building with Lucene

39

Span Query

• SpanNearQuery– Matches spans that are within a certain “slop” of

each other– Slop: max number of positions between spans– Can specify whether order matters

Page 40: Introduction to search engine-building with Lucene

40

brownquickthe fox

1. spanNear([brown, fox, the, quick], slop = 4, inOrder = false)

2 1 0

2. spanNear([brown, fox, the, quick], slop = 3, inOrder = false)

3. spanNear([brown, fox, the, quick], slop = 2, inOrder = false)

4. spanNear([brown, fox, the, quick], slop = 3, inOrder = true)

5. spanNear([the, quick, brown, fox], slop = 3, inOrder = true)

Page 41: Introduction to search engine-building with Lucene

41

Filtering

• A Filter narrows down the search result– Creates a set of document IDs– Decides what documents get processed further– Does not affect scoring, i.e. does not score/rank

documents that pass the filter– Can be cached easily– Useful for access control, presets, etc.

Page 42: Introduction to search engine-building with Lucene

42

Notable Filter classes

• TermsFilter– Allows documents with any of the given terms

• TermRangeFilter– Filter version of TermRangeQuery

• PrefixFilter– Filter version of PrefixQuery

• QueryWrapperFilter– “Adapts” a query into a filter

• CachingWrapperFilter– Cache the result of the wrapped filter

Page 43: Introduction to search engine-building with Lucene

43

Sorting

• Score (default)• Index order• Field– Requires the field be indexed & not analyzed– Specify type (string, int, etc.)– Normal or reverse order– Single or multiple fields

Page 44: Introduction to search engine-building with Lucene

44

Interfacing Lucene with “Outside”

• Embedding directly• Language bridge– E.g. PHP/Java Bridge

• Web service– E.g. Jetty + your own request handler

• Solr– Lucene + Jetty + lots of useful functionality

Page 45: Introduction to search engine-building with Lucene

45

Books

• Lucene in Action, 2nd Edition– Written by 3 committers and PMC members– http://www.manning.com/hatcher3/

• Introduction to Information Retrieval– Not specific to Lucene, but about IR concepts– Free e-book– http://nlp.stanford.edu/IR-book/

Page 46: Introduction to search engine-building with Lucene

46

Web Resources

• Official Website– http://lucene.apache.org/

• StackOverflow– http://stackoverflow.com/questions/tagged/lucene

• Mailing lists– http://lucene.apache.org/core/discussion.html

• Blogs– http://www.lucidimagination.com/blog/– http://blog.mikemccandless.com/– http://lucene.grantingersoll.com/

Page 47: Introduction to search engine-building with Lucene

47

Getting Started

• Getting started– Download lucene-3.6.0.zip (or .tgz)– Add lucene-core-3.6.0.jar to your classpath– Consider using an IDE (e.g. Eclipse)– Luke (Lucene Index Toolbox)

http://code.google.com/p/luke/

Page 48: Introduction to search engine-building with Lucene

48