introduction to search engine-building with lucene
Post on 25-May-2015
1.118 Views
Preview:
DESCRIPTION
TRANSCRIPT
Introduction to Search Engine-Building with Lucene
Kai ChanSoCal Code Camp, June 2012
2
How to Search
• One (common) approach to searching all your documents:
for each document d { if (query is a substring of d’s content) { add d to the list of results }}sort the results
3
How to Search
• Problems– Slow: Reads the whole database for each search– Not scalable: If your database grows by 10x, your
search slows down by 10x– How to show the most relevant documents first?
4
Inverted Index
• (term -> document list) map
Example taken from Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)
"a": {2}"banana": {2}"is": {0, 1, 2}"it": {0, 1, 2}"what": {0, 1}
T0 = "it is what it is"T1 = "what is it"T2 = "it is a banana"
Documents:
Inverted index:
5
Inverted Index
• (term -> <document, position> list) map
Example taken from Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)
T0 = "it is what it is” 0 1 2 3 4
T1 = "what is it” 0 1 2
T2 = "it is a banana” 0 1 2 3
6
Inverted Index
• (term -> <document, position> list) map
Example taken from Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)
"a": {(2, 2)}"banana": {(2, 3)}"is": {(0, 1), (0, 4), (1, 1), (2, 1)}"it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)}
T0 = "it is what it is"T1 = "what is it"T2 = "it is a banana"
7
Inverted Index
• Speed– Term list • Very small compared to documents’ content• Tends to grow at a slower speed than documents
(after a certain level)
– Term lookup: O(1) to O(log of number of terms)– Document lists are very small– Document + position lists still small
8
Inverted Index
• Relevance– Extra information in the index• Stored in a easily accessible way• Determine relevance of each document to the query
– Enables sorting by (decreasing) relevance
9
Determining Relevancy
• Two models used in the searching process– Boolean model• AND, OR, NOT, etc.• Either a document matches a query, or not
– Vector space model• How often a query term appears in a document vs.
how often the term appears in all documents• Scoring and sorting by relevancy possible
10
Determining Relevancy
all documents
some documents (unsorted)
some documents (sorted by score)
filtering (Boolean Model)
scoring (Vector Space Model)
Lucene uses both models
11
Vector Space Model
querydocument 1
document 2
f(frequency of term B)
f(frequency of term A)
12
Scoring
• Term frequency (TF)– How many times does this term (t) appear in this
document (d)?– Score proportional to TF
• Document frequency (DF)– How many documents have this term (t)?– Score proportional to the inverse of DF (IDF)
13
Scoring
• Coordination factor (coord)– Documents that contains all or most query terms
get higher scores• Normalizing factor (norm)– Adjust for field length and query complexity
14
Scoring
• Boost– “Manual override”: ask Lucene to give a higher
score to some particular thing– Index-time• Document• Field (of a particular document)
– Search-time• Query
15
score(q, d) = coord(q, d) . queryNorm(q) . Σ t in q (tf (t in d) . idf(t)2 . boost(t) . norm(t, d))
coordination factor
term boost
query normalizing factor
document boost, field boost,
length normalizing factor
inverse document frequency
term frequency
Scoring
http://lucene.apache.org/core/3_6_0/scoring.html
16
Work Flow
• Indexing– Index: storage of inverted index + documents– Add fields to a document– Add the document to the index– Repeat for every document
• Searching– Generate a query– Search with this query– Get back a sorted document list (top N docs)
17
Adding Field to Document
• Store?• Index?– Analyzed (split text into multiple terms)– Not analyzed (treat the whole text as ONE term)– Not indexed (this field will not be searchable)– Store norms?
18
Analyzed vs. Not Analyzed
Text: “the quick brown fox”
Analyzed: 4 terms1. the2. quick3. brown4. fox
Not analyzed: 1 term1. the quick brown fox
19
Index-time Analysis
• Analyzer– Determine which TokenStream classes to use
• TokenStream– Does the actual hard work– Tokenizer: text to tokens– Token filter: tokens to tokens
20
Text:San Franciso, the Bay Area’s city-county http://www.ci.sf.ca.us controller@sfgov.org
WhitespaceAnalyzer:[San] [Francisco,] [the] [Bay] [Area’s] [city-county] [http://www.ci.sf.ca.us/] [controller@sfgov.org]
StopAnalyzer:[san] [francisco] [bay] [area] [s] [city] [county] [http] [www] [ci] [sf] [ca] [us] [controller] [sfgov] [org]
StandardAnalyzer:[san] [francisco] [bay] [area's] [city] [county] [http] [www.ci.fs.ca.us] [controller] [sfgov.org]
21
Notable TokenStream Classes
• ASCIIFoldingFilter– Converts alphabetic characters into basic forms
• PorterStemFilter– Reduces tokens into their stems
• SynonymTokenFilter– Converts words to their synonyms
• ShingleFilter– Creates shingles (n-grams)
22
Tokens
• Information about a token– Field– Text– Start offset, end offset– Position increment
23
Attributes
• Past versions of Lucene: Token object• Recent version of Lucene: attributes– Efficiency, flexibility– Ask for attributes you want– Receive attribute objects– Use these object for information about tokens
24
TokenStream tokenStream = analyzer.reusableTokenStream(fieldName, reader);tokenStream.reset();
CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
OffsetAttribute offset = stream.addAttribute(OffsetAttribute.class);
PositionIncrementAttribute posInc = stream.addAttribute(PositionIncrementAttribute.class);
while (tokenStream.incrementToken()) { doSomething(term.toString(), offset.startOffset(), offset.endOffset(), posInc.getPositionIncrement());}
tokenStream.end();tokenStream.close();
create token stream
obtain each attribute you want to know
use information about the current token
go to the next token
close token stream
25
Query-time Analysis
• Text in a query is analyzed like fields• Use the same analyzer that analyzed the
particular field
quick
+field1:“quick brown fox” +(field2:“lazy dog” field2:“cozy cat”)
brown fox lazy dog cozy cat
26
Query Formation
• Query parsing– A query parser in core code– Additional query parsers in contributed code
• Or build query from the Lucene query classes
27
Term Query
• Matches documents with a particular term– Field– Text
28
Term Range Query
• Matches documents with any of the terms in a particular range– Field– Lowest term text– Highest term text– Include lowest term text?– Include highest term text?
29
Prefix Query
• Matches documents with any of the terms with a particular prefix– Field– Prefix
30
Wildcard/Regex Query
• Matches documents with any of the terms that match a particular pattern– Field– Pattern• Wildcard: * for 0+ characters, ? for 0-1 character• Regular expression• Pattern matching on individual terms only
31
Fuzzy Query
• Matches documents with any of the terms that are “similar” to a particular term– Levenshtein distance (“edit distance”):
Number of character insertions, deletions or substitutions needed to transform one string into another• e.g. kitten -> sitten -> sittin -> sitting (3 edits)
– Field– Text– Minimum similarity score
32
Phrase Query
• Matches documents with all the given words present and being “near” each other– Field– Terms– Slop• Number of “moves of words” permitted• Slop = 0 means exact phrase match required
33
Boolean Query
• Conceptually similar to boolean operators (“AND”, “OR”, “NOT”), but not identical
• Why Not AND, OR, And NOT?– http://www.lucidimagination.com/blog/2011/12/
28/why-not-and-or-and-not/
– In short, boolean operators do not handle > 2 clauses well
34
Boolean Query
• Three types of clauses– Must – Should– Must not
• For a boolean query to match a document– All “must” clauses must match– All “must not” clauses must not match– At least one “must” or “should” clause must
match
35
Span Query
• Similar to other queries, but matches spans• Span– particular place/part of a particular document– <document ID, start position, end position> tuple
36
T0 = "it is what it is” 0 1 2 3 4
T1 = "what is it” 0 1 2
T2 = "it is a banana” 0 1 2 3
“it is”: <doc ID, start pos., end pos.><0, 0, 2><0, 3, 5><2, 0, 2>
37
Span Query
• SpanTermQuery– Same as TermQuery, except your can build other
span queries with it• SpanOrQuery– Matches spans that are matched by any of some
span queries• SpanNotQuery– Matches spans that are matched by one span
query but not the other span query
38
apple orange
spanNot(apple, orange)
spanOr([apple, orange])
apple orange
spanTerm(orange)
spanTerm(apple)
39
Span Query
• SpanNearQuery– Matches spans that are within a certain “slop” of
each other– Slop: max number of positions between spans– Can specify whether order matters
40
brownquickthe fox
1. spanNear([brown, fox, the, quick], slop = 4, inOrder = false)
2 1 0
2. spanNear([brown, fox, the, quick], slop = 3, inOrder = false)
3. spanNear([brown, fox, the, quick], slop = 2, inOrder = false)
4. spanNear([brown, fox, the, quick], slop = 3, inOrder = true)
5. spanNear([the, quick, brown, fox], slop = 3, inOrder = true)
✔
✖
✖
✔
✔
41
Filtering
• A Filter narrows down the search result– Creates a set of document IDs– Decides what documents get processed further– Does not affect scoring, i.e. does not score/rank
documents that pass the filter– Can be cached easily– Useful for access control, presets, etc.
42
Notable Filter classes
• TermsFilter– Allows documents with any of the given terms
• TermRangeFilter– Filter version of TermRangeQuery
• PrefixFilter– Filter version of PrefixQuery
• QueryWrapperFilter– “Adapts” a query into a filter
• CachingWrapperFilter– Cache the result of the wrapped filter
43
Sorting
• Score (default)• Index order• Field– Requires the field be indexed & not analyzed– Specify type (string, int, etc.)– Normal or reverse order– Single or multiple fields
44
Interfacing Lucene with “Outside”
• Embedding directly• Language bridge– E.g. PHP/Java Bridge
• Web service– E.g. Jetty + your own request handler
• Solr– Lucene + Jetty + lots of useful functionality
45
Books
• Lucene in Action, 2nd Edition– Written by 3 committers and PMC members– http://www.manning.com/hatcher3/
• Introduction to Information Retrieval– Not specific to Lucene, but about IR concepts– Free e-book– http://nlp.stanford.edu/IR-book/
46
Web Resources
• Official Website– http://lucene.apache.org/
• StackOverflow– http://stackoverflow.com/questions/tagged/lucene
• Mailing lists– http://lucene.apache.org/core/discussion.html
• Blogs– http://www.lucidimagination.com/blog/– http://blog.mikemccandless.com/– http://lucene.grantingersoll.com/
47
Getting Started
• Getting started– Download lucene-3.6.0.zip (or .tgz)– Add lucene-core-3.6.0.jar to your classpath– Consider using an IDE (e.g. Eclipse)– Luke (Lucene Index Toolbox)
http://code.google.com/p/luke/
48
top related