lucene brian nisonger feb 08,2006. what is it? doug cutting’s grandmother’s middle name doug...

21
Lucene Lucene Brian Nisonger Brian Nisonger Feb 08,2006 Feb 08,2006

Post on 21-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

LuceneLucene

Brian NisongerBrian Nisonger

Feb 08,2006Feb 08,2006

What is it?What is it?

Doug Cutting’s grandmother’s middle Doug Cutting’s grandmother’s middle namename

A open source set of Java ClasssesA open source set of Java ClasssesSearch Engine/Document Search Engine/Document

Classifier/IndexerClassifier/Indexerhttp://http://lucene.sourceforge.net/talks/pisalucene.sourceforge.net/talks/pisa//

Developed by Doug Cutting 1996Developed by Doug Cutting 1996Xerox/Apple/Excite/NutchXerox/Apple/Excite/NutchWrote several papers in IRWrote several papers in IR

What is it-Nuts and BoltsWhat is it-Nuts and Bolts

Modules for IRModules for IRAnalysisAnalysis

TokenizationTokenizationWhere tokens are indexedWhere tokens are indexed

Document Document Where the Document ID is createdWhere the Document ID is createdDate of Document is extractedDate of Document is extractedTitle of document is extractedTitle of document is extracted

Nuts and Bolts -IINuts and Bolts -II

Modules-Con’tModules-Con’tIndexIndex

Provides access to indexesProvides access to indexesMaintains indexesMaintains indexes

Query ParserQuery ParserWhere the magic of query happensWhere the magic of query happens

SearchSearchSearches across indexesSearches across indexes

Nuts and Bolts-IIINuts and Bolts-III

Modules-Con’tModules-Con’tSearch SpansSearch Spans

SpansSpansK+/- wordsK+/- wordsExample:Example:

Find me a document that has Rachael Ray and Find me a document that has Rachael Ray and Alton Brown within 100 words of each other Alton Brown within 100 words of each other that also has the term cookingthat also has the term cooking

Store/UtilStore/UtilStore the indexes and other housekeepingStore the indexes and other housekeeping

TheoryTheory

Space Optimization for Total RankingSpace Optimization for Total RankingCutting et al 1996Cutting et al 1996RAIO (Computer Assisted IR) 1997RAIO (Computer Assisted IR) 1997http://lucene.sf.net/papers/riao97.pshttp://lucene.sf.net/papers/riao97.ps

Lucene lecture at PisaLucene lecture at PisaDoug CuttingDoug CuttingSlides from Lecture at University of Pisa Slides from Lecture at University of Pisa

20042004See previous linkSee previous link

Vector Vector

Vectors are a mathematical distance Vectors are a mathematical distance between termsbetween terms Uses a cosine distance to determine how close Uses a cosine distance to determine how close

terms/documents areterms/documents are This distance can then be used for This distance can then be used for

WSD/Clustering/IRWSD/Clustering/IR Example:Example:

Bass,fishing: .6506Bass,fishing: .6506Bass,guitar: .000423Bass,guitar: .000423This tells us the document is about fishing not about This tells us the document is about fishing not about

guitarsguitars

Vectors-IRVectors-IR

““Vector-space search engines use the notion of a Vector-space search engines use the notion of a term spaceterm space, where each document is represented , where each document is represented as a vector in a high-dimensional space. There are as a vector in a high-dimensional space. There are as many dimensions as there are unique words in as many dimensions as there are unique words in the entire collection. Because a document's the entire collection. Because a document's position in the term space is determined by the position in the term space is determined by the words it contains, documents with many words in words it contains, documents with many words in common end up close together, while documents common end up close together, while documents with few shared words end up far apart.” with few shared words end up far apart.”

http://www.perl.com/pub/a/2003/02/19/engine.htmlhttp://www.perl.com/pub/a/2003/02/19/engine.html Intro to Comp Ling and its applications to IRIntro to Comp Ling and its applications to IR

Nisonger 2005 :PNisonger 2005 :P

Inverted IndexInverted Index

Term/Doc Id/WeightTerm/Doc Id/WeightTermTerm

““A Token, the basic unit of indexing in A Token, the basic unit of indexing in Lucene, represents a single word to be Lucene, represents a single word to be indexed after any document domain indexed after any document domain transformation -- such as stop-word transformation -- such as stop-word elimination, stemming, filtering, term elimination, stemming, filtering, term normalization, or language translation -- has normalization, or language translation -- has been applied.”been applied.”

http://www.javaworld.com/javaworld/jw-09-2http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene-p2.html000/jw-0915-lucene-p2.html

Inverted Index –Con’tInverted Index –Con’t

Doc IdDoc IdA unique “key” that identifies each A unique “key” that identifies each

documentdocumentWeightWeight

BinaryBinaryFreq CountFreq CountWeighting AlgorithmWeighting Algorithm

Index MergeIndex Merge

Basic/Basket/BasketballBasic/Basket/BasketballOnly keeps track of the differences Only keeps track of the differences

between wordsbetween wordsPeriodically merges indexesPeriodically merges indexes

Allows new documents to be added easilyAllows new documents to be added easily

QueryQuery

Boolean SearchBoolean SearchOnly searches documents with at least 1 Only searches documents with at least 1

term in queryterm in query““Boolean Search Engine”Boolean Search Engine”

Parallel SearchParallel SearchEach term in query is search in parallelEach term in query is search in parallelPartial scores added to queue of docsPartial scores added to queue of docs

Query-IIQuery-II

ThresholdThresholdIf partial score is too low and will not be If partial score is too low and will not be

part of N-best then the document is part of N-best then the document is ignored even before search is completeignored even before search is completeExampleExample

Potential New Doc [0,0,0,0,0,0,i]Potential New Doc [0,0,0,0,0,0,i]Document ranked 14 [233,202,109,100,i]Document ranked 14 [233,202,109,100,i]Potential New Doc is ignoredPotential New Doc is ignored

Small loss of recall greatly increases Small loss of recall greatly increases speed of searchspeed of search

Evaluation of LuceneEvaluation of Lucene

Quantitative Evaluation of Passage Quantitative Evaluation of Passage Retrieval Algorithms for Question Retrieval Algorithms for Question AnsweringAnsweringTellex et al, MIT AI Lab 2003Tellex et al, MIT AI Lab 2003

Compared Prise to Lucene for Compared Prise to Lucene for question and answer tasksquestion and answer tasksQuestion & AnswerQuestion & Answer

<Who is the president?> <George W. <Who is the president?> <George W. Bush .76>Bush .76>

Evaluation-IIEvaluation-II

PrisePriseA IR system developed by NIS that A IR system developed by NIS that

according to the paper uses “modern” according to the paper uses “modern” search engine techniquessearch engine techniques

FindingsFindingsFound Prise was better than Lucene Found Prise was better than Lucene

since “Boolean” query engines are since “Boolean” query engines are considered old school and its answers to considered old school and its answers to questions were betterquestions were better

Eval-IIIEval-III

LuceneLuceneFound although Prise had better correct Found although Prise had better correct

answers Lucene found more documents answers Lucene found more documents containing relevant informationcontaining relevant information

Eval-ConclusionEval-Conclusion

External Knowledge Sources for External Knowledge Sources for Question AnsweringQuestion Answering

http://people.csail.mit.edu/gremio/puhttp://people.csail.mit.edu/gremio/publications/TREC2005.psblications/TREC2005.ps..Katz et al, MIT Lab 2005Katz et al, MIT Lab 2005

MIT used Lucene in their 2005 TREC MIT used Lucene in their 2005 TREC submission not Prisesubmission not Prise

UsersUsers

Lucene is used widelyLucene is used widelyTRECTRECDocument Retrieval Enterprise SystemsDocument Retrieval Enterprise SystemsPart of Database/Web enginePart of Database/Web enginePart of NutchPart of NutchUsed by academics for large projectsUsed by academics for large projects

MIT, AI LabMIT, AI LabKnow-It-All Project (UW)Know-It-All Project (UW)

ConclusionsConclusions

Lucene is a good set of classesLucene is a good set of classesDesigned to allow customization without Designed to allow customization without

have to “reinvent the wheel”have to “reinvent the wheel”RobustRobustFastFastLarge development groupsLarge development groupsUsed Widely in Academia and IndustryUsed Widely in Academia and Industry

Questions?Questions?

Feel free to ask questions, make Feel free to ask questions, make comments, tell jokes.comments, tell jokes.

That’s ALL Folks!!!!!That’s ALL Folks!!!!!