search me: using lucene.net

25
SEARCH ME Using Lucene.Net In Your Apps

Upload: gramana

Post on 07-Jul-2015

810 views

Category:

Technology


0 download

DESCRIPTION

May 2012 JaxDUG presentation by Zachary Gramana on using the Lucene.NET library to add search functionality to .NET applications. Contains an overview of search/information retrieval concepts and highlights some common use-cases.

TRANSCRIPT

Page 1: Search Me: Using Lucene.Net

SEARCH ME

Using Lucene.Net In Your Apps

Page 2: Search Me: Using Lucene.Net

About Me

Zachary Johnson Gramana

Engineer at Potts Consulting Group

Proud new father of Rex

Page 3: Search Me: Using Lucene.Net

Search is...

A vague term that encompasses multiple

problems.

Better term is “information retrieval”, or IR

system.

Interdisciplinary, drawing from:

computer science (parsing, data structures)

psychology (query grammar, human/computer

interact.)

linguistics (textual analysis)

information science (scoring/relevancy)

maths (document retrieval strategy)

Page 4: Search Me: Using Lucene.Net

Problems Solved

Information Overload

Transparently handle all kinds of data:

structured (hierarchical)

semi-structured (markup)

un-structured data (plain text)

Page 5: Search Me: Using Lucene.Net

Problems Solved

Information Overload

Find the information that users want,

not just the information they asked for.

Transparently handle all kinds of data:

structured (hierarchical)

semi-structured (markup)

un-structured data (plain text)

Single portal to multiple data types and

sources.

Do it fast!

Page 6: Search Me: Using Lucene.Net

Basic IR System Capabilities

Collection (importing, crawling) Anonymous web page crawling (google)

User-uploaded photographs (flickr)

Publisher upload of .mp3 files (iTunes)

Indexing Analysis

Modify index data structure

Querying Input parsing

Query generation & execution

Collecting the results

Filtering the results (optional)

Page 7: Search Me: Using Lucene.Net

What is Lucene.Net?

Port of the Apache Foundation‟s Lucene

libraries from Java to C#

It‟s a search library.

Lucene created by Doug Cutting

Named after his wife.

First released in 2000 on SourceForge

Migrated to Apache Foundation in 9/2001.

Page 8: Search Me: Using Lucene.Net

Used By

StackOverflow

JIRA

IBM

Akamai

Apple

Autodesk

Orchard

RavenDB

CouchDB

Page 9: Search Me: Using Lucene.Net

What Isn‟t Lucene.NET

Not a complete information retrieval system Check out Google Search Appliance instead:

http://www.google.com/enterprise/search/

Not a web-crawler. Check out Arachnode instead

http://arachnode.net

Not a query service. Check out SOLR instead

http://lucene.apache.org/solr

Not hard Check out Windows Search SDK instead

http://bit.ly/ImRtMk

Page 10: Search Me: Using Lucene.Net

Concept and Overview

Page 11: Search Me: Using Lucene.Net

What‟s In an Index?

Stores a collection of Documents, each of

which represent a source record.

Document contain:

Metadata about the source record.

(optionally) actual data from the source record.

(optionally) derived analytical products.

Documents store a collection of

token/frequency pairs (optionally position),

plus a document identifier.

Page 12: Search Me: Using Lucene.Net

Lucene‟s Index Structure

Documents store a collection of fields.

Fields are collection of terms, plus and identifier, and optional term vectors.

Terms are string key-value-pairs of a field name, and a string value.

Lucene provides special classes to deal with tricky data, like the NumericField class.

Term vectors are terms, along with their frequency counts and positions.

Fields can be indexed, stored, or both. Storing allows a term value to be retrieved after indexing.

Indexing adds the term value to Lucene‟s inverted index.

Page 13: Search Me: Using Lucene.Net

The Inverted Index

(taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )

Page 14: Search Me: Using Lucene.Net

Lucene‟s Index Structure

What an „inverted index‟?

verted index: document points to collection of

terms

inverted index: term points to a collection of

documents

One or more segments

Self-contained, independent partition of the

entire index.

Stores: field names, field values, term dictionary,

term frequencies, term proximities, normalization

factor, term vectors, and (optional) deleted record

lookup table.

Page 15: Search Me: Using Lucene.Net

Analysis

(taken from Apple‟s excellent “Search Basics” article at http://bit.ly/JO59kH )

Page 16: Search Me: Using Lucene.Net

Tokenization

(taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )

Page 17: Search Me: Using Lucene.Net

Tokenization

Normalization: “Gramåna” > “gramana”

Stemming: “preschooling” > “school”

Page 18: Search Me: Using Lucene.Net

Norms

(taken from Thomas Koch‟s presentation “Search Basics and Lucene” at http://bit.ly/JTOrnH )

Page 19: Search Me: Using Lucene.Net

Time to Look at Some Code

Page 20: Search Me: Using Lucene.Net

Getting a Query

Two options:

Parse a search string using a QueryParser class.

Programatically build a query.

QueryParser can build very complex queries

very quickly, but requires user to provide a

query string.

Programatic building of a query requires less

overhead for simple queries.

Page 21: Search Me: Using Lucene.Net

General Query Types

(taken from the Wikipedia entry “Information Retrieval” at http://bitly.com/T1Qbw)

Page 22: Search Me: Using Lucene.Net

Some Lucene Query Types

TermQuery (general purpose)

BooleanQuery

MultiPhraseQuery

SpanQuery

WildcardQuery

FilteredQuery

MoreLikeThisQuery

BoostingQuery

FuzzyQuery

ConstantScoreRangeQuery

Page 23: Search Me: Using Lucene.Net

Time to Look at More Code

Page 24: Search Me: Using Lucene.Net

Lucene.Net Contribs

Spatial (geo-spatial search)

Similarity

SimpleFactedSearch

Highlighter

SpellChecker

WordNET (synonyms)

Snowball (stemming library)

RegEx

Page 25: Search Me: Using Lucene.Net

Thanks for your time and attention.

twitter: @zgramana

blog: http://www.excitabyte.com/

Email: zgramanaATgee mail dot com

That‟s All!