introduction to full-text search

69
Introduction to Full-text search

Upload: cristian-vat

Post on 21-Jan-2015

1.713 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Introduction to Full-Text Search

Introduction to Full-text search

Page 2: Introduction to Full-Text Search

About me Full-time (Mostly) Java Developer Part-time general technical/sysadmin/geeky guy Interested in: hard problems, search, performance, paralellism,

scalability

Page 3: Introduction to Full-Text Search

Why should you care?

Page 4: Introduction to Full-Text Search

Because every application needs search

Page 5: Introduction to Full-Text Search

We live in an era of big, complex and connected applications.

Page 6: Introduction to Full-Text Search

That means a lot of data

Page 7: Introduction to Full-Text Search

But it's no use if you can't find anything!

Page 8: Introduction to Full-Text Search

But it's no use if you can't quickly find anything something relevant

Page 9: Introduction to Full-Text Search

Quick

Page 10: Introduction to Full-Text Search

Relevant

Page 11: Introduction to Full-Text Search

Customized Experience

Page 12: Introduction to Full-Text Search

You can't win by being generic, but you can be the best for your specific type of content.

Deathy's Tip

Page 13: Introduction to Full-Text Search

So back to our full-text search...

Page 14: Introduction to Full-Text Search

Some core ideas "index" (or "inverted index") "document"

Page 15: Introduction to Full-Text Search

Don't be too quick in deciding what a "document" is. Put some thought into it or you'll regret it (speaking from a lot of experience)

Deathy’s Tip

Page 16: Introduction to Full-Text Search

First we need some documents, more specifically some text samples

Page 17: Introduction to Full-Text Search

Documents Doc1: "The cow says moo" Doc2: "The dog says woof" Doc3: "The cow-dog says moof“

"Stolen" from http://www.slideshare.net/tomdyson/being-google

Page 18: Introduction to Full-Text Search

Important: individual words are the basis for the index

Page 19: Introduction to Full-Text Search

Individual wordsindex = [

"cow","dog","moo","moof","The","says","woof"

]

Page 20: Introduction to Full-Text Search

For each word we have a list of documents to which it belongs

Page 21: Introduction to Full-Text Search

Words, with appearancesindex = {

"cow": ["Doc1", "Doc3"],"dog": ["Doc2", "Doc3"],"moo": ["Doc1"],"moof": ["Doc3"],"The": ["Doc1", "Doc2", "Doc3"],"says": ["Doc1", "Doc2", "Doc3"],"woof": ["Doc2"]

}

Page 22: Introduction to Full-Text Search

Q1: Find documents which contain "moo"A1: index["moo"]

Page 23: Introduction to Full-Text Search

Q2: Find documents which contain "The" and "dog"A2: set(index["The"]) & set(index["dog"])

Page 24: Introduction to Full-Text Search

Try to think of search as unions/intersections or other filters on sets.

Page 25: Introduction to Full-Text Search

Most searches are using simple terms and "boolean" operators.

Page 26: Introduction to Full-Text Search

“boolean” "word" - word MAY/SHOULD appear in document "+word" - word MUST appear in document "-word" - word MUST NOT appear in document

Page 27: Introduction to Full-Text Search

Example Query: “+type:book content:java content:python -content:ruby”

Find books, with "java" or "python" in content but which don't contain "ruby" in content.

Page 28: Introduction to Full-Text Search

Err...wait...what the hell does "content:java" mean?

Page 29: Introduction to Full-Text Search

Reviewing the "document" concept

Page 30: Introduction to Full-Text Search

An index consists out of one or more documents

Page 31: Introduction to Full-Text Search

Each document consists of one or more "field"s. Each field has

a name and content.

Page 32: Introduction to Full-Text Search

Field examples content title author publication date etc.

Page 33: Introduction to Full-Text Search

So how are fields handled internally?

In most cases very simple. A word belongs to a specific field, so it can be stored in the term directly.

Page 34: Introduction to Full-Text Search

New index exampleindex = {

"content:cow": ["Doc1", "Doc3"],"content:dog": ["Doc2", "Doc3"],"content:moo": ["Doc1"],"content:moof": ["Doc3"],"content:The": ["Doc1", "Doc2", "Doc3"],"content:says": ["Doc1", "Doc2", "Doc3"],"content:woof": ["Doc2"],"type:example_documents": ["Doc1", "Doc2", "Doc3"]

}

Page 35: Introduction to Full-Text Search

But enough of that

Page 36: Introduction to Full-Text Search

We missed the most important thing!

Page 37: Introduction to Full-Text Search

We missed saved the most important thing for last!

Page 38: Introduction to Full-Text Search

Analysis

Page 39: Introduction to Full-Text Search

or for mortals: how you get from a long text to small

tokens/words/terms

Page 40: Introduction to Full-Text Search

…borrowing from Lucene naming/API...

Page 41: Introduction to Full-Text Search

(One) Tokenizer

Page 42: Introduction to Full-Text Search

and zero or more Filters

Page 43: Introduction to Full-Text Search

First...

Page 44: Introduction to Full-Text Search

Some more interesting documents Doc1: "The quick brown fox jumps over the lazy dog" Doc2: "All Daleks: Exterminate! Exterminate! EXTERMINATE!!

EXTERMINATE!!!" Doc3: "And the final score is: no TARDIS, no screwdriver, two

minutes to spare. Who da man?!"

Page 45: Introduction to Full-Text Search

Tokenizer: Breaks up a single string into smaller tokens.

Page 46: Introduction to Full-Text Search

You define what splitting rules are best for you.

Page 47: Introduction to Full-Text Search

Whitespace TokenizerJust break into tokens wherever there is some space. So we get something like:

Page 48: Introduction to Full-Text Search

Doc1: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

Doc2: ["All", "Daleks:", "Exterminate!", "Exterminate!", "EXTERMINATE!!", "EXTERMINATE!!!"]

Doc3: ["And", "the", "final", "score", "is:", "no", "TARDIS,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "Who", "da", "man?!"]

Page 49: Introduction to Full-Text Search

But wait, that doesn't look right...

Page 50: Introduction to Full-Text Search

So we apply Filters

Page 51: Introduction to Full-Text Search

Filter transforms one single token into another single token, multiple

tokens or no token at all you can apply more of them in a specific order

Page 52: Introduction to Full-Text Search

Filter 1: lower-case (since we don't want the search to be

case-sensitive)

Page 53: Introduction to Full-Text Search

Result

Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

Doc2: ["all", "daleks:", "exterminate!", "exterminate!", "exterminate!!", "exterminate!!!"]

Doc3: ["and", "the", "final", "score", "is:", "no", "tardis,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "who", "da", "man?!"]

Page 54: Introduction to Full-Text Search

Filter 2: remove punctuation

Page 55: Introduction to Full-Text Search

Result

Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

Doc2: ["all", "daleks", "exterminate", "exterminate", "exterminate", "exterminate"]

Doc3: ["and", "the", "final", "score", "is", "no", "tardis", "no", "screwdriver", "two", "minutes", "to", "spare", "who", "da", "man"]

Page 56: Introduction to Full-Text Search

Add more filter seasoning until it tastes just right.

Page 57: Introduction to Full-Text Search

Lots of things you can do with filters case normalization removing unwanted/unneeded characters transliteration/normalization of special characters stopwords synonyms

Page 58: Introduction to Full-Text Search

Possibilities are endless, enjoy experimenting with

them!

Page 59: Introduction to Full-Text Search

Just one warning…

Page 60: Introduction to Full-Text Search

Always use the same analysis rules when indexing and when parsing search text entered by

the user!

Page 61: Introduction to Full-Text Search

I bet you want to start working with this

Page 62: Introduction to Full-Text Search

Implementations

Lucene (Java main, .NET, Python, C ) SOLR if using from other languages

Xapian Sphinx OpenFTS MySQL Full-Text Search (kind of…)

Page 63: Introduction to Full-Text Search

Related Books

Page 64: Introduction to Full-Text Search

The theoryIntroduction to Information Retrievalhttp://nlp.stanford.edu/IR-book/information-retrieval-book.htmlWarning: contains a lot of math.

Page 65: Introduction to Full-Text Search

The practice (for Lucene at least):Lucene in Action, second edition:http://www.manning.com/hatcher3/Warning: contains a lot of Java

Page 66: Introduction to Full-Text Search

Questions?

Page 67: Introduction to Full-Text Search

Contact me(with interesting problems involving lots of data )

@[email protected]://blog.deathy.info/ (yeah…I know…)

Page 68: Introduction to Full-Text Search

Fin.

Page 69: Introduction to Full-Text Search

So where’s the Halloween Party?

Happy Halloween !