Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to
Information Retrieval
Hinrich Schütze and Christina LiomaLecture 3: Dictionaries and tolerant retrieval
1
Introduction to Information RetrievalIntroduction to Information Retrieval
Outline❶ Recap
❷ Dictionaries❸ Wildcard queries
❹ Edit distance
❺ Spelling correction
❻ Soundex
2
Introduction to Information RetrievalIntroduction to Information Retrieval
Dictionary data structures for inverted indexes• The dictionary data structure stores the term
vocabulary, document frequency, pointers to each postings list … in what data structure?
Sec. 3.1
Introduction to Information RetrievalIntroduction to Information Retrieval
A naïve dictionary• An array of struct:
char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes • How do we store a dictionary in memory efficiently?• How do we quickly look up elements at query time?
Sec. 3.1
Introduction to Information RetrievalIntroduction to Information Retrieval
Dictionary data structures• Two main choices:– Hash table– Tree
• Some IR systems use hashes, some trees
Sec. 3.1
Introduction to Information RetrievalIntroduction to Information Retrieval
Hashes• Each vocabulary term is hashed to an integer– (We assume you’ve seen hashtables before)
• Pros:– Lookup is faster than for a tree: O(1)
• Cons:– No easy way to find minor variants:
• judgment/judgement
– No prefix search [tolerant retrieval]– If vocabulary keeps growing, need to occasionally do the
expensive operation of rehashing everything
Sec. 3.1
Introduction to Information RetrievalIntroduction to Information Retrieval
Trees• Simplest: binary tree• More usual: B-trees• Trees require a standard ordering of characters and hence
strings … but we standardly have one• Pros:– Solves the prefix problem (terms starting with hyp)
• Cons:– Slower: O(log M) [and this requires balanced tree]– Rebalancing binary trees is expensive
• But B-trees mitigate the rebalancing problem
Sec. 3.1
Introduction to Information RetrievalIntroduction to Information Retrieval
8
Binary tree
8
Introduction to Information RetrievalIntroduction to Information Retrieval
Tree: B-tree
– Definition: Every internal node has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].
a-huhy-m
n-z
Sec. 3.1
Introduction to Information RetrievalIntroduction to Information Retrieval
Outline❶ Recap
❷ Dictionaries❸ Wildcard queries
❹ Edit distance
❺ Spelling correction
❻ Soundex
10
Introduction to Information RetrievalIntroduction to Information Retrieval
Wild-card queries: *
• mon*: find all docs containing any word beginning “mon”.
• Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo
• *mon: find words ending in “mon”: harder– Maintain an additional B-tree for terms backwards.Can retrieve all words in range: nom ≤ w < non.
Exercise: from this, how can we enumerate all termsmeeting the wild-card query pro*cent ?
Sec. 3.2
Introduction to Information RetrievalIntroduction to Information Retrieval
12
How to handle * in the middle of a term
Example: m*nchen We could look up m* and *nchen in the B-tree and intersect
the two term sets. Expensive Alternative: permuterm index Basic idea: Rotate every wildcard query, so that the * occurs
at the end. Store each of these rotations in the dictionary, say, in a B-
tree
12
Introduction to Information RetrievalIntroduction to Information Retrieval
B-trees handle *’s at the end of a query term• How can we handle *’s in the middle of query term?– co*tion
• We could look up co* AND *tion in a B-tree and intersect the two term sets– Expensive
• The solution: transform wild-card queries so that the *’s occur at the end
• This gives rise to the Permuterm Index.
Sec. 3.2
Introduction to Information RetrievalIntroduction to Information Retrieval
14
Permuterm index
For term HELLO: add hello$, ello$h, llo$he, lo$hel, and o$hell to the B-tree where $ is a special symbol
14
Introduction to Information RetrievalIntroduction to Information Retrieval
15
Permuterm → term mapping
15
Introduction to Information RetrievalIntroduction to Information Retrieval
16
Permuterm index For HELLO, we’ve stored: hello$, ello$h, llo$he, lo$hel, and
o$hell Queries
For X, look up X$ For X*, look up $X* For *X, look up X$* For *X*, look up X* For X*Y, look up Y$X* Example: For hel*o, look up o$hel*
Permuterm index would better be called a permuterm tree. But permuterm index is the more common name.
16
Introduction to Information RetrievalIntroduction to Information Retrieval
17
Processing a lookup in the permuterm index
Rotate query wildcard to the right Use B-tree lookup as before Problem: Permuterm more than quadruples the size of the
dictionary compared to a regular B-tree. (empirical number)
17
Introduction to Information RetrievalIntroduction to Information Retrieval
18
k-gram indexes
More space-efficient than permuterm index Enumerate all character k-grams (sequence of k characters)
occurring in a term 2-grams are called bigrams. Example: from April is the cruelest month we get the bigrams:
$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$
$ is a special word boundary symbol, as before. Maintain an inverted index from bigrams to the terms that
contain the bigram
18
Introduction to Information RetrievalIntroduction to Information Retrieval
19
Postings list in a 3-gram inverted index
19
Introduction to Information RetrievalIntroduction to Information Retrieval
20
k-gram (bigram, trigram, . . . ) indexes
Note that we now have two different types of inverted indexes
The term-document inverted index for finding documents based on a query consisting of terms
The k-gram index for finding terms based on a query consisting of k-grams
20
Introduction to Information RetrievalIntroduction to Information Retrieval
21
Processing wildcarded terms in a bigram index
Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon . . . . . . but also many “false positives” like MOON. We must postfilter these terms against query. Surviving terms are then looked up in the term-document
inverted index. k-gram index vs. permuterm index
k-gram index is more space efficient. Permuterm index doesn’t require postfiltering.
21
Introduction to Information RetrievalIntroduction to Information Retrieval
Processing wild-card queries• As before, we must execute a Boolean query for each
enumerated, filtered term.• Wild-cards can result in expensive query execution
(very large disjunctions…)– pyth* AND prog*
• If you encourage “laziness” people will respond!
• Which web search engines allow wildcard queries?
Search
Type your search terms, use ‘*’ if you need to.E.g., Alex* will match Alexander.
Sec. 3.2.2
Introduction to Information RetrievalIntroduction to Information Retrieval
Outline❶ Recap
❷ Dictionaries❸ Wildcard queries
❹ Edit distance
❺ Spelling correction
❻ Soundex
23
Introduction to Information RetrievalIntroduction to Information Retrieval
24
Distance between misspelled word and “correct” word
We will study several alternatives. weighted edit distance Edit distance and Levenshtein distance k-gram overlap
24
Introduction to Information RetrievalIntroduction to Information Retrieval
25
Weighted edit distance
As above, but weight of an operation depends on the characters involved.
Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q.
Therefore, replacing m by n is a smaller edit distance than by q.
25
Introduction to Information RetrievalIntroduction to Information Retrieval
26
Edit distance The edit distance between string s1 and string s2 is the
minimum number of basic operations that convert s1 to s2. Levenshtein distance: The admissible basic operations are
insert, delete, and replace Levenshtein distance dog-do: 1 Levenshtein distance cat-cart: 1 Levenshtein distance cat-cut: 1 Levenshtein distance cat-act: 2
26
Introduction to Information RetrievalIntroduction to Information Retrieval
27
Levenshtein distance: Algorithm
27
Introduction to Information RetrievalIntroduction to Information Retrieval
28
Levenshtein distance: Algorithm
28
Introduction to Information RetrievalIntroduction to Information Retrieval
29
Levenshtein distance: Algorithm
29
Introduction to Information RetrievalIntroduction to Information Retrieval
30
Levenshtein distance: Algorithm
30
Introduction to Information RetrievalIntroduction to Information Retrieval
31
Levenshtein distance: Algorithm
31
Introduction to Information RetrievalIntroduction to Information Retrieval
32
Each cell of Levenshtein matrix
32
cost of getting here frommy upper left neighbor(copy or replace)
cost of getting herefrom my upper neighbor(delete)
cost of getting here frommy left neighbor (insert)
the minimum of the three possible “movements”;the cheapest way of getting here
Introduction to Information RetrievalIntroduction to Information Retrieval
33
Levenshtein distance: Example
33
Introduction to Information RetrievalIntroduction to Information Retrieval
34
Exercise
❶Compute Levenshtein distance matrix for OSLO – SNOW❷What are the Levenshtein editing operations that transform
cat into catcat?
34
Introduction to Information RetrievalIntroduction to Information Retrieval
3535
Introduction to Information RetrievalIntroduction to Information Retrieval
3636
Introduction to Information RetrievalIntroduction to Information Retrieval
3737
Introduction to Information RetrievalIntroduction to Information Retrieval
3838
Introduction to Information RetrievalIntroduction to Information Retrieval
3939
Introduction to Information RetrievalIntroduction to Information Retrieval
4040
Introduction to Information RetrievalIntroduction to Information Retrieval
4141
Introduction to Information RetrievalIntroduction to Information Retrieval
42
How do I read out the editing operations that transform OSLO into SNOW?
42
Introduction to Information RetrievalIntroduction to Information Retrieval
4343
Introduction to Information RetrievalIntroduction to Information Retrieval
4444
Introduction to Information RetrievalIntroduction to Information Retrieval
4545
Introduction to Information RetrievalIntroduction to Information Retrieval
4646
Introduction to Information RetrievalIntroduction to Information Retrieval
4747
Introduction to Information RetrievalIntroduction to Information Retrieval
4848
Introduction to Information RetrievalIntroduction to Information Retrieval
4949
Introduction to Information RetrievalIntroduction to Information Retrieval
5050
Introduction to Information RetrievalIntroduction to Information Retrieval
5151
Introduction to Information RetrievalIntroduction to Information Retrieval
5252
Introduction to Information RetrievalIntroduction to Information Retrieval
Outline❶ Recap
❷ Dictionaries❸ Wildcard queries
❹ Edit distance
❺ Spelling correction
❻ Soundex
53
Introduction to Information RetrievalIntroduction to Information Retrieval
Spell correction• Two main flavors:– Isolated word
• Check each word on its own for misspelling• Will not catch typos resulting in correctly spelled words• e.g., from form
– Context-sensitive• Look at surrounding words, • e.g., I flew form Heathrow to Narita.
Sec. 3.3
Introduction to Information RetrievalIntroduction to Information Retrieval
55
Correcting queries
First: isolated word spelling correction Premise 1: There is a list of “correct words” from which the
correct spellings come. Premise 2: We have a way of computing the distance
between a misspelled word and a correct word. Simple spelling correction algorithm: return the “correct”
word that has the smallest distance to the misspelled word. Example: informaton → information For the list of correct words, we can use the vocabulary of all
words that occur in our collection. Why is this problematic?
55
Introduction to Information RetrievalIntroduction to Information Retrieval
56
Alternatives to using the term vocabulary
A standard dictionary (Webster’s, OED etc.) An industry-specific dictionary (for specialized IR systems)
56
Introduction to Information RetrievalIntroduction to Information Retrieval
57
k-gram indexes for spelling correction
Enumerate all k-grams in the query term Example: bigram index, misspelled word bordroom Bigrams: bo, or, rd, dr, ro, oo, om Use the k-gram index to retrieve “correct” words that match
query term k-grams Threshold by number of matching k-grams
57
Introduction to Information RetrievalIntroduction to Information Retrieval
58
k-gram indexes for spelling correction: bordroom
58
….
Introduction to Information RetrievalIntroduction to Information Retrieval
Jaccard coefficient• A commonly-used measure of overlap• Let X and Y be two sets; then the J.C. is
• Equals 1 when X and Y have the same elements and zero when they are disjoint
• X and Y don’t have to be of the same size
YXYX /
Sec. 3.3.4
Introduction to Information RetrievalIntroduction to Information Retrieval
Example : bigrams match• Query A = bordroom reaches term B = boardroom• A = bo or rd dr ro oo om• B = bo oa ar rd dr ro oo om
• = 2 / (7 + 8 -6) = 6/9
60
BABA /
Introduction to Information RetrievalIntroduction to Information Retrieval
Context-sensitive spell correction• Text: I flew from Heathrow to Narita.• Consider the phrase query “flew form Heathrow”• We’d like to respond
Did you mean “flew from Heathrow”?because no docs matched the query phrase.
Sec. 3.3.5
Introduction to Information RetrievalIntroduction to Information Retrieval
Context-sensitive correction• Need surrounding context to catch this.• First idea: retrieve dictionary terms close (in
weighted edit distance) to each query term• Now try all possible resulting phrases with one word
“fixed” at a time– flew from heathrow – fled form heathrow– flea form heathrow
• Hit-based spelling correction: Suggest the alternative that has lots of hits.
Sec. 3.3.5
Introduction to Information RetrievalIntroduction to Information Retrieval
Outline❶ Recap
❷ Dictionaries❸ Wildcard queries
❹ Edit distance
❺ Spelling correction
❻ Soundex
63
Introduction to Information RetrievalIntroduction to Information Retrieval
64
Soundex
Soundex is the basis for finding phonetic (as opposed to orthographic) alternatives.
Example: chebyshev / tchebyscheff Algorithm:
Turn every token to be indexed into a 4-character reduced form Do the same with query terms Build and search an index on the reduced forms
64
Introduction to Information RetrievalIntroduction to Information Retrieval
65
Soundex algorithm❶ Retain the first letter of the term.❷ Change all occurrences of the following letters to ’0’ (zero): A, E, I,
O, U, H, W, Y❸ Change letters to digits as follows:
B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6
❹ Repeatedly remove one out of each pair of consecutive identical digits
❺ Remove all zeros from the resulting string; pad the resulting string with trailing zeros and return the first four positions, which will consist of a letter followed by three digits 65
Introduction to Information RetrievalIntroduction to Information Retrieval
66
Example: Soundex of HERMAN
Retain H ERMAN → 0RM0N 0RM0N → 06505 06505 → 06505 06505 → 655 Return H655 Note: HERMANN will generate the same code
66
Introduction to Information RetrievalIntroduction to Information Retrieval
67
How useful is Soundex?
Not very – for information retrieval OK for similar-sounding terms. The idea owes its origins to
work in international police department , seeking to match names for wanted criminals despite the names being spelled differently in different countries.
67