![Page 1: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/1.jpg)
Dictionaries and Tolerant retrieval
![Page 2: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/2.jpg)
This lecture
Dictionary data structures “Tolerant” retrieval
Wild-card queries Spelling correction Soundex
![Page 3: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/3.jpg)
Dictionary data structures for inverted indexes
The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure?
![Page 4: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/4.jpg)
A naïve dictionary
An array of struct:
char[20] int Postings *
20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time?
![Page 5: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/5.jpg)
Dictionary data structures
Two main choices: Hash table Tree
Some IR systems use hashes, some trees
![Page 6: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/6.jpg)
Hashes
Each vocabulary term is hashed to an integer (We assume you’ve seen hashtables before)
Pros: Lookup is faster than for a tree: O(1)
Cons: No easy way to find minor variants:
judgment/judgement No prefix search [tolerant retrieval] If vocabulary keeps going, need to occasionally do
the expensive operation of rehashing everything
![Page 7: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/7.jpg)
Roota-m n-z
a-hu hy-m n-sh si-z
aardvark
huygens
sickle
zygot
Tree: binary tree
![Page 8: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/8.jpg)
Tree: B-tree
Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].
a-huhy-m
n-z
![Page 9: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/9.jpg)
Trees
Simplest: binary tree More usual: B+-trees Trees require a standard ordering of characters and
hence strings … but we standardly have one Pros:
Solves the prefix problem (terms starting with hyp) Cons:
Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive
But B-trees mitigate the rebalancing problem
![Page 10: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/10.jpg)
Wild-card queries
![Page 11: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/11.jpg)
Wild-card queries: *
mon*: find all docs containing any word beginning “mon”.
Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo
*mon: find words ending in “mon”: harder Maintain an additional B-tree for terms backwards.
Can retrieve all words in range: nom ≤ w < non.
Exercise: from this, how can we enumerate all termsmeeting the wild-card query pro*cent ?
![Page 12: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/12.jpg)
Query processing
At this point, we have an enumeration of all terms in the dictionary that match the wild-card query.
We still have to look up the postings for each enumerated term.
E.g., consider the query:
se*ate AND fil*er
This may result in the execution of many Boolean AND queries.
![Page 13: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/13.jpg)
B-trees handle *’s at the end of a query term
How can we handle *’s in the middle of query term? co*tion
We could look up co* AND *tion in a B-tree and intersect the two term sets Expensive
The solution: transform wild-card queries so that the *’s occur at the end
This gives rise to the Permuterm Index.
![Page 14: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/14.jpg)
Permuterm index
For term hello, index under: hello$, ello$h, llo$he, lo$hel, o$hell
where $ is a special symbol. Queries:
X lookup on X$ X* lookup on X*$ *X lookup on X$* *X* lookup on X* X*Y lookup on Y$X*
Query = hel*oX=hel, Y=o
Lookup o$hel*
![Page 15: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/15.jpg)
Permuterm query processing
Rotate query wild-card to the right Now use B-tree lookup as before. Permuterm problem: ≈ quadruples lexicon size
Empirical observation for English.
![Page 16: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/16.jpg)
Bigram (k-gram) indexes
Enumerate all k-grams (sequence of k chars) occurring in any term
e.g., from text “April is the cruelest month” we get the 2-grams (bigrams)
$ is a special word boundary symbol Maintain a second inverted index from bigrams to
dictionary terms that match each bigram.
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$
![Page 17: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/17.jpg)
Bigram index example
The k-gram index finds terms based on a query consisting of k-grams
mo
on
among
$m mace
among
amortize
madden
around
![Page 18: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/18.jpg)
Processing n-gram wild-cards
Query mon* can now be run as $m AND mo AND on
Gets terms that match AND version of our wildcard query.
But we’d enumerate moon. Must post-filter these terms against query. Surviving enumerated terms are then looked up
in the term-document inverted index. Fast, space efficient (compared to permuterm).
![Page 19: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/19.jpg)
Processing wild-card queries
As before, we must execute a Boolean query for each enumerated, filtered term.
Wild-cards can result in expensive query execution
If you encourage “laziness” people will respond!
Does Google allow wildcard queries?
Search
Type your search terms, use ‘*’ if you need to.E.g., Alex* will match Alexander.
![Page 20: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/20.jpg)
B+-tree
Records must be ordered over an attribute
Queries: exact match and range queries over the indexed attribute: “find the name of the student with ID=087-34-7892” or “find all students with gpa between 3.00 and 3.5”
![Page 21: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/21.jpg)
B+-tree:properties
Insert/delete at log F/2 (N/B) cost; keep tree height-balanced. (F = fanout)
Two types of nodes: index nodes and data nodes; each node is 1 page (disk based method)
![Page 22: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/22.jpg)
57
81
95
to keys to keys to keys to keys
< 57 57 k<81 81k<95 95
Index node
![Page 23: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/23.jpg)
Data node5
7
81
95
To r
eco
rd
wit
h k
ey 5
7
To r
eco
rd
wit
h k
ey 8
1
To r
eco
rd
wit
h k
ey 8
5
From non-leaf node
to next leaf
in sequence
![Page 24: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/24.jpg)
EX: B+ Tree of order 3.
(a) Initial tree
60
80
20 , 40
205,10 6040,50 80,100
Index level
Data level
![Page 25: Dictionaries and Tolerant retrieval. This lecture Dictionary data structures “Tolerant” retrieval Wild-card queries Spelling correction Soundex](https://reader035.vdocument.in/reader035/viewer/2022081513/56649e765503460f94b77359/html5/thumbnails/25.jpg)
Query Example
Root
100
120
150
180
30
3 5 11
30
35
100
101
110
120
130
150
156
179
180
200
Range[32, 160]