1 ranking with index rong jin. 2 inverted index find plays of shakespeare related to brutus and...
TRANSCRIPT
![Page 1: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/1.jpg)
1
Ranking with Index
Rong Jin
![Page 2: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/2.jpg)
2
Inverted Index
Find plays of Shakespeare related to Brutus and Calpurnia?
( , ) ( , ) ( ) ( , ) ( )sim q d tf d Brutus idf Brutus tf d Calpurnia idf Calpurnia
![Page 3: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/3.jpg)
3
Inverted Index
A simple approach: linear scan, compute a score for each doc Assume idf(Brutus) = idf(Calpurnia) = 1 Slow for large collections
1 2 0 1 0 0
( , ) ( , ) ( ) ( , ) ( )sim q d tf d Brutus idf Brutus tf d Calpurnia idf Calpurnia
![Page 4: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/4.jpg)
4
Inverted Index
A simple approach: linear scan, compute a score for each doc Assume idf(Brutus) = idf(Calpurnia) = 1 Slow for large collections
1 2 0 1 0 0
( , ) ( , ) ( ) ( , ) ( )sim q d tf d Brutus idf Brutus tf d Calpurnia idf Calpurnia
![Page 5: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/5.jpg)
5
Inverted Index
Only three plays of Shakespeare contain “Brutus” or “Culpurnia” Inverted index: quickly find the list of documents that contain any
of the query words
( , ) ( , ) ( ) ( , ) ( )sim q d tf d Brutus idf Brutus tf d Calpurnia idf Calpurnia
![Page 6: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/6.jpg)
6
Inverted Index For each term t, we store a list of all documents that
contain t.
dictionary postings
![Page 7: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/7.jpg)
7
Inverted Index Query: Brutus and Calpurnia Substantially reduce the size of candidate documents (is this in
general true? )
dictionary postings
![Page 8: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/8.jpg)
8
Inverted Index Query: Brutus and Calpurnia Substantially reduce the size of candidate documents (is this in
general true? )
dictionary postings
Only compute score for 1, 2, 4, 11, 31, 45, 54, 173, 174, 101
Merge
![Page 9: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/9.jpg)
9
Indexes
Indexes are data structures designed to make search faster, and support efficient updates
Text search has unique requirements, which leads to unique data structures
Most common data structure is inverted index general name for a class of structures “inverted” because documents are associated with
words, rather than words with documents
![Page 10: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/10.jpg)
10
Inverted Index
Each index term is associated with an inverted list Contains lists of documents, or lists of word
occurrences in documents, and other information Each entry is called a posting The part of the posting that refers to a specific
document or location is called a pointer Each document in the collection is given a unique
number Lists are usually document-ordered (sorted by
document number)
![Page 11: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/11.jpg)
Inverted Index
11
postings
![Page 12: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/12.jpg)
12
Example “Collection”
![Page 13: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/13.jpg)
13
Simple Inverted Index
( , ) ( , ) ( )
( , ) ( )
sim q d tf d tropical idf tropical
tf d fish idf fish
query: tropical fish
![Page 14: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/14.jpg)
14
Inverted Indexwith counts
( , ) ( , ) ( )
( , ) ( )
sim q d tf d tropical idf tropical
tf d fish idf fish
query: tropical fish
![Page 15: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/15.jpg)
15
Inverted Index with positions
( , ) ( , ) ( )
( , ) ( )
sim q d tf d tropical idf tropical
tf d fish idf fish
query: tropical fish
![Page 16: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/16.jpg)
16
Proximity Matches
Matching phrases or words within a window e.g., "tropical fish", or “find tropical
within 5 words of fish” Word positions in inverted lists make these
types of query features efficient e.g.,
![Page 17: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/17.jpg)
17
Proximity Matches
Matching phrases or words within a window e.g., "tropical fish", or “find tropical
within 5 words of fish” Word positions in inverted lists make these
types of query features efficient e.g.,
![Page 18: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/18.jpg)
18
Fields and Extents
Document structure is useful in search field restrictions
e.g., date, from:, etc.
some fields more important e.g., title
Options: separate inverted lists for each field type add information about fields to postings use extent lists
![Page 19: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/19.jpg)
19
Extent Lists
An extent is a contiguous region of a document represent extents using word positions inverted list records all extents for a given field
type e.g., find “fish” in title
extent list
![Page 20: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/20.jpg)
20
Extent Lists
An extent is a contiguous region of a document represent extents using word positions inverted list records all extents for a given field
type e.g., find “fish” in title
extent list
![Page 21: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/21.jpg)
21
Other Issues
Precomputed scores in inverted list e.g., list for “fish” [(1:3.6), (3:2.2)], where 3.6 is total
feature value for document 1 improves speed but reduces flexibility
1 ,1 2,1 ,2 ,,2 ,( , )k k kk k kn nk nsim q d q d q d q dw w w
![Page 22: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/22.jpg)
22
Other Issues
Precomputed scores in inverted list e.g., list for “fish” [(1:3.6), (3:2.2)], where 3.6 is total
feature value for document 1 improves speed but reduces flexibility
Score-ordered lists very efficient for single-word queries only retrieve the top part of each inverted list, reducing
disc access
1 ,1 2,1 ,2 ,,2 ,( , )k k kk k kn nk nsim q d q d q d q dw w w
![Page 23: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/23.jpg)
23
Compression Inverted lists are very large
e.g., 25-50% of collection for TREC collections using Indri search engine
Much higher if n-grams are indexed Compression of indexes saves disk and/or
memory space Typically have to decompress lists to use them Trade off between compression ratios and
computational cost Lossless compression – no information lost
![Page 24: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/24.jpg)
24
Compression
Basic idea: Common data elements use short codes while uncommon data elements use longer codes Example: coding numbers
number sequence:
possible encoding:(14 bits)
![Page 25: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/25.jpg)
25
Compression
Basic idea: Common data elements use short codes while uncommon data elements use longer codes Example: coding numbers
number sequence:
possible encoding:(14 bits)
But ‘0’ is more popular than ‘1’, ‘2’, and ‘3’ A better coding scheme: 0: 0, 1: 01, 2: 11, 3: 10
![Page 26: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/26.jpg)
26
Compression
Basic idea: Common data elements use short codes while uncommon data elements use longer codes Example: coding numbers
number sequence:
possible encoding:
encode 0 using a single 0:
only 10 bits, but...
![Page 27: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/27.jpg)
27
Compression
Basic idea: Common data elements use short codes while uncommon data elements use longer codes Example: coding numbers
number sequence:
possible encoding:
encode 0 using a single 0:
only 10 bits, but... 0 0 3 3 0 2 0
![Page 28: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/28.jpg)
28
Compression Example
Ambiguous encoding – not clear how to decode
use unambiguous code:
which gives:
(13 bits)
![Page 29: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/29.jpg)
29
Delta Encoding
Word count data is good candidate for compression many small numbers and few larger numbers encode small numbers with small codes
(1, 1) (102, 3) (1100, 2) (100,010, 10)
Document ID Count
Inverted list
![Page 30: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/30.jpg)
30
Delta Encoding
Word count data is good candidate for compression many small numbers and few larger numbers encode small numbers with small codes
Document numbers are less predictable but differences between numbers in an ordered list
are smaller and more predictable
(1, 1) (102, 3) (1100, 2) (100,010, 10)
Document ID Count
Inverted list
![Page 31: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/31.jpg)
31
Delta Encoding
Word count data is good candidate for compression many small numbers and few larger numbers encode small numbers with small codes
Document numbers are less predictable but differences between numbers in an ordered list
are smaller and more predictable Delta encoding:
encoding differences between document numbers (d-gaps)
![Page 32: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/32.jpg)
32
Delta Encoding
• Inverted list (without counts)
• Differences between adjacent numbers
![Page 33: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/33.jpg)
33
Bit-Aligned Codes Breaks between encoded numbers can occur
after any bit position Unary code
Encode k by k 1s followed by 0 0 at end makes code unambiguous
![Page 34: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/34.jpg)
Bit-Aligned Codes
Example
34
1 4 4 9 5
10 11110 11110 1111111110 111110 10 1101110 101110
![Page 35: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/35.jpg)
Bit-Aligned Codes
Example
35
1 4 4 9 5
10 11110 11110 1111111110 111110 10 1101110 101110
1 2 3 1 3
![Page 36: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/36.jpg)
36
Unary and Binary Codes
Unary is very efficient for small numbers such as 0 and 1, but quickly becomes very expensive 1023 can be represented in 10 binary bits, but
requires 1024 bits in unary Binary is more efficient for large numbers, but
it may be ambiguous
![Page 37: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/37.jpg)
37
Elias-γ Code To encode a number k, compute
Important property: The number of bits for coding kr is no more than
kd
![Page 38: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/38.jpg)
38
Elias-γ Code To encode a number k, compute
Unary code for kd and binary code for kr
kd number of bits used to encode kr
![Page 39: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/39.jpg)
39
Elias- γ Code
Elias-γ code uses no more bits than unary, many fewer for k > 2 1023 takes 19 bits instead of 1024 bits using unary
In general, takes 2 log⌊ 2k +1 bits⌋ No more than twice number of bits than binary code
100111010011010111100111
![Page 40: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/40.jpg)
40
Elias- γ Code
Elias-γ code uses no more bits than unary, many fewer for k > 2 1023 takes 19 bits instead of 1024 bits using unary
In general, takes 2 log⌊ 2k +1 bits⌋ No more than twice number of bits than binary code
100111010011010111100111
2 10 6 23
![Page 41: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/41.jpg)
41
Byte-Aligned Codes
Variable-length bit encodings can be a problem on processors that process bytes
v-byte is a popular byte-aligned code Similar to Unicode UTF-8
Shortest v-byte code is 1 byte Numbers are 1 to 4 bytes, with high bit 1 in
the last byte, 0 otherwise
![Page 42: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/42.jpg)
42
V-Byte Encoding
6 110 00000110
![Page 43: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/43.jpg)
43
V-Byte Encoding
6 110 10000110
127 10000000 00000001 00000000
![Page 44: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/44.jpg)
44
V-Byte Encoding
6 110 10000110
127 10000000 00000001 10000000
![Page 45: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/45.jpg)
45
Compression Example
Consider invert list with positions:
Delta encode document numbers and positions: Compress using v-byte:
1 2 1 6 1 3 6 11 180 1 1 1
![Page 46: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/46.jpg)
46
Results of Compression Reuter dataset: 800K documents
![Page 47: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/47.jpg)
47
Skipping
Search involves comparison of inverted lists of different lengths can be expensive Find documents with both “galago” and “animal” Number of documents with “animal”: 1,000,000,000 Number of documents with “galago”: 922,000 Number of documents with both: 89,700
galago
animal
922,000 web pages
1,000,000,000 web pages
![Page 48: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/48.jpg)
48
Skipping
Search involves comparison of inverted lists of different lengths Can be very inefficient “Skipping” ahead to check document numbers is
much better Compression makes this difficult
Variable size, only d-gaps stored
Skip pointers are additional data structure to support skipping
![Page 49: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/49.jpg)
49
Skip Pointers
A skip pointer (d, p) contains a document number d and a byte (or bit) position p Means there is an inverted list posting that starts
at position p, and the posting before it was for document d
skip pointersInverted list
![Page 50: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/50.jpg)
50
Auxiliary Structures
Inverted lists usually stored together in a single file for efficiency Inverted file
Term statistics stored at start of inverted lists Collection statistics stored in separate file
![Page 51: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/51.jpg)
51
Auxiliary Structures
Vocabulary or lexicon Contains a lookup table from index terms to the byte
offset of the inverted list in the inverted file Either hash table in memory or B-tree for larger
vocabularies
HashtableB-tree
Brutus
![Page 52: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/52.jpg)
52
Index Construction
Simple in-memory indexer
![Page 53: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/53.jpg)
53
Merging Merging addresses limited memory problem
Build the inverted list structure until memory runs out Then write the partial index to disk, start making a new
one At the end of this process, the disk is filled with many
partial indexes, which are merged Partial lists must be designed so they can be
merged in small pieces e.g., storing in alphabetical order
![Page 54: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/54.jpg)
54
Merging
![Page 55: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/55.jpg)
55
Distributed Indexing
Distributed processing driven by need to index and analyze huge amounts of data (i.e., the Web)
Large numbers of inexpensive servers used rather than larger, more expensive machines
MapReduce is a distributed programming tool designed for indexing and analysis tasks
![Page 56: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/56.jpg)
56
MapReduce Basic process
Map stage which transforms data records into pairs, each with a key and a value (i.e. (word, doc-id) pair)
Shuffle uses a hash function so that all pairs with the same key end up next to each other and on the same machine (i.e. all pairs for the same word are sent to the same machine)
Reduce stage processes records in batches, where all pairs with the same key are processed at the same time (i.e. create the inverted list for each word)
![Page 57: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/57.jpg)
57
MapReduce
![Page 58: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/58.jpg)
58
Indexing Example
![Page 59: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/59.jpg)
59
Update Index Index merging is a good strategy for handling
updates when they come in large batches For small updates this is very inefficient
instead, create separate index for new documents, merge results from both searches
could be in-memory, fast to update and search How to deal with deleted documents ?
![Page 60: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/60.jpg)
60
Update Index Index merging is a good strategy for handling
updates when they come in large batches For small updates this is very inefficient
instead, create separate index for new documents, merge results from both searches
could be in-memory, fast to update and search Deletions handled using delete list
Modifications done by putting old version on delete list, adding new version to new documents index
![Page 61: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/61.jpg)
61
Query Processing
Document-at-a-time Calculates complete scores for documents by
processing all term lists, one document at a time
Query: Salt Water Tropic
![Page 62: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/62.jpg)
62
Query Processing
Term-at-a-time Accumulates scores
for documents by processing term lists one at a time
Query: Salt Water Tropic
![Page 63: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/63.jpg)
63
Optimization Techniques Term-at-a-time uses more memory for storing
inverted lists, but less disk accesses Two classes of optimization
Read less data from inverted lists e.g., using skip lists to read part of an inverted list
Calculate scores for fewer documents e.g., conjunctive processing to reduce the number of
candidate documents
![Page 64: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/64.jpg)
64
Distributed Evaluation Basic process
All queries sent to a director machine Director then sends messages to many index servers Each index server does some portion of the query
processing Director organizes the results and returns them to the
user Two main approaches
Document distribution by far the most popular
Term distribution
![Page 65: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/65.jpg)
65
Distributed Evaluation
Document distribution each index server acts as a search engine for a
small fraction of the total collection director sends a copy of the query to each of the
index servers, each of which returns the top-k results
results are merged into a single ranked list by the director
Collection statistics should be shared for effective ranking
![Page 66: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/66.jpg)
66
Document Distribution
Index Server 1
Index Server 2
Index Server 100
……..
Results Query
Document Collection 1
Document Collection 2
Document Collection 100
Doc. List 1
Doc. List 2 Doc.
List 100
Merge Director
![Page 67: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/67.jpg)
67
Distributed Evaluation Term distribution
Single index is built for the whole cluster of machines Each inverted list in that index is then assigned to one
index server in most cases the data to process a query is not stored on a
single machine
One of the index servers is chosen to process the query usually the one holding the longest inverted list
Other index servers send information to that server Final results sent to director
![Page 68: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/68.jpg)
68
Term Distribution
Index Server 1
Index Server 2
Index Server 100
……..
Results Query
Words a-b Word c-d Word y-z
Doc. List 1
Final Results
Doc. List 2
Director
![Page 69: 1 Ranking with Index Rong Jin. 2 Inverted Index Find plays of Shakespeare related to Brutus and Calpurnia?](https://reader035.vdocument.in/reader035/viewer/2022062619/5518d08555034638098b5042/html5/thumbnails/69.jpg)
69
Caching Query distributions similar to Zipf
About 50% are unique, but some are very popular Caching can significantly improve
effectiveness Cache popular query results Cache common inverted lists
Cache must be refreshed to prevent stale data