1 inf 2914 information retrieval and web search lecture 6: index construction these slides are...
Post on 21-Dec-2015
217 views
TRANSCRIPT
![Page 1: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/1.jpg)
1
INF 2914Information Retrieval and Web
Search
Lecture 6: Index ConstructionThese slides are adapted from
Stanford’s class CS276 / LING 286Information Retrieval and Web
Mining
![Page 2: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/2.jpg)
2
(Offline) Search Engine Data Flow
- Parse- Tokenize- Per page analysis
tokenizedweb pages
duptable
Parse & Tokenize Global Analysis
2
invertedtext index
1
Crawler
web page - Scan tokenized web pages, anchor text, etc- Generate text index
Index Build
-Dup detection -Static rank - Anchor text -Spam analysis-- …
3 4
ranktable
anchortext
in background
spam table
![Page 3: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/3.jpg)
3
Inverted index
For each term T, we must store a list of all documents that contain T.
Brutus
Calpurnia
Caesar
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Dictionary Postings lists
Sorted by docID (more later on why).
Posting
![Page 4: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/4.jpg)
4
Inverted index construction
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Documents tobe indexed.
Friends, Romans, countrymen.
![Page 5: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/5.jpg)
5
Sequence of (Modified token, Document ID) pairs.
I did enact JuliusCaesar I was killed
i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The noble
Brutus hath told youCaesar was ambitious
Doc 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2
caesar 2was 2ambitious 2
Indexer steps
![Page 6: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/6.jpg)
6
Sort by terms. Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Core indexing step.
![Page 7: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/7.jpg)
7
Multiple term entries in a single document are merged.
Frequency information is added.
Term Doc # Term freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Why frequency?Will discuss later.
![Page 8: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/8.jpg)
8
The result is split into a Dictionary file and a Postings file.
Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Term N docs Coll freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1
Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
![Page 9: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/9.jpg)
9
The index we just built
How do we process a query?
![Page 10: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/10.jpg)
10
Query processing: AND
Consider processing the query:Brutus AND Caesar Locate Brutus in the Dictionary;
Retrieve its postings. Locate Caesar in the Dictionary;
Retrieve its postings. “Merge” the two postings:
128
34
2 4 8 16 32 64
1 2 3 5 8 13
21
Brutus
Caesar
![Page 11: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/11.jpg)
11
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
The merge
Walk through the two postings simultaneously, in time linear in the total number of postings entries
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar2 8
If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.
![Page 12: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/12.jpg)
12
Index construction
How do we construct an index? What strategies can we use with limited
main memory?
![Page 13: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/13.jpg)
13
Our corpus for this lecture
Number of docs = n = 1M Each doc has 1K terms
Number of distinct terms = m = 500K 667 million postings entries
![Page 14: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/14.jpg)
14
./ ln~ / m/J
/
1JmnJHnJinJ
Jm
i
How many postings?
Number of 1’s in the i th block = nJ/i Summing this over m/J blocks, we have
For our numbers, this should be about 667 million postings.
![Page 15: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/15.jpg)
15
I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitious
Doc 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Recall index construction
Documents are processed to extract words and these are saved with the Document ID.
![Page 16: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/16.jpg)
16
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
We focus on this sort step.We have 667M items to sort.
Key step
After all documents have been processed the inverted file is sorted by terms.
![Page 17: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/17.jpg)
17
Index construction
At 10-12 bytes per postings entry, demands several temporary gigabytes
![Page 18: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/18.jpg)
18
System parameters for design
Disk seek ~ 10 milliseconds Block transfer from disk ~ 1 microsecond
per byte (following a seek) All other ops ~ 1 microsecond
E.g., compare two postings entries and decide their merge order
![Page 19: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/19.jpg)
19
If every comparison took 2 disk seeks, and N items could besorted with N log2N comparisons, how long would this take?
Bottleneck
Build postings entries one doc at a time Now sort postings entries by term (then by
doc within each term) Doing this with random disk seeks would
be too slow – must sort N=667M records
![Page 20: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/20.jpg)
2020
If every comparison took 2 disk seeks, and N items could besorted with N log2N comparisons, how long would this take?
12.4 years!!!
Disk-based sorting
Build postings entries one doc at a time Now sort postings entries by term Doing this with random disk seeks would
be too slow – must sort N=667M records
![Page 21: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/21.jpg)
2121
Sorting with fewer disk seeks
12-byte (4+4+4) records (term, doc, freq). These are generated as we process docs. Must now sort 667M such 12-byte records
by term. Define a Block ~ 10M such records
can “easily” fit a couple into memory. Will have 64 such blocks to start with.
Will sort within blocks first, then merge the blocks into one long sorted order.
![Page 22: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/22.jpg)
2222
Sorting 64 blocks of 10M records
First, read each block and sort within: Quicksort takes 2N ln N expected steps In our case 2 x (10M ln 10M) steps Time to Quicksort each block = 320 seconds
Total time to read each block from disk and write it back
120M x 2 x 10-6 = 240 seconds 64 times this estimate - gives us 64 sorted runs of
10M records each Total Quicksort time = 5.6 hours Total read+write time = 4.2 hours Total for this phase ~ 10 hours
Need 2 copies of data on disk, throughout
![Page 23: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/23.jpg)
2323Disk
1
3 4
22
1
4
3
Runs beingmerged.
Merged run.
Merging 64 sorted runs
Merge tree of log264= 6 layers. During each layer, read into memory runs
in blocks of 10M, merge, write back.
![Page 24: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/24.jpg)
2424
…
…
Sorted runs.
1 2 6463
32 runs, 20M/run
16 runs, 40M/run8 runs, 80M/run4 runs … ?2 runs … ?1 run … ?
Bottom levelof tree.
Merge tree
![Page 25: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/25.jpg)
2525
Merging 64 runs
Time estimate for disk transfer: 6 x Time to read+write 64 blocks = 6 x 4.2 hours ~ 25 hours
Time estimate for the merge operation: 6 x 640M x 10-6 = 1 hour
Time estimate for the overall algorithm: Sort time + Merge time ~ 10 + 26 ~ 36 hours
Lower bound (main memory sort): Time to read+write = 4.2 hours Time to sort in memory = 10.7 hours Total time ~ 15 hours
![Page 26: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/26.jpg)
2626
Some indexing numbers
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Trevi Lucy Melnik et. al. Long andSuel
Lucene
K d
oc
um
en
ts p
er
ho
ur
![Page 27: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/27.jpg)
2727
How to improve indexing time?
Compression of the sorted runs Multi-way merge
Heap merge all runs Radix sort (linear time sorting) Pipelining reading, sorting, and writing
phases
![Page 28: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/28.jpg)
28
Multi-way merge
28
…Sorted runs
1 2 6463
Heap
![Page 29: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/29.jpg)
2929
Indexing improvements
Radix sort Linear time sorting Flexibility in defining the sort criteria Bigger sort buffers increase performance (contradicting
previous literature) (see VLDB paper on the references) Pipelining read and sort + write phases
B1
B2
B1
B2
B1
B2
B1
B2
time
Read
Sort + Write
![Page 30: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/30.jpg)
3030
Positional indexing
Given documents:D1: This is a testD2: Is this a testD3: This is not a test
Reorganize by term:
TERM DOC LOC DATA(caps)this 1 0 1is 1 1 0a 1 2 0test 1 3 0is 2 0 1this 2 1 0a 2 2 0test 2 3 0this 3 0 1is 3 1 0not 3 2 0a 3 3 0test 3 4 0
![Page 31: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/31.jpg)
3131
Positional indexing
In “postings list” format:
a (1,2,0),(2,2,0),(3,3,0)is (1,1,0),(2,0,1),(3,1,0)not (3,2,0)test (1,3,0),(2,3,0),(3,4,0)this (1,0,1),(2,1,0),(3,0,1)
Sort by <term, doc, loc>:
TERM DOC LOC DATA(caps)a 1 2 0a 2 2 0a 3 3 0is 1 1 0is 2 0 1is 3 1 0not 3 2 0test 1 3 0test 2 3 0test 3 4 0this 1 0 1this 2 1 0this 3 0 1
![Page 32: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/32.jpg)
3232
Positional indexing with radix sort
Radix key = Token hash = 8 bytes Document ID = 8 bytes Location = 4 bytes, but no need to sort by
location since Radix sort is stable!
![Page 33: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/33.jpg)
3333
Distributed indexing
Maintain a master machine directing the indexing job – considered “safe”
Break up indexing into sets of (parallel) tasks
Master machine assigns each task to an idle machine from a pool
![Page 34: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/34.jpg)
3434
Parallel tasks
We will use two sets of parallel tasks Parsers Inverters
Break the input document corpus into splits Each split is a subset of documents
Master assigns a split to an idle parser machine
Parser reads a document at a time and emits (term, doc) pairs
![Page 35: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/35.jpg)
3535
Parallel tasks
Parser writes pairs into j partitions Each for a range of terms’ first letters
(e.g., a-f, g-p, q-z) – here j=3. Now to complete the index inversion
![Page 36: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/36.jpg)
3636
splits
Parser
Parser
Parser
Master
a-fg-p q-z
a-fg-p q-z
a-fg-p q-z
Inverter
Inverter
Inverter
Postings
a-f
g-p
q-z
assign assign
Data flow
![Page 37: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/37.jpg)
3737
Above process flow a special case of MapReduce.
Inverters
Collect all (term, doc) pairs for a partition Sorts and writes to postings list Each partition contains a set of postings
![Page 38: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/38.jpg)
3838
MapReduce
Model for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of machines. A lot of MapReduce programs are executed
on Google’s cluster everyday.
![Page 39: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/39.jpg)
3939
Motivation
Input data is large The whole Web, billions of Pages
Lots of machines Use them efficiently
![Page 40: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/40.jpg)
4040
A real example
Term frequencies through the whole Web repository.
Count of URL access frequency. Reverse web-link graph ….
![Page 41: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/41.jpg)
4141
Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key,
intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) ->
list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just
one)
![Page 42: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/42.jpg)
4242
Example
Page 1: the weather is good Page 2: today is good Page 3: good weather is good.
![Page 43: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/43.jpg)
4343
Example: Count word occurrences
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
![Page 44: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/44.jpg)
4444
Map output
Worker 1: (the 1), (weather 1), (is 1), (good 1).
Worker 2: (today 1), (is 1), (good 1).
Worker 3: (good 1), (weather 1), (is 1), (good 1).
![Page 45: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/45.jpg)
4545
Reduce Input
Worker 1: (the 1)
Worker 2: (is 1), (is 1), (is 1)
Worker 3: (weather 1), (weather 1)
Worker 4: (today 1)
Worker 5: (good 1), (good 1), (good 1), (good 1)
![Page 46: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/46.jpg)
4646
Reduce Output
Worker 1: (the 1)
Worker 2: (is 3)
Worker 3: (weather 2)
Worker 4: (today 1)
Worker 5: (good 4)
![Page 47: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/47.jpg)
4747
![Page 48: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/48.jpg)
4848
![Page 49: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/49.jpg)
4949
Fault tolerance
Typical cluster: 100s/1000s of 2-CPU x86 machines, 2-4 GB
of memory Storage is on local IDE disks GFS: distributed file system manages data
(SOSP'03) Job scheduling system: jobs made up of
tasks, scheduler assigns tasks to machines
Implementation is a C++ library linked into user programs)
![Page 50: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/50.jpg)
5050
Fault tolerance
On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map
tasks Re-execute in progress reduce tasks Task completion committed through master
Master failure: Could handle, but don't yet (master failure
unlikely)
![Page 51: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/51.jpg)
5151
Performance
Scan 10^10 100-byte records to extract records matching a rare pattern (92K matching records): 150 seconds.
Sort 10^10 100-byte records (modeled after TeraSort benchmark): 839 seconds.
![Page 52: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/52.jpg)
5252
More and more MapReduce
![Page 53: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/53.jpg)
5353
Experience: Rewrite of Production Indexing System
Rewrote Google's production indexing system using MapReduce Set of 24 MapReduce operations New code is simpler, easier to understand MapReduce takes care of failures, slow
machines Easy to make indexing faster by adding
more machines
![Page 54: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/54.jpg)
5454
MapReduce Overview
MapReduce has proven to be a useful abstraction
Greatly simplifies large-scale computations at Google
Fun to use: focus on problem, let library deal w/ messy details
![Page 55: 1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information](https://reader033.vdocument.in/reader033/viewer/2022042615/56649d5f5503460f94a403ad/html5/thumbnails/55.jpg)
5555
Resources
MG Chapter 5 MapReduce: Simplified Data Processing on
Large Clusters, Jeffrey Dean and Sanjay Ghemawat
Indexing Shared Content in Information Retrieval Systems, A. Broder et. al., EDBT2006
High Performance Index Build Algorithms for Intranet Search Engines, M. F. Fontoura et. al., VLDB 2004