modern information retrieval
DESCRIPTION
Modern Information Retrieval. Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section 9.2.2.: MIMD Architectures Inverted Files November 5, 1999. Summary. Introduction Review of parallel computing and parallel program performance measures - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/1.jpg)
Modern Information Retrieval
Chapter 9: Parallel and Distributed IR
Section 9.1: Introduction
Section 9.2.2.: MIMD Architectures
Inverted Files
November 5, 1999
![Page 2: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/2.jpg)
Summary Introduction Review of parallel computing and parallel
program performance measures Exploration of techniques for implementing
inverted file on MIMD parallel architecture Conclusion
![Page 3: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/3.jpg)
Introduction The volume of electronic text available online today is
staggering. The WWW contains over 800 millions pages of text,
comprising nearly 6 terabytes of data (NATURE|Vol 400|8 July 1999|www.nature.com).
As document collections grow larger, they become more expensive to manage with an information retrieval system.
To support the demanding requirements of modern search environments, we must turn to alternative architectures and algorithms.
![Page 4: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/4.jpg)
Parallel Computing Parallel computing is the simultaneous
aplication of multiple processors to solve a single problem.
Flynn’s Taxonomy: SISD single instruction, single data SIMD single instruction, multiple data MISD multiple instruction, single data MIMD multiple instruction, multiple data
![Page 5: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/5.jpg)
Parallel Program Performance Measures
Speedup
Amdahl’s Law
where f is the fraction of the problem that must be computed sequencially;
N is the number of processors.
SRunning time of best available sequential algorithm
Running time of parallel algorithm
fNffS
1
/)1(
1
![Page 6: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/6.jpg)
Parallel Program Performance Measures
Efficiency
where S is speedup;
N is the number of processors.
N
S
![Page 7: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/7.jpg)
MIMD Architectures MIMD architectures offer a great deal of
flexibility in how parallelism is defined and exploited to solve a problem.
There are two ways in which a retrieval system can exploit a MIMD machine: Parallel multitasking; Partitioned parallel processing.
![Page 8: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/8.jpg)
MIMD Architectures
Parallel multitasking on a MIMD machine
Broker
UserQuery
Result
UserQuery
Result
SearchEngine
SearchEngine Search
Engine
SearchEngine Search
Engine
![Page 9: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/9.jpg)
MIMD Architectures
Partitioned parallel processing on a MIMDmachine
Broker
UserQuery
Result
Subquery/Results
SearchProcess
SearchProcess Search
ProcessSearchProcess
SearchProcess
![Page 10: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/10.jpg)
MIMD Architectures
Basic data elements processed by a seachalgorithm
k1 k2 . . . ki . . . kt
d1 w1,1 w2,1 . . . wi,1 . . . wt,1
d2 w1,2 w2,2 . . . wi,2 . . . wt,2
. . . . . . . . . . . . . . . . . . . . .dj w1,j w2,j . . . wi,j . . . wt,j
. . . . . . . . . . . . . . . . . . . . .dN w1,N w2,N . . . wi,N . . . wt,N
Indexing Items
Documents
![Page 11: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/11.jpg)
MIMD Architectures There are two possible methods for
partitioning the data: Document partitioning: the N documents are
distributed across the P processors; each parallel process evaluates the query on the subcollection of N/P documents assigned to it;
Term partitioning: the t indexing items are distributed across the P processors; the evaluation process for each document is spread over multiple processors.
![Page 12: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/12.jpg)
Inverted FilesLogical Document Partitioning Data Partitioning
The data partitioning is done logically using essentially the same basic underlying inverted file index as in the original sequential algorithm;
The inverted file is extended to give each parallel process direct access to that portion of the index related to the processor’s subcollection of documents.
![Page 13: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/13.jpg)
Extended dictionary entry for documentpartitioning
Inverted FilesLogical Document Partitioning
item i
P1
P2
P3
P4
Inverted ListTerm i
Dictionary
![Page 14: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/14.jpg)
Query Evaluation The broker initiates P parallel processes to
evaluate the query; Each process executes the same document scoring
algorithm on its document subcollection; The search processes record document scores in a
single shared array of document score accumulators;
The broker produces the final ranked list of documents.
Inverted FilesLogical Document Partitioning
![Page 15: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/15.jpg)
Inverted File Construction The indexer partitions the documents
among the processors; Each indexing process generates a batch of
inverted lists, sorted by indexing item; A merge step is performed to create the final
inverted file.
Inverted Files Logical Document Partitioning
![Page 16: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/16.jpg)
Data Partitioning The documents are physically partitioned
into separate subcollections, one for each parallel processor;
Each subcollection has its own inverted file.
Inverted FilesPhysical Document Partitioning
![Page 17: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/17.jpg)
Query Evaluation The broker distributes the query to all of the
parallel search processes; Each parallel search process evaluates the
query on its portion of the document collection, producing an intermediate hit-list;
The broker collects the intermediate hit-lists from all of the parallel search processes and merges them into a final hit-list.
Inverted FilesPhysical Document Partitioning
![Page 18: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/18.jpg)
Inverted File Construction Each processor creates, in parallel, its own
complete index corresponding to its document partition;
A merge step is performed to accumulate the global statistics for all of the partitions and distribute them to each of the partition dictionaries.
Inverted FilesPhysical Document Partitioning
![Page 19: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/19.jpg)
Data Partitioning Inverted lists are spread across the
processors.
Inverted FilesTerm Partitioning
![Page 20: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/20.jpg)
Query Evaluation Query is decomposed into indexing items
and each indexing item is sent to the processor that holds the corresponding inverted list;
The processors create hit-lists with partial document scores and return them to the broker;
The broker combines the hit-lists.
Inverted FilesTerm Partitioning
![Page 21: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/21.jpg)
Inverted File Construction Inverted file is created using the parallel
construction technique described for logical document partitioning.
Inverted FilesTerm Partitioning
![Page 22: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/22.jpg)
Example
Document collection
Document Text
1 Pease porridge hot
2 Pease porridge cold
3 Pease porridge in the pot
4 Pease porridge hot, pease porridge not cold
5 Pease porridge cold, pease porridge not hot
6 Pease porridge hot in the pot
![Page 23: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/23.jpg)
ExampleInverted File
<6,1>
cold
hot
in
not
pease
porridge
pot
the
<1,1> <2,1> <3,1> <4,2> <5,2>
Dictionary
<2,1> <4,1>
<1,1> <4,1> <5,1> <6,1>
<3,1> <6,1>
<4,1> <5,1>
<6,1><1,1> <2,1> <3,1> <4,2> <5,2>
<3,1> <6,1>
<3,1> <6,1>
Inverted Lists
<5,1>
![Page 24: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/24.jpg)
Example
Logical Document Partitioning
<6,1>
cold
hot
in
not
pease
porridge
pot
P1
P2
P3
the
<1,1>
<2,1>
<3,1>
<4,2>
<5,2>
Inverted ListTerm “pease”
Dictionary
![Page 25: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/25.jpg)
Example
Physical Document Partitioningcold
hot
in
not
pease
porridge
pot
the
<3,1> <4,2>
<4,1>
<4,1>
<3,1>
<4,1>
<3,1> <4,2>
<3,1>
<3,1>
P2
hot
pease
porridge
<1,1> <2,1>
<1,1>
<1,1> <2,1>
P1
cold <2,1>
<6,1>
hot
in
not
pease
porridge
pot
the
<5,2>
<5,1> <6,1>
<6,1>
<5,1>
<6,1><5,2>
<6,1>
<6,1>
P3
cold <5,1>
![Page 26: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/26.jpg)
Example
Term Partitioning
<6,1>
cold
hot
in
not
pease
porridge
pot
the
<1,1> <2,1> <3,1> <4,2> <5,2>
<2,1> <4,1>
<1,1> <4,1> <5,1> <6,1>
<3,1> <6,1>
<4,1> <5,1>
<6,1><1,1> <2,1> <3,1> <4,2> <5,2>
<3,1> <6,1>
<3,1> <6,1>
P1
P2
P3
<5,1>
![Page 27: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/27.jpg)
Conclusion The task of indexing and searching in very large text
collections is costly; Faster indexing and searching algorithms are always
desirable and the use of parallel hardware is and obvious alternative;
We discussed two possible organization for the document collection index on a MIMD parallel architecture: Document partitioning; Term partitioning.
![Page 28: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/28.jpg)
Conclusion Document partitioning affords simpler inverted
index construction and maintenance than term partitioning;
When term distributions in the documents and queries are more skewed, document partitioning performs better;
When terms are uniformily distributed in user queries, term partitioning performs better.
![Page 29: Modern Information Retrieval](https://reader035.vdocument.in/reader035/viewer/2022081520/56814a50550346895db771b6/html5/thumbnails/29.jpg)
Adicional References
Lawrence, S., Giles, C.L. 1999. Accessibility of Information on the Web. Nature. Vol.400.pp.107-109.
Ribeiro-Neto, B.A., Barbosa, R.A. 1998. Query Performance for Tighly Coupled Distributed Digital Libraries. Digital Libraries 98. pp.182-190.
Ribeiro-Neto, B.A., Moura, E.S., Neubert, M.S., Ziviani, N. 1999. Efficient Distributed Algorithms to Build Inverted Files. SIGIR’99. pp.105-112.