modern information retrieval

Modern Information Retrieval

Chapter 9: Parallel and Distributed IR

Section 9.1: Introduction

Section 9.2.2.: MIMD Architectures

Inverted Files

November 5, 1999

Summary Introduction Review of parallel computing and parallel

program performance measures Exploration of techniques for implementing

inverted file on MIMD parallel architecture Conclusion

Introduction The volume of electronic text available online today is

staggering. The WWW contains over 800 millions pages of text,

comprising nearly 6 terabytes of data (NATURE|Vol 400|8 July 1999|www.nature.com).

As document collections grow larger, they become more expensive to manage with an information retrieval system.

To support the demanding requirements of modern search environments, we must turn to alternative architectures and algorithms.

Parallel Computing Parallel computing is the simultaneous

aplication of multiple processors to solve a single problem.

Flynn’s Taxonomy: SISD single instruction, single data SIMD single instruction, multiple data MISD multiple instruction, single data MIMD multiple instruction, multiple data

Parallel Program Performance Measures

Speedup

Amdahl’s Law

where f is the fraction of the problem that must be computed sequencially;

N is the number of processors.

SRunning time of best available sequential algorithm

Running time of parallel algorithm

fNffS

1

/)1(

1

Parallel Program Performance Measures

Efficiency

where S is speedup;

N is the number of processors.

N

S

MIMD Architectures MIMD architectures offer a great deal of

flexibility in how parallelism is defined and exploited to solve a problem.

There are two ways in which a retrieval system can exploit a MIMD machine: Parallel multitasking; Partitioned parallel processing.

MIMD Architectures

Parallel multitasking on a MIMD machine

Broker

UserQuery

Result

UserQuery

Result

SearchEngine

SearchEngine Search

Engine

SearchEngine Search

Engine

MIMD Architectures

Partitioned parallel processing on a MIMDmachine

Broker

UserQuery

Result

Subquery/Results

SearchProcess

SearchProcess Search

ProcessSearchProcess

SearchProcess

MIMD Architectures

Basic data elements processed by a seachalgorithm

k1 k2 . . . ki . . . kt

d1 w1,1 w2,1 . . . wi,1 . . . wt,1

d2 w1,2 w2,2 . . . wi,2 . . . wt,2

. . . . . . . . . . . . . . . . . . . . .dj w1,j w2,j . . . wi,j . . . wt,j

. . . . . . . . . . . . . . . . . . . . .dN w1,N w2,N . . . wi,N . . . wt,N

Indexing Items

Documents

MIMD Architectures There are two possible methods for

partitioning the data: Document partitioning: the N documents are

distributed across the P processors; each parallel process evaluates the query on the subcollection of N/P documents assigned to it;

Term partitioning: the t indexing items are distributed across the P processors; the evaluation process for each document is spread over multiple processors.

Inverted FilesLogical Document Partitioning Data Partitioning

The data partitioning is done logically using essentially the same basic underlying inverted file index as in the original sequential algorithm;

The inverted file is extended to give each parallel process direct access to that portion of the index related to the processor’s subcollection of documents.

Extended dictionary entry for documentpartitioning

Inverted FilesLogical Document Partitioning

item i

P1

P2

P3

P4

Inverted ListTerm i

Dictionary

Query Evaluation The broker initiates P parallel processes to

evaluate the query; Each process executes the same document scoring

algorithm on its document subcollection; The search processes record document scores in a

single shared array of document score accumulators;

The broker produces the final ranked list of documents.

Inverted FilesLogical Document Partitioning

Inverted File Construction The indexer partitions the documents

among the processors; Each indexing process generates a batch of

inverted lists, sorted by indexing item; A merge step is performed to create the final

inverted file.

Inverted Files Logical Document Partitioning

Data Partitioning The documents are physically partitioned

into separate subcollections, one for each parallel processor;

Each subcollection has its own inverted file.

Inverted FilesPhysical Document Partitioning

Query Evaluation The broker distributes the query to all of the

parallel search processes; Each parallel search process evaluates the

query on its portion of the document collection, producing an intermediate hit-list;

The broker collects the intermediate hit-lists from all of the parallel search processes and merges them into a final hit-list.


Inverted File Construction Each processor creates, in parallel, its own

complete index corresponding to its document partition;

A merge step is performed to accumulate the global statistics for all of the partitions and distribute them to each of the partition dictionaries.


Data Partitioning Inverted lists are spread across the

processors.

Inverted FilesTerm Partitioning

Query Evaluation Query is decomposed into indexing items

and each indexing item is sent to the processor that holds the corresponding inverted list;

The processors create hit-lists with partial document scores and return them to the broker;

The broker combines the hit-lists.


Inverted File Construction Inverted file is created using the parallel

construction technique described for logical document partitioning.


Example

Document collection

Document Text

1 Pease porridge hot

2 Pease porridge cold

3 Pease porridge in the pot

4 Pease porridge hot, pease porridge not cold

5 Pease porridge cold, pease porridge not hot

6 Pease porridge hot in the pot

ExampleInverted File

<6,1>

cold

hot

in

not

pease

porridge

pot

the

<1,1> <2,1> <3,1> <4,2> <5,2>

Dictionary

<2,1> <4,1>

<1,1> <4,1> <5,1> <6,1>

<3,1> <6,1>

<4,1> <5,1>

<6,1><1,1> <2,1> <3,1> <4,2> <5,2>

<3,1> <6,1>

<3,1> <6,1>

Inverted Lists

<5,1>

Example

Logical Document Partitioning

<6,1>

cold

hot

in

not

pease

porridge

pot

P1

P2

P3

the

<1,1>

<2,1>

<3,1>

<4,2>

<5,2>

Inverted ListTerm “pease”

Dictionary

Example

Physical Document Partitioningcold

hot

in

not

pease

porridge

pot

the

<3,1> <4,2>

<4,1>

<4,1>

<3,1>

<4,1>

<3,1> <4,2>

<3,1>

<3,1>

P2

hot

pease

porridge

<1,1> <2,1>

<1,1>

<1,1> <2,1>

P1

cold <2,1>

<6,1>

hot

in

not

pease

porridge

pot

the

<5,2>

<5,1> <6,1>

<6,1>

<5,1>

<6,1><5,2>

<6,1>

<6,1>

P3

cold <5,1>

Example

Term Partitioning

<6,1>

cold

hot

in

not

pease

porridge

pot

the

<1,1> <2,1> <3,1> <4,2> <5,2>

<2,1> <4,1>

<1,1> <4,1> <5,1> <6,1>

<3,1> <6,1>

<4,1> <5,1>

<6,1><1,1> <2,1> <3,1> <4,2> <5,2>

<3,1> <6,1>

<3,1> <6,1>

P1

P2

P3

<5,1>

Conclusion The task of indexing and searching in very large text

collections is costly; Faster indexing and searching algorithms are always

desirable and the use of parallel hardware is and obvious alternative;

We discussed two possible organization for the document collection index on a MIMD parallel architecture: Document partitioning; Term partitioning.

Conclusion Document partitioning affords simpler inverted

index construction and maintenance than term partitioning;

When term distributions in the documents and queries are more skewed, document partitioning performs better;

When terms are uniformily distributed in user queries, term partitioning performs better.

Adicional References

Lawrence, S., Giles, C.L. 1999. Accessibility of Information on the Web. Nature. Vol.400.pp.107-109.

Ribeiro-Neto, B.A., Barbosa, R.A. 1998. Query Performance for Tighly Coupled Distributed Digital Libraries. Digital Libraries 98. pp.182-190.

Ribeiro-Neto, B.A., Moura, E.S., Neubert, M.S., Ziviani, N. 1999. Efficient Distributed Algorithms to Build Inverted Files. SIGIR’99. pp.105-112.

modern information retrieval

Documents

document collections

p processors

parallel process direct

single problem

number of processors

n documents

sisd single instruction

batch of inverted lists