on large-scale retrieval tasks with ivory and mapreduce

Tamer ElsayedQatar University

On Large-Scale Retrieval Taskswith Ivory and MapReduce

Nov 7th, 2012

My Field …

Information Retrieval (IR) is …Finding material (usually documents)

of an unstructured nature (usually text) that satisfies an information need from within large collections

Quite effective (at some things) Highly visible (mostly) Commercially successful (some of them)

IR is not just “Document Retrieval” Clustering and Classification Question answering Filtering, tracking, routing Recommender systems Leveraging XML and other Metadata Text mining Novelty identification Meta-search (multi-collection searching) Summarization Cross-language mechanisms Evaluation techniques Multimedia retrieval Social media analysis …

My Research …

Large-ScaleProcessing

emails

+ web pages

CLuEWebIdentity

Resolution

WebSearch

~500,000

~1,000,000,000

User Application

Back in 2009 … Before 2009, small text collections are available● Largest: ~ 1M documents

ClueWeb09● Crawled by CMU in 2009● ~ 1B documents !● need to move to cluster environments

MapReduce/Hadoop seems like promising framework

E2E Search Toolkit using MapReduce Completely designed for the Hadoop environment Experimental Platform for research Supports common text collections

● + ClueWeb09 Open source release Implements state-of-the-art retrieval models

http://ivory.ccIvory

MapReduce Framework

reduce

output

Shuffling

group values by: [keys]

(a) Map (b) Shuffle (c) Reduce

(k2, [v2])(k1, v1)

[(k3, v3)][k2, v2]

Framework handles “everything else” !

The IR Black Box

DocumentsQuery

Inside the IR Black Box

DocumentsQuery

RepresentationFunction

Query Representation Document Representation

ComparisonFunction Index

offlineonline

Indexing

ClintonCheney

ClintonObamaClinton

ClintonBarackObama

CCheney

Barack

ClintonA, 2

C, 1B, 1

A, 1C, 1

Collection Inverted IndexDocuments, IDs Terms, Posting Lists

Indexing

ClintonRomney

ClintonObamaClinton

ClintonBarackObama

CRomney

Barack

ClintonA, 2

C, 1B, 1

A, 1C, 1

Collection Inverted IndexDocuments, IDs Terms, Posting Lists

Indexing(a) Map (b) Shuffle (c) Reduce

Clinton

Romney

Clinton

Barack

Clinton

Romney

Barack

Romney

Barack

Clinton

ClintonRomney

ClintonBarackObama

ClintonObamaClinton

Shuffl

reducemap

mapreduce

reduce

ClintonObamaClinton

ClintonRomney

ClintonBarackObama

Retrieval Directly from HDFS!

Cute hack: use Hadoop to launch partition servers● Embed an HTTP server inside each mapper● Mappers start up, initialize servers, enter into infinite service loop!

Why do this?● Unified Hadoop ecosystem● Simplifies data management issues

PartitionServer

RetrievalBroker

SearchClient

HDFSdatanode

HDFSnamenode

PartitionServer

Local Disk

TREC’10

TREC’09

RoadmapIndexing

& Retrieval

• Batch Retrieval• Approx. Pos. Indexes

Pairwise Similarity

• Monolingual• Cross-Lingual

Pseudo Test

Collection• Training L2R

Iterative Process

• iHadoop

SIGIR 2011

CIKM 2011

ACL 2008

TREC 2009TREC 2010

CloudCom 2011

RoadmapIndexing

& Retrieval

Pairwise Similarity

Pseudo Test

Iterative Process

• iHadoop

SIGIR 2011ACL 2008

Abstract Problem

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.200.300.540.210.000.340.340.130.74

Applications: Clustering Coreference resolution “more-like-that” queries

Decomposition

reduce

Each term contributes only if appears in

Pairwise Similarity(a) Generate pairs (b) Group pairs (c) Sum pairs

Clinton

Barack

Romney

Terms: Zipfian Distribution

term rank

each term t contributes o(dft2) partial results

very few terms dominate the computations

most frequent term (“said”) 3%most frequent 10 terms 15%

most frequent 100 terms 57%most frequent 1000 terms 95%

~0.1% of total terms(99.9% df-cut)

Efficiency (disk space)

0 10 20 30 40 50 60 70 80 90 100Corpus Size (%)

no df-cutdf-cut at 99.999%df-cut at 99.99%df-cut at 99.9%df-cut at 99%

8 trillionintermediate pairs

0.5 trillion intermediate pairs

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ 906k doc

EffectivenessEffect of df-cut on effectiveness

Medline04 - 909k abstracts- Ad-hoc retrieval

50556065707580859095

99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)

Drop 0.1% of terms“Near-Linear” Growth

Fit on diskCost 2% in Effectiveness

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

ACL’08

Cross-Lingual Pairwise Similarity Find similar document pairs in different languages

Multilingual text mining, Machine Translation

Application: automatic generation of potential “interwiki” language links

More difficult than monolingual!

Vocabulary Space Matching

MTDoc A

MT translate

doc vector A

German English

English

doc vector B

CLIR project

doc vector A

German

English

doc vector B

doc vector ACLIR

fdfefpedf

ftfefpetf

)()|()(

Locality-Sensitive Hashing (LSH) Cosine score is a good similarity measure but expensive! LSH is a method for effectively reducing the search

space when looking for similar pairs Each vector is converted into a compact representation,

called a signature

A sliding window-based algorithm uses these signatures to search for similar articles in the collection

Vectors close to each other are likely to have similar signatures

Solution Overview

CLIRprojection

Nf German articles

Englisharticles

Preprocess

English document

vectors

SignaturesSignature

generation

Sliding window

algorithm

on large-scale retrieval tasks with ivory and mapreduce

Documents

mapreduce and hadoop file...

tamer elsayed qatar university on large-scale retrieval...

ee324 distributed systems fall 2015 mapreduce. overview 2 ...

mapreduce · 2020. 7. 22. · hadoop is an implementation...

syringa reticulata ‘ivory silk’: ‘ivory silk’...

powerpoint presentation€¦ · shine tower 1st floor...

agony and ivory: ivory trade is back

ivory travertine - patara...

hadoop/mapreduce - 123seminarsonly.comhadoop mapreduce •...

mapreduce with scalding · java mapreduce word count...

ivory/nude, ivory, white - marieesdenantes.fr · me136 |...

pipelined-mapreduce an improved mapreduce

identification guide for ivory and ivory substitutes

elephant ivory trafficking in california,...

google’s mapreduce programming model —...

media retrieval information retrieval image retrieval video...

1. introduction to mapreduce -...

data integration and information retrieval: moving from the...

processing with what is mapreduce? hadoop/mapreduce

mapreduce vs pig | mapreduce pig integration