g.skobeltsyn | query-driven indexing for p2p text retrieval query-driven indexing for p2p text...

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval

Query-Driven Indexing forQuery-Driven Indexing for P2P Text RetrievalP2P Text Retrieval

The Future of Web SearchThe Future of Web Search19.07.2007 Bertinoro, Italy19.07.2007 Bertinoro, Italy

Gleb SkobeltsynEPFL, Switzerland

June 19, 2007

Joint work with: • Toan Luu• Ivana Podnar Žarko• Martin Rajman• Karl Aberer

AlvisAlvis

DHTDHT

GoalGoal

• Our goalgoal is to achieve scalablescalable full-text retrieval with structured P2P networks (DHTs)

Each peer:• Provides resources (bandwidth, storage)• Searches the whole network• Publishes its own documents

22 // 2929

Naïve (single-term) approachNaïve (single-term) approach

... is to distribute the global inverted index in a DHT using term partitioning:

Query: “epfl & gleb”

h(“epfl”)-{d1,d2}

h(“gleb”)-{d2,d3}

h(t’)-{d4,d5}

This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor

{d1,d2}

33 // 2929

Single-term indexing

Multi-term indexing

term 1 posting list 1 term 2 posting list 2

term M-1 posting list M-1term M posting list M

®... ...

long posting listssm

key 11 posting list 11 key 12 posting list 12

key 1i posting list 1i

®... ...

short posting lists

PEER 1

key N1 posting list N1 key N2 posting list N2

key Nj posting list Nj

®... ... PEER N

PEER 1

PEER N

Multi-term keys

Single-term vs. multi-term P2P Single-term vs. multi-term P2P indexingindexing

How to choose keys to keep a satisfactory retrieval quality?

voc. sizecould grow

exponentially!

G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 44 // 2929

Multi-term indexing: frameworkMulti-term indexing: framework

• Each peer is responsibleresponsible for a set of keys assigned by the underlying DHT DHT using the standard hashing mechanism

• Each keykey corresponds to a term or a set of terms• Each key is assigned to a truncated posting list truncated posting list

(TPL) (TPL) that stores at most DFDFmaxmax top-ranked top-ranked document references

Distributed index contains {key,TPL} pairs

• The indexing load is handled by an optimizedoptimized DHT layer:

F. Klemm, J.-Y. Le Boudec, D. Kostic, K. AbererImproving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07 55 // 2929

Single-term indexing

Multi-term indexing

term 1 posting list 1 term 2 posting list 2

term M-1 posting list M-1term M posting list M

®... ...

long posting listssm

key 11 posting list 11 key 12 posting list 12

key 1i posting list 1i

®... ...

short posting lists

PEER 1

key N1 posting list N1 key N2 posting list N2

key Nj posting list Nj

®... ... PEER N

PEER 1

PEER N

Multi-term keys

Single-term vs. multi-term P2P Single-term vs. multi-term P2P indexingindexing

How to choose keys to keep a satisfactory retrieval quality?

voc. sizecould grow

exponentially!

Multi-term indexing techniquesMulti-term indexing techniques

• Indexing with Highly Discriminative Keys (HDKs), based on:– Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Abererin ICDE’07

– Beyond term indexing: A P2P framework for Web information retrieval I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer Informatica, vol. 30, no. 2, 2006.

• Query-Driven Indexing (QDI), based on:– Web Text Retrieval with a P2P Query-Driven Index

G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Abererin SIGIR’07

– Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Aberer in Infoscale’07

Indexing with HDKIndexing with HDK

• Data-Driven key generation:

• Each time a new document is indexed, some pos-ting lists for a key k k can reach the max size of DFmax

− It triggerstriggers the generation of new keys (k + other frequent keys)

• Use a number of filters to reduce the number of keys, e.g.:− Proximity Filter: a document qualifies for a key

t1&t2 if t1 is closeclose to t2 (specified by a window size ww).

Indexing with HDKIndexing with HDK

• Pro’sPro’s: – ICDE’07 paper proves that the number of keys grows

linearly– Elegant key generation mechanism– Low bandwidth while query processing (PL’s of limited

• Con’sCon’s:– Practically the number of keys is LARGE: 68M for 0.6M

docs– High bandwidth consumption at indexing

• ProblemProblem:– Too many keys are superfluous (almost never used)

Query Driven IndexingQuery Driven Indexing

Lets index only what is queried!Lets index only what is queried!

ContentsContents

• Introduction• Single-term vs. multi term indexing• HDK approach for indexing• Query-driven approach for indexing/retrieval

– Indexing structure– Example– Scalability– Evaluation

• Conclusion

Query-Driven Index (QDI)Query-Driven Index (QDI)

• Query-Driven Indexing strategy solves the “Too-Many-KeysToo-Many-Keys” problem:– Avoids maintenance of superfluous keys– Generates only such keys that are requested by users– Utilizes query-log to discover such keys

• ProblemsProblems– Indexing of a new key requires a bandwidth-efficient

mechanism to obtain the top-k posting list associated with the key Smart Broadcast (ONM) Smart Broadcast (ONM) or Conventional intersection like TA, but less Conventional intersection like TA, but less

frequentfrequent

– Incomplete index causes degradation of query results quality Show that the degradation is lowShow that the degradation is low

Which keys to index?Which keys to index?

• Each single-term found in the document collection has to be indexed. – We call all single-term keys a basic single term indexbasic single term index.

– The posting lists are truncated at DFmax.

• A key k is non-superfluousnon-superfluous and can be activated activated iff:

– k is popular: QF(k) ≥QFmin, where QF(k) is the popularity of the key k derived from the available query log and QFmin is a parameter for our model (popularity filter).

– k contains from 2 to smax terms: 2≤|k|≤ smax, where smax is a parameter of our model (size filter).

– all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter).

QDI: RetrievalQDI: Retrieval

ab bc ac

• Single term index is generated

• Process abc1) Probe Pabc

2) Probe Pab Pbc and Pac

3) Probe Pa Pb and Pc

4) Obtain top-DFmax results for a, b and c (ranked w.r.t a, b and c respectively)

5) Contact peers in the list, re-rank the obtained results w.r.t abc

6) Output top-10

• Inc. the QF for ab, bc and ac• Activate (index) ac

peer?abc nothing

nothing

+1 +1 +1

popularpopular

1414 // 2929

QDI: Retrieval 2QDI: Retrieval 2

ab bc ac

• Assume the frequency of b is below DFmax

• Note, how the redundancy filter would simplify the lattice in such a case(grayed nodes cannot be activated)

1515 // 2929

QDI: Retrieval 3QDI: Retrieval 3

ab bc ac

• Single term index is generated and ac is indexed

• Process abc1) Probe Pabc

2) Probe Pab Pbc and Pac – obtain the result for ac

3) Probe Pb and obtain the result for b

4) Contact all peers in the list to re-rank the obtained results w.r.t abc

5) Output top-10

• Inc. the QF for ab, bc and ac

peer?abc nothing

nothing

+1+1 +1

1616 // 2929

ScalabilityScalability

• The retrieval traffic is bounded by a constant due to trun-cated posting lists (depends on DFmax and a query size)

• The indexing traffic depends on the number of keys to be activated.

– The number of keys in the HDK approach (UPPER BOUND) is proven to grow linearly linearly with the number of peers, if each peer provides a limited number of documents

– The number of keys does does notnot depend on the document depend on the document collection sizecollection size but only on the size of the query log

– We can use the QFmin parameter to adjust the tradeoff:

indexing traffic <-> retrieval qualityindexing traffic <-> retrieval qualityG.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 1818 // 2929

ContentsContents

• Introduction• Single-term vs. multi term indexing• HDK approach for indexing• Query-driven approach for indexing/retrieval

– Indexing structure– Example– Scalability– Evaluation

• Conclusion

AOL logsAOL logs

• 17M Queries from March, April, May 2006 (92 days)• 650K anonymous user sessions• Extracted all unique queries from each user

session:

…2006-05-31 23:50:30 wearthbow.com native.cheyenne origin.2006-05-31 23:50:30 l6 screensaver2006-05-31 23:50:30 horses for sale in tn ky2006-05-31 23:50:30 bank of america.com2006-05-31 23:50:30 ask2006-05-31 23:50:29 del rosa lanes2006-05-31 23:50:28 www.spirit airlines.com2006-05-31 23:50:28 find holy women of the bible2006-05-31 23:50:27 trains2006-05-31 23:50:27 todaysmiricles2006-05-31 23:50:27 constition2006-05-31 23:50:26 german grocceries in las vegas nv2006-05-31 23:50:25 porn2006-05-31 23:50:25 northwest indiana2006-05-31 23:50:24 united.eprize.net2006-05-31 23:50:24 jessica laguna…

<-0.7Gb

2020 // 2929

Distribution of combinations in the AOL Distribution of combinations in the AOL logslogs

TREC ExperimentTREC Experiment

• WT10G collection (~1.69 M docs)• 100 TREC queries (from TREC Web Track 9 & 10)• Query statistics generated form 17M AOL

queries• Using Okapi-BM25 weighting schema to

compute ranking score• QFmin = 1, 3, 5, ∞• DFmax = 100, 500• smax=3

DFmax=100 DFmax=500ST-BM25

QFmin=∞ QFmin=5 QFmin=3 QFmin=1 QFmin=∞ QFmin=5 QFmin=3 QFmin=1

P@1 0.408 0.449 0.449 0.449 0.429 0.439 0.439 0.439 0.439

P@2 0.388 0.439 0.434 0.434 0.418 0.429 0.429 0.429 0.429

P@3 0.347 0.412 0.412 0.408 0.391 0.395 0.395 0.395 0.395

P@4 0.324 0.370 0.372 0.370 0.367 0.362 0.362 0.362 0.360

P@5 0.306 0.345 0.347 0.341 0.345 0.343 0.343 0.343 0.337

P@10 0.266 0.299 0.295 0.294 0.307 0.302 0.303 0.302 0.298

P@15 0.237 0.267 0.267 0.267 0.276 0.279 0.280 0.278 0.278

P@20 0.212 0.243 0.243 0.246 0.254 0.259 0.259 0.259 0.257

P@30 0.174 0.206 0.209 0.212 0.214 0.221 0.221 0.224 0.226

P@50 0.139 0.169 0.171 0.174 0.175 0.181 0.181 0.183 0.186

P@100 0.097 0.126 0.127 0.130 0.128 0.135 0.135 0.136 0.140

Precision is similar to centralized indexing

TREC: Precision at Top Ranked Pages (table)

2222 // 2929

Overlap experimentOverlap experiment

• Use the query-log to build the index (days 1..91)• Choose randomly 2K test queries from the day 92• Answer each test queryquery with Google and compare to the union

of top-DFmax Google results for each of its combinationsits combinations that are indexed according to the logs.

• Mimics our P2PIR system if Google’s ranking is used.• Example:

Original query

Non-superfluous (indexed) combinations

overlap@5=3/5=60%

Overlap exampleOverlap example

>id=481, q=“what did babe ruth do in the 1920”what did babe ruth do in the 1920”

“1920 babe ruth”, qf=0 ----> Ov@100= 100%

“1920 babe”, qf=0 ---------> Ov@100= 9% +++“1920 ruth”1920 ruth”, qf=1 ---------> Ov@100= 33%33% +++“babe ruth”babe ruth”, qf=495 -------> Ov@100= 69% 69%

---“1920”, qf=716 ------------> Ov@100= 1% ---“babe”, qf=3196 -----------> Ov@100= 2% ---“ruth”, qf=1653 -----------> Ov@100= 7%

Size: 192192, Keys used: 22, Overlap@100: 94%94%

• Cut-n-paste from the simulation log:

2424 // 2929

Google experiment: impact of sGoogle experiment: impact of smaxmax, , DFDFmaxmax

impact of Smax for all possible combinations (QFmin=0)

Impact of DFmax with QFmin=1, Smax =3

2525 // 2929

Google experiment: impact of QFGoogle experiment: impact of QFminmin

impact of QFmin (DFmax=600) Number of keys for different QFmin

• Does not depend on the document collection size

• HDK approach would require~65M keys for 650K documents

• Does not depend on the document collection size

• HDK approach would require~65M keys for 650K documents

• >30% of badly performing queries are misspells => real quality is higher

2626 // 2929

Google experiment: impact of the log Google experiment: impact of the log sizesize

impact of the log size (Qfmin =1, DFmax=600)

2727 // 2929

ConclusionsConclusions

• We presented the query-driven indexing strategy query-driven indexing strategy for scalable web text retrieval with structured P2P networks:– Stores posting lists in a DHT for terms andand term combinations

– Stores at most at most DFmax top document references in a posting list

– Efficiently collects the query statisticsstatistics in a distributed fashion

– Based on this statistics activates (indexes) only popularpopular keys

– Computes the result of a multi-term query based only on the index entries available at the moment – nono costly intersections

• We also showed that:– With real query-logs our approach achieves good retrieval qualitygood retrieval quality

– The QFmin parameter adjusts the traffic/quality tradeofftradeoff

Last slideLast slide

Thank you for your attention!Questions?

2929 // 2929

AlvisP2P - to appear in July athttp://globalcomputing.epfl.ch/alvis/

g.skobeltsyn | query-driven indexing for p2p text retrieval query-driven indexing for p2p text...

multiterm p2p indexing

sigir07 querydriven

querydriven indexing

peer text retrieval

p2p querydriven index

web text retrieval

peer web retrieval

multiterm indexing techniques

Documents

indexing mixed types for approximate retrieval

video indexing and retrieval

indexing & retrieval. approaches to indexing key word...

self-indexing inverted files for fast text retrieval

image indexing and retrieval

indexing implementation and indexing models csc 575...

introduction to information retrieval introduction to...

mira group (multimedia indexing, retrieval, & analysis) in...

query-driven indexing for scalable p2p text retrieval

semantic annotation, indexing, and retrieval

indexing and retrieval for genomic databases

image indexing& retrieval using intermediate features

text indexing and retrieval

audio based indexing and retrieval in muvis

information retrieval in p2p -...

faster case retrieval using hash indexing technique

semantic annotation, indexing, and retrieval...semantic...

indexing and retrieval

2. boolean retrieval and term indexing · 4 ir & ws,...

bsi: bloom filter-based semantic indexing for unstructured...