query-driven indexing for peer-to-peer text retrieval ** www 2007 banff, canada contact: gleb...

1
Query-Driven Indexing for Peer-to-Peer Query-Driven Indexing for Peer-to-Peer Text Retrieval Text Retrieval ** ** WWW 2007 Banff, Canada Contact: Contact: Gleb Gleb Skobeltsyn Skobeltsyn [email protected] http:// * I.Podnar is currently affiliated with University of Zagreb, Croatia ** The work presented in this paper was (partly) carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European projects BRICKS (507457) and ALVIS (002068). G.Skobeltsyn, T.Luu, I.Podnar G.Skobeltsyn, T.Luu, I.Podnar * , M.Rajman, K.Aberer , M.Rajman, K.Aberer Experiments: retrieval quality of the query-driven index when compared to Google 0 1 2 3 4 5 10 50 100 c)QFmin /3 m onths Our goal: Our goal: Features: Features: - Low bandwidth Low bandwidth during retrieval as posting lists of bounded size bounded size are transmitted, - The content of the index adapts adapts to the current query popularity popularity distribution, - Tradeoff Tradeoff between retrieval quality and index size (i.e., indexing cost). Scalable full text web retrieval in a structured P2P network. Processing the query abc with a query-driven index More details in: Skobeltsyn et al: “Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval”, in Infoscale’07, Suzhou, China, 2007 Skobeltsyn et al: “Web Text Retrieval with a P2P Query-Driven Index”, in SIGIR’07, Amsterdam, The Netherlands, 2007 Alvis project web site: http://globalcomputing.epfl.ch/alvis http://globalcomputing.epfl.ch/alvis 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 Q uality ofansw er(% ) 0% (0%,25%) [25%,50%) [50%,75%) [75%,100%) 100% Avg.overlap a)Log history size (days) 20 50 100 200 300 400 500 600 b)DFm ax (documents) Overlap achieved for different sizes of the query log measured in number of days with QF min =1, DF max =600 Overlap achieved for different values of DF max with QF min =1 Overlap achieved for different values of QF min /3 months with DF max =600 >id=481, q=“what did babe ruth do in the 1920 what did babe ruth do in the 1920“1920 babe ruth”, qf=0 ----> Ov@100= 100% “1920 babe”, qf=0 ---------> Ov@100= 9% + 1920 ruth 1920 ruth”, qf=1 ---------> Ov@100= 33% 33% + babe ruth babe ruth”, qf=495 -------> Ov@100= 69% 69% - “1920”, qf=716 ------------> Ov@100= 1% - “babe”, qf=3196 -----------> Ov@100= 2% - “ruth”, qf=1653 -----------> Ov@100= 7% Size: 192 192, Keys used: 2, Overlap@100: 94% 94% Top-20 overlap measure: •Use Google Google to answer a query and compare compare it to the union of top- top- DF DF max max Google Google results results for each of its indexed indexed keys, •Keys are indexed indexed if contained in more than QF QF min min queries in the global query history. Example of resolving a query: • Distributed single term index maintains global posting lists for each single term single term in a DHT To process a multi-term query abc it intersects intersects the full posting lists of a, b and c. Intersections lead to unscalable unscalable retrieval traffic The naïve approach: P c Pb P a Querying peer a b c ab bc ac abc a b c ab bc ac abc a)ifthe posting lists for b and c are truncated (only) c)ifthe key bc is also indexed. a b c ab bc ac abc b)ifthe posting listfor a is also truncated, - probed combination - skipped combination - popularity counter - truncated posting list - posting list is used to answ erthe query - no index item for the key - candidate index item (only stat.) - active index item (stat.+TPL) In d e x i te m s : a b Legend

Upload: clyde-sutton

Post on 11-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Query-Driven Indexing for Peer-to-Peer Text Retrieval ** WWW 2007 Banff, Canada Contact: Gleb Skobeltsyn Contact: Gleb Skobeltsyn gleb.skobeltsyn@epfl.ch

Query-Driven Indexing for Peer-to-Peer Text RetrievalQuery-Driven Indexing for Peer-to-Peer Text Retrieval****

WWW 2007Banff, Canada

Contact: Contact: Gleb Skobeltsyn Gleb Skobeltsyn [email protected]

http://lsirpeople.epfl.ch/skobelts

* I.Podnar is currently affiliated with University of Zagreb, Croatia

** The work presented in this paper was (partly) carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European projects BRICKS (507457) and ALVIS (002068).

* I.Podnar is currently affiliated with University of Zagreb, Croatia

** The work presented in this paper was (partly) carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European projects BRICKS (507457) and ALVIS (002068).

G.Skobeltsyn, T.Luu, I.PodnarG.Skobeltsyn, T.Luu, I.Podnar**, M.Rajman, K.Aberer, M.Rajman, K.Aberer

Experiments: retrieval quality of the query-driven index when compared to Google

0 1 2 3 4 5 10 50 100 ∞

c) QFmin / 3 months

Our goal: Our goal:

Features:Features:

Our goal: Our goal:

Features:Features:- Low bandwidth Low bandwidth during retrieval as posting lists of bounded size bounded size are transmitted,- The content of the index adaptsadapts to the current query popularitypopularity distribution,- TradeoffTradeoff between retrieval quality and index size (i.e., indexing cost).

Scalable full text web retrieval in a structured P2P network. Scalable full text web retrieval in a structured P2P network.

Processing the query abc with a query-driven index

More details in:• Skobeltsyn et al: “Query-Driven Indexing for Scalable Peer-to-

Peer Text Retrieval”, in Infoscale’07, Suzhou, China, 2007• Skobeltsyn et al: “Web Text Retrieval with a P2P Query-Driven

Index”, in SIGIR’07, Amsterdam, The Netherlands, 2007• Alvis project web site: http://globalcomputing.epfl.ch/alvishttp://globalcomputing.epfl.ch/alvis

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90

Qu

alit

y o

f an

swe

r (%

0%

(0%,25%)

[25%,50%)

[50%,75%)

[75%,100%)

100%

Avg.overlap

a) Log history size (days)

20 50 100 200 300 400 500 600

b) DFmax (documents)

Overlap achieved for different sizes of the query log measured in number of days with QFmin=1, DFmax=600

Overlap achieved for different values of DFmax with QFmin=1

Overlap achieved for different values of QFmin/3 months with DFmax=600

>id=481, q=“what did babe ruth do in the 1920what did babe ruth do in the 1920” “1920 babe ruth”, qf=0 ----> Ov@100= 100% “1920 babe”, qf=0 ---------> Ov@100= 9% + “1920 ruth1920 ruth”, qf=1 ---------> Ov@100= 33%33% + “babe ruthbabe ruth”, qf=495 -------> Ov@100= 69%69% - “1920”, qf=716 ------------> Ov@100= 1% - “babe”, qf=3196 -----------> Ov@100= 2% - “ruth”, qf=1653 -----------> Ov@100= 7% Size: 192192, Keys used: 22, Overlap@100: 94%94%

Top-20 overlap measure:•Use GoogleGoogle to answer a query

and comparecompare it to the union of top-top-DFDFmaxmax Google results Google results for each of its indexed indexed keys,

•Keys are indexed indexed if contained in more than QFQFminmin queries in the global query history.

Example of resolving a query:

• Distributed single term index – maintains global posting lists for each single term single term in a DHT

• To process a multi-term query abc it intersectsintersects the full posting lists of a, b and c.

• Intersections lead to unscalableunscalable retrieval traffic

The naïve approach:

Pc

Pb

Pa

Querying peer

a b c

ab bcac

abc

a b c

ab bcac

abc

a) if the posting lists for b and c are truncated (only)

c) if the key bc is also indexed.

a b c

ab bcac

abc

b) if the posting list for a is also truncated,

- probed combination

- skipped combination

- popularity counter

- truncated posting list

- posting list is used to answer the query

- no index item for the key - candidate index item (only stat.)

- active index item (stat.+TPL)

I n d e x i t e m s :

a

b

Legend