routing of structured queries in large-scale distributed systems

22
Routing of Structured Queries in Large-Scale Distributed Systems Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS_IR'08) @ ACM 17th CIKM 2008, Napa Valley, California, USA, Oct 2008. Judith Winter Institute for Informatics / Telematics Group Goethe-University / Frankfurt am Main, Germany

Upload: kipling

Post on 08-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Routing of Structured Queries in Large-Scale Distributed Systems. Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS_IR'08) @ ACM 17th CIKM 2008, Napa Valley, California, USA, Oct 2008. Judith Winter - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Routing of Structured Queries  in Large-Scale Distributed Systems

Routing of Structured Queries in Large-Scale Distributed Systems

Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS_IR'08)

@ ACM 17th CIKM 2008,

Napa Valley, California, USA, Oct 2008.

Judith Winter

Institute for Informatics / Telematics GroupGoethe-University / Frankfurt am Main, Germany

Page 2: Routing of Structured Queries  in Large-Scale Distributed Systems

2

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

Routing of Structured Queries in Large-Scale Distributed Systems

Overview

1. Introduction

2. Concept & Architecture

3. Routing

4. Evaluation

5. Questions and Discussion

1. Introduction

Page 3: Routing of Structured Queries  in Large-Scale Distributed Systems

3

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

• XML Information Retrieval in P2P systems

• Investigate the impact of using structural information when retrieving XML-documents in a P2P network

• Challenge: not all information accessable / scalability issues

Proposed research:

How to perform & improve query routing in a large-scale P2P System

by using structural information?

1.Introduction 2. Concept 3. Routing 4. Evaluation

Page 4: Routing of Structured Queries  in Large-Scale Distributed Systems

4

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

XML Information Retrieval in Peer-to-Peer Systems:

• structured documents• more precise search• based on c/s architectures

• distributed• autonomous peers• growing amount of XML-documents

• vague queries • relevance-ranking

XML-Retrieval

InformationRetrieval

Peer-to-Peer

Challenges:• no central index• only selected information available• bandwith consumption / communication overhead• efficiency vs effectiveness

1.Introduction 2. Concept 3. Routing 4. Evaluation

Page 5: Routing of Structured Queries  in Large-Scale Distributed Systems

5

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

Routing of Structured Queries in Large-Scale Distributed Systems

1. Introduction

2. Concept & Architecture

3. Routing

4. Evaluation

5. Questions and Discussion

2. Concept & Architecture

Page 6: Routing of Structured Queries  in Large-Scale Distributed Systems

6

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

• Queries: content-and-structure (CAS)

• Indexing: include structure

• Hybrid indexing: globally or locally (distributing summaries) depending on peer status index with posting lists (doc level) & peer lists (peer-level)

• Distributing global information into DHT

• Ranking: extended vector space model (using structure)

• Results/Retrieval units: document or element retrieval

Concept for a P2P-search engine:

1.Introduction 2. Concept 3. Routing 4. Evaluation

Page 7: Routing of Structured Queries  in Large-Scale Distributed Systems

7

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

• Routing:

• Use peer lists and posting lists

• Use of pre-computed posting lists for popular term combinations highly discriminative keys (HDKs)

• Use of pruned posting lists by considering structural information

• Ordering of posting lists by a query-independent score (evidence from document-, element-, collection, and peer level)

• Select top k results according to pre-ranking regarding structural similarity between CAS query and posting key

Concept for a P2P-search engine:

1.Introduction 2. Concept 3. Routing 4. Evaluation

Page 8: Routing of Structured Queries  in Large-Scale Distributed Systems

8

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

P2P network

Index storage component

Inverted Index Statistics Index

INFORMATION RETRIEVAL

PEER-TO-PEER

APPLICATION

Retrieval component

Ranking component

P2P component

Documentindex

Retrieval unitindex

SpirixDHT

GUIIndexingQuerying &

result presentation

Frequent XTerm index

HDKindex

DL Local documents

Querying Component

Routing component

Similarity calculator

Weighting calculator

Sourceselector

SimulationDHT ChordPeerMetricscalculator

1.Introduction 2. Concept 3. Routing 4. Evaluation

Page 9: Routing of Structured Queries  in Large-Scale Distributed Systems

9

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

Routing of Structured Queries in Large-Scale Distributed Systems

1. Introduction

2. Concept & Architecture

3. Routing

4. Evaluation

5. Questions and Discussion

3. Routing

Page 10: Routing of Structured Queries  in Large-Scale Distributed Systems

10

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

1. Peer P0 looks for books about apples

2. Id i0 = hash(apple, \book) = hash(apple)is calculated

3. Peer P5 assigned to i0 is located in log(n) hops

4. Query q is sent to P5

5. P5 selects top k=2 postings for q;these relate to dok1 and dok2

6. Id i1 = hash(dok1) and Id i1 = hash(dok1) are calculated, their peers located

7. q is sent to P2 and P6 assigned to i1 and i2

8. P2 and P6 calculate relevance for dok1 and dok2 plus their RUs

9. P2 and P6 send back results to P0

Example:

P0

P1

P2

P5

P4P3

P6

P7

q = {apple, \book}

1.Introduction 2. Concept 3. Routing 4. Evaluation

assigned to hash(apple)

apple, \book dok1(4.8), dok2(4.1), dok3(3.7)…apple, \novel dok2(12.9)

apple, \article\p\sec ----

Dok2=(1,4,0,0,3,…)Dok1=(0,1,5,1,3,…)

Result = {(dok2,12.4), (dok2/chap, 11.2)}

Result = {(dok1/sec,5.4)}

q

q

1. (dok2,12.4)

2. (dok2/chap, 11.2)

3. (dok1/sec,5.4)

Page 11: Routing of Structured Queries  in Large-Scale Distributed Systems

11

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

GUI

peer pq

retrievalcomponent

pq

Pre-Ranking

routing component

pi1

Final Ranking

rankingcomponent

pdn

results for retrieval units(dn)

results for q

k1 q

……

(pq,q, k1, k2, k3, )

(pq,q,dn)

q={k3 k1, k2 }

d1 pl(k1,k2, k3)

routing component

pi2

routing component

pi3

k2 q

k3 q

Send message

Query routing

(pq,q, k2, k3,, pl(k1))

(pq,q, k3,, pl(k1,k2))d2 pl(k1,k2, k3)

dk pl(k1,k2, k3)

Routing process:

1.Introduction 2. Concept 3. Routing 4. Evaluation

Page 12: Routing of Structured Queries  in Large-Scale Distributed Systems

12

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

• Entries sorted by scoret(di); choose k best entries for XTerm t

• Considers document di, best retrieval unit rubest, and peer pi

• Weighting function w: BM25e-based

• PeerScore: high for peers with good collections regarding t and with good performance metrics

Weighting of postings (query independent at indexing):

)(

),(

), (

*

)( *

)(*

idt

tbestt

tit it

pscore

icfrutfw

icfdtfw)(dscore

1.Introduction 2. Concept 3. Routing 4. Evaluation

Page 13: Routing of Structured Queries  in Large-Scale Distributed Systems

13

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

ht

qtit sssimdscoreiqh dscore

XTerm else 0(*) ), ,( )*(

)(

apple \book\chapter dok1(12.8), dok2(12.4)

\article\p dok2(25.3), dok3(12.7), dok4(10.7)

chips \book\c1\section dok4(18.4), dok2(3.1), dok1(2.3), dok3(1.5)

Selection of Postings (query dependend reordering):

)( and ) ,( if (*) ttpostinglisi

dsim

thrq

st

ssim Example:

q = { (apple, \book\chapter), (chips, \section) }

1.Introduction 2. Concept 3. Routing 4. Evaluation

Final Posting list = {dok2(12.4*1+3.1*0.7=14.6), dok1(12.8*1+2.3*0.7=14.4), dok4(18.4*0.7=12.9), dok3(1.5*0.7=1.1) }

apple \book\chapter dok1(12.8), dok2(12.4)

\article\p dok2(25.3), dok3(12.7), dok4(10.7)

chips \book\c1\section dok4(18.4), dok2(3.1), dok1(2.3), dok3(1.5)

sim = 1

sim = 0

sim = 0.7

Page 14: Routing of Structured Queries  in Large-Scale Distributed Systems

14

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

Routing of Structured Queries in Large-Scale Distributed Systems

1. Introduction

2. Concept & Architecture

3. Routing

4. Evaluation

5. Questions and Discussion

4. Evaluation

Page 15: Routing of Structured Queries  in Large-Scale Distributed Systems

15

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

• Implementation of SPIRIX: Search Engine for P2P Information Retrieval in XML-Documents

• P2P-complex: • Based on OpenChord, • Collects peer characteristics,• Adapted to special requirements of XML IR

• Preliminary evaluation with INEX-Collection

Implementation:

1.Introduction 2. Concept 3. Routing 4. Evaluation

Page 16: Routing of Structured Queries  in Large-Scale Distributed Systems

16

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

• Evaluation with INEX-Collection of 2007: • Wikipedia-collection: 660.000 documents (4.6 GB) • 80 CAS queries (out of 123 topics )• run on 1 peer with simulationDHT (measurement of #postings)• retrieval of best 1500 results per query• PLmax set to indefinite ( all HDKs single XTerms)• different structural similarity functions• simple version of the proposed formulas (document-based)

• Goal: show the effect of using structural hints for routing

• efficiency (#postings: 100, 500, 2000 postings)

• effectivness (precision at different recall levels)

Evaluation:

1.Introduction 2. Concept 3. Routing 4. Evaluation

Page 17: Routing of Structured Queries  in Large-Scale Distributed Systems

17

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

1.Introduction 2. Concept 3. Routing 4. Evaluation

0,000000

0,050000

0,100000

0,150000

0,200000

0,250000

0,300000

0,350000

iP[0

.00]

iP[0

.01]

iP[0

.05]

iP[0

.10]

ircl_p

rn.0

.15

ircl_p

rn.0

.20

ircl_p

rn.0

.30

ircl_p

rn.0

.40

ircl_p

rn.0

.50

ircl_p

rn.0

.75

ircl_p

rn.1

.00

measures

pre

cisi

on

noSim100

noSim500

noSim2000

Path_Sim100

Path_Sim500

Path_Sim2000

Page 18: Routing of Structured Queries  in Large-Scale Distributed Systems

18

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

1.Introduction 2. Concept 3. Routing 4. Evaluation

MAiP

0,1074290,132392573

-10,07% -11,77%

0,128581689

0,1500620,115631260,11563126

7,64%

-0,120000

-0,070000

-0,020000

0,030000

0,080000

0,130000

0,180000

100 500 2000

#postings

pre

cisi

on noSim (BASELINE)

Path_Sim

improvement

Page 19: Routing of Structured Queries  in Large-Scale Distributed Systems

19

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

1.Introduction 2. Concept 3. Routing 4. Evaluation

0,310000

0,320000

0,330000

0,340000

0,350000

0,360000

100 500 2000

postings

iP[0

.01] noSim (BASELINE)

Path_Sim

+7,2% +8,7% +5,5%

Page 20: Routing of Structured Queries  in Large-Scale Distributed Systems

20

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

1.Introduction 2. Concept 3. Routing 4. Evaluation

0,285000

0,295000

0,305000

0,315000

0,325000

0,335000

0,345000

0,355000

0,365000

0,375000

0,385000

iP[0.00] iP[0.01] iP[0.05] iP[0.10]

early prec. measures

pre

cisi

on

noSim100

noSim500

noSim2000

Path_Sim100

Path_Sim500

Path_Sim2000

Page 21: Routing of Structured Queries  in Large-Scale Distributed Systems

21

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

• Propose to take advantage of XML structure when routing in highly distributed environments such as P2P systems

• Provide an infrastructure for investigation of proposed techniques to perform routing based on evidence from document-, element-, collection-, and peer-level

• For 80 CAS topics of INEX2007, efficiency and effectivness could be improved

• Future work to verify the observed improvement:

• evaluate formulas in full version

• runs with multimedia topics INEX 2007; INEX2008

• measure bandwidth consumption (incl. #messages, message sizes)

• run on different peers; split collection

Conclusion:

1.Introduction 2. Concept 3. Routing 4. Evaluation

Page 22: Routing of Structured Queries  in Large-Scale Distributed Systems

22

Judi

th W

inte

r: R

outin

g of

Str

uctu

red

Que

ries

in L

arge

-Sca

le D

istr

. Sys

tem

sJu

dith

Win

ter:

Rou

ting

of S

truc

ture

d Q

uerie

s in

Lar

ge-S

cale

Dis

tr. S

yste

ms

Routing of Structured Queries in Large-Scale Distributed Systems

1. Introduction

2. Concept & Architecture

3. Routing

4. Evaluation

5. Questions and Discussion5. Questions and Discussion