routing of structured queries in large-scale distributed systems
DESCRIPTION
Routing of Structured Queries in Large-Scale Distributed Systems. Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS_IR'08) @ ACM 17th CIKM 2008, Napa Valley, California, USA, Oct 2008. Judith Winter - PowerPoint PPT PresentationTRANSCRIPT
Routing of Structured Queries in Large-Scale Distributed Systems
Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS_IR'08)
@ ACM 17th CIKM 2008,
Napa Valley, California, USA, Oct 2008.
Judith Winter
Institute for Informatics / Telematics GroupGoethe-University / Frankfurt am Main, Germany
2
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
Routing of Structured Queries in Large-Scale Distributed Systems
Overview
1. Introduction
2. Concept & Architecture
3. Routing
4. Evaluation
5. Questions and Discussion
1. Introduction
3
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
• XML Information Retrieval in P2P systems
• Investigate the impact of using structural information when retrieving XML-documents in a P2P network
• Challenge: not all information accessable / scalability issues
Proposed research:
How to perform & improve query routing in a large-scale P2P System
by using structural information?
1.Introduction 2. Concept 3. Routing 4. Evaluation
4
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
XML Information Retrieval in Peer-to-Peer Systems:
• structured documents• more precise search• based on c/s architectures
• distributed• autonomous peers• growing amount of XML-documents
• vague queries • relevance-ranking
XML-Retrieval
InformationRetrieval
Peer-to-Peer
Challenges:• no central index• only selected information available• bandwith consumption / communication overhead• efficiency vs effectiveness
1.Introduction 2. Concept 3. Routing 4. Evaluation
5
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
Routing of Structured Queries in Large-Scale Distributed Systems
1. Introduction
2. Concept & Architecture
3. Routing
4. Evaluation
5. Questions and Discussion
2. Concept & Architecture
6
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
• Queries: content-and-structure (CAS)
• Indexing: include structure
• Hybrid indexing: globally or locally (distributing summaries) depending on peer status index with posting lists (doc level) & peer lists (peer-level)
• Distributing global information into DHT
• Ranking: extended vector space model (using structure)
• Results/Retrieval units: document or element retrieval
Concept for a P2P-search engine:
1.Introduction 2. Concept 3. Routing 4. Evaluation
7
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
• Routing:
• Use peer lists and posting lists
• Use of pre-computed posting lists for popular term combinations highly discriminative keys (HDKs)
• Use of pruned posting lists by considering structural information
• Ordering of posting lists by a query-independent score (evidence from document-, element-, collection, and peer level)
• Select top k results according to pre-ranking regarding structural similarity between CAS query and posting key
Concept for a P2P-search engine:
1.Introduction 2. Concept 3. Routing 4. Evaluation
8
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
P2P network
Index storage component
Inverted Index Statistics Index
INFORMATION RETRIEVAL
PEER-TO-PEER
APPLICATION
Retrieval component
Ranking component
P2P component
Documentindex
Retrieval unitindex
SpirixDHT
GUIIndexingQuerying &
result presentation
Frequent XTerm index
HDKindex
DL Local documents
Querying Component
Routing component
Similarity calculator
Weighting calculator
Sourceselector
SimulationDHT ChordPeerMetricscalculator
1.Introduction 2. Concept 3. Routing 4. Evaluation
9
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
Routing of Structured Queries in Large-Scale Distributed Systems
1. Introduction
2. Concept & Architecture
3. Routing
4. Evaluation
5. Questions and Discussion
3. Routing
10
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
1. Peer P0 looks for books about apples
2. Id i0 = hash(apple, \book) = hash(apple)is calculated
3. Peer P5 assigned to i0 is located in log(n) hops
4. Query q is sent to P5
5. P5 selects top k=2 postings for q;these relate to dok1 and dok2
6. Id i1 = hash(dok1) and Id i1 = hash(dok1) are calculated, their peers located
7. q is sent to P2 and P6 assigned to i1 and i2
8. P2 and P6 calculate relevance for dok1 and dok2 plus their RUs
9. P2 and P6 send back results to P0
Example:
P0
P1
P2
P5
P4P3
P6
P7
q = {apple, \book}
1.Introduction 2. Concept 3. Routing 4. Evaluation
assigned to hash(apple)
apple, \book dok1(4.8), dok2(4.1), dok3(3.7)…apple, \novel dok2(12.9)
apple, \article\p\sec ----
Dok2=(1,4,0,0,3,…)Dok1=(0,1,5,1,3,…)
Result = {(dok2,12.4), (dok2/chap, 11.2)}
Result = {(dok1/sec,5.4)}
q
q
1. (dok2,12.4)
2. (dok2/chap, 11.2)
3. (dok1/sec,5.4)
11
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
GUI
peer pq
retrievalcomponent
pq
Pre-Ranking
routing component
pi1
Final Ranking
rankingcomponent
pdn
results for retrieval units(dn)
results for q
k1 q
……
(pq,q, k1, k2, k3, )
(pq,q,dn)
q={k3 k1, k2 }
d1 pl(k1,k2, k3)
routing component
pi2
routing component
pi3
k2 q
k3 q
Send message
Query routing
(pq,q, k2, k3,, pl(k1))
(pq,q, k3,, pl(k1,k2))d2 pl(k1,k2, k3)
dk pl(k1,k2, k3)
Routing process:
1.Introduction 2. Concept 3. Routing 4. Evaluation
12
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
• Entries sorted by scoret(di); choose k best entries for XTerm t
• Considers document di, best retrieval unit rubest, and peer pi
• Weighting function w: BM25e-based
• PeerScore: high for peers with good collections regarding t and with good performance metrics
Weighting of postings (query independent at indexing):
)(
),(
), (
*
)( *
)(*
idt
tbestt
tit it
pscore
icfrutfw
icfdtfw)(dscore
1.Introduction 2. Concept 3. Routing 4. Evaluation
13
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
ht
qtit sssimdscoreiqh dscore
XTerm else 0(*) ), ,( )*(
)(
apple \book\chapter dok1(12.8), dok2(12.4)
\article\p dok2(25.3), dok3(12.7), dok4(10.7)
chips \book\c1\section dok4(18.4), dok2(3.1), dok1(2.3), dok3(1.5)
Selection of Postings (query dependend reordering):
)( and ) ,( if (*) ttpostinglisi
dsim
thrq
st
ssim Example:
q = { (apple, \book\chapter), (chips, \section) }
1.Introduction 2. Concept 3. Routing 4. Evaluation
Final Posting list = {dok2(12.4*1+3.1*0.7=14.6), dok1(12.8*1+2.3*0.7=14.4), dok4(18.4*0.7=12.9), dok3(1.5*0.7=1.1) }
apple \book\chapter dok1(12.8), dok2(12.4)
\article\p dok2(25.3), dok3(12.7), dok4(10.7)
chips \book\c1\section dok4(18.4), dok2(3.1), dok1(2.3), dok3(1.5)
sim = 1
sim = 0
sim = 0.7
14
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
Routing of Structured Queries in Large-Scale Distributed Systems
1. Introduction
2. Concept & Architecture
3. Routing
4. Evaluation
5. Questions and Discussion
4. Evaluation
15
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
• Implementation of SPIRIX: Search Engine for P2P Information Retrieval in XML-Documents
• P2P-complex: • Based on OpenChord, • Collects peer characteristics,• Adapted to special requirements of XML IR
• Preliminary evaluation with INEX-Collection
Implementation:
1.Introduction 2. Concept 3. Routing 4. Evaluation
16
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
• Evaluation with INEX-Collection of 2007: • Wikipedia-collection: 660.000 documents (4.6 GB) • 80 CAS queries (out of 123 topics )• run on 1 peer with simulationDHT (measurement of #postings)• retrieval of best 1500 results per query• PLmax set to indefinite ( all HDKs single XTerms)• different structural similarity functions• simple version of the proposed formulas (document-based)
• Goal: show the effect of using structural hints for routing
• efficiency (#postings: 100, 500, 2000 postings)
• effectivness (precision at different recall levels)
Evaluation:
1.Introduction 2. Concept 3. Routing 4. Evaluation
17
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
1.Introduction 2. Concept 3. Routing 4. Evaluation
0,000000
0,050000
0,100000
0,150000
0,200000
0,250000
0,300000
0,350000
iP[0
.00]
iP[0
.01]
iP[0
.05]
iP[0
.10]
ircl_p
rn.0
.15
ircl_p
rn.0
.20
ircl_p
rn.0
.30
ircl_p
rn.0
.40
ircl_p
rn.0
.50
ircl_p
rn.0
.75
ircl_p
rn.1
.00
measures
pre
cisi
on
noSim100
noSim500
noSim2000
Path_Sim100
Path_Sim500
Path_Sim2000
18
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
1.Introduction 2. Concept 3. Routing 4. Evaluation
MAiP
0,1074290,132392573
-10,07% -11,77%
0,128581689
0,1500620,115631260,11563126
7,64%
-0,120000
-0,070000
-0,020000
0,030000
0,080000
0,130000
0,180000
100 500 2000
#postings
pre
cisi
on noSim (BASELINE)
Path_Sim
improvement
19
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
1.Introduction 2. Concept 3. Routing 4. Evaluation
0,310000
0,320000
0,330000
0,340000
0,350000
0,360000
100 500 2000
postings
iP[0
.01] noSim (BASELINE)
Path_Sim
+7,2% +8,7% +5,5%
20
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
1.Introduction 2. Concept 3. Routing 4. Evaluation
0,285000
0,295000
0,305000
0,315000
0,325000
0,335000
0,345000
0,355000
0,365000
0,375000
0,385000
iP[0.00] iP[0.01] iP[0.05] iP[0.10]
early prec. measures
pre
cisi
on
noSim100
noSim500
noSim2000
Path_Sim100
Path_Sim500
Path_Sim2000
21
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
• Propose to take advantage of XML structure when routing in highly distributed environments such as P2P systems
• Provide an infrastructure for investigation of proposed techniques to perform routing based on evidence from document-, element-, collection-, and peer-level
• For 80 CAS topics of INEX2007, efficiency and effectivness could be improved
• Future work to verify the observed improvement:
• evaluate formulas in full version
• runs with multimedia topics INEX 2007; INEX2008
• measure bandwidth consumption (incl. #messages, message sizes)
• run on different peers; split collection
Conclusion:
1.Introduction 2. Concept 3. Routing 4. Evaluation
22
Judi
th W
inte
r: R
outin
g of
Str
uctu
red
Que
ries
in L
arge
-Sca
le D
istr
. Sys
tem
sJu
dith
Win
ter:
Rou
ting
of S
truc
ture
d Q
uerie
s in
Lar
ge-S
cale
Dis
tr. S
yste
ms
Routing of Structured Queries in Large-Scale Distributed Systems
1. Introduction
2. Concept & Architecture
3. Routing
4. Evaluation
5. Questions and Discussion5. Questions and Discussion