lca -based selection for xml document collections georgia koloniari joint work with evaggelia...
TRANSCRIPT
![Page 1: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/1.jpg)
LCA -Based Selection for XML Document Collections
Georgia Koloniarijoint work with Evaggelia Pitoura
Department of Computer ScienceUniversity of Ioannina, Greece
http://dmod.cs.uoi.gr
![Page 2: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/2.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
2
Fundamental question:
Given a query and many available data sources with large volumes of data,
select the most relevant sources for the query/filter out the irrelevant ones
What is the topic of this talk?
![Page 3: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/3.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
3
More formally:
Source/Database Selection: Problem Definition
Given a query q and a set of data sources, rank the data sources according to the relevance (called goodness) of their data to q
Evaluate q against the most relevant (best) data sources
What is the topic of this talk?
Database selection for: relational databases Sayyadian et al [ICDE ‘07], Yu et al [SIGMOD ‘07], Vu at el [SIGMOD ‘08]
textual document collectionsCallan et al [SIGIR ‘95], Gravano et al [ACM Trans. Database Syst. ‘99]
Source Selection Problem: Previous research
However, many data sources with XML documents
![Page 4: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/4.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
4
In this paper,
The source selection problem for XML Document Collections
Given a set of N distributed collections of XML documents and a query q rank the collections based on their goodness (i.e., relevance) to q
Keyword queries, q = (w1, w2, …, wk)
XML Selection Problem: Definition
![Page 5: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/5.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
5
OUTLINE
In the rest of this talk,
What is different with XML?
our LCA-based approach
Define goodness for a database of XML documents
How to compute goodness for a given query
using pre-computed summaries
Experimental evaluation
![Page 6: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/6.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
6
conf
WWW
cname paper
title
facet name
author
van Zwol
paper
title
RDF name
author
Atre
demo
title
Top-k name
author
Soliman
name
Chaoij
2010
year
author
name
author
Sigurbjörnsson
…
Query: Atre RDF
Keyword search for XML Documents: an example
Search for nodes that contain the keywords (as their label, content, label or value of their attributes) Result: the subtrees whose nodes contain all the keywords
![Page 7: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/7.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
7
conf
HDMS
cname paper
title
XPath name
author
Georgiadis
paper
title
RDF name
author
Atre
demo
title
Top-k name
author
Soliman
name
Chaoij
2010
year
author
name
author
Vassalos
…
Query: Atre RDF
Keyword search for XML Documents: an example
The Lowest Common Ancestor (LCA) of a set of nodes V ‘ = {v1, . . . , vk} (V’ V ) is the deepest node v in a tree T which is an ancestor of all nodes in V’
![Page 8: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/8.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
8
Keyword query q = (w1, w2, …, wk)
An unordered labeled XML tree T = (V, E) of an XML document d
An element (node) v ∈ V contains a keyword wi - contains(v, wi)
Si = {v|v ∈ V and contains(v, wi)}, 1 ≤ i ≤ k
(set of nodes that contain keyword wi)
Result(q) subset of (basic LCA-approach)
lca(S1, . . . , Sk) that evaluates the set of LCA nodes V, such that v ∈ V if v = lca(v1, . . . , vk) and v1 ∈ S1, . . . , vk ∈ Sk
(at least one occurence of each keyword)
Keyword search for XML Documents: LCA semantics
![Page 9: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/9.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
9
Query: paper van Zwol
Keyword search for XML Documents: LCA semantics
conf
WWW
cname
paper
title
facet name
author
van Zwol
paper
title
facet name
author
Lin
demo
title
RDF name
author
van Zwol
name
Yan
2010
year
author
name
author
Sigurbjörnsson
…
contentSLCA
![Page 10: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/10.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
10
Query: paper van Zwol
Keyword search for XML Documents: LCA semantics
conf
WWW
cname
paper
title
facet name
author
van Zwol
paper
title
facet name
author
Lin
demo
title
RDF name
author
van Zwol
name
Yan
2010
year
author
name
author
Sigurbjörnsson
…
contentELCA
ELCA
![Page 11: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/11.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
11
Lowest Common Ancestor
Many variations
Only structural (Smallest LCA, Exclusive LCA, etc)
Schema of the documents (Meaningful LCA, Valuable LCA, based also on node/element types)
in addition IR-based statistics
We do not propose yet another one, instead we use the basic LCA (the Result(q) set)
Most others can be implemented on filtering our results (details in the paper)
Experimental evaluation on ELCA
![Page 12: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/12.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
12
Query: paper van Zwol
Keyword search for XML Documents: Ranking
conf
WWW
cname
paper
title
facet name
author
van Zwol
paper
title
facet name
author
Lin
demo
title
RDF name
author
van Zwol
name
Yan
2010
year
author
name
author
Sigurbjörnsson
…
contentELCA
ELCA
Structure is used to improve the quality of the result -> rank results based on the distance of the keywords from their LCA
![Page 13: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/13.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
13
e
v
Keyword search for XML Documents: Ranking
root
),(max)( ii
vvdistuh the height of the LCA node v ∈ Result(q)
the maximum distance of any of the keywords of q in the XML tree to their LCA node
Height: 2
Query: o, b
f
v
aa a
d m
e
d
x
f
b ob
f
c h
oHeight: 1
![Page 14: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/14.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
14
Query: paper van Zwol
Keyword search for XML Documents: Ranking
conf
WWW
cname
paper
title
facet name
author
van Zwol
paper
title
facet name
author
Lin
demo
title
RDF name
author
van Zwol
name
Yan
2010
year
author
name
author
Sigurbjörnsson
…
contentELCA,
ELCA
Height: 4
Height: 3
![Page 15: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/15.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
15
conf
WWW
name paper
title
object name
author
Zaragoza
paper
title
RDF name
author
Atre
demo
title
Top-k name
author
Soliman
name
Chaoij
2010
year
author
name
author
Pound
…
Query: demo RDF Pound
Keyword search for XML Documents: Relevance
Not all trees that contain the keywords are relevant
Exclude some of the results as not relevant based on height
![Page 16: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/16.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
16
otherwise,0
)(min if,1),,( )(
vhdqsim qResultv
otherwise,0
)(min if)),((min),,( )()(
vhvhF
dqsim qResultvqResultv
F(h(v)): a function F of the height h of a result node v such that the similarity of d to q is greater when h(v) is small
Boolean Problem:
Weighted Problem:
A user is interested in d as a result for q iff the distance (height) of a result in d is lower or equal to
Database Selection: Document relevance
![Page 17: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/17.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
17
17
A database D is ranked based on its goodness to q by aggregating the relevance of their documents
Database Selection
Dd
dqsimDqGoodness ),,(),,(
The goodness measure ranks highly collections that: have a large number of documents with a relatively small similarity
score have less documents but with higher similarity scores
The threshold limits the tendency to favor large collections in contrast to more relevant ones
![Page 18: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/18.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
18
OUTLINE
In the rest of this talk,
What is different with XML?
our LCA-based approach
Define goodness for a database of XML documents
How to compute goodness for a given query
using pre-computed summaries
Experimental evaluation
![Page 19: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/19.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
19
Goodness Estimation
To estimate the goodness of a collection D for a keyword query, the straightforward approach is:1. For each document d ∈ D
Evaluate q against dFind all the LCA nodes in d of the k keywords that appear in q (Result(q))Select v ∈ Result(q) with the minimum heightif h(v) ≤ l
the boolean model returns a matchthe weighted model computes the similarity based on function F
2. Aggregate over all d ∈ D
Computing LCA online for each query is expensive
![Page 20: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/20.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
20
Goodness Estimation
To avoid at execution time: Pre-compute the LCA nodes of for all possible combinations of keywords that appear in each d and maintain their heightsNumber of computed LCA nodes for an XML document with n keywords:
)2(2
nn
i
Oi
n
![Page 21: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/21.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
21
Pair-Wise Goodness Estimation
OUR APPROACH
We maintain information for the height only for pairs of keywords and use this to estimate the height of the LCA for more than 2 keywords
For each distinct pair of keywords (wi, wj) in a document d, we maintain
the height hmin(i, j) of the LCA node v ∈ lca(Si, Sj) with h(v) ≤ h(u), ∀ u ∈ lca(Si, Sj) (the lowest LCA) and
the height hmax(i, j) of the LCA node v ∈ lca(Si, Sj) with h(v) ≥ h(u), ∀u ∈ lca(Si, Sj) (the highest LCA)
![Page 22: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/22.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
22
Proposition. Let G(V,E) be an acyclic directed graph, and V ‘ = {v1, . . . , vM} any subset of M nodes in G, V ‘ V . Then,
h(lca(v1, . . . , vM)) = maxvi,vj∈V h(lca(vi, vj)).
Pairwise-based Height Estimation
If the keywords are distinct (just a single LCA), then it is easy to see that the height is equal to the maximum
Else, we get estimations
![Page 23: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/23.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
23
Pair-based Height Estimation
Hmin(d, q): the maximum value of the minimum LCA height values for any pair of keywords in q
Hmax(d, q): the maximum value of the maximum LCA height values for any pair of keywords in q
(o, b) → 1-3(o, a) → 2-3(b, a) → 1-3
Hmin(d, q): 2Hmax(d, q): 3
Theorem. Given a keyword query q and a document d, the height of any v ∈ Result(q) is such that: Hmin(d, q) ≤ h(v) ≤ Hmax(d, q)
Query: o, b, a
f
v
a a a
d m
e
d
x
f
b ob
f
c h
d
v
e
![Page 24: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/24.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
24
Boolean Goodness Estimation
If Hmin(d, q) > -> not relevant (no false negatives)
If Hmin(d, q) and Hmax (d, q) then relevant
If Hmin(d, q) and Hmax (d, q) > , relevant but false positives are possibleFor the weigthed and the goodness estimation
bounds, details in the paper
![Page 25: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/25.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
25
Even with the optimizations, the information to be maintained may remain large =>
summaries to reduce its size
Our summaries are based on Bloom filters
Summarizing the matrices
![Page 26: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/26.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
26
Bloom-based Summaries
Test whether y in the set (look up y), again apply the same function Tunable probability of False Positive: probability of incorrectly identifying an element as a match
Bloom FiltersCompact data structures for a probabilistic representation of a set Used to answer membership queries
1 1 1 1 Bit vector v
h1(x) = 4
h2(x) = 2
h3(x) = 5
h4(x) = 8
m = 10 bits
Bit vector of m bits, initially set to 0 - l hash function: 0 -> m - 1
Insert x in the Bloom - Apply the l hash function, set to 1 the corresponding bits
![Page 27: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/27.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
27
Bloom-based Summaries for the Boolean ProblemFor each d in D maintain two Bloom filters:
BFmin(d) for the hmin(i, j) and BFmax(d) for the hmax(i, j) values
of each distinct keyword pair (wi,wj) in d
Given a similarity threshold , for all (wi, wj) in d
if hmin (i, j) ≤ , then (wi, wj) is hashed as one key and inserted into BFmin(d)
if hmax(i, j) ≤ , then (wi, wj) is also inserted into BFmax(d)
![Page 28: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/28.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
28
Bloom-based Summaries for the Boolean Problem
Similarity Evaluation of d to q:
1. every pair of keywords of q is looked up in BFmin(d) and if one is not found, d is not relevant
2. else, we also look them in BFmax(d), if found, definite relevant else relevant but with a false positive probability
![Page 29: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/29.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
29
Bloom-based Summaries for the Weighted Problem
Group the keyword pairs according to their hmin(i, j) (hmax(i, j) ) value and use a separate Bloom filter for each such group - distance
Compute the similarity by applying F on the number of the highest level for which there was a hit for any of the keyword pairs of the query
f
v
a a a
d f f m
o h o h e
d
x
f
b o
e
vb
![Page 30: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/30.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
30
OUTLINE
In the rest of this talk,
What is different with XML?
our LCA-based approach
Define goodness for a database of XML documents
How to do compute goodness for a given query
using pre-computed summaries
Experimental evaluation
![Page 31: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/31.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
31
Experimental EvaluationWe consider four approaches for goodness evaluation:
keyword: ignores structure - based solely on the appearance of the keywords tree: exact evaluation based on ELCA semantics
pair: pairwise estimation bloom: pairwsise + Bloom-based summaries
Experiments on both synthetic and real datasets
goodness estimation of a single collection
accuracy of the ranking based on goodness
![Page 32: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/32.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
32
Goodness Estimation (Single Collection)
Using Bloom filters increases the estimation error but also reduces the storage overhead to 8% of the pair-based one
Due to false positives, Bloom filters derive more optimistic lower bounds
Weighted Boolean
![Page 33: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/33.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
33
Weighted Boolean
For low threshold values, the goodness estimations and the lower bounds are more accurate, while they increase as the threshold increases
When the threshold value is close to the tree depth of the documents, the accuracy of the estimations improves again
Similarity Threshold (Single Collection)
![Page 34: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/34.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
34
Document & Query Structure (single collection)
Absolute estimation error (distance from ELCA)
Overall acceptable estimations (below 20%)
Our approaches behaves worse for queries of "medium" length (4-5) and small number of repeating elements
![Page 35: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/35.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
35
Achieved ranking
Optimal Ranking (Ranking achieved through the actual ELCA computation) and Pair-wise Ranking (with and without Blooms)
Spearman Footrule distance between two ranked lists: the absolute difference of their pairwise elements normalized by dividing by 1/2(S), where S the number of elements in the lists
Mean Average Precision (MAP) for a set of different queries: the average of the precision value (percentage of relevant documents) attained after each query, divided by the number of queries
three different collections (same size, different size, random)
![Page 36: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/36.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
36
Ranking (Spearman)
Equal Size Collections Different Size Collections
Random Collections The keyword-based approach ignores the document structure and ranks the collections according to their size Our approaches behave well, with maximum distance to the actual ranking at 0.3 in the worst case The Bloom-based approach sometimes outperforms the pair-based one due to the more optimistic estimations
![Page 37: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/37.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
37
Equal Size Collections Different Size Collections
Random Collections
Our approaches behave well, with a MAP around 0.75 to 0.85 The Bloom-based approach is less precise because of the false positives
Ranking (MAP)
![Page 38: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/38.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
3838
We split the DBLP bibliographic data collection:
Two sets of collections grouped by:1. year of publication (i.e., collections “2009”, "2008", etc)2. conference name (i.e., collection “WWW”, "VLDB", etc)
Queries with author names as keywords With λ equal to 1, we retrieve publications cowritten by two authors
Pair-based Bloom-based Keyword-based
“Omar Benjelloun and Serge Abiteboul” & collections by yearCorrect order (by counting commnon publications): 2004 2002 2003 2005
SF distance: 0Precision: 1
SF distance: 0.2Precision: 5/6
SF distance: 0.46Precision: 1/2
“Alon Y. Halevy and Zachary G. Ives” & collections by conferenceCorrect order (by counting common publications): SIGMOD, WebDb, WWW, ...
SF distance: 0Precision: 1
SF distance: 0.75Precision: 6/8
SF distance: 0.85Precision: 1/3
Real Data
![Page 39: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/39.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
39
Summary
(LCA-based ranking) Maintain information about the height of the LCA node between keywords
Propose a pair-wise aproach: the actual height for a combination of keywords is estimated using the pair-wise heights
Introduce Bloom-based summaries for maintaining heights
Both a Boolean and a Weighted version for document similarity
Evaluation of the quality of the goodness estimation per collection and the actual ranking, as well as usefulness for real data
Consider the problem of source selection for XML documents:Given a set of XML databases and a keyword query, ranked the databases based on their goodness
![Page 40: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/40.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
40
Future Work
Other definitions of document relevance (including schema based and IR techniques)
Alternative definitions of database goodness + user study for their evaluation
Other types of summaries
![Page 41: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/41.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
41
Thank you
![Page 42: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/42.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
42
Related WorkLCA-type Description Definition
Smallest LCA(SLCA)(Yu, & Papakonstantinou, Sigmod’05)
v is an SLCA if all keywords of q appear in the subtree rooted at v and none of its descendants has such a subtree containing all keywords.
v slca(S1, S2, . . ., Sk) if v lca(S1, . . ., Sk) and u lca(S1, S2, . . ., Sk) v not an ancestor of u.
Exclusive LCA (ELCA)((Yu, & Papakonstantinou, EDBT ‘08)
v is an ELCA if it contains at least one occurrence of each keyword in the subtree rooted at v, excluding the occurrences of the keywords in subtrees of its descendants already containing all the keywords.
v elca(S1, S2, . . ., Sk) iff v1 S1, . . ., vk Sk: v=lca(v1, . . ., vk) and vi (1≤ i ≤ k) the child of v in the path from v to vi is not an LCA of S1, . . ., Sk itself or an ancestor of such an LCA.
Meaningful LCA (MLCA)(Li et al, VLDB’04)
v is an MLCA if in the subtree rooted at v, the nodes containing the keywords are pairwise meaningfully related.
v is not an MLCA, if all pairs of nodes (vi, vj) in the subtree rooted at v that contain the keywords of q are such that v’i, v’j containing the same keywords such that lca(vi, vj) is an ancestor of lca(v’i, v’j).
Valuable LCA (VLCA)(Li et al, CIKM’07)
v is a VLCA, iff for the nodes vi, vj, containing keywords (wi, wj), in the subtree rooted at v, there are no other two nodes of the same label/tag except vi, vj.
For v=lca(v1, . . ., vk) , v is the VLCA of v1, . . ., vk iff vi, vj
there are no other two nodes of the same label/tag.
For all the variations of the LCA, for any query q and document d the set of the LCA nodes of the keywords in q (basic LCA nodes) is a superset of any type of LCA nodes, i.e., SLCA, ELCA, MLCA, VLCA
![Page 43: LCA -Based Selection for XML Document Collections Georgia Koloniari joint work with Evaggelia Pitoura Department of Computer Science University of Ioannina,](https://reader035.vdocument.in/reader035/viewer/2022070400/56649f0e5503460f94c222dc/html5/thumbnails/43.jpg)
DMOD Laboratory, University of Ioannina HDMS 2010
43
Experimental Evaluation
Parameter Range Default
# of documents per collection (|D|) 20-200 100
# of elements per document (n) - 50000
depth of XML tree (depth) 4-20 12
% of repeating element names (r) 0-0.6 0.3
query elements appearing in documents - 90%
query length (k) 1-6 4
similarity threshold (l) 1-12 4
number of collections (N) - 12
Number of Bloom filter hash functions - 4
Size of Bloom filter - 996bits