lca -based selection for xml document collections georgia koloniari joint work with evaggelia...

LCA -Based Selection for XML Document Collections

Georgia Koloniarijoint work with Evaggelia Pitoura

Department of Computer ScienceUniversity of Ioannina, Greece

http://dmod.cs.uoi.gr

DMOD Laboratory, University of Ioannina HDMS 2010

2

Fundamental question:

Given a query and many available data sources with large volumes of data,

select the most relevant sources for the query/filter out the irrelevant ones

What is the topic of this talk?


3

More formally:

Source/Database Selection: Problem Definition

Given a query q and a set of data sources, rank the data sources according to the relevance (called goodness) of their data to q

Evaluate q against the most relevant (best) data sources

What is the topic of this talk?

Database selection for: relational databases Sayyadian et al [ICDE ‘07], Yu et al [SIGMOD ‘07], Vu at el [SIGMOD ‘08]

textual document collectionsCallan et al [SIGIR ‘95], Gravano et al [ACM Trans. Database Syst. ‘99]

Source Selection Problem: Previous research

However, many data sources with XML documents


4

In this paper,

The source selection problem for XML Document Collections

Given a set of N distributed collections of XML documents and a query q rank the collections based on their goodness (i.e., relevance) to q

Keyword queries, q = (w1, w2, …, wk)

XML Selection Problem: Definition


5

OUTLINE

In the rest of this talk,

What is different with XML?

our LCA-based approach

Define goodness for a database of XML documents

How to compute goodness for a given query

using pre-computed summaries

Experimental evaluation


6

conf

WWW

cname paper

title

facet name

author

van Zwol

paper

title

RDF name

author

Atre

demo

title

Top-k name

author

Soliman

name

Chaoij

2010

year

author

name

author

Sigurbjörnsson

…

Query: Atre RDF

Keyword search for XML Documents: an example

Search for nodes that contain the keywords (as their label, content, label or value of their attributes) Result: the subtrees whose nodes contain all the keywords


7

conf

HDMS

cname paper

title

XPath name

author

Georgiadis

paper

title

RDF name

author

Atre

demo

title

Top-k name

author

Soliman

name

Chaoij

2010

year

author

name

author

Vassalos

…

Query: Atre RDF

Keyword search for XML Documents: an example

The Lowest Common Ancestor (LCA) of a set of nodes V ‘ = {v1, . . . , vk} (V’ V ) is the deepest node v in a tree T which is an ancestor of all nodes in V’


8

Keyword query q = (w1, w2, …, wk)

An unordered labeled XML tree T = (V, E) of an XML document d

An element (node) v ∈ V contains a keyword wi - contains(v, wi)

Si = {v|v ∈ V and contains(v, wi)}, 1 ≤ i ≤ k

(set of nodes that contain keyword wi)

Result(q) subset of (basic LCA-approach)

lca(S1, . . . , Sk) that evaluates the set of LCA nodes V, such that v ∈ V if v = lca(v1, . . . , vk) and v1 ∈ S1, . . . , vk ∈ Sk

(at least one occurence of each keyword)

Keyword search for XML Documents: LCA semantics


9

Query: paper van Zwol


conf

WWW

cname

paper

title

facet name

author

van Zwol

paper

title

facet name

author

Lin

demo

title

RDF name

author

van Zwol

name

Yan

2010

year

author

name

author

Sigurbjörnsson

…

contentSLCA


10



conf

WWW

cname

paper

title

facet name

author

van Zwol

paper

title

facet name

author

Lin

demo

title

RDF name

author

van Zwol

name

Yan

2010

year

author

name

author

Sigurbjörnsson

…

contentELCA

ELCA


11

Lowest Common Ancestor

Many variations

Only structural (Smallest LCA, Exclusive LCA, etc)

Schema of the documents (Meaningful LCA, Valuable LCA, based also on node/element types)

in addition IR-based statistics

We do not propose yet another one, instead we use the basic LCA (the Result(q) set)

Most others can be implemented on filtering our results (details in the paper)

Experimental evaluation on ELCA


12


Keyword search for XML Documents: Ranking

conf

WWW

cname

paper

title

facet name

author

van Zwol

paper

title

facet name

author

Lin

demo

title

RDF name

author

van Zwol

name

Yan

2010

year

author

name

author

Sigurbjörnsson

…

contentELCA

ELCA

Structure is used to improve the quality of the result -> rank results based on the distance of the keywords from their LCA


13

e

v


root

),(max)( ii

vvdistuh the height of the LCA node v ∈ Result(q)

the maximum distance of any of the keywords of q in the XML tree to their LCA node

Height: 2

Query: o, b

f

v

aa a

d m

e

d

x

f

b ob

f

c h

oHeight: 1


14



conf

WWW

cname

paper

title

facet name

author

van Zwol

paper

title

facet name

author

Lin

demo

title

RDF name

author

van Zwol

name

Yan

2010

year

author

name

author

Sigurbjörnsson

…

contentELCA,

ELCA

Height: 4

Height: 3


15

conf

WWW

name paper

title

object name

author

Zaragoza

paper

title

RDF name

author

Atre

demo

title

Top-k name

author

Soliman

name

Chaoij

2010

year

author

name

author

Pound

…

Query: demo RDF Pound

Keyword search for XML Documents: Relevance

Not all trees that contain the keywords are relevant

Exclude some of the results as not relevant based on height


16

otherwise,0

)(min if,1),,( )(

vhdqsim qResultv

otherwise,0

)(min if)),((min),,( )()(

vhvhF

dqsim qResultvqResultv

F(h(v)): a function F of the height h of a result node v such that the similarity of d to q is greater when h(v) is small

Boolean Problem:

Weighted Problem:

A user is interested in d as a result for q iff the distance (height) of a result in d is lower or equal to

Database Selection: Document relevance


17

17

A database D is ranked based on its goodness to q by aggregating the relevance of their documents

Database Selection

Dd

dqsimDqGoodness ),,(),,(

The goodness measure ranks highly collections that: have a large number of documents with a relatively small similarity

score have less documents but with higher similarity scores

The threshold limits the tendency to favor large collections in contrast to more relevant ones


18

OUTLINE





How to compute goodness for a given query




19

Goodness Estimation

To estimate the goodness of a collection D for a keyword query, the straightforward approach is:1. For each document d ∈ D

Evaluate q against dFind all the LCA nodes in d of the k keywords that appear in q (Result(q))Select v ∈ Result(q) with the minimum heightif h(v) ≤ l

the boolean model returns a matchthe weighted model computes the similarity based on function F

2. Aggregate over all d ∈ D

Computing LCA online for each query is expensive


20

Goodness Estimation

To avoid at execution time: Pre-compute the LCA nodes of for all possible combinations of keywords that appear in each d and maintain their heightsNumber of computed LCA nodes for an XML document with n keywords:

)2(2

nn

i

Oi

n


21

Pair-Wise Goodness Estimation

OUR APPROACH

We maintain information for the height only for pairs of keywords and use this to estimate the height of the LCA for more than 2 keywords

For each distinct pair of keywords (wi, wj) in a document d, we maintain

the height hmin(i, j) of the LCA node v ∈ lca(Si, Sj) with h(v) ≤ h(u), ∀ u ∈ lca(Si, Sj) (the lowest LCA) and

the height hmax(i, j) of the LCA node v ∈ lca(Si, Sj) with h(v) ≥ h(u), ∀u ∈ lca(Si, Sj) (the highest LCA)


22

Proposition. Let G(V,E) be an acyclic directed graph, and V ‘ = {v1, . . . , vM} any subset of M nodes in G, V ‘ V . Then,

h(lca(v1, . . . , vM)) = maxvi,vj∈V h(lca(vi, vj)).

Pairwise-based Height Estimation

If the keywords are distinct (just a single LCA), then it is easy to see that the height is equal to the maximum

Else, we get estimations


23

Pair-based Height Estimation

Hmin(d, q): the maximum value of the minimum LCA height values for any pair of keywords in q

Hmax(d, q): the maximum value of the maximum LCA height values for any pair of keywords in q

(o, b) → 1-3(o, a) → 2-3(b, a) → 1-3

Hmin(d, q): 2Hmax(d, q): 3

Theorem. Given a keyword query q and a document d, the height of any v ∈ Result(q) is such that: Hmin(d, q) ≤ h(v) ≤ Hmax(d, q)

Query: o, b, a

f

v

a a a

d m

e

d

x

f

b ob

f

c h

d

v

e


24

Boolean Goodness Estimation

If Hmin(d, q) > -> not relevant (no false negatives)

If Hmin(d, q) and Hmax (d, q) then relevant

If Hmin(d, q) and Hmax (d, q) > , relevant but false positives are possibleFor the weigthed and the goodness estimation

bounds, details in the paper


25

Even with the optimizations, the information to be maintained may remain large =>

summaries to reduce its size

Our summaries are based on Bloom filters

Summarizing the matrices


26

Bloom-based Summaries

Test whether y in the set (look up y), again apply the same function Tunable probability of False Positive: probability of incorrectly identifying an element as a match

Bloom FiltersCompact data structures for a probabilistic representation of a set Used to answer membership queries

1 1 1 1 Bit vector v

h1(x) = 4

h2(x) = 2

h3(x) = 5

h4(x) = 8

m = 10 bits

Bit vector of m bits, initially set to 0 - l hash function: 0 -> m - 1

Insert x in the Bloom - Apply the l hash function, set to 1 the corresponding bits


27

Bloom-based Summaries for the Boolean ProblemFor each d in D maintain two Bloom filters:

BFmin(d) for the hmin(i, j) and BFmax(d) for the hmax(i, j) values

of each distinct keyword pair (wi,wj) in d

Given a similarity threshold , for all (wi, wj) in d

if hmin (i, j) ≤ , then (wi, wj) is hashed as one key and inserted into BFmin(d)

if hmax(i, j) ≤ , then (wi, wj) is also inserted into BFmax(d)


28

Bloom-based Summaries for the Boolean Problem

Similarity Evaluation of d to q:

1. every pair of keywords of q is looked up in BFmin(d) and if one is not found, d is not relevant

2. else, we also look them in BFmax(d), if found, definite relevant else relevant but with a false positive probability


29

Bloom-based Summaries for the Weighted Problem

Group the keyword pairs according to their hmin(i, j) (hmax(i, j) ) value and use a separate Bloom filter for each such group - distance

Compute the similarity by applying F on the number of the highest level for which there was a hit for any of the keyword pairs of the query

f

v

a a a

d f f m

o h o h e

d

x

f

b o

e

vb


30

OUTLINE





How to do compute goodness for a given query




31

Experimental EvaluationWe consider four approaches for goodness evaluation:

keyword: ignores structure - based solely on the appearance of the keywords tree: exact evaluation based on ELCA semantics

pair: pairwise estimation bloom: pairwsise + Bloom-based summaries

Experiments on both synthetic and real datasets

goodness estimation of a single collection

accuracy of the ranking based on goodness


32

Goodness Estimation (Single Collection)

Using Bloom filters increases the estimation error but also reduces the storage overhead to 8% of the pair-based one

Due to false positives, Bloom filters derive more optimistic lower bounds

Weighted Boolean


33

Weighted Boolean

For low threshold values, the goodness estimations and the lower bounds are more accurate, while they increase as the threshold increases

When the threshold value is close to the tree depth of the documents, the accuracy of the estimations improves again

Similarity Threshold (Single Collection)


34

Document & Query Structure (single collection)

Absolute estimation error (distance from ELCA)

Overall acceptable estimations (below 20%)

Our approaches behaves worse for queries of "medium" length (4-5) and small number of repeating elements


35

Achieved ranking

Optimal Ranking (Ranking achieved through the actual ELCA computation) and Pair-wise Ranking (with and without Blooms)

Spearman Footrule distance between two ranked lists: the absolute difference of their pairwise elements normalized by dividing by 1/2(S), where S the number of elements in the lists

Mean Average Precision (MAP) for a set of different queries: the average of the precision value (percentage of relevant documents) attained after each query, divided by the number of queries

three different collections (same size, different size, random)


36

Ranking (Spearman)

Equal Size Collections Different Size Collections

Random Collections The keyword-based approach ignores the document structure and ranks the collections according to their size Our approaches behave well, with maximum distance to the actual ranking at 0.3 in the worst case The Bloom-based approach sometimes outperforms the pair-based one due to the more optimistic estimations


37

Equal Size Collections Different Size Collections

Random Collections

Our approaches behave well, with a MAP around 0.75 to 0.85 The Bloom-based approach is less precise because of the false positives

Ranking (MAP)


3838

We split the DBLP bibliographic data collection:

Two sets of collections grouped by:1. year of publication (i.e., collections “2009”, "2008", etc)2. conference name (i.e., collection “WWW”, "VLDB", etc)

Queries with author names as keywords With λ equal to 1, we retrieve publications cowritten by two authors

Pair-based Bloom-based Keyword-based

“Omar Benjelloun and Serge Abiteboul” & collections by yearCorrect order (by counting commnon publications): 2004 2002 2003 2005

SF distance: 0Precision: 1

SF distance: 0.2Precision: 5/6


“Alon Y. Halevy and Zachary G. Ives” & collections by conferenceCorrect order (by counting common publications): SIGMOD, WebDb, WWW, ...

SF distance: 0Precision: 1



Real Data


39

Summary

(LCA-based ranking) Maintain information about the height of the LCA node between keywords

Propose a pair-wise aproach: the actual height for a combination of keywords is estimated using the pair-wise heights

Introduce Bloom-based summaries for maintaining heights

Both a Boolean and a Weighted version for document similarity

Evaluation of the quality of the goodness estimation per collection and the actual ranking, as well as usefulness for real data

Consider the problem of source selection for XML documents:Given a set of XML databases and a keyword query, ranked the databases based on their goodness


40

Future Work

Other definitions of document relevance (including schema based and IR techniques)

Alternative definitions of database goodness + user study for their evaluation

Other types of summaries


41

Thank you


42

Related WorkLCA-type Description Definition

Smallest LCA(SLCA)(Yu, & Papakonstantinou, Sigmod’05)

v is an SLCA if all keywords of q appear in the subtree rooted at v and none of its descendants has such a subtree containing all keywords.

v slca(S1, S2, . . ., Sk) if v lca(S1, . . ., Sk) and u lca(S1, S2, . . ., Sk) v not an ancestor of u.

Exclusive LCA (ELCA)((Yu, & Papakonstantinou, EDBT ‘08)

v is an ELCA if it contains at least one occurrence of each keyword in the subtree rooted at v, excluding the occurrences of the keywords in subtrees of its descendants already containing all the keywords.

v elca(S1, S2, . . ., Sk) iff v1 S1, . . ., vk Sk: v=lca(v1, . . ., vk) and vi (1≤ i ≤ k) the child of v in the path from v to vi is not an LCA of S1, . . ., Sk itself or an ancestor of such an LCA.

Meaningful LCA (MLCA)(Li et al, VLDB’04)

v is an MLCA if in the subtree rooted at v, the nodes containing the keywords are pairwise meaningfully related.

v is not an MLCA, if all pairs of nodes (vi, vj) in the subtree rooted at v that contain the keywords of q are such that v’i, v’j containing the same keywords such that lca(vi, vj) is an ancestor of lca(v’i, v’j).

Valuable LCA (VLCA)(Li et al, CIKM’07)

v is a VLCA, iff for the nodes vi, vj, containing keywords (wi, wj), in the subtree rooted at v, there are no other two nodes of the same label/tag except vi, vj.

For v=lca(v1, . . ., vk) , v is the VLCA of v1, . . ., vk iff vi, vj

there are no other two nodes of the same label/tag.

For all the variations of the LCA, for any query q and document d the set of the LCA nodes of the keywords in q (basic LCA nodes) is a superset of any type of LCA nodes, i.e., SLCA, ELCA, MLCA, VLCA


43

Experimental Evaluation

Parameter Range Default

# of documents per collection (|D|) 20-200 100

# of elements per document (n) - 50000

depth of XML tree (depth) 4-20 12

% of repeating element names (r) 0-0.6 0.3

query elements appearing in documents - 90%

query length (k) 1-6 4

similarity threshold (l) 1-12 4

number of collections (N) - 12

Number of Bloom filter hash functions - 4

Size of Bloom filter - 996bits

lca -based selection for xml document collections georgia koloniari joint work with evaggelia...

Documents

university of ioannina

xml documentsdmod laboratory

definitiondmod laboratory

keywordsdmod laboratory

vdmod laboratory

set of nodes v

set of lca nodes v

database of xml documents