clustering xml documents for query performance enhancement wang lian

22
Clustering XML Documents for Query Performance Enhancement Wang Lian

Upload: hortense-bridges

Post on 03-Jan-2016

229 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Clustering XML Documents for Query Performance Enhancement Wang Lian

Clustering XML Documentsfor Query Performance

Enhancement

Wang Lian

Page 2: Clustering XML Documents for Query Performance Enhancement Wang Lian

Outline Related Work Motivation Our Approach

S-Graph Distance Function Clustering Algorithm

Experimental Results

Page 3: Clustering XML Documents for Query Performance Enhancement Wang Lian

Related Work Besides storing XML documents in their

native format, using RDBMS is an established trend.

There are mainly two approaches for storing XML documents in RDBMS Schema-mapping Structure-mapping

Page 4: Clustering XML Documents for Query Performance Enhancement Wang Lian

Related Work(cont) In Schema-mapping

A database schema is derived from the DTD of XML documents, therefore different DTD will generate different database schema

In Structure mapping The database schema is fixed by defining a

set of generic tables.

Page 5: Clustering XML Documents for Query Performance Enhancement Wang Lian

Motivation Using both schema or structure mapping,

documents must be cut into pieces and inserted into tables. To answer a query tables should be joined to provide the answers.

As the size of tables grows larger, the join cost may be very high.

Our observation is: if a collection contains documents of different structures, then clustering on documents’ structures may reduce the join cost.

Page 6: Clustering XML Documents for Query Performance Enhancement Wang Lian

Motivation(cont) An example

DTD is :

<!ELEMENT conference (name, author)*>

<!ELEMENT journal (name, author, publisher)*>

<!ELEMENT name (#PCDATA)*>

<!ELEMENT author (#PCDATA)*>

<!ELEMENT publisher (#PCDATA)*>

Page 7: Clustering XML Documents for Query Performance Enhancement Wang Lian

Motivation(cont) Documents in 3 clusters

Page 8: Clustering XML Documents for Query Performance Enhancement Wang Lian

Motivation(cont) Unpartitioned schema

Page 9: Clustering XML Documents for Query Performance Enhancement Wang Lian

Motivation(cont) Partitioned schema

Page 10: Clustering XML Documents for Query Performance Enhancement Wang Lian

Motivation(cont) Suppose we want to answer an Xpath

query /conference/author.text() Using unpartitioned schema, table conference

(2 tuples )and author (9 tuples) should be joined,

Using partitioned schema, table conference1 (2 tuples ) and author1 (3 tuples) should be joined.

Page 11: Clustering XML Documents for Query Performance Enhancement Wang Lian

Our Approach XML document is a mixture of structure

information and data value. In our context, only structure information is used to do clustering.

We need a proper distance function before using any clustering algorithm.

Page 12: Clustering XML Documents for Query Performance Enhancement Wang Lian

S-Graph Given a set of XML documents C, the structure

graph (s-graph) of C, sg(C) =(N, E), is a directed graph such that N is the set of all the elements and attributes in the documents in C and (a, b)E if and only if a is a parent element of b in document(s) in C (b can be element or attribute).

Certainly, s-graph does not catch all structure information of documents, however it captures the parent-child relationship which is valuable for evaluating path expression.

Page 13: Clustering XML Documents for Query Performance Enhancement Wang Lian

Distance Function For two sets, C1 and C2, of XML

documents, the distance between them,

where |sg(Ci)| is the number of edges in sg(Ci), i=1,2, and sg(C1) sg(C2) is the set of common edges of sg(C1) and sg(C2).

|})(||,)({|

|)()(|1),(

21

2121

CsgCsgMax

CsgCsgCCdist

Page 14: Clustering XML Documents for Query Performance Enhancement Wang Lian

Distance Function(cont)

Dist({doc1},{doc2})=1 and Dist({doc2},{doc3})=0.25Tree-dist({doc1},{doc2})= Tree-dist({doc2},{doc3})=1

Page 15: Clustering XML Documents for Query Performance Enhancement Wang Lian

Clustering AlgorithmInput: X the set of XML documents

Input: k the number of clusters specified by user

1. SG=pre-clustering(X)

2. While(remaining cluster number>k) Merge cluster Ci and Cj which maximize a

predefined goodness function

Page 16: Clustering XML Documents for Query Performance Enhancement Wang Lian

Clustering Algorithm(cont) Complexity

n =|X|, m=|SG|, Time complexity

The upper bound of pre-clustering is O(nm), in general, it can be reduced to O(n).

Iterative merging : O(m2logm)

Space complexity O(m2)

Page 17: Clustering XML Documents for Query Performance Enhancement Wang Lian

Experimental Results The clustering algorithm is tested on real

data, the DBLP XML records, which contains about 200,000 documents composed by 36 elements. Pre-clustering is effective, after scan 200,000

documents, only 233 distinguished s-graphes remain, which makes following clustering using only less than 2 second.

Page 18: Clustering XML Documents for Query Performance Enhancement Wang Lian

Experimental Results(cont) After setting the number of cluster to be 3,

we get three clusters containing about 193,000 documents, one for article, and the other two for inproceedings.

The interesting thing is in those two clusters of inproceedings, one s-graph is a subgraph of another.

Page 19: Clustering XML Documents for Query Performance Enhancement Wang Lian

Experimental Results(cont) We use Oracle 8.1.5 to store all the

documents in 4 versions: Version 1: unpartitioned schema-mapping Version 2: partitioned schema-mapping Version 3: unpartitioned structure mapping Version 4: partitioned structure mapping

Page 20: Clustering XML Documents for Query Performance Enhancement Wang Lian

Experimental Results Query type

Q1 : /A1/A2/…/Ak; all possible absolute XPathes in the documents.

Q2 : /A1/A2/…/Ak[text()=``value'']/text(); absolute XPaths in which Ai, i=1,…,k, are randomly picked and "value" is the value of Ak in some

documents. Q3 : /A1/A2/…/Ak[contains(.,``substring'')]/text() same as Q2 except that the condition tested is ``Ak

contains a “substring”

Page 21: Clustering XML Documents for Query Performance Enhancement Wang Lian

Experimental Results

Page 22: Clustering XML Documents for Query Performance Enhancement Wang Lian

Question?