clustering xml documents for query performance enhancement wang lian

Clustering XML Documentsfor Query Performance

Enhancement

Wang Lian

Outline Related Work Motivation Our Approach

S-Graph Distance Function Clustering Algorithm

Experimental Results

Related Work Besides storing XML documents in their

native format, using RDBMS is an established trend.

There are mainly two approaches for storing XML documents in RDBMS Schema-mapping Structure-mapping

Related Work(cont) In Schema-mapping

A database schema is derived from the DTD of XML documents, therefore different DTD will generate different database schema

In Structure mapping The database schema is fixed by defining a

set of generic tables.

Motivation Using both schema or structure mapping,

documents must be cut into pieces and inserted into tables. To answer a query tables should be joined to provide the answers.

As the size of tables grows larger, the join cost may be very high.

Our observation is: if a collection contains documents of different structures, then clustering on documents’ structures may reduce the join cost.

Motivation(cont) An example

DTD is :

<!ELEMENT conference (name, author)*>

<!ELEMENT journal (name, author, publisher)*>

<!ELEMENT name (#PCDATA)*>

<!ELEMENT author (#PCDATA)*>

<!ELEMENT publisher (#PCDATA)*>

Motivation(cont) Documents in 3 clusters

Motivation(cont) Unpartitioned schema

Motivation(cont) Partitioned schema

Motivation(cont) Suppose we want to answer an Xpath

query /conference/author.text() Using unpartitioned schema, table conference

(2 tuples )and author (9 tuples) should be joined,

Using partitioned schema, table conference1 (2 tuples ) and author1 (3 tuples) should be joined.

Our Approach XML document is a mixture of structure

information and data value. In our context, only structure information is used to do clustering.

We need a proper distance function before using any clustering algorithm.

S-Graph Given a set of XML documents C, the structure

graph (s-graph) of C, sg(C) =(N, E), is a directed graph such that N is the set of all the elements and attributes in the documents in C and (a, b)E if and only if a is a parent element of b in document(s) in C (b can be element or attribute).

Certainly, s-graph does not catch all structure information of documents, however it captures the parent-child relationship which is valuable for evaluating path expression.

Distance Function For two sets, C1 and C2, of XML

documents, the distance between them,

where |sg(Ci)| is the number of edges in sg(Ci), i=1,2, and sg(C1) sg(C2) is the set of common edges of sg(C1) and sg(C2).

|})(||,)({|

|)()(|1),(

21

2121

CsgCsgMax

CsgCsgCCdist

Distance Function(cont)

Dist({doc1},{doc2})=1 and Dist({doc2},{doc3})=0.25Tree-dist({doc1},{doc2})= Tree-dist({doc2},{doc3})=1

Clustering AlgorithmInput: X the set of XML documents

Input: k the number of clusters specified by user

1. SG=pre-clustering(X)

2. While(remaining cluster number>k) Merge cluster Ci and Cj which maximize a

predefined goodness function

Clustering Algorithm(cont) Complexity

n =|X|, m=|SG|, Time complexity

The upper bound of pre-clustering is O(nm), in general, it can be reduced to O(n).

Iterative merging : O(m2logm)

Space complexity O(m2)

Experimental Results The clustering algorithm is tested on real

data, the DBLP XML records, which contains about 200,000 documents composed by 36 elements. Pre-clustering is effective, after scan 200,000

documents, only 233 distinguished s-graphes remain, which makes following clustering using only less than 2 second.

Experimental Results(cont) After setting the number of cluster to be 3,

we get three clusters containing about 193,000 documents, one for article, and the other two for inproceedings.

The interesting thing is in those two clusters of inproceedings, one s-graph is a subgraph of another.

Experimental Results(cont) We use Oracle 8.1.5 to store all the

documents in 4 versions: Version 1: unpartitioned schema-mapping Version 2: partitioned schema-mapping Version 3: unpartitioned structure mapping Version 4: partitioned structure mapping

Experimental Results Query type

Q1 : /A1/A2/…/Ak; all possible absolute XPathes in the documents.

Q2 : /A1/A2/…/Ak[text()=``value'']/text(); absolute XPaths in which Ai, i=1,…,k, are randomly picked and "value" is the value of Ak in some

documents. Q3 : /A1/A2/…/Ak[contains(.,``substring'')]/text() same as Q2 except that the condition tested is ``Ak

contains a “substring”

Experimental Results

Question?

clustering xml documents for query performance enhancement wang lian

Documents

dtd of xml documents

set of xml documents

documents structures

clustering algorithminput

following clustering

set of xml documentsinput

structure graph sgraph

query tables