clustering xml documents for query performance enhancement wang lian
TRANSCRIPT
Clustering XML Documentsfor Query Performance
Enhancement
Wang Lian
Outline Related Work Motivation Our Approach
S-Graph Distance Function Clustering Algorithm
Experimental Results
Related Work Besides storing XML documents in their
native format, using RDBMS is an established trend.
There are mainly two approaches for storing XML documents in RDBMS Schema-mapping Structure-mapping
Related Work(cont) In Schema-mapping
A database schema is derived from the DTD of XML documents, therefore different DTD will generate different database schema
In Structure mapping The database schema is fixed by defining a
set of generic tables.
Motivation Using both schema or structure mapping,
documents must be cut into pieces and inserted into tables. To answer a query tables should be joined to provide the answers.
As the size of tables grows larger, the join cost may be very high.
Our observation is: if a collection contains documents of different structures, then clustering on documents’ structures may reduce the join cost.
Motivation(cont) An example
DTD is :
<!ELEMENT conference (name, author)*>
<!ELEMENT journal (name, author, publisher)*>
<!ELEMENT name (#PCDATA)*>
<!ELEMENT author (#PCDATA)*>
<!ELEMENT publisher (#PCDATA)*>
Motivation(cont) Documents in 3 clusters
Motivation(cont) Unpartitioned schema
Motivation(cont) Partitioned schema
Motivation(cont) Suppose we want to answer an Xpath
query /conference/author.text() Using unpartitioned schema, table conference
(2 tuples )and author (9 tuples) should be joined,
Using partitioned schema, table conference1 (2 tuples ) and author1 (3 tuples) should be joined.
Our Approach XML document is a mixture of structure
information and data value. In our context, only structure information is used to do clustering.
We need a proper distance function before using any clustering algorithm.
S-Graph Given a set of XML documents C, the structure
graph (s-graph) of C, sg(C) =(N, E), is a directed graph such that N is the set of all the elements and attributes in the documents in C and (a, b)E if and only if a is a parent element of b in document(s) in C (b can be element or attribute).
Certainly, s-graph does not catch all structure information of documents, however it captures the parent-child relationship which is valuable for evaluating path expression.
Distance Function For two sets, C1 and C2, of XML
documents, the distance between them,
where |sg(Ci)| is the number of edges in sg(Ci), i=1,2, and sg(C1) sg(C2) is the set of common edges of sg(C1) and sg(C2).
|})(||,)({|
|)()(|1),(
21
2121
CsgCsgMax
CsgCsgCCdist
Distance Function(cont)
Dist({doc1},{doc2})=1 and Dist({doc2},{doc3})=0.25Tree-dist({doc1},{doc2})= Tree-dist({doc2},{doc3})=1
Clustering AlgorithmInput: X the set of XML documents
Input: k the number of clusters specified by user
1. SG=pre-clustering(X)
2. While(remaining cluster number>k) Merge cluster Ci and Cj which maximize a
predefined goodness function
Clustering Algorithm(cont) Complexity
n =|X|, m=|SG|, Time complexity
The upper bound of pre-clustering is O(nm), in general, it can be reduced to O(n).
Iterative merging : O(m2logm)
Space complexity O(m2)
Experimental Results The clustering algorithm is tested on real
data, the DBLP XML records, which contains about 200,000 documents composed by 36 elements. Pre-clustering is effective, after scan 200,000
documents, only 233 distinguished s-graphes remain, which makes following clustering using only less than 2 second.
Experimental Results(cont) After setting the number of cluster to be 3,
we get three clusters containing about 193,000 documents, one for article, and the other two for inproceedings.
The interesting thing is in those two clusters of inproceedings, one s-graph is a subgraph of another.
Experimental Results(cont) We use Oracle 8.1.5 to store all the
documents in 4 versions: Version 1: unpartitioned schema-mapping Version 2: partitioned schema-mapping Version 3: unpartitioned structure mapping Version 4: partitioned structure mapping
Experimental Results Query type
Q1 : /A1/A2/…/Ak; all possible absolute XPathes in the documents.
Q2 : /A1/A2/…/Ak[text()=``value'']/text(); absolute XPaths in which Ai, i=1,…,k, are randomly picked and "value" is the value of Ak in some
documents. Q3 : /A1/A2/…/Ak[contains(.,``substring'')]/text() same as Q2 except that the condition tested is ``Ak
contains a “substring”
Experimental Results
Question?