kaist2002 sigdb tutorial1 indexing methods for efficient xml query processing jun-ki min kaist jkmin
TRANSCRIPT
KAIST 2002 SIGDB Tutorial 1
Indexing Methods for Efficient XML Query Processing
Jun-Ki Min KAIST
http://islab.kaist.ac.kr/~jkmin/
KAIST 2002 SIGDB Tutorial 2
XML eXtensible Markup Language The de facto standard
data representation and exchange on the Web
XML Data An instance of semistructured data self-describing irregularly structured
KAIST 2002 SIGDB Tutorial 3
XML Data Comprise hierarchically nested
collections of elements Element can contains
Atomic data value A sequences of subelements attributes composed of name-value pairs
ID-IDREF relationship Tree or Graph representation
KAIST 2002 SIGDB Tutorial 4
XML Example
<libraryDB>
<book editor = 1> <title> title1 </title> <author> author1 </author> <chapter> … </chapter></book><paper> <title> title2</title> <author id = 1> author2 </author> <author> author3 </author> <section> … </section></paper>…</libraryDB>
1
62
3
book
title
paper
libraryDB
7
editor
title
author
4
109
section
8
author5
chapterauthor
ToXinIndex Fabric APEX
KAIST 2002 SIGDB Tutorial 5
XML Query XML Query Language
XSLT, XML-QL, XPath, XQuery use path expression to traverse the
irregularly structured data ex) /libraryDB/book/title or //title search the whole XML data => inefficiency
Structural Summary & Path Index by restricting the search to only relevant
portions of XML Data
KAIST 2002 SIGDB Tutorial 6
Schemas for XML DTD, XML Schema
Specifies the constraints of XML Data <!ELEMENT book (title, author+,chapter*)>
are not mandatory => lack of external schema
Structural Summary Summary of label paths
Path Index Structural Summary + Extents
KAIST 2002 SIGDB Tutorial 7
Schemas for XML Applications
User Interface XML Data Design, Editing
Query Formulation Query Validation Query Optimization Path Index
KAIST 2002 SIGDB Tutorial 8
Structural Summary DTD Extraction
XTRACT based on element information
Structural Summary Representative Objects
based on path information
KAIST 2002 SIGDB Tutorial 9
XTRACT [Garofalakis, Gionis, Rastogi, Seshadri, Shim:
SIGMOD 00] Infer concise and accurate DTD Choose a DTD from candidate DTDs
(a b),(b a) => (a|b)* or (a b)|(b a) Based on Minimum Description Length (MDL)
Principle ranks each candidate DTDs depending on the
number of bits required to describe the subelement sequences in terms of the candidate DTD
6(for DTD)+3+3 = 12 9(for DTD)+1+1 = 11
KAIST 2002 SIGDB Tutorial 10
Representative Objects(RO)
[Nestorov, Ullman, Wiener, Chawathe : ICDE 97] Provide a concise representation of the inherent
schema of a semistructured hierarchical data Full-RO
Describe all simple paths K-RO
K-RO guarantees that its paths whose length are k+1 exist in data.
1-RO Simplest & very compacted representation
KAIST 2002 SIGDB Tutorial 11
Representative Objects(RO)
book
title
paper
libraryDB
editor
title
author
section
authorchapter author
XML Dataname
book
title
paper
libraryDB
editor
title
author sectionauthorchapter
Graph Representation of 1-RO
name
book
title
paper
libraryDB
editor
title
authorsectionauthor
chapter
Graph Representation of 2-RO(= Full-RO)name
KAIST 2002 SIGDB Tutorial 12
Path Index Access Support Relations Deterministic
Strong DataGuide Index Fabric ToXin APEX
Non-Deterministic 1-Index A(k) Index F&B Index
KAIST 2002 SIGDB Tutorial 13
Access Support Relations [Kemper, Moerkotte: IS 92] Originated from OODBMS
select Name from Mercedes.Manufactures.Composition.Division
To support join along arbitrary reference chains Generalization of Join Index[Valduriez 87]
Based on the paths in the schema Materialize access paths of arbitrary length Support only predefined subsets of paths.
KAIST 2002 SIGDB Tutorial 14
DataGuides [Goldman, Widom : VLDB 97] An implementation version of Full-RO Summary of label paths from the root (=
simple paths) Concise: describe every unique simple path
exactly once, regardless of the number of times it appears
Accuracy: do not contains label paths that do not appear in the data
Convenience: can store and access it using similar techniques available for processing semistructured data
KAIST 2002 SIGDB Tutorial 15
DataGuides Construction Algorithm emulates the
conversion algorithm from non-deterministic finite automata (NFA) to deterministic finite automata (DFA)
Intuitively, a simple path is represented as a node in DataGuide
One XML Data may have multiple DataGuidesA B
CC
A B
CC
A B
C
Various DataGuides
A B
CCC C C
B
An XML Data
KAIST 2002 SIGDB Tutorial 16
Strong DataGuide If the sets of nodes which are reachable for simple paths are
equal, then the simple paths are represented as a single node.
Linear time and linear space for tree structured data Exponential time and exponential space for graph
structured data
1A
A
C
B
CC
A
C
B
C
2
3
4
5
6
1
2,4
3,5
6
5
1A
A
C
B
CC2
3
4
5
6C
Source Strong DataGuide Source Strong DataGuide
A B
C
1
2,4
3,5
6
C
KAIST 2002 SIGDB Tutorial 17
1/2/T-Index [Milo and Suciu: ICDT 99] 1-Index
Summary all label paths starting from the root
Support queries of q= Px where P = /l1/l2/…/ln Non-deterministic Based on backward bisimulation which is
originated from graph verification Extents are disjoint More compact size than Strong DataGuides
KAIST 2002 SIGDB Tutorial 18
1-Index Equivalence relation (≡)
v ≡ u iff Lv =Lu
where Lx = {w| w is a simple path from the root to x} the collection of all equivalence class Exponential construction cost
Backward Bisimulation (≈b) 1. If x≈by and x is the root then y is the root
2. Conversely, If x≈by and y is the root, then x is the root.
3. If x≈by and <x’l x> is an edge, then there is exists an edge (y’l y), such that x’ ≈by’
4. Conversely, if x≈by and (y’l y) is an edge, then there exists an edge (x’l x) such that x’≈by’
KAIST 2002 SIGDB Tutorial 19
≡ vs ≈b
X ≡ Y since LX = LY = {a.b.d, a.c.d} X Y v ≈b u v ≡ u
O(mlogm) construction cost [Paige and Tarjan 87]
d d
aaaa
b bc c
d
X Y
≈b
KAIST 2002 SIGDB Tutorial 20
1-Index vs Strong DataGuide
In tree structured Data, strong Dataguide and 1-Index coincide
1
62
3
book
title
paper
libraryDB
7editortitle
author4 109
section
8author
5
chapter author
XML Data
1
62
3
book
title
paper
libraryDB
7editor
title
author
108,9
section
85
chapter author
Strong DataGuide4
1-Index
1
62
3
book
title
paper
libraryDB
7editor
title
author
109
section
85
chapter author
4
KAIST 2002 SIGDB Tutorial 21
2/T-Index 2-Index
To support queries of x1Px2
ex) //title Equivalence relation (≡)
(v, u) ≡ (v’, u’) iff L(v,u) =L(v’,u’)where L(x,y) = {w| w is a label path from x to y}
Summary of path information bwt. two arbitrary nodes T-Index
Generalization of 1/2-Index(v1,…,vn )≡ (u1,…,un) iff L(v1,…,vn) =L(u1,…,un)
Conceptually similar to Access Support Relations Support only predefined paths
KAIST 2002 SIGDB Tutorial 22
Index Fabric [Cooper, Sample, Franklin, Hjaltason,
Shadmon, VLDB 01] Tree Structured Data Conceptual similar to strong DataGuide Layered structure Use Patricia trie to index a large number of
search keys The simple path of an element which has a data
value is encoded as a special character sequence Keeps the key which is the combination of encoded
sequence and data value.
KAIST 2002 SIGDB Tutorial 23
Index Fabric
Keeps only the information of elements which have data values
Patricia trie : lossy Compression
XML Data Patricia Trie
0
1
2 2
L
B P
T A
…
LBAauthor1LBTtitle1
C
…
2
1
2
8
“L”“L”
“LBC”
C
B
P
C
KAIST 2002 SIGDB Tutorial 24
ToXin [Rizzolo, Mendelzon: WebDB 01] Tree Structured Data Conceptually Similar to strong DataGuide (not
minimal DataGuide) Support navigation of forward and backward
traversal Path Tree ( = strong DataGuide) A node of Path Tree has an Index Table or
Value Tables Index Table (IT): parent-child relationships Value Table (VT): owner-value relationships
KAIST 2002 SIGDB Tutorial 25
ToXin
Since ToXin keeps parent-child relationships, ToXin supports path expression with value predicates
ex) /libraryDB/book[author = author1]
LibraryDB:IT
book:IT paper:IT
title:VT author:VT
chaptertitle:VT author:VT
section
LibararyDBparent childnull 1
LibraryDB.bookparent child1 2
LibraryDB.paperparent child1 6
LibraryDB.book.authorparent value2 author1 …
•Index Tables
•Value Tables
XML Data
KAIST 2002 SIGDB Tutorial 26
A(k)-Index [Kaushik, Shenoy, Bohannon, Gudes:
ICDE 02] Strong DataGuide and 1-Index record the
all simple paths Increase index size => Increase search space
Approximation of 1-Index Non-deterministic Utilize local similarity(= degree k)
reduce the size of index graph
KAIST 2002 SIGDB Tutorial 27
A(k)-Index k-bisimulation (≈k)
For any two nodes, v and u, v ≈0 u iff u and v have the same label
Node v≈ku iff v≈k-1u and for every parent v’ of v, there is a parent u’ of u such that v’≈k-1u’
A
CB
D
E
D
E
A
CB
D
E
XML Data A(0)-Index
A
CB
D
E
D
A(1)-Index
A
CB
D
E
D
E
A(2)-Index (= 1-Index)
D
E
KAIST 2002 SIGDB Tutorial 28
A(k)-Index Building cost = O(km) In general, for 1-Index, k < logm Query Processing
label path expression whose length ≤ k+1 precise
label path expression whose length > k+1 safe : include false results validation => require the data scan
KAIST 2002 SIGDB Tutorial 29
APEX:Adaptive Path indEx for XML Data
[Chung, Min, Shim : SIGMOD 02] Strong DataGuide and 1-Index are kept
the all simple paths Users used partial matching path
queries //book/title
Exhaustive navigation of index structure for partial matching path queries may result in performance degradation
KAIST 2002 SIGDB Tutorial 30
APEX Deterministic Approximation of DataGuides Efficient processing of partial matching path
queries Workload-Aware
Self Tuning Strategies [Chaudhuri et. al 00] Utilize Query Workload Build APEX with both XML data and
frequently used paths Sequential pattern mining [Agrawal and
Srikant 95]
KAIST 2002 SIGDB Tutorial 31
APEX
Hash Tree keep frequently used paths prevent the exhaustive search
Graph Structure structural summary + extents
APEXfrequently used paths = {book.title}
extent&0: {<null,0>} &1: {<0,1>} &2: {<1,2> }&3: {<1,6>}&4: {<2,4>, <6,8>, <6,9>} &5: {<2,5>} &6: {<6,10>} &7: {<2,8>} &8: {<2,3>} &9: {<6,7>}
label xnode nextxroot &0libraryDB &1book &2paper &3titleauthor &4chapter &5section &6editor &7
label count xnode nextbook &8remainder &9
libraryDB
title title
paperbook&1
&2 &3
&8 &9
&0
authorauthor
&4&5
chapter&6
section
&7editor
XML Data
KAIST 2002 SIGDB Tutorial 32
F&B Index [Kaushik, Bohannon, Naughton, Korth :
SIGMOD 02] Support Twig path expression
/A/B[C]
Basic Idea For every edge e labelled l from v to u, add an
(inverse) edge e-1 with label l-1 from u to v And then, compute 1-Index on this modified graph.
Very large Index space Apply some heuristics
- Exploiting Local Similarity : k-bisimulation
A B C
A B C-1
KAIST 2002 SIGDB Tutorial 33
Discussion Path Index
Improve the query performance by restriction of search space
Can be apply to various application Selectivity Estimation QBE(Query By Example)
Future Work Support twig queries Query Optimization
cost formula of path index
KAIST 2002 SIGDB Tutorial 34
Thank You! Any Question?
http://islab.kaist.ac.kr/~jkmin [email protected]
KAIST 2002 SIGDB Tutorial 35
Reference1. C. Chung, J. Min and K. Shim, “ APEX: An Adaptive Path Index for XML Data,” SIGMOD 022. B. Cooper, N. Sample, M. Franklin, G. Hjaltason and M. Shadmon, “A Fast Index for
Semistructed Data,” VLDB 013. M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim, “ XTRACT: A System for
Extracting Document Type Descriptors from XML Documents,” SIGMOD 004. L. Goldman and J. Widom, “ DataGuides: Enabling Queries Formulation and Optimization
in Seminstructured Databases,” VLDB 975. R. Kaushik, P. Bohannon, J. Naughton and H. Korth, “Covering Indexes for Branching Path
Queries,” SIGMOD 026. R. Kaushik, P. Shenoy, P. Bohannon and E. Gudes, “Exploiting Local Similarity for
Indexing Paths in Graph-Structured Data,” ICDE 027. A. Kemper and G. Moerkotte, “Access Support Relations: An Indexing Method for Object
Bases,” Information Systems 928. T. Milo and D. Suciu, “ Index Structures for Path Expressions,” ICDT 999. S. Nestorov, J. Ullman, J. Wiener and S. Chawathe, “ Representative Objects : Concise
Representations of Semi structured, Hierarchical Data,” ICDE 9710. F. Rizzolo and A. Mendelzon,” Indexing XML Data with ToXin,” WebDB 0111. R. Paige and R. Tarjan, “Three partition refinement algorithms,” SIAM Journal of
Computing 8712. P. Valduriez, “Join Indices,” TODS 87