kaist2002 sigdb tutorial1 indexing methods for efficient xml query processing jun-ki min kaist jkmin

35
KAIST 2002 SIGDB Tutorial 1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST http://islab.kaist.ac.kr/~jkmin/

Upload: phebe-washington

Post on 13-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 1

Indexing Methods for Efficient XML Query Processing

Jun-Ki Min KAIST

http://islab.kaist.ac.kr/~jkmin/

Page 2: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 2

XML eXtensible Markup Language The de facto standard

data representation and exchange on the Web

XML Data An instance of semistructured data self-describing irregularly structured

Page 3: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 3

XML Data Comprise hierarchically nested

collections of elements Element can contains

Atomic data value A sequences of subelements attributes composed of name-value pairs

ID-IDREF relationship Tree or Graph representation

Page 4: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 4

XML Example

<libraryDB>

<book editor = 1> <title> title1 </title> <author> author1 </author> <chapter> … </chapter></book><paper> <title> title2</title> <author id = 1> author2 </author> <author> author3 </author> <section> … </section></paper>…</libraryDB>

1

62

3

book

title

paper

libraryDB

7

editor

title

author

4

109

section

8

author5

chapterauthor

ToXinIndex Fabric APEX

Page 5: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 5

XML Query XML Query Language

XSLT, XML-QL, XPath, XQuery use path expression to traverse the

irregularly structured data ex) /libraryDB/book/title or //title search the whole XML data => inefficiency

Structural Summary & Path Index by restricting the search to only relevant

portions of XML Data

Page 6: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 6

Schemas for XML DTD, XML Schema

Specifies the constraints of XML Data <!ELEMENT book (title, author+,chapter*)>

are not mandatory => lack of external schema

Structural Summary Summary of label paths

Path Index Structural Summary + Extents

Page 7: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 7

Schemas for XML Applications

User Interface XML Data Design, Editing

Query Formulation Query Validation Query Optimization Path Index

Page 8: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 8

Structural Summary DTD Extraction

XTRACT based on element information

Structural Summary Representative Objects

based on path information

Page 9: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 9

XTRACT [Garofalakis, Gionis, Rastogi, Seshadri, Shim:

SIGMOD 00] Infer concise and accurate DTD Choose a DTD from candidate DTDs

(a b),(b a) => (a|b)* or (a b)|(b a) Based on Minimum Description Length (MDL)

Principle ranks each candidate DTDs depending on the

number of bits required to describe the subelement sequences in terms of the candidate DTD

6(for DTD)+3+3 = 12 9(for DTD)+1+1 = 11

Page 10: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 10

Representative Objects(RO)

[Nestorov, Ullman, Wiener, Chawathe : ICDE 97] Provide a concise representation of the inherent

schema of a semistructured hierarchical data Full-RO

Describe all simple paths K-RO

K-RO guarantees that its paths whose length are k+1 exist in data.

1-RO Simplest & very compacted representation

Page 11: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 11

Representative Objects(RO)

book

title

paper

libraryDB

editor

title

author

section

authorchapter author

XML Dataname

book

title

paper

libraryDB

editor

title

author sectionauthorchapter

Graph Representation of 1-RO

name

book

title

paper

libraryDB

editor

title

authorsectionauthor

chapter

Graph Representation of 2-RO(= Full-RO)name

Page 12: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 12

Path Index Access Support Relations Deterministic

Strong DataGuide Index Fabric ToXin APEX

Non-Deterministic 1-Index A(k) Index F&B Index

Page 13: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 13

Access Support Relations [Kemper, Moerkotte: IS 92] Originated from OODBMS

select Name from Mercedes.Manufactures.Composition.Division

To support join along arbitrary reference chains Generalization of Join Index[Valduriez 87]

Based on the paths in the schema Materialize access paths of arbitrary length Support only predefined subsets of paths.

Page 14: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 14

DataGuides [Goldman, Widom : VLDB 97] An implementation version of Full-RO Summary of label paths from the root (=

simple paths) Concise: describe every unique simple path

exactly once, regardless of the number of times it appears

Accuracy: do not contains label paths that do not appear in the data

Convenience: can store and access it using similar techniques available for processing semistructured data

Page 15: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 15

DataGuides Construction Algorithm emulates the

conversion algorithm from non-deterministic finite automata (NFA) to deterministic finite automata (DFA)

Intuitively, a simple path is represented as a node in DataGuide

One XML Data may have multiple DataGuidesA B

CC

A B

CC

A B

C

Various DataGuides

A B

CCC C C

B

An XML Data

Page 16: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 16

Strong DataGuide If the sets of nodes which are reachable for simple paths are

equal, then the simple paths are represented as a single node.

Linear time and linear space for tree structured data Exponential time and exponential space for graph

structured data

1A

A

C

B

CC

A

C

B

C

2

3

4

5

6

1

2,4

3,5

6

5

1A

A

C

B

CC2

3

4

5

6C

Source Strong DataGuide Source Strong DataGuide

A B

C

1

2,4

3,5

6

C

Page 17: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 17

1/2/T-Index [Milo and Suciu: ICDT 99] 1-Index

Summary all label paths starting from the root

Support queries of q= Px where P = /l1/l2/…/ln Non-deterministic Based on backward bisimulation which is

originated from graph verification Extents are disjoint More compact size than Strong DataGuides

Page 18: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 18

1-Index Equivalence relation (≡)

v ≡ u iff Lv =Lu

where Lx = {w| w is a simple path from the root to x} the collection of all equivalence class Exponential construction cost

Backward Bisimulation (≈b) 1. If x≈by and x is the root then y is the root

2. Conversely, If x≈by and y is the root, then x is the root.

3. If x≈by and <x’l x> is an edge, then there is exists an edge (y’l y), such that x’ ≈by’

4. Conversely, if x≈by and (y’l y) is an edge, then there exists an edge (x’l x) such that x’≈by’

Page 19: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 19

≡ vs ≈b

X ≡ Y since LX = LY = {a.b.d, a.c.d} X Y v ≈b u v ≡ u

O(mlogm) construction cost [Paige and Tarjan 87]

d d

aaaa

b bc c

d

X Y

≈b

Page 20: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 20

1-Index vs Strong DataGuide

In tree structured Data, strong Dataguide and 1-Index coincide

1

62

3

book

title

paper

libraryDB

7editortitle

author4 109

section

8author

5

chapter author

XML Data

1

62

3

book

title

paper

libraryDB

7editor

title

author

108,9

section

85

chapter author

Strong DataGuide4

1-Index

1

62

3

book

title

paper

libraryDB

7editor

title

author

109

section

85

chapter author

4

Page 21: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 21

2/T-Index 2-Index

To support queries of x1Px2

ex) //title Equivalence relation (≡)

(v, u) ≡ (v’, u’) iff L(v,u) =L(v’,u’)where L(x,y) = {w| w is a label path from x to y}

Summary of path information bwt. two arbitrary nodes T-Index

Generalization of 1/2-Index(v1,…,vn )≡ (u1,…,un) iff L(v1,…,vn) =L(u1,…,un)

Conceptually similar to Access Support Relations Support only predefined paths

Page 22: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 22

Index Fabric [Cooper, Sample, Franklin, Hjaltason,

Shadmon, VLDB 01] Tree Structured Data Conceptual similar to strong DataGuide Layered structure Use Patricia trie to index a large number of

search keys The simple path of an element which has a data

value is encoded as a special character sequence Keeps the key which is the combination of encoded

sequence and data value.

Page 23: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 23

Index Fabric

Keeps only the information of elements which have data values

Patricia trie : lossy Compression

XML Data Patricia Trie

0

1

2 2

L

B P

T A

LBAauthor1LBTtitle1

C

2

1

2

8

“L”“L”

“LBC”

C

B

P

C

Page 24: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 24

ToXin [Rizzolo, Mendelzon: WebDB 01] Tree Structured Data Conceptually Similar to strong DataGuide (not

minimal DataGuide) Support navigation of forward and backward

traversal Path Tree ( = strong DataGuide) A node of Path Tree has an Index Table or

Value Tables Index Table (IT): parent-child relationships Value Table (VT): owner-value relationships

Page 25: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 25

ToXin

Since ToXin keeps parent-child relationships, ToXin supports path expression with value predicates

ex) /libraryDB/book[author = author1]

LibraryDB:IT

book:IT paper:IT

title:VT author:VT

chaptertitle:VT author:VT

section

LibararyDBparent childnull 1

LibraryDB.bookparent child1 2

LibraryDB.paperparent child1 6

LibraryDB.book.authorparent value2 author1 …

•Index Tables

•Value Tables

XML Data

Page 26: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 26

A(k)-Index [Kaushik, Shenoy, Bohannon, Gudes:

ICDE 02] Strong DataGuide and 1-Index record the

all simple paths Increase index size => Increase search space

Approximation of 1-Index Non-deterministic Utilize local similarity(= degree k)

reduce the size of index graph

Page 27: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 27

A(k)-Index k-bisimulation (≈k)

For any two nodes, v and u, v ≈0 u iff u and v have the same label

Node v≈ku iff v≈k-1u and for every parent v’ of v, there is a parent u’ of u such that v’≈k-1u’

A

CB

D

E

D

E

A

CB

D

E

XML Data A(0)-Index

A

CB

D

E

D

A(1)-Index

A

CB

D

E

D

E

A(2)-Index (= 1-Index)

D

E

Page 28: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 28

A(k)-Index Building cost = O(km) In general, for 1-Index, k < logm Query Processing

label path expression whose length ≤ k+1 precise

label path expression whose length > k+1 safe : include false results validation => require the data scan

Page 29: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 29

APEX:Adaptive Path indEx for XML Data

[Chung, Min, Shim : SIGMOD 02] Strong DataGuide and 1-Index are kept

the all simple paths Users used partial matching path

queries //book/title

Exhaustive navigation of index structure for partial matching path queries may result in performance degradation

Page 30: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 30

APEX Deterministic Approximation of DataGuides Efficient processing of partial matching path

queries Workload-Aware

Self Tuning Strategies [Chaudhuri et. al 00] Utilize Query Workload Build APEX with both XML data and

frequently used paths Sequential pattern mining [Agrawal and

Srikant 95]

Page 31: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 31

APEX

Hash Tree keep frequently used paths prevent the exhaustive search

Graph Structure structural summary + extents

APEXfrequently used paths = {book.title}

extent&0: {<null,0>} &1: {<0,1>} &2: {<1,2> }&3: {<1,6>}&4: {<2,4>, <6,8>, <6,9>} &5: {<2,5>} &6: {<6,10>} &7: {<2,8>} &8: {<2,3>} &9: {<6,7>}

label xnode nextxroot &0libraryDB &1book &2paper &3titleauthor &4chapter &5section &6editor &7

label count xnode nextbook &8remainder &9

libraryDB

title title

paperbook&1

&2 &3

&8 &9

&0

authorauthor

&4&5

chapter&6

section

&7editor

XML Data

Page 32: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 32

F&B Index [Kaushik, Bohannon, Naughton, Korth :

SIGMOD 02] Support Twig path expression

/A/B[C]

Basic Idea For every edge e labelled l from v to u, add an

(inverse) edge e-1 with label l-1 from u to v And then, compute 1-Index on this modified graph.

Very large Index space Apply some heuristics

- Exploiting Local Similarity : k-bisimulation

A B C

A B C-1

Page 33: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 33

Discussion Path Index

Improve the query performance by restriction of search space

Can be apply to various application Selectivity Estimation QBE(Query By Example)

Future Work Support twig queries Query Optimization

cost formula of path index

Page 34: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 34

Thank You! Any Question?

http://islab.kaist.ac.kr/~jkmin [email protected]

Page 35: KAIST2002 SIGDB Tutorial1 Indexing Methods for Efficient XML Query Processing Jun-Ki Min KAIST jkmin

KAIST 2002 SIGDB Tutorial 35

Reference1. C. Chung, J. Min and K. Shim, “ APEX: An Adaptive Path Index for XML Data,” SIGMOD 022. B. Cooper, N. Sample, M. Franklin, G. Hjaltason and M. Shadmon, “A Fast Index for

Semistructed Data,” VLDB 013. M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim, “ XTRACT: A System for

Extracting Document Type Descriptors from XML Documents,” SIGMOD 004. L. Goldman and J. Widom, “ DataGuides: Enabling Queries Formulation and Optimization

in Seminstructured Databases,” VLDB 975. R. Kaushik, P. Bohannon, J. Naughton and H. Korth, “Covering Indexes for Branching Path

Queries,” SIGMOD 026. R. Kaushik, P. Shenoy, P. Bohannon and E. Gudes, “Exploiting Local Similarity for

Indexing Paths in Graph-Structured Data,” ICDE 027. A. Kemper and G. Moerkotte, “Access Support Relations: An Indexing Method for Object

Bases,” Information Systems 928. T. Milo and D. Suciu, “ Index Structures for Path Expressions,” ICDT 999. S. Nestorov, J. Ullman, J. Wiener and S. Chawathe, “ Representative Objects : Concise

Representations of Semi structured, Hierarchical Data,” ICDE 9710. F. Rizzolo and A. Mendelzon,” Indexing XML Data with ToXin,” WebDB 0111. R. Paige and R. Tarjan, “Three partition refinement algorithms,” SIAM Journal of

Computing 8712. P. Valduriez, “Join Indices,” TODS 87