holistic twig joins optimal xml pattern matching
DESCRIPTION
Holistic Twig Joins Optimal XML Pattern Matching. Nicolas Bruno Columbia University. Nick Koudas Divesh Srivastava AT&T Labs-Research. SIGMOD 2002. XML Query Processing. XML query languages are complex, with many features. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/1.jpg)
Holistic Twig JoinsOptimal XML Pattern Matching
Nicolas BrunoColumbia University
Nick Koudas Divesh Srivastava
AT&T Labs-Research
SIGMOD 2002
![Page 2: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/2.jpg)
2
XML Query Processing
XML query languages are complex, with many features.
Natural and pervasive operation: matching XML data with a tree structured pattern.
Previous attempts decompose query into small pieces and solve them separately: complex optimization problem.
![Page 3: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/3.jpg)
3
Data Model
XML database: forest of rooted, ordered, labeled trees: Nodes represent elements or values. Edges model direct containment
properties.book
title allauthors year chapter
XML author author author 2000 title section
fn ln fn ln fn ln XML head
Jane Poe John Doe Jane Doe Origins
...
...
...
![Page 4: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/4.jpg)
4
Query Model: Subset of XQuery
FOR $b IN document(“books.xml”)//book $a IN $b//authorWHERE contains($b/title, ‘XML’) AND $a/fn = ‘jane’ AND $a/ln = ‘doe’RETURN <pubyear> $b/year <pubyear/>
Find the year of publication of all books about “XML” written by “Jane Doe”.
Specific twig patterns can match relevant portions of the XML database.
![Page 5: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/5.jpg)
5
Outline
Problem formulation. PathStack: Path Queries. TwigStack: Twig Queries. XB-Trees: Sub-linear pattern
matching. Experimental evaluation.
![Page 6: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/6.jpg)
6
Twig Pattern Matching
Exploit indexes over the XML document: document not needed in main memory.
Given a query twig pattern Q and an XML database D, compute the set of all matches for Q on D.
![Page 7: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/7.jpg)
7
Indexing XML Documents
Element positions represented as tuples(DocID, Left:Right, Level), sorted by Left.
Child and descendant relationships between elements easily determined.
(1,1:150,1) (1,180:200,1) ...
(1,8:8,5) (1,43:43,5) ...
book
title
author
year
jane
XML
...
(1,6:20,3) (1,22:40,3) ...
(1,65:67,3) ...
(1,66:66,4) (2,140:140:6) ...
(1,61:63,2) (1,88:90,2) ...
...
Extension to classical IR inverted lists
![Page 8: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/8.jpg)
8
Previous Attempts Based on binary joins [Zhang’01, Al-
Khalifa’02]. Decompose query into binary
relationships. Solve binary joins against XML database. Combine together “basic” matches.
Main drawbacks: Optimization is required. Intermediate results can be large.- ((book title) XML) (year 2000)
- (((book year) 2000) title) XMLmany other possibilities…
![Page 9: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/9.jpg)
9
Our Approach: Holistic Joins
Solve the entire twig query in two phases:
1- Produce “guaranteed” partial results using one pass.
2- Combine (merge join) partial results. Partial result smaller than final result. Exploit indexes.
Skip irrelevant document fragments. Use containment relationships between
query nodes.
![Page 10: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/10.jpg)
10
Data Structures
Each node q in query has associated: A stream Tq, with the positions of the
elements corresponding to node q, in increasing “left” order.
A stack Sq with a compact encoding of partial solutions (stacks are chained).
A
C
D
A1
C1
A2
C2
B1
D1
[A1 ,C1 ,D1][A1 ,C2 ,D1][A2 ,C2 ,D1]
D1
SD
C1
SC
C2
A1
SA
A2
XML fragment Query Matches Stacks
![Page 11: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/11.jpg)
11
PathStack: Holistic Path Queries
Repeatedly constructs stack encodings of partial solutions by iterating through the streams Tq.
Stacks encode the set of partial solutions from the current element in Tq to the root of the XML tree.
WHILE (!eof) qN = “getMin(q)” clean stacks push TqN’s first element to SqN
IF qN is a leaf node, expand solutions
![Page 12: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/12.jpg)
12
PathStack Example
A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1 - A2A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1 - A2
C1
A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1 - A2
B1
C1
A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
SA1 - A2
B1
A1,B1,C2A2,B1,C2
C1 - C2
A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
S A1
B2
A1,B1,C2A2,B1,C2
A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
S A1
B2
A1,B1,C2A2,B1,C2A1,B2,C3
C3
A1
A2
C1
B1
C2
B2
C3 C4
A
B
C
S A1
B2
A1,B1,C2A2,B1,C2A1,B2,C3A1,B2,C4
C4
Theorem: PathStack correctly returns all query matches with O(|input|+|output|) I/O and CPU complexity.
![Page 13: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/13.jpg)
13
Twig Queries
Naïve adaptation of PathStack. Solve each root-to-leaf path
independently. Merge-join each intermediate result.
Problem: Many intermediate results might not be part of the final answer.A
B D
C EB
A AA A
BB B D D D D
X
C C C C E E E E
A
B D
C E
A
![Page 14: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/14.jpg)
14
TwigStack
1) Compute only partial solutions that are guaranteed to extend to a final solution.
2) Merge partial solutions to obtain all matches.
WHILE (!eof) qN = “getNext(q)” clean stacks IF TqN’s first element is part of a solution, push it IF qN is a leaf node, expand solutions
getNext might advance the streams in
subTree(q) that are guaranteed not to be
part of a solution
![Page 15: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/15.jpg)
15
Analysis of TwigStack
If getNext(q)=qN, then: Sub-tree qN has a solution using the stream heads. qN is “maximal”.
getNext returns nodes in topological order. Stacks encode the set of partial solutions
from the current element in getNext to the root of the XML tree.
Theorem: TwigStack correctly returns all query matches with O(|input|+|output|) I/O and CPU complexity for ancestor/descendant relationships.
![Page 16: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/16.jpg)
16
XB-Trees: A Variant of B-Trees
Index positions of elements in the document.
Allows adaptive granularity for consuming streams: advance and drillDown.
TwigStack can be adapted to use XB-Trees with minimal changes....
... ...
XB-Tree Structure
Advance
DrillDown
![Page 17: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/17.jpg)
17
Experimental Setting
Implemented all algorithms in C++ using the file system as a simple storage engine.
Synthetic and real databases. Unfolded DBLP database. X-Match + X-Mark benchmarks. Random XML documents.
Techniques compared: Binary Join techniques. PathStack. TwigStack.
XMark
XMatch-1
![Page 18: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/18.jpg)
18
PathStack vs. Binary Joins
XML database fragment: 1 million nodes.Path Query: A1//A2//A3//A4//A5//A6
0
10
20
30
40
50
60
Exe
cutio
n tim
e (s
econ
ds)
Binary Joins PathStack SS
![Page 19: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/19.jpg)
19
PathStack vs. TwigStack
![Page 20: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/20.jpg)
20
XB-Trees
XML database fragment: 1 million nodes.Twig Query
![Page 21: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/21.jpg)
21
Current and Future Work
Handle arbitrary projections and constrained ancestor/descendant relationships optimally.
Integrate TwigStack with value-based joins (id-refs, user defined predicates, etc.).
Incorporate remaining axes (following, etc.).
![Page 22: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/22.jpg)
22
Summary and Conclusions
Developed holistic path join algorithms (PathStack and PathMPMJ) that are independent of size of intermediate results.
Developed TwigStack, which generalizes PathStack for twig queries.
Designed XB-Trees and integrated them to TwigStack.
![Page 23: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/23.jpg)
23
Overflow Slides
![Page 24: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/24.jpg)
24
PathMPMJ
Non trivial adaptation of MPMGJN [Zhang’01].
Variant of merge-join that uses a stack of backtracking marks per query node.
![Page 25: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/25.jpg)
25
PathStack vs. PathMPMJ
XML database fragment: 1 million nodes.
![Page 26: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/26.jpg)
26
TwigStack: Parent/Child edges
Any algorithm that works over streams either gets deadlocked or results in suboptimal executions.
A
B C
A1
B2 C2A2
B1 C1
Query MatchesData
(A1, B2, C2)(A2, B1, C1)
![Page 27: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/27.jpg)
27
PathStack vs. PathMPMJ (2)
DBLP database
![Page 28: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/28.jpg)
28
PathStack vs. PathMPMJ (3)
Benchmark database
![Page 29: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/29.jpg)
29
PathStack vs. TwigStack (2)
DBLP database
![Page 30: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/30.jpg)
30
PathStack vs. TwigStack (3)
![Page 31: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/31.jpg)
31
PathStack vs. TwigStack (4)
![Page 32: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/32.jpg)
32
XB-Trees(2)
DBLP database.
![Page 33: Holistic Twig Joins Optimal XML Pattern Matching](https://reader036.vdocument.in/reader036/viewer/2022062520/56815a63550346895dc7a7fb/html5/thumbnails/33.jpg)
33
XB-Trees(3)
Benchmark database.