on boosting holism in xml twig pattern matching using structural indexing techniques ting chen,...
TRANSCRIPT
On Boosting Holism in XML Twig Pattern Matching Using
Structural Indexing Techniques
Ting Chen, Jiaheng Lu, Tok Wang Ling
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
2
Outline Background
XML Twig Pattern Query Previous Twig Join algorithms Limit of the original holistic method TwigStack
Our holistic Twig Pattern Matching algorithms Two Refined Indexing Schemes: Tag+Level and PPS A generalized holistic matching theory iTwigJoin: a generalized holistic matching algorithm
Experiments Conclusion
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
3
Background: XML and Region coding XML document is modeled as a tree in our work
Region Coding for XML document tree <start, end, level> label for each element Containment Property:
a.start < b.start AND a.end > b.end if and only if a is an ancestor of b
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
4
Background: XML twig pattern queries
An XML twig query is a small tree, whose edges include parent-child or ancestor-descendant relationships.
Given an XML document D, and an XML twig query Q, our problem is to find all occurrences of Q on D.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
5
Previous XML Twig Join algorithms
Techniques Edge Based
Binary Structural Join [Al-Khalifa et al ICDE02] Join Order Selection [Wu et al ICDE03]
Path Based BLAS [Chen et al SIGMOD04]
Tree (Holistic) Based TwigStack [Bruno et al SIGMOD02] TwigStackList [Lu et al CIKM04]
Index Based B tree [[Chien et al VLDB02] XR tree[Jiang et al ICDE02] TSGeneric+[Jiang et al VLDB03]
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
6
Holistic Twig Matching TwigStack [Bruno et al SIGMOD02] A holistic twig
join algorithm E.g: For query A[.//C]//B, there may be many matches only to A//B. But
TwigStack only output results for A with descendants B and C. No join order selection required
TwigStack is optimal for only ancestor-descendant twig patterns.
Reordering of elements in a stream does not help. [Choi et al DEXA03]
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
7
Sub-optimality of TwigStack Not optimal for twigs with parent-child edge
a1
b1 a2 an cn
b2 c1 bn cn-1…
a1 a2 … an
b1 b2 … bn c1 c2 … cn
A
B C
QueryDocument
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
8
Two Refined Streaming Schemes(1) To enlarge the optimality of TwigStack, in our paper we proposed two refined streaming schemes.
Tag + Level: elements with the same tag and level are grouped together
a1
b1 a2 an cn…
b2 c1 bn cn-1…
a1
a3 … an
b2 b3 … bn c1 c2 …
a2b1 cn
A
B C
QueryDocument
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
9
Two Refined Streaming Schemes(1) For this query, tag+level streaming scheme can guarantee the optimality.
a1
b1 a2 an cn…
b2 c1 bn cn-1…
a1
a3 … an
b2 b3 … bn c1 c2 …
a2b1 cn
A
B C
QueryDocument
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
10
Two Refined Streaming Schemes(1) But given a more complex query and document, tag+level cannot guarantee the optimality.For example:
a1
e1 a2 b2
d2 b1 d3
c2
a1
d1 d2,d3
a2 b2
A
D B
QueryDocument
Cd1
c1
b1
c1 c2
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
11
Two Refined Streaming Schemes(2) Prefix Path Streaming (PPS): elements with the same root-to-node path are grouped together
a1
a2
d1
b2
Document
a1
e1 a2 b2
d2 b1 d3
c1
d1
D:
d2 b1
c1
d3
c2
Every element in the document is stored as an individual stream in this
example.
e1
c2
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
12
Two Refined Streaming Schemes(2) PPS is optimal for the following example.
a1
e1 a2 b2
d2 b1 d3
c2
a1
d1
a2 b2
A
D B
QueryDocument
Cd1
c1
b1
c1
d2
c2
d1,d2,c1,c2 are separated to
different streams
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
13
Two Refined Streaming Schemes(2) A natural question : Can PPS guarantee to be
optimal for all queries and data?
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
14
Two Refined Streaming Schemes(2) A natural question : Can PPS guarantee to be
optimal for all queries and data? The answer is NO. For example:
a1
b1 b2 b3
c2
a3
b5
a4
b4
a2
c1
e1 d1 e2
d2
A
C B
E D
c1, c2 are in the same stream.
Similarly, e1, e2 are also in the same
stream.
DocumentQuery: head element
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
15
A general algorithm: iTwigJoin We propose a general algorithm, called iTwigJoin , which can be used on various data streaming schemes.
Our key idea is to classify all current head elements to three classes: Subtree-matching Useless Blocked
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
16
Classifying Head Elements Subtree-Matching Element
Element e of tag E is called a subtree-matching element for query Q e is in a match to QE (QE is the sub-tree of Q rooted at E); and NOT in any future match to QP where P is the parent of E in Q
Useless Element Element e is called a useless element if e is not in any future
match to QE. Blocked Element
An element which is neither subtree-matching nor useless
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
17
Example: Classifying Head Elements (Tag+Level Streaming)a1
e1 a2 b2
d2 b1
c2
d3
c1
d1
a1
d1 d2 d3 … b1
a2 b2
c1 c2
A
D B
C
D:Q1:
: head element
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
18
Example: Classifying Head Elements (Tag+Level Streaming)a1
e1 a2 b2
d2 b1
c2
d3
c1
d1
a1
d1 d2 d3 … b1
a2 b2
c1 c2
A
D B
C
D:Q1:
: head element
Subtree-matching
useless
blocked d1
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
19
Example: Classifying Head Elements (Tag+Level Streaming)a1
e1 a2 b2
d2 b1
c2
d3
c1
d1
a1
d1 d2 d3 … b1
a2 b2
c1 c2
A
D B
C
D:Q1:
: head element
Subtree-matching
useless
blocked d1,c1
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
20
Example: Classifying Head Elements (Tag+Level Streaming)a1
e1 a2 b2
d2 b1
c2
d3
c1
d1
a1
d1 d2 d3 … b1
a2 b2
c1 c2
A
D B
C
D:Q1:
: head element
Subtree-matching
-
useless -blocked d1,c1,a1,a2,b2,b1
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
21
Example: Classifying Head Elements (Tag+Level Streaming)a1
e1 a2 b2
d2 b1
c2
d3
c1
d1
a1
d1 d2 d3 … b1
a2 b2
c1 c2
A
D B
C
D:Q1:
: head element
Subtree-matching
-
useless -blocked d1,c1,a1,a2,b2,b1
A
D B
Q2:Subtree-matching
useless
blockedC
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
22
Example: Classifying Head Elements (Tag+Level Streaming)
Subtree-matching
-
useless -blocked d1,c1, a1,a2,b2,b1
a1
e1 a2 b2
d2 b1
c2
d3
c1
d1
a1
d1 d2 d3 … b1
a2 b2
c1 c2
A
D B
C
D:Q1:
: head element
A
D B
C
Q2:Subtree-matching
d1
useless
blocked
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
23
Example: Classifying Head Elements (Tag+Level Streaming)
Subtree-matching
-
useless -blocked d1,c1, a1,a2,b2,b1
a1
e1 a2 b2
d2 b1
c2
d3
c1
d1
a1
d1 d2 d3 … b1
a2 b2
c1 c2
A
D B
C
D:Q1:
: head element
A
D B
C
Q2:Subtree-matching
d1
useless a1,b2blocked
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
24
Example: Classifying Head Elements (Tag+Level Streaming)
Subtree-matching
-
useless -blocked d1,c1, a1,a2,b2,b1
a1
e1 a2 b2
d2 b1
c2
d3
c1
d1
a1
d1 d2 d3 … b1
a2 b2
c1 c2
A
D B
C
D:Q1:
: head element
A
D B
C
Q2:Subtree-matching
d1
useless a1,b2blocked c1
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
25
Example: Classifying Head Elements (Tag+Level Streaming)
Subtree-matching
-
useless -blocked d1,c1, a1,a2,b2,b1
a1
e1 a2 b2
d2 b1
c2
d3
c1
d1
a1
d1 d2 d3 … b1
a2 b2
c1 c2
A
D B
C
D:Q1:
: head element
A
D B
C
Q2:Subtree-matching
d1
useless a1,b2blocked c1, b1, a2,
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
26
Example: Classifying Head Elements (Tag+Level Streaming)
Subtree-matching
-
useless -
blocked a1,a2,b1,b2,c1,d1
a1
e1 a2 b2
d2 b1
c2
d3
c1
d1
A
D B
C
Subtree-matching
d1,
useless a1,b2
blocked a2,b1,c1
A
D B
C
•Useless element can be discarded safely
•sub-tree Matching element is pushed to the corresponding stack
•Blocked element causes problem
•CANNOT be discarded because it may cause loss of results
•CANNOT be pushed to stack because it may cause useless results
•When all head elements are blocked; optimal holistic matching CANNOT be guaranteed
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
27
iTwigJoin In our algorithm, in order to output all correct
answers, we push blocked elements into stack, which may result in useless intermediate results in some cases.
a1
e1 a2 b2
d2 b1
c2
d3
c1
d1
A
D B
C
Q1:
Tag+Level Streaming
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
28
iTwigJoin In our algorithm, in order to output all correct
answers, we push blocked elements into stack, which may result in useless intermediate results in some cases.
a1
e1 a2 b2
d2 b1
c2
d3
c1
d1
A
D B
C
Q1:
Since all head elements are
blocked, we have to push a1 to stack and
output one path solution (a1,d1).
Tag+Level Streaming
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
29
iTwigJoin In our algorithm, in order to output all correct
answers, we push blocked elements into stack, which may result in useless intermediate results in some cases.
a1
e1 a2 b2
d2 b1 d3
c1
d1
A
D B
C
Q1:
If there is no c2, then (a1,d1) is a useless path solution.
Since all head elements are
blocked, we have to push a1 to stack and
output one path solution (a1,d1).
Tag+Level Streaming
c2
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
30
iTwigJoin
Stream Manager
a1
c1 c2 c3 … b1
a2 b2
Temporary Storage
SA
SB SC
Two Main Components Stream Manager: Control the advance operation of
streams and send elements for temporary storage Temporary Storage: Push elements to stack and
output intermediate paths.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
31
Flowchart of iTwigJoinLabel current head elements
as either subtree-Matching, Useless or Blocked
Discard Useless elements
Select a subtree-Matching or blocked element e
Pop some elements from stack
Push e to the stack and output intermediate paths if e is the leaf
If useless element is found
If not all streams end
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
32
Optimal classes of iTwigJoin for three streaming schemes
A
B C
Tag Streaming A-D only pattern
Optimal classStreaming scheme
A-D only
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
33
A
B C
A
B C
Tag Streaming A-D only pattern
Tag+Level Streaming A-D/P-C only pattern
Optimal classStreaming scheme
A-D/P-C only
A-D only
Optimal classes of iTwigJoin for three streaming schemes
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
34
A
B C
A
B C
Tag Streaming A-D only pattern
Tag+Level Streaming A-D/P-C only pattern
Prefix Path Streaming
Optimal classStreaming scheme
A-D/P-C only or 1-Branch node
A-D/P-C only
A-D only
A
B C
A-D/P-C only or 1-Branch
Optimal classes of iTwigJoin for three streaming schemes
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
35
A
B C
A
B C
Tag Streaming A-D only pattern
Tag+Level Streaming A-D/P-C only pattern
Prefix Path Streaming A-D/P-C only or 1-Branch
Optimal classStreaming scheme
A-D/P-C only or 1-Branch node
A-D/P-C only
A-D only
A
B C
More refined
Optimal class:Larger
Optimal classes of iTwigJoin for three streaming schemes
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
36
Experiments
Benchmarks XMark: Synthetic Data Treebank: Real Data from Wall Street Journal
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
37
Experiments: I/O Performance
0
20000004000000
60000008000000
1000000012000000
14000000
Tree1 Tree2 Tree3 Tree4 Tree5
Ele
men
t Sca
nned
TwigStack TwigStackLst Tag+Level Prefix
Tree1: A-D only
Tree2: P-C only
Tree3: P-C only
Tree4: 1-branchnode
Tree5: 1-branchnode
By pruning irrelevant streams, PPS usually scan the fewest number of elements.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
38
Experiments: Number of Intermediate PathsTree1: A-D only
Tree2: P-C only
Tree3: P-C only
Tree4: 1-branchnode
Tree5: 1-branchnode1
10
100
1000
10000
100000
Tree1 Tree2 Tree3 Tree4 Tree5In
term
ed
iate
Pa
ths
Ou
tpu
tTwigStack TwigStackLst Tag+Level Prefix
For treebank 5, there is no matching results. So Tag+Level and PPS do not output any intermediate results.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
39
Experiments: Running Time
XMark1: Path Pattern,
XMark2: A-D only,
XMark3: P-C only,
XMark4: 1-branchnode,
XMark5: Non-optimal,
0
2
4
68
10
12
14
XMark1 XMark2 XMark3 XMark4 XMark5
Exe
cutio
n T
ime
(Sec
ond)
TwigStack TwigStackLst Tag+Level Prefix
Tag+level and PPS have better performance than TwigStack and TwigStackList in XMark data.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
40
Experiments: Summary
Both PPS and Tag+Level help to reduce I/O costs. while PPS saves more.
PPS may result in too many streams for deep XML data; Tag+Level seems to be a good compromise.
PPS and Tag+Level completely avoid the output of redundant intermediate paths in all cases we tested, though they cannot guarantee the optimality in theory.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
41
Conclusions We develop a general algorithm to perform
holistic twig join on Tag+Level and PPS streaming schemes.
We identify two I/O optimal classes for Tag+Level and PPS streaming schemes.
Since our experiments show that Tag+Level streaming schemes can guarantee to produce very few useless intermediate results in most cases, we recommend to use Tag+Level scheme for efficient XML twig pattern matching.
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
42
END
Thank you! Q & A
On Boosting Holism in XML Twig Pattern Matching using Structural Indexing
43
Backup iTwigJoin AlgorithmWhile(not all streams end)
1. Label current head elements as either Matching, Useless or
Blocked
2. If any head element is Useless, discard it and continue
3. Let e1 be the matching element with the smallest startPos;
Let e2 be the blocked element with the smallest endPos;
4. If e2.endPos < e1.startPos, let e be the blocked element with
the smallest startPos; else let e be e1
5. Advance the stream e belongs to
6. Pop out elements from e’s stack whose endPos < e.startPos
7. Push e into its stack if e has a parent/ancestor in the
temporary storage system,
8. Output all paths involving e If the tag of e is a leaf node in Q