efficient processing of xml twig patterns with parent child edges: a look-ahead approach
DESCRIPTION
Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach. Presenter: Qi He. Outline. ☞ XML Twig Pattern Matching Problem definition State of the Art: TwigStack Sub-optimality of TwigStack Our algorithm TwigStackList Performance Conclusion. - PowerPoint PPT PresentationTRANSCRIPT
1
Efficient Processing of XML Twig Patterns with Parent Child Edges: A
Look-ahead Approach
Presenter: Qi He
2
Outline
☞☞ XML Twig Pattern Matching Problem definition State of the Art: TwigStack Sub-optimality of TwigStack
Our algorithm TwigStackListPerformanceConclusion
3
XML Twig Pattern Matching
XML Data Model A XML document is commonly modeled as a rooted, ordered
and labeled tree. E.g. Note that identifiers (e.g. b1) are given to tree nodes for
easy reference
book
preface chapter chapter
paragraph section
section
figure
paragraph
section
figure
paragraph figure
paragraph
………….
title
title
p1
t1
c1
s1
s2
t2 p2
pf1
f1
s3
p3
f2
c2
f3
p4
b1
D1:
4
XML Twig Pattern Matching Regional Coding [1]
Node Label: (startPos: endPos, LevelNum) startPos and endPos are calculated by performing a pre-order traver
sal of the document tree; LevelNum is the level of the node in the tree.
E.g. book (0: 50, 1)
preface (1:3, 2) chapter (4:22, 2) chapter(23:45, 2)
paragraph (2:2, 3) section (5:21, 3)
section(7:12, 4)
figure (10:10, 6)
paragraph(9:11, 5)
section(13:17, 4)
figure (15:15, 6)
paragraph(14:16, 5) figure (19:19, 5)
paragraph(18:20, 4)title: (6:6, 4)
title: (8:8, 5)
D1:
1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.
5
XML Twig Pattern MatchingWhat is a Twig Pattern?
A twig pattern is a small tree whose nodes are predicates (e.g. element type test) and edges are either Parent-Child (P-C) edges or Ancestor-Descendant (A-D) edges.
E.g. An XPath query Q1 selects Figure elements which are descendants of some Paragraph elements which in turn are children of Section elements having at least one child element Title
Section
Title Paragraph
Figure
Q1: Section[Title]/Paragraph//Figure
6
XML Twig Pattern Matching Twig Pattern Matching
Problem Statement Given a query twig pattern Q, and a XML database D that has index str
uctures (e.g. regional coding scheme) to identify database nodes that satisfy each of Q’s node predicates, compute ALL the answers to Q in D.
E.g. The matches for twig pattern Section[Title]/Paragraph//Figure in the document D1 are:
(s1, t1, p4, f3)
(s2, t2, p2, f1)
D1: b1
pf1 c1 c2
p1 s1
s2
f1
p2
s3
f2
p3 f3
s4t1
t2
7
XML Twig Pattern Matching TwigStack[2]: a holistic approach
Tag Streaming: all elements of tag q are grouped in a stream Tq ordered by their startPos
Optimal when all the edges in twig pattern are A-D edges Two-phase algorithm:
Phase 1 TwigJoin: a list of intermediate paths are outputted Phase 2 Merge: merge the intermediate path list to get the result
1. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.
8
XML Twig Pattern Matching TwigStack Review
A node q in a twig pattern Q is coupled with a stack Sq An element e is pushed into its stack if and only if e is in some match to Q.
E.g. Only color highlighted elements are pushed into their stacks. Thus it is ensured that no redundant paths are output.
An element e is popped out from its stack if all matches involving it have been reported
Thus we ensure that the memory space used by stacks is bounded.
Q: Section[//Title]//Paragraph//Figure
SSection
STitle
SParagraph
SFigure
D1:b1
pf1 c1 c2
p1 s1
s2
f1
p2
s3
f2
p3 f3
s4t1
t2
9
XML Twig Pattern Matching
Optimality of TwigStack for only A-D edge twig pattern Each stream Tq is scanned only once ,where q appears the twig pat
tern No redundant intermediate result: All intermediate paths output in
Phase 1 appear in the final result; CPU and I/O cost: O(|Input| + |Output|)
Space Complexity: O(|Longest Path in the XML tree|)
10
Sub-optimality of TwigStack
Unfortunately, TwigStack is sub-optimal for queries with any parent-child relationship.
TwigStack may output a large size of intermediate results that are not merge-joinable to final solutions for queries with parent-child relationships.
11
Example for sub-optimality of TwigStack
Twig PatternAn simple XML tree
s1
p1
f2
t2
t1
Section
title paragraph
figure
TwigStack output (s1,t1) as the intermediate result, since s1 has a descendant t1 and p1 which in turn has a descendant f2.
Observe that p1 has no child with tag figure. There is not any matching in this XML tree. So (s1,t1) is a “useless” solution.
12
Main problem and my experiment As shown before, TwigStack might output some intermedi
ate results that are not merge-joinable to final solutions for queries with parent-child edges.
To have a better understanding , we perform TwigStack on real dataset.
Data set : TreeBank [UW XML repository] Queries:
Q1:VP [/DT] //PRP_DOLLAR_ Q2: S//NP[//PP/TO][/VP/_NONE_]/JJ Q3: S [/JJ] /NP
All queries contain parent-child relationships.
13
Our experimental results
Intermediate paths by TwigStack
Merge-joinable paths
Percentage of useless intermediate paths
Q1 10,663 5 99.9%
Q2 24,493 49 99.5%
Q3 70,967 10 99.9%
Most intermediate paths do not contribute to final answers due to parent-child edges!
It is a big challenge to improve TwigStack to answer queries with parent-child edges.
14
Our intuitive observation We can improve TwigStack for queries in the previous examp
le.
Twig Patterns1
p1
f1
t2
t1
Section
title paragraph
figure
An simple XML tree
Our intuitive observation: why not read more paragraph elements and cache them in the main memory?
For example, in this XML tree, after we scan the p1, we do not stop and continue to read the next element. Then we find that there is only one paragraph element and f1 is not the child of paragraph. So we should not output any solution.
15
Outline
XML Twig Pattern Matching Problem definition State of the Art: TwigStack Sub-optimality of TwigStack
☞☞ Our algorithm TwigStackListExperimental resultsConclusion
16
Our main idea
Main idea: we read more elements in the input stream and cache some of them in the main memory so that we can make a more accurate decision about whether an element can contribute to final answer.
One desiderata: We cannot cache too many elements in the main memory. For each node q in twig query, the number of elements with tag q cached in the main memory should not be greater than the longest path in the XML dataset.
17
Our caching strategy What elements should be cached into the main memory?
Only those that may contribute to final answers
Twig Patterns1
s2
s4s3
t1
Section
title paragraph
An simple XML tree
p1
We only need to cache s1,s2,s4 into main memory, why not s3? Because if s3 contributed to final answer, then there would be an element before p1 that is child of
s3. Now we see that p1 is the first element. So s3 is guaranteed not to contribute to final answer.
18
Our criteria for pushing an element to stack
Whether an element can be pushed into stack is very important for controlling intermediate results. Why?
Because, once an element is pushed into stack, then this element is ready to output. So less elements are pushed into stack, less intermediate results are output.
Our Criteria: Given an element eq from stream Tq, before eq is pushed into stack Sq , we ensure that
(i) element eq has a descendant eq’ for each child q’ of q, and (ii) if (q, q’) is a parent-child relationship, eq’ has parent with tag q i
n the path from eq to eqmax , where eqmax is the descendant of eq with the maximal start value.
(iii) each of q’ recursively satisfy the first two conditions.
19
Examples Let us see two examples to understand the criteria.
Twig PatternSection
title paragraph
s1
s2
s3p1
t1
An simple XML tree
f1
figure
Element s1 can be pushed into stack , but s2, s3 cannot. Note that s1 can be pushed into stack, not just because t1,p1 and f1 are descendants, more importantly, because in the path from s1 to f1, element t1 , p1 and f1 can find their parents
with tag section.
20
Examples Twig Pattern
Section
title paragraph
s1
p1
o1t1
An simple XML tree
f1
figure
In this example, s1 cannot be pushed into stack. Because although elements t1,p1 and f1 are still descendants of s1, now in the path from s1 to f1, element p1 cannot find the parent with tag section. Observe that the parent of p1 is o1 (i. e. o1 means other element ).
In this example, we cache s1 and s2 to main memory, for they might involve in query answers in the future.
s2
21
TwigStackListWe propose a novel holistic twig algorithm TwigStackli
st to evaluate a twig query. Unlike previous TwigStack, TwigStackList has the unique
features: It considers the parent-child edge in the query and enhance the crite
ria for elements to be pushed into stack. It use data structure: list to cache some elements that likely particip
ate in final solutions. The number of elements in any list is strictly bounded by the longest path in the dataset.
It has a broader class of optimal queries. TwigStackList can guarantee each output intermediate solution contributes to final answers when queries contain only ancestor-descendant edges below branching nodes.
22
Example TwigStackList show I/O optimal for the following query. In contr
ast, TwigStack shows sub-optimal. Note that below branching node section, all edges in query are A-D relationship.
Twig Patterns1
p1
f1
t2
t1
Section
title paragraph
figure
An simple XML tree
In this case, TwigStacklList does not push s1 to stack and thereby avoid outputting (s1,t1) . But TwigStack push s1 to stack and output (s1,t1). Observe that (s1,t1) is a useless intermediate solution.
23
Sub-optimality of TwigStackList Although TwigStackList broaden the class of optimal query compar
ed to TwigStack, TwigStackList is still show sub-optimality for queries with parent-child edge below branching edges.
Twig Pattern
s1
s2
f1
t1
Section
title paragraph
An simple XML tree
Observe that there is no matching solution for this dataset. But TwigStackList caches s1 and s2 in the list and push s1 to stack. So (s1,t1) will be output as a useless solution.
24
Outline
XML Twig Pattern Matching Problem definition State of the Art: TwigStack Sub-optimality of TwigStack
Our algorithm TwigStackList☞☞ Experimental resultsConclusion
25
Experimental Setting
Experimental Setting Pentium 4 CPU, RAM 768MB, disk 2GB TreeBank
Maximal depth 36, 2.4 million nodes DTD data
a → bc | cb |d c → a a and c are non- terminals, b and d are terminals
Random Seven tags : a, b, c, d, e, f, g. ; uniform distributed Fan-out of elements varied 2-100, depth varied 10-100
26
Performance against TreeBank Queries with XPath expression:
Q1 S[//MD]//ADJ
Q2 S/VP/PP[/NP/VBN]/IN
Q3 S/VP//PP[//NP/VBN]//IN
Q4 VP[/DT]//PRP_DOLLAR_
Q5 S[//VP/IN]//NP
Q6 S[/JJ]/NP
Number of intermediate path solutions for TwigStackList V.s. TwigStack
TwigStack TwigStackList Reduction percentage Useful Path
Q1 35 35 0% 35
Q2 2957 143 95% 92
Q3 25892 4612 82% 4612
Q4 10663 11 99.9% 5
Q5 702391 22565 96.8% 22565
Q6 70988 30 99.9% 10
27
Performance analysis
We have three observations: (1) when queries contain only ancestor-descendant ed
ges, two algorithms have similar performance. See Q1. (2)When edges below non-branching nodes contain o
nly ancestor-descendant relationships, TwigStack is optimal, but TwigStack show the sub-optimal. See Q3.Q5
(3) When edges below branching nodes contain parent-child relationships, both TwigStack and TwigStackList are sub-optimal. Buit TwigStack typically output far few “useless” intermediate solution than TwigStack. See Q 2,Q4,Q6.
28
Performance against DTD data
There is no matching solution for query a[//b]//c/d in the DTD dataset. But TwigStack outputs too much redundant path solutions. In contrast, TwigStackList shows its optimal and significantly outperforms TwigStack in this query.
020000040000060000080000010000001200000140000016000001800000
10% 20% 30% 40% 50% 60% 70% 80% 90%
Fraction of the number of elements with tag d relative to the numberof elements with tag b and c
Num
ber o
f int
erm
ediat
e sol
utio
ns
TwigStack TwigStackList
0
5
10
15
20
25
30
35
10% 20% 30% 40% 50% 60% 70% 80% 90%
Fraction of the number of elements with tag d relative tothe number of elements with tag b and c
Exe
cuti
on ti
me(
seco
nd)
TwigStack TwigStackList
29
Performance against random dataset
(a ) Q 1 (b ) Q 2 (c) Q 3
(d ) Q 4 (e) Q 5
a
b c
d e f g
a
aa
a
bb
bb cc
d
e
f
g
d
e
f
g
c d
e f g
c d
e f g
TwigStack TwigStackList Reduction percentage
Useful Path
Q1 9048 4354 52% 2077
Q2 1098 467 57% 100
Q3 25901 14476 44% 14476
Q4 32875 16775 49% 16775
Q5 3896 1320 66% 566
Twig queriesFrom the following table, we see that for all queries, TwigStackList again is more efficient than TwigStack in terms of the size of intermediate results.
30
Outline
XML Twig Pattern Matching Problem definition State of the Art: TwigStack Sub-optimality of TwigStack
Our algorithm TwigStackListExperimental results☞☞ Conclusion
31
Conclusion
Previous algorithm TwigStack show the sub-optimality for queries with parent-child edges.
We propose new algorithm TwigStackList to address this problem.
TwigStackList broadens the class of query with I/O optimality.
Experiments show that TwigStackList typically output much fewer useless intermediate result as far as the query contains parent-child relationships.
We commend to use TwigStackList to evaluate a query with parent-child relationships.
32
Backup questions:
1. Turn back to the slide about “Performance against DTD data”. In two figures , what is the X-axis?
X-axis shows that the ratio of the number of elements with t
ag d relative to that with b and c. This ratio is important. Because according to the DTD: a → bc | cb |d , c → a, for query a[//b]//c/d, while the ratio decreases, the “useless” intermediate results output by TwigStack increase. In contrast, TwigStackList is optimal in this case. So it does not affected by the variety of the ratio. Therefore, we show the superiority of TwigStackList over TwigStack by varying the ratio.
33
Backup questions:
2. You say that TwigStackList is more efficient than TwigStack, since it outputs less intermediate results. So it is easy to understand that TwigStackList is better than TwigStack in terms of I/O cost, but how about CPU cost?
TwigStackList is more efficient than TwigStack for evaluatin
g query with parent-child relationships in terms of not only intermediate result size, but also the execution time. Of course, TwigStackList needs to scan the elements cached in the main memory and slightly increase the CPU cost. But compared to the great benefit from the reduction of I/O cost, this cost is worthy.