incremental maintenance of xml structural indexes ke yi 1, hao he 1, ioana stanoi 2 and jun yang 1 1...
TRANSCRIPT
Incremental Maintenance of XML Structural Indexes
Ke Yi1, Hao He1, Ioana Stanoi2 and Jun Yang1
1Department of Computer Science, Duke University2IBM T. J. Watson Research Center
Motivation
XML is gaining tremendously in popularity in recent years
Used to represent many kinds of data Major DB vendors are rushing to incorporate
solutions for native XML repositories and retrieval IBM DB2, Oracle , Microsoft SQL Server Tamino, Natix, X-Hive, …
Overview1
paper
4 section
5 title 6 algorithm
“1-index”7proof
8 section
9 title 10
“A(k)-index” 11proof
12
uses
algorithm
13 section
14
“experiments”
15 16
1718
aboutabout
title2
section
3title
“intro”
exp
exp
Label Path Expressions1
paper
4 section
5 title 6 algorithm
“1-index”7proof
8 section
9 title 10
“A(k)-index” 11proof
12
uses
algorithm
13 section
14
“experiments”
15 16
1718
aboutabout
title2
section
3title
“intro”
exp
exp
/paper/section/algorithm
Structural Indexes
Why do we need them? Speedup the evaluation of path expressions Provides a structural summary of the data graph
Structural indexes DataGuide [Goldman & Widom 97] 1-index [Milo & Suciu 99] A(k)-index [Kaushik et al. 02], D(k)-index [Qun et al. 03],
M(k)-index [He & Yang 04] Integration of structural indexes and inverted lists
[Kaushik et al. 04] Focus on maintenance
Has a major effect on index efficiency Remains an overlooked issue
Outline1
paper
4 section
5 title 6 algorithm
“1-index”7proof
8 section
9 title 10
“A(k)-index” 11proof
12
uses
algorithm
13 section
14
“experiments”
15 16
1718
aboutabout
title2
section
3title
“intro”
exp
exp
1-Index: Definition
Constructed by using bisimilarity Definition based on stability
Partition data nodes into index nodes dnode (v) and inode (I[v]) I[u] is v’s index parent if u is v’s parent An inode is stable if all of its dnodes have the
same index parents In a 1-index, all inodes are stable
vI[v]
u I[u]
1-Index: Example
1
paper
2,4,8,13section
3,5,9,14
title
6,10algorithm
7proof 11
proof12
uses
15,16
17,18about
exp
1-index
1
paper
4 section
5 title 6
algorithm
7proof
8section
9title
10
11proof
12
uses
algorithm
13 section14
15
16
17
18
about
about
title2
section
3
title
expexp
data graph
/paper/section/algorithm
1-Index: Quality
Assigning dnodes that are bisimilar into different inodes does not affect
correctness, but does affect efficiency
The quality of an index# inodes
# inodes in the minimum 1-index
− 1 X 100%
1
paper
2,4,8,13section
3,5,9,14
title
6,10algorithm
7proof 11
proof12
uses
15,16
17,18about
exp
2,4 8,13
Ideal: quality = 0%
Previous Results
Construction The PT algorithm [Paige & Tarjan 87], in time O(m log n)
m – # edges, n - # nodes
Edge changes The propagate algorithm [Kaushik et al. 02] Quality of the 1-index after update
No guarantee on the quality of the resulted index 3 ~ 5% after 500 edge insertions in experiments
Subgraph addition Index-reconstruction
Edge Insertion: An Example (1)
R
A B
C1 C2 C3
D1 D2 D3
Data Graph
R
A B
C1, C2 C3
D3
1-Index
D1, D2
R
A B
C3
D3
Split 1
D1, D2
C1 C2
Edge Insertion: An Example (2)
R
A B
C3
D3
Split 2
C1 C2
D1 D2
R
A B
C2, C3
D3
Merge 1
C1
D1 D2
R
A B
C2, C3
D2, D3
Merge 2
C1
D1
Indeed the minimum 1-indexfor the data graph after updateNot a coincidence!
Minimum & Minimal Indexes
Minimum: with the smallest number of inodes Minimal: no two inodes can be merged
R
A1 A2
B2B1
R
A1 A2
B2B1
R
A1,A2
B1,B2
Data graph Minimum 1-index Minimal 1-index
Quality Guarantee
Theorem: The split/merge algorithm always maintains a minimal 1-index
Lemma: For acyclic data graphs, there is a unique minimal 1-index The minimum 1-index is always maintained
For cyclic data graphs, there could be more than one minimal 1-index One of them is maintained
Outline1
paper
4 section
5 title 6 algorithm
“1-index”7proof
8 section
9 title 10
“A(k)-index” 11proof
12
uses
algorithm
13 section
14
“experiments”
15 16
1718
aboutabout
title2
section
3title
“intro”
exp
exp
A(k)-Index: Definition
k-bisimilarity Definition based on stability
A(0)-index: partition by label … A(k)-Index
An inode in A(k)-index is stable if all of its dnodes have the same index parents in A(k-1)-index
Only interested in paths of length ≤k Shown to be much smaller and more efficient than
1-index [Kaushik et al. 02] But, no efficient maintenance algorithms are known!
A(k)-index: Example
R
A B
C3
C6
C1 C2
C4 C5
R
A B
C2,C3C1
C4 C5,C6
R
A B
C2,C3C1
C4,C5,C6
R
A B
C1,C2,C3C4,C5,C6
Data graph A(2) (=1-index) A(1) A(0)
Maintenance of A(i)-index requires the information in A(i-1)-index
A(k)-index: Refinement Tree
R
A B
C3
C6
C1 C2
C4 C5
R
A B
C2,C3C1
C4 C5,C6
R
A B
C2,C3C1
C4,C5,C6
R
A B
C1,C2,C3C4,C5,C6
Data graph A(2) (=1-index) A(1) A(0)
A(k)-index: Refinement Tree
R
A B
C3
C6
C1 C2
C4 C5
R
A B
CC
C C
R
A B
CC
C
R
A B
C
Data graph A(2) A(1) A(0)
0.5% ~ 13% additional storage
1. Reduce storage cost2. Reduce maintenance cost
Quality Guarantee
Theorem: The split/merge algorithm always maintains A(k)-index
Lemma: There is a unique minimal A(k)-index for any data graph, acyclic or cyclic
1-index A(k)-index
Acyclic minimum minimum
Cyclic minimal minimum
a minimalthe minimum
Outline1
paper
4 section
5 title 6 algorithm
“1-index”7proof
8 section
9 title 10
“A(k)-index” 11proof
12
uses
algorithm
13 section
14
“experiments”
15 16
1718
aboutabout
title2
section
3title
“intro”
exp
exp
Experiments on Edge Changes
Datasets Real-life: IMDB (272,000 nodes) Benchmark: XMark (198,000 nodes)
Setup First delete a portion of existing ID-REF links Then do random mixed insertions/deletions
Compare with 1-index: propagate (+ reconstruction) A(k)-index: recompute affected portion (+
reconstruction)
Experiment Results: 1-index
Experiment Results: A(k)-index
k speedup
2 1.35
3 6.15
4 16.6
5 15.3
running times
Conclusions
The first solutions for the maintenance (edge & subgraph additions/deletions) of 1-index and A(k)-index that are both effective and efficient Effective: quality guarantee on the resulted index Efficient: the algorithms themselves are fast
Thank you!
Graphical Illustrationsize
index
valid 1-index
split
merge
the index can only grow in size due to splitting, if merging is not enforced