storing and querying multi-version xml documents using durable node numbers shu-yao chien dept. of...
TRANSCRIPT
Storing and Querying Multi-version XML Documents using Durable Node Numbers
Shu-Yao Chien
Dept. of CS
Vassilis J. Tsotras
Dept. of CS&E
Carlo Zaniolo
Dept. of CS
Donghui Zhang
Dept. of CS&E
Traditional applications migrating to the web: – Software configuration management
– Cooperative work
– CAD
An array of web-based applications:
– Web content providers and trackers
– Link Permanence
– WebDAV
Document Version Management
An assortment of new and old applications seek from XML a shared technology and toolset to support their assorted requirements
Main requirements and research challenges:
– Efficient version retrieval
– Storage efficiency
– Complex query support
Problem Definition
Naive approach stores each version in its entirety: minimizes retrieval but very inefficient storage.
RCS (Revision Control System) :– stores the latest version in its entirety, and– old versions represented by deltas ---reverse edit script– minimizes storage cost– version retrieval cost grows linearly with version number
SCCS (Source Code Control System) :– objects timestamped and stored by their document order– version retrieval cost as high as whole change history
These schemes are used by most current systems---but need improvements in storage management, retrieval, query, and support for complex objects.
Traditional Versioning Schemes
DBs for CAD and for semi-structured information paid much attention to version support
Temporal DBs: efficient support for transaction time by various indexing schemes, Snapshot Index, Multi-Version B+ -Trees, etc.
But typical DBs do not support object ordering (since reconstruction of complete document is not a critical query)
Numbering schemes are proposed to represent document structure and enhance efficiency in evaluating regular path expressions.
Databases --- Temporal, OO, Semi-structured, XML DB, …
UBCC [WebDB200] enhances RCS with page
management
Flexibility of trading off storage and retrieval costs
Using the concept of Page Usefulness
Captures the information on the order of the object
document in the (forward) edit script
Storage Level Enhancement
DELT2
DEL DELT3
A B C D 75%
A B C D 25%
A B C DT1
100%
Versn Page Usefulness
We set a minimum usefulness requirement Umin, e.g. 70% (0 < Umin <= 1).
A page is useful/useless when its usefulness is above/ below Umin .
Useful
Useless
Page Usefulness – by Example
Root Ch A Fig D Sec E Ch B Sec F Fig G Fig H
VERSION 2 INS(Sec J)
DEL
INS(Fig M)
DEL DEL
INS(Ch K),INS(Sec L)
• STEP 1 : Determine page usefulness for copying.
, U(P1)=75%
VERSION 1
, U(P2) = 50% < Umin=70%
• STEP 2 : Append new/copied objects into new pages by their logical order.
P3
Sec J
COPY
Ch B Sec F Fig M
P4
Ch K Sec L
P1 P2
, U(P3)=100% , U(P4)=100%
Usefulness Based Copy Control (UBCC)
New Support are Needed …
Complex Query Support:
Temporal Selection
Structural Projection
Content-Based Selection
Regular Path Expression
Query on Diff
UBCC is not efficient in supporting version queries.
A new scheme is needed …
The SPaR Versioning Scheme
SPaR numbering scheme
Version model
Complex query support
Usefulness-based storage strategy
SPaR Numbering Scheme
XML document structure are represented by: a Durable Node Number (DNN) , and a Range
DNN is a sparse numbering scheme that preserves element order.
Range preserves parent-child relationships. Documents can be decomposed and stored as separate
elements, then reconstructed (maybe partially) when needed.
Indexes can be built upon DNN and Range for efficient XML query evaluation.
SPaR Numbering Scheme --- by Example
DNN is a sparse numbering scheme that preserves element order as pre-order traversal (the same as document order).
Range preserves parent-child containment relationship such that:
dnn(P) < dnn(C) < dnn(C)+range(C) < dnn(P)+range(P).
55 65
51 80305
1001
21 25
Rootdnn=1
Ch Adnn=5
Ch Bdnn=51
Fig Gdnn=61
dnn=11Sec E
dnn=21Sec F
dnn=55 dnn=71Fig HFig D
range=100
range=25 range=30
range=2
range=2 range=5 range=10 range=2
Durability upon Updates
Unused ranges are saved between consecutive elements for future insertions.
When a new element Y is inserted between two consecutive elements X and Z, an unused SPaR range is assigned to Y according to the structural relationship between X, Y, and Z.
Range overflow is handled by floating point numbers with variable length.
SPaR Version Model
Elements are stored by their DNN order along with: Lifespan --- (Tstart , Tend) SPaR range
Adding a new version, VN : Delete(E) – Set E.Tend to VN-1 and free its SPaR range. Insert(E) – Set the lifespan of E to (N,now) and assign
it an unused SPaR range. Update(E,new-value) – Delete(E) + Insert(new_value)
using the same SPaR range. New elements of VN are appended into data pages by
their DNN order. However, elements of VN may be scattered among
low usefulness data pages …
Version Reconstruction
To reconstruct version VN :
Step 1 --- Locate useful data pages using the
Sparse Page Index.
Step 2 --- Ordering elements according to their
DNN number.
Step 3 --- Reconstruct the ordered-tree structure of
the document.
Step 1 --- Locate Useful Pages
Sparse Page Index
1 2 3 4 5 6 7 8 9 10Version #
P1P2
P3P4
P5P6
P7P8
P1(1,now)
P2(1,6)
P3(2,5)
P4(3,now)
P5(4,10)
P6(7,8) P7(8,9)
P8(9,now)List L lpr
Step 2 and 3
Ordering elements by their DNN numbers --- Valid elements inserted at the same version are already
sorted by their DNN number, for instance :
Merge-sort these sorted lists. Reconstructing ordered-tree structure ---
Parent-child is determined by SPaR ranges. Sibling order is implied by the DNN order. Maintain a backward ancestor stack for back-tracking.
Ch Sec Fig Sec Fig Fig
Sec Fig Sec Sec Fig
Ch Fig Sec Sec Fig Fig
V3
V7
V13
…
…
…
…Fig
… …
…
Regular Path Expression
Regular Path Query --- “For version 10, retrieve all figures contained by a chapter.”
doc[version=10]/Ch/*/Fig Basic Ideas:
Traditional algorithms trace tree structure to match path pattern.
SPaR range makes it possible to evaluate path query simply using relational join operator.
We use SPaR range of Ch elements to reduce the search space for Fig elements.
Multi-version B+ Tree is built to help search based upon DNN numbers.
Dense Element Index
Multi-version B+ Tree (MVBT) keeps history for B+ Tree. We use MVBT to build dense element indexes.
Ch_MVBT
… Fig_MVBT
…SPaR : (200,300)Life : (1,now)Loc : Page P1
SPaR : (500,700)Life : (3,now)Loc : Page P1
…
SPaR : (400,410)Life : (1,now)Loc : Page P5
SPaR : (480,490)Life : (1,15)
Loc : Page P1…
SPaR : (250,260)Life : (2,10)
Loc : Page P3
SPaR : (550,560)Life : (2,9)
Loc : Page P1
Pages stored : size(RCS)/(1-Umin)
Retrieval of single version : size(Version)/Umin pages
UBCC uses a separate edit script pointing to the data – to retrieve only useful pages
– in the right order!
SPaR scheme only needs SPaR ranges to reconstruct versions.
SPaR is slightly better than UBCC in storage cost and version reconstruction.
Performance
Storage
0
500
1000
1500
2000
2500
3000
Total Number of Versions
PagesRCSSPaR 50%UBCC 50%Snapshot
Version Retrieval Cost
50
70
90
110
130
150
170
190
210
230
250
Total Number of Versions
PagesRCSSPaR 50%UBCC 50%Snapshot
Performance and Storage Cost(10% inserted, 10% deleted)
The web changes everything—XML unifies everything. It’s time for a new technology that merges and overcomes
the limitations of traditional versioning schemes and temporal databases.
Usefulness-based clustering is effective and versatile: we applied it to edit script based schemes (UBCC) and spar scheme.
Spar numbering scheme makes it possible to build document structural index and efficiently evaluate complex version queries.
Emerging issues:
– Query language support for version queries.
– User interface for browsing versions and presenting query results
Conclusion and Future Work