storing and querying multi-version xml documents using durable node numbers shu-yao chien dept. of...

Storing and Querying Multi-version XML Documents using Durable Node Numbers

Shu-Yao Chien

Dept. of CS

[email protected]

Vassilis J. Tsotras

Dept. of CS&E

UC [email protected]

Carlo Zaniolo

Dept. of CS

[email protected]

Donghui Zhang

Dept. of CS&E

UC [email protected]

Traditional applications migrating to the web: – Software configuration management

– Cooperative work

– CAD

An array of web-based applications:

– Web content providers and trackers

– Link Permanence

– WebDAV

Document Version Management

An assortment of new and old applications seek from XML a shared technology and toolset to support their assorted requirements

Main requirements and research challenges:

– Efficient version retrieval

– Storage efficiency

– Complex query support

Problem Definition

Naive approach stores each version in its entirety: minimizes retrieval but very inefficient storage.

RCS (Revision Control System) :– stores the latest version in its entirety, and– old versions represented by deltas ---reverse edit script– minimizes storage cost– version retrieval cost grows linearly with version number

SCCS (Source Code Control System) :– objects timestamped and stored by their document order– version retrieval cost as high as whole change history

These schemes are used by most current systems---but need improvements in storage management, retrieval, query, and support for complex objects.

Traditional Versioning Schemes

DBs for CAD and for semi-structured information paid much attention to version support

Temporal DBs: efficient support for transaction time by various indexing schemes, Snapshot Index, Multi-Version B+ -Trees, etc.

But typical DBs do not support object ordering (since reconstruction of complete document is not a critical query)

Numbering schemes are proposed to represent document structure and enhance efficiency in evaluating regular path expressions.

Databases --- Temporal, OO, Semi-structured, XML DB, …

UBCC [WebDB200] enhances RCS with page

management

Flexibility of trading off storage and retrieval costs

Using the concept of Page Usefulness

Captures the information on the order of the object

document in the (forward) edit script

Storage Level Enhancement

DELT2

DEL DELT3

A B C D 75%

A B C D 25%

A B C DT1

100%

Versn Page Usefulness

We set a minimum usefulness requirement Umin, e.g. 70% (0 < Umin <= 1).

A page is useful/useless when its usefulness is above/ below Umin .

Useful

Useless

Page Usefulness – by Example

Root Ch A Fig D Sec E Ch B Sec F Fig G Fig H

VERSION 2 INS(Sec J)

DEL

INS(Fig M)

DEL DEL

INS(Ch K),INS(Sec L)

• STEP 1 : Determine page usefulness for copying.

, U(P1)=75%

VERSION 1

, U(P2) = 50% < Umin=70%

• STEP 2 : Append new/copied objects into new pages by their logical order.

P3

Sec J

COPY

Ch B Sec F Fig M

P4

Ch K Sec L

P1 P2

, U(P3)=100% , U(P4)=100%

Usefulness Based Copy Control (UBCC)

New Support are Needed …

Complex Query Support:

Temporal Selection

Structural Projection

Content-Based Selection

Regular Path Expression

Query on Diff

UBCC is not efficient in supporting version queries.

A new scheme is needed …

The SPaR Versioning Scheme

SPaR numbering scheme

Version model

Complex query support

Usefulness-based storage strategy

SPaR Numbering Scheme

XML document structure are represented by: a Durable Node Number (DNN) , and a Range

DNN is a sparse numbering scheme that preserves element order.

Range preserves parent-child relationships. Documents can be decomposed and stored as separate

elements, then reconstructed (maybe partially) when needed.

Indexes can be built upon DNN and Range for efficient XML query evaluation.

SPaR Numbering Scheme --- by Example

DNN is a sparse numbering scheme that preserves element order as pre-order traversal (the same as document order).

Range preserves parent-child containment relationship such that:

dnn(P) < dnn(C) < dnn(C)+range(C) < dnn(P)+range(P).

55 65

51 80305

1001

21 25

Rootdnn=1

Ch Adnn=5

Ch Bdnn=51

Fig Gdnn=61

dnn=11Sec E

dnn=21Sec F

dnn=55 dnn=71Fig HFig D

range=100

range=25 range=30

range=2

range=2 range=5 range=10 range=2

Durability upon Updates

Unused ranges are saved between consecutive elements for future insertions.

When a new element Y is inserted between two consecutive elements X and Z, an unused SPaR range is assigned to Y according to the structural relationship between X, Y, and Z.

Range overflow is handled by floating point numbers with variable length.

SPaR Version Model

Elements are stored by their DNN order along with: Lifespan --- (Tstart , Tend) SPaR range

Adding a new version, VN : Delete(E) – Set E.Tend to VN-1 and free its SPaR range. Insert(E) – Set the lifespan of E to (N,now) and assign

it an unused SPaR range. Update(E,new-value) – Delete(E) + Insert(new_value)

using the same SPaR range. New elements of VN are appended into data pages by

their DNN order. However, elements of VN may be scattered among

low usefulness data pages …

Version Reconstruction

To reconstruct version VN :

Step 1 --- Locate useful data pages using the

Sparse Page Index.

Step 2 --- Ordering elements according to their

DNN number.

Step 3 --- Reconstruct the ordered-tree structure of

the document.

Step 1 --- Locate Useful Pages

Sparse Page Index

1 2 3 4 5 6 7 8 9 10Version #

P1P2

P3P4

P5P6

P7P8

P1(1,now)

P2(1,6)

P3(2,5)

P4(3,now)

P5(4,10)

P6(7,8) P7(8,9)

P8(9,now)List L lpr

Step 2 and 3

Ordering elements by their DNN numbers --- Valid elements inserted at the same version are already

sorted by their DNN number, for instance :

Merge-sort these sorted lists. Reconstructing ordered-tree structure ---

Parent-child is determined by SPaR ranges. Sibling order is implied by the DNN order. Maintain a backward ancestor stack for back-tracking.

Ch Sec Fig Sec Fig Fig

Sec Fig Sec Sec Fig

Ch Fig Sec Sec Fig Fig

V3

V7

V13

…

…

…

…Fig

… …

…

Regular Path Expression

Regular Path Query --- “For version 10, retrieve all figures contained by a chapter.”

doc[version=10]/Ch/*/Fig Basic Ideas:

Traditional algorithms trace tree structure to match path pattern.

SPaR range makes it possible to evaluate path query simply using relational join operator.

We use SPaR range of Ch elements to reduce the search space for Fig elements.

Multi-version B+ Tree is built to help search based upon DNN numbers.

Dense Element Index

Multi-version B+ Tree (MVBT) keeps history for B+ Tree. We use MVBT to build dense element indexes.

Ch_MVBT

… Fig_MVBT

…SPaR : (200,300)Life : (1,now)Loc : Page P1

SPaR : (500,700)Life : (3,now)Loc : Page P1

…

SPaR : (400,410)Life : (1,now)Loc : Page P5

SPaR : (480,490)Life : (1,15)

Loc : Page P1…

SPaR : (250,260)Life : (2,10)

Loc : Page P3

SPaR : (550,560)Life : (2,9)

Loc : Page P1

Pages stored : size(RCS)/(1-Umin)

Retrieval of single version : size(Version)/Umin pages

UBCC uses a separate edit script pointing to the data – to retrieve only useful pages

– in the right order!

SPaR scheme only needs SPaR ranges to reconstruct versions.

SPaR is slightly better than UBCC in storage cost and version reconstruction.

Performance

Storage

0

500

1000

1500

2000

2500

3000

Total Number of Versions

PagesRCSSPaR 50%UBCC 50%Snapshot

Version Retrieval Cost

50

70

90

110

130

150

170

190

210

230

250

Total Number of Versions

PagesRCSSPaR 50%UBCC 50%Snapshot

Performance and Storage Cost(10% inserted, 10% deleted)

The web changes everything—XML unifies everything. It’s time for a new technology that merges and overcomes

the limitations of traditional versioning schemes and temporal databases.

Usefulness-based clustering is effective and versatile: we applied it to edit script based schemes (UBCC) and spar scheme.

Spar numbering scheme makes it possible to build document structural index and efficiently evaluate complex version queries.

Emerging issues:

– Query language support for version queries.

– User interface for browsing versions and presenting query results

Conclusion and Future Work

storing and querying multi-version xml documents using durable node numbers shu-yao chien dept. of...

Documents

latest version

storage management

multiversion xml documents

efficient support

multi version b trees

retrieval costs

inefficient storage

object document