efficient discovery of xml data redundancies

Efficient Discovery of XML Data Redundancies

Cong Yu and H. V. JagadishUniversity of Michigan, Ann Arbor

-VLDB 2006, Seoul, Korea September 12th, 2006

2 / 42

Talk Outline•Motivating Example

•A Comprehensive Notion of XML FD

•XML Redundancy Discovery Algorithms

•Experimental Evaluation

•Conclusion

3 / 42

An Example XML Document

warehouse

state state

store

bookname

book

store

name

book

state

ISBN title au au

“Borders”“Borders”

“… 269”“DB” “R.R.”“J.G.”

store

name“Amazon”

ISBN title“… 269” “DB”

ISBN title au au“… 269” “DB” “R.R.”“J.G.”

price“$59.9” price

“$51.1”

price“$59.9”

… …

4 / 42

• An example constraint:For any two books, if they have the same ISBN, then they have the same title.

• Similar to Equality Generating Dependencies (EGDs) [BV84] and Nested EGDs [YP04]

Constraints on XML Data

TargetCondition

Element(s)Implication Element(s)

5 / 42

Data Redundancies•E.g., title is redundantly stored•Result of “non-optimal” design of the

database schema in the presence of constraints

•Lead to: Update anomalies Increased cost for data transfer and

manipulation

•Constraints are the properties of data May not be known at the design phase

6 / 42

GoalEfficiently Discover

Redundancies From the XML Database By Discovering

Satisfied Constraints

7 / 42

Main Contributions•A comprehensive notion of XML FD

Capturing a semantically richer set of XML constraints

Definition of XML data redundancy in terms of XML FDs and XML Keys

•Efficient algorithms for discovering FDs and data redundancies from an XML database


8 / 42





•Conclusion

10 / 42

Example XML Constraints• Hierarchical: condition and/or implication

elements can come from multiple hierarchies

state

store

bookname

book

store

name

book

state

ISBN title au au


“… 269”“DB” “R.R.”“J.G.”

store

name“Amazon”




“$51.1”

price“$59.9”

… …

11 / 42

• Set elements: condition and/or implication elements can involve set elements

Example XML Constraints, Cont’d

store

bookname

book

store

name

book

ISBN title au au


“… 269”“DB” “R.R.”“J.G.”

store

name“Amazon”




“$51.1”

price“$59.9”

… …

state state

12 / 42

Functional Dependencies (FDs)

•FDs are used to describe constraints in relational databases

•A similar notion of FD is needed for XML

•Challenges: Target is difficult to specify due to the

hierarchical structure Set elements introduce new semantics

XML FD needs richer semantics !

13 / 42

Previous Notions• Path Based Notion [LLL02,VLL04]

Example: {/warehouse/state/store/book/ISBN} /warehouse/state/store/book/title

Format: LHS RHS Semantics: for any two RHS nodes, same

(associated) LHS indicates same RHS

• Tree Tuple Based Notion [AL04] A tree tuple is a data tree, with exactly one data

node for each schema element Format: LHS RHS Semantics: for any two tree tuples, same LHS

indicates same RHS

14 / 42

• Both capture hierarchical constraints• Neither can capture set constraints• {/store/book/ISBN} /store/book/au

Violated in previous Satisfied if the two au nodes are a single set

• {/store/book/title,/store/book/au} /store/book/ISBN Undefined in previous Intuitive if au nodes are

a single set

Previous Notions, cont’d

store

bookname

ISBN title au au

“Borders”

“… 269”“DB” “R.R.”“J.G.”price

“$59.9”

15 / 42

A New Comprehensive Notion•Generalized Tree Tuple

A data tree constructed around a pivot data node (np)

Entire subtree rooted at np is kept

All ancestors of np and their “attributes” are kept

•Tuple Class CP

The set of all generalized tree tuples, whose pivot nodes share the same path P (called pivot path)

16 / 42

warehouse

state state

store

bookname

book

store

name

book

state

ISBN title au au


“… 269”“DB” “R.R.”“J.G.”

store

name“Amazon”




“$51.1”

price“$59.9”

… …

Example Generalized Tree Tuple

Pivot

17 / 42

warehouse

state state

store

bookname

book

store

name

book

state

ISBN title au au


“… 269”“DB” “R.R.”“J.G.”

store

name“Amazon”




“$51.1”

price“$59.9”

… …

Example Generalized Tree TuplePivot

18 / 42

XML FD•<CP, LHS, RHS>: LHS RHS w.r.t. CP

•Semantics:

for any two generalized tree tuple t1, t2 in CP, if they share the same LHS, they have the same RHS.

•E.g., {./title, ./au} ./ISBN, w.r.t. C/warehouse/state/store/book

19 / 42

Repeatable Elements Are Special

warehouse

state state

store

bookname

book

store

name

book

state

ISBN title au au


“… 269”“DB” “R.R.”“J.G.”

store

name“Amazon”




“$51.1”

price“$59.9”

… …

20 / 42

Essential Tuple Classes•Definition:

Tuple classes with pivot paths that correspond to repeatable schema elements

C/warehouse/state/store/book is essential

C/warehouse/state/store/name is not

•Express XML FDs that are expressible with non-essential tuple classes

•See paper for detailed proof

23 / 42

XML Key and Data Redundancy

• Let attribute @key uniquely identify each node in the entire data tree

• <CP, LHS> is an XML Key, when the database satisfies XML FD: LHS ./@key w.r.t. CP

• Similar to the relative key notion proposed in [BDF+01]

• Data redundancy exists if the database: Satisfies the XML FD <CP, LHS, RHS>,

But <CP, LHS> is not an XML key

RHS is redundantly stored.

24 / 42





•Conclusion

25 / 42

Strategy•Discover satisfied XML FDs and Keys

•Data redundancies can then be discovered based on the definition

•First, we need an efficient representation of the XML data

26 / 42

• Each essential tuple class a relation Similar to nested relations [OY87,MNE96] All relations together form a hierarchy Tree tuples can be reconstructed by joining @key

with parent

Hierarchical Representation of XML Data

R_state@key parent 2 root 3 root 18 root. . . . .

R_store@key parent name 4 3 Borders 12 3 Amazon 19 18 Borders

R_book@key parent ISBN title price 6 4 …269 DB $59.9 13 12 …269 DB $51.1 20 19 …269 DB $59.9

R_au@key parent @text 10 6 R.R. 11 6 J.G. 24 20 R.R. 25 20 J.G.

27 / 42

Intra-Relation FDs

state

store

bookname

book

store

name

book

state

ISBN title au au


“… 269”“DB” “R.R.”“J.G.”

store

name“Amazon”




“$51.1”

price“$59.9”

… …

• {./ISBN} ./title, w.r.t. C/warehouse/state/store/book

28 / 42

Present in R_book

Inter-Relation FDs

state

store

bookname

book

store

name

book

state

ISBN title au au


“… 269”“DB” “R.R.”“J.G.”

store

name“Amazon”




“$51.1”

price“$59.9”

… …

• {../name, ./ISBN} ./price, w.r.t. C/warehouse/state/store/book

Present in R_store

29 / 42

Overview of the Discovery Process•Only interested in minimal FDs•Bottom-Up•At each relation

Discover intra-relation FDs and Keys Discover inter-relation FDs and Keys

involving descendant relations Generate candidate inter-relation FDs and

Keys for examination at the parent level

•Attribute Partition as the basic data structure

30 / 42

Attribute Partition•Groups tuples

according to the attribute value

•∏{price} for Cbook = { {t6,t20}, {t13} }

∏{@key} for Cbook = { {t6}, {t20}, {t13} }

∏{price, @key} for Cbook = { {t6}, {t20}, {t13} }

•FD: LHS RHS w.r.t. CP is satisfied iff:

∏LHS∪RHS = ∏LHS


31 / 42

Set Attribute Partition • Generated through

refinement Initialize ∏{au} for R_book to be { {t6, t13, t20} }

∏{@text} for R_au = { {t10, t24}, {t11, t25} }

{ {t6, t20}, {t6, t20} }

∏au for R_book =

{ {t6, t20}, {t13} }

• ∏au can then be used as

a normal partition

R_au@key parent @text 10 6 R.R. 11 6 J.G. 24 20 R.R. 25 20 J.G.


Convert to parent

Refine ∏{au} using partitions in ∏{@text}

32 / 42

Discovery Algorithms•DiscoverFD:

Discover intra-relation FDs and Keys Similar to existing relational algorithms

•DiscoverXFD: Discover inter-relation FDs and Keys Key component:

Candidate inter-relation XML FD generation

33 / 42

Generating Candidate Inter-Relation FDs

• Let P' be a parent relation of P

• Parent satisfaction property For LHS∪X RHS w.r.t. CP to hold for any

attribute set X in relation P', LHS∪{./parent} RHS w.r.t. CP must hold

• Child implication property For LHS∪X RHS w.r.t. CP to be a non-trivial FD

for any attribute set X in relation P', LHS RHS w.r.t. CP must not hold

• An FD is a candidate inter-relation FD if it satisfies both properties

36 / 42





•Conclusion

37 / 42

Real Datasets

• DBLP contains a fair amount of redundancy, as noted earlier in [AL04] as well

• ~ 10% redundancies in PIR (measured as # of redundant elements over total # of elements), schema modification reported to PIR

38 / 42

Scalability on XMark

• Linear in terms of scale factor (# of elements) – even though exponential in theory

• Orders of magnitude faster than direct application of a state-of-the-art relational discovery algorithm The latter takes over 3 hours to run on XMark scale factor 1

39 / 42

Related Work•XML Integrity Constraints (FDs and

Keys) [BDF+01], [LLL02], [FS03]

•XML Normal Form [AL04], [VLL04]

•Nested Relation Normal Form [OY87], [MNE96]

•Relational FD discovery FUN, Dep-Miner, TANE, fdep, FastFDs

41 / 42

Conclusion•A comprehensive notion of XML FDs and

Keys, capturing set semantics

•A system for for detecting XML data redundancies through the discovery of FDs and Keys

•The system is practical for real datasets and out-performs direct application of the best available relational algorithm by orders of magnitude.

42 / 42

Questions ?

efficient discovery of xml data redundancies

Documents

example xml constraintsregular

terms of xml fds

xml keysefficient algorithms

xml dataan example constraint

data tree

data transfer

data redundanciese

data node