tax: a tree algebra for xml reference: jagadish et al. dbpl 2001

44
TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001.

Upload: lambert-atkinson

Post on 04-Jan-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

TAX: A Tree Algebra for XML

Reference: Jagadish et al. DBPL 2001.

Page 2: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Overview

Why an algebra for XML? Main challenges Data model Patterns & Witnesses Tree Value Functions Some Example Operators Translation Example – XQuery

Page 3: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Overview (contd.)

Main Results Optimization Examples Implementation Summary & Future Work

Page 4: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Why an Algebra (for XML)? (aka Related Work)

Bulk algebra for tree manipulation – efficient implementation of XML queries

Algebra for manipulating trees (has been attempted before) Feature algebras – linguistics; efficient

implementation? Grammar-based algebra for trees [Tompa+ 87,

Gyssens+ 89] Aqua project [Zdonik+95]

Page 5: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Why XML algebra? [Related work] (contd.)

GraphLog, Hy+ [Consens+90], GOOD [Paradaens+92] – cannot exploit special properties of trees (e.g., support for arbitrary recursion vs. descendants, order)

SS data – Lorel [Abiteboul+ 96], UnQL [Buneman+ 96].

XML algebras – [Beech+ 99], [Fernandez+ 00] (mainly type system issues), [Christofidis+ 00] (trees tuples), [Ludascher+ 00] (nodes, not trees), SAL [Beeri+ 99] (ordered lists of nodes)

Page 6: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Why? (contd.)

be close to relational model, but direct support for (collections of) trees express at least RA + aggregation capture substantial fragment of XQuery admit efficient implementation and

effective query optimization e.g., satisfy “natural” identities.

Page 7: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Main Chellanges

Capture rich variety of manipulations in a simple algebra

Handle heterogeneity in tree collections structure “schema” of nodes of the same “type”

Handle order (documents are ordered) sometimes important (e.g., author list,

whether anesthesia preceded incision) sometimes not (e.g., publisher vs. authors)

Page 8: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Data Model Data tree = rooted ordered tree Data in node = set of attr-val pairs Special attribute: pedigree – where did I come

from? Node representation = (docId, startPos:endPos, level) preserved for (copies of) original nodes thru

manipulations. play important role in grouping, sorting, etc. null for new nodes.

Collections (of trees) – unordered. IDREF(S) treated like other attr’s. Possible alternative: treat them as pointers. One position: express pointer dereferencing as

IDREF=ID join (but implement as you will).

Page 9: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Patterns & Witnesses

first challenge: how do you get at nodes and/or attributes?

Notion of selection parameter – considerably more complex

our solution: patterns – enable specification of parameters for most operations

only show parts of interest: Need not know/care about entire structure of trees in

collection Analogy: in SchemaLog, you only specify what you

care about.

Page 10: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Patterns & Witnesses (contd.)

Example P1:$1

$2 $3

pc ad

$1.tag = book & $2.tag = year & $2.content < 2000 & $3.tag = author

Structural part

Condition partAdditional parameters possible: e.g., selection/projection lists, grouping, ordering, etc.

pc = directad = transitive

Page 11: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Patterns & Witnesses (contd.) What does a pattern do for you?

generate witnesses against i/p collection one for each matching of pattern against i/p conditions must be respected (sub)structure preserved in o/p

e.g., witness trees for pattern P1 – one tree for each author of each book published

before 2000, showing year & author book-author link may be transitive in i/p but is

necessarily direct in o/p source trees = trees witnesses “came from”

Page 12: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Example Database

bib

book book

author author

name

first lastmid

deg degname

titletitleyear

first last

1910 PrincipiaMathematica

Alfred North Whitehead Bertrand Russel

Sc.D., FRS

M.A., FRS

author

name

Panini

Ashtadhyayi(First book on Sanskrit Grammar)

year

560 BC

12

3 4 5 12

19

20 22

Only startPos shown.

Page 13: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

What should selection return?

trees where a match occurred? poor granularity when DB = one big document tree;

e.g., select books authored by John Grisham the whole bib tree!

only distinguished nodes (as XPath)? don’t get all info. that you want.

witness trees – right level of abstraction and info. extraction.

may enhance: e.g., relatives of selected nodes might be of interest too. Deescendants – most useful case.

Page 14: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Example Operators – Selection Input: collection; parameters: pattern, selection

list (pattern nodes) Example

pattern P1 and empty SL: same witness trees as before

pattern P1 with SL = {$1}: whole book subtrees (i.e. retain $1’s descendants)

One-zero/more o/p trees in general per i/p tree Could retain other “relatives” instead (e.g.,

siblings)

Page 15: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Selection with P1 (empty SL)

book book

authoryear

1910

authoryear

560 BC

book

year author

Whole author subtree included when SL = {$3}.

1910

2

35

2

3 12

19

2122

Page 16: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

What should projection do?

Unlike relational model, selection is not purely “horizontal” (so, can’t expect pure “vertical” for project).

Can one op serve both roles? Select finds match witnesses (localized) Want project to retain all (named) nodes

satisfying some predicates in a given source tree no matter how you match the pattern

The two ops are still orthogonal

Page 17: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Example operators – Projection Input: collection; parameters: pattern, projection

list Example

Pattern P1 w/ PL = {$1, $2, $3}: one tree for each book published before 2000, showing year and author(s)

Pattern P1 w/ PL = {$3}: one tree for each author of aforementioned books

`*’ in PL causes descendants to be retained One-zero/more op (for reasons diff. from select)

Page 18: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Projection: P1 w/ PL = {$1,$2,$3}

book book

author authoryear1910

authoryear

560 BC

With $3*, we can include whole author subtrees.

2

3 5 12

19

2122

Page 19: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Selection vs. Projection Example

FOR $b IN document(“doc.xml”)//book FOR $y IN $b/year[data() < 2000]

FOR $a IN $b//author RETURN

<book> {$y} {$a}</book>

versus FOR $b IN document(“doc.xml”)//book[/year/data() < 2000

& author] RETURN <book> {$b/year} {$b/author}

</book>

selection

projection

Page 20: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Tree Value Functions (TVF)

What are they? Primitive recursive functions on structure of source

trees Codomain must be ordered

Where are they used? grouping, ordering, aggregation, etc.

Here is an example: f: T value of author, number of authors, tuple of

authors, {author tuple, title}, etc. Complete example coming up …

Page 21: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Example operators – grouping Input: collection; parameters: pattern,

grouping TVF, ordering TVF. Example

input: collection of books

pattern: $1

$2 $3

$4$1.tag = book & $2.tag = title & $3.tag = author & $4.tag = name

f_g(T) = “$4.content”f_o(T) = “$2.content”pc ad

pc

Page 22: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Grouping (contd.)

Here is what the o/p looks like:

-- books ordered by title in each group

…tax_group_root

tax_group_basis tax_group_subroot

authorbook book

name

Page 23: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Other operators

Derived operators – various joins. Set operations:

When are two data trees the “same”? Equality (shallow/deep) vs. isomorphism

(include pedigree or not?) Multiset versions of operators

Aggregation, Reordering, Renaming.

Page 24: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Joins

$1 $1.tag=book & SL=$2

E SELECT: | $2.tag=publisher

$2

$3 $3.tag=book & SL=$4

F SELECT: |ad $4.tag=author

$4

G (F x E) $5 $5.tag=tax_prod_root &

H SELECT: / \ $6.tag=book & $7.tag=book &

$6 $7 $6.pedigree=$7.pedigree

SL=$6, $7. - we joined on pedigrees. - could have joined on publisher city = author city instead, if desired. - can express a variety of outerjoins easily.

Page 25: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

XQuery Translation Example 1

FOR $b IN

document(“doc.xml)//book[//author/@hobby=tennis] RETURN <sportydiveshbook>

$b/title IF SOME $a IN $b//author SATISFIES $a/data() = “divesh” THEN $b//author

</sportydiveshbook>

Page 26: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Example 1 (contd.) outer pattern tree:

inner pattern tree:

$1

$2

$1.tag=book & $2.tag=author & $2.hobby=tennis

$3

$4

$3.tag=book & $4.tag=author & $4.content=divesh

ad

ad

Page 27: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Example 1 (contd.) SELECT input DB w/ outer pattern and

empty SL; Take Cartesian product with entire input

DB; SELECT result w/ combined inner+outer

pattern and join condition:

$5

$6

$7

$8

$9

$5.tag=tax_prod_root & $6.tag=book & $7.tag=author & $8.tag=book & $8.pedigree=$6.pedigree & $9.tag=author & $9.content=divesh & $10.tag=title

$10What is wrong with this translation?

Page 28: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Example 1 (contd.) Pre-IF part E: select w/

$1

$2

$1.tag=book & $2.tag=author & $2.hobby=tennisSL = $1*

$3

$4PL = $3, $4 $3.tag=book & $4.tag=titlePL = $3, $4

Additional duplicate elimination needed if we don’t know title is unique per book.

ad

then project w/

Page 29: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Example 1 (contd.)

IF part F: select w/

then project w/

$5

$6$5.tag=book & $6.tag=author & $6.content = divesh

SL = $5*

$7

$8$7.tag=book & $8.tag=author PL = $7, $8

ad

ad

Page 30: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Example 1 (contd.) Do a left outerjoin of E with F w/ the condition $3

= $7 (What does this really entail?)

tax_prod_root

/ \

book book . . .

| / ... \

title author author

PL = $9 $9.tag != book$9

Rename tax_prod_root sportydiveshbook.

Project w/

Page 31: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Example 2

FOR $a IN distinct(document(“bib.xml”))//author

RETURN

<authorpubs>

{$a}

{FOR $b IN document(“bib.xml”)//article

WHERE $a = $b/author

RETURN $b/title }

</authorpubs>

Page 32: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Example 2 (contd.) select/project authors and dup-elim. join with books based on (pedigree-) equality

ofbook nodes. (So, what should the selection pattern look like?)

Group by author pedigree. Do a project, retaining only author and title. Do a final renaming, if needed.

Page 33: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Main Results

Duplicate elimination by value can be expressed in TAX.

The operators in TAX are independent. TAX is complete for relational algebra w/

aggregation. TAX can capture the fragment of XQuery FLWR

expressions w/o function calls, recursion, w/ all path expressions using only constants, wildcards, and / & //, when no new ancestor-descendant relationships are created.

Page 34: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Optimization Examples

Revisit translation example 1: E can be simplified to – project w/

$1

$2 $3$1.tag=book & $2.tag=author & $2.hobby=tennis & $3.tag=title

PL= $1,$3

Similar simplification applies to F

Self-join can sometimes be eliminated Associativity, commutativity issues

Page 35: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Implementation

TIMBER system at Univ. of Michigan Find pattern tree matches via

Index scans Full scans Twig joins

Joins implemented on streams Pedigree – implemented as position of

element within document Pedigrees similar to RID at impl. level

Page 36: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

Summary & Future Work TAX – extension of RA for handling

heterogeneous collections of ordered labeled trees

Simplicity; few more operators Recognize selective importance of order and

handle elegantly Bulk algebra for efficient implementation of XML

querying Stay tuned for TIMBER release(s) Future

Arbitrary restructuring: copy-and-paste Updates: principled via operators

Page 37: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,

$a IN $b/author

WHERE $b/year < 1990 AND $a/@hobby=“tennis”

RETURN

<result>

{$b//publisher}

{$a/affiliation}

</result>

What’s a generic way to translate such queries into TAX?

Page 38: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,

$a IN $b/author

WHERE $b/year < 1990 AND $a/@hobby=“tennis”

RETURN

<result>

{$b//publisher}

{$a/affiliation}

</result>

Eforwhere

Ereturn1

Ereturn2

Efinal

Identify major components in query statement and associate expressions with each. Expressions developed in cascade. Each uses its own pattern (tree).

Page 39: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,

$a IN $b/author

WHERE $b/year < 1990 AND $a/@hobby=“tennis”

RETURN

<result>

{$b//publisher}

{$a/affiliation}

</result>

Eforwhere

Ereturn1

Ereturn2

Efinal

$b

$a $y

$b.tag=book & $a.tag=

author & $y.tag=year & $a.hobby=“tennis” &

$y.content<1990

pattern used for creating Eforwhere

E0 = SELECT_{Pforwhere, {}}(mybib.xml); E1 = PROJECT_{P’forwhere, {$b,$a}}(E0);

Eforwhere = DE_{P’forwhere, {$b,$a}}(E1);

P’forwhere – same as Pforwhere, except $y is dropped.

Why need project? Why need DE?

Page 40: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,

$a IN $b/author

WHERE $b/year < 1990 AND $a/@hobby=“tennis”

RETURN

<result>

{$b//publisher}

{$a/affiliation}

</result>

Eforwhere

Ereturn1

Ereturn2

Efinal

$tpr

$b

$a $p

$b’ad

$tpr.tag=tax_prod_root & $b.tag=$b’.tag=book & $a.tag=author &

$p.tag=publisher & $b’.pedigree=$b.pedigree

& $a’.tag=author & $a’.pedigree=$a.pedigree

Why did we impose pedigree equality?

pettern used for creating Ereturn1

$a’

Page 41: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,

$a IN $b/author

WHERE $b/year < 1990 AND $a/@hobby=“tennis”

RETURN

<result>

{$b//publisher}

{$a/affiliation}

</result>

Eforwhere

Ereturn1

Ereturn2

Efinal

$tpr

$b

$a $p

$b’ad

$tpr.tag=tax_prod_root & $b.tag=$b’.tag=book & $a.tag=author &

$p.tag=publisher & $b’.pedigree=$b.pedigree

& $a’.tag=author & $a’.pedigree=$a.pedigree

pettern used for creating Ereturn1

$a’

Ereturn1is created via left outer-join, Project;DE; followed by GROUP-BY.

Page 42: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,

$a IN $b/author

WHERE $b/year < 1990 AND $a/@hobby=“tennis”

RETURN

<result>

{$b//publisher}

{$a/affiliation}

</result>

Eforwhere

Ereturn1

Ereturn2

Efinal

$tpr

$b

$a $p

$b’ad

$tpr.tag=tax_prod_root & $b.tag=$b’.tag=book & $a.tag=author &

$p.tag=publisher & $b’.pedigree=$b.pedigree

& $a’.tag=author & $a’.pedigree=$a.pedigree

pettern used for creating Ereturn1

$a’

E2 = LOJ_{P_LG,{$p}}(Eforwhere, mybib.xml);E3 = PD_{P’_LG, {$b,$a,$p*}}(E2 ); Ereturn1 = GP_{P’’_LG, {$b,$a}, {$p*}}(E3 );

Page 43: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

More translation examples – ex3 FOR $b IN document(“mybib.xml”)//book,

$a IN $b/author

WHERE $b/year < 1990 AND $a/@hobby=“tennis”

RETURN

<result>

{$b//publisher}

{$a/affiliation}

</result>

Eforwhere

Ereturn1

Ereturn2

Efinal

$tpr

$b

$a $p

$b’ad

$tpr.tag=tax_prod_root & $b.tag=$b’.tag=book & $a.tag=author &

$p.tag=publisher & $b’.pedigree=$b.pedigree

& $a’.tag=author & $a’.pedigree=$a.pedigree

pettern used for creating Ereturn1

$a’

Efinal = PJ_{P_PJ, {$r, $p*, $l*}}(Ereturn1, Ereturn2);

Page 44: TAX: A Tree Algebra for XML Reference: Jagadish et al. DBPL 2001

General translation remarks

LET clause handled as correlated subquery; E_LET left outer-joined with E_FORWHERE (just like E_RETURNi).

Ordering by pedigree (i.e., as in original input) already captured.

Ordering by other means doable. Aggregation – straightforward. Nested queries (with correlated subqueries) –

handled by rewriting them so the query conforms to: (FOR LET)*RETURN where WHERE clause and ORDER-BY are implicit.