nov 2, 2001the storage and benchmarking of xml1 presenter: kevin see ([email protected]) ibm toronto...

53
Nov 2, 2001 The Storage and Benchmarking of XML 1 The Storage and Benchmarking of XML Presenter: Kevin See ([email protected]) IBM Toronto Lab. DB2 SQL /Catalog Development Date: Nov 2, 2001 <?XML?>

Upload: tracy-nicholes

Post on 11-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Nov 2, 2001The Storage and Benchmarking of

XML 1

The Storage and Benchmarking of XML

Presenter: Kevin See([email protected])

IBM Toronto Lab. DB2 SQL /Catalog Development

Date: Nov 2, 2001

<?XML?>

Nov 2, 2001The Storage and Benchmarking of

XML 2

Outline Introduction Text file / OODBMS/ native DB

approach Relational DB approaches

Categories 2 latest proposals

XML benchmarks Conclusion

Nov 2, 2001The Storage and Benchmarking of

XML 3

Introduction XML is emerging to become the

standard for data exchanging Demand for storage and

management of the XML documents is growing

There are a few ways to manage the XML document

Nov 2, 2001The Storage and Benchmarking of

XML 4

Text Approach File system A separate query engine will need to be

implemented Parsing not possible Index strategies : (parent_offset, tag),

(child_offset, parent_offset), (tagname, value), (attribute_name, attribute_value)

not good for update

Nov 2, 2001The Storage and Benchmarking of

XML 5

Object-oriented Database Management System Michael R. Olson and Byung S. Lee

(1997) OO model fit well Immature technology: Hard to

scale Conclude the experiment without

any great success.

Nov 2, 2001The Storage and Benchmarking of

XML 6

Native Database Approach Prototypes: Lore (Stanford

University), Xyleme (INRIA, France).

Immature technology No optimization capabilities

Nov 2, 2001The Storage and Benchmarking of

XML 7

Relational Database Technology Very mature technology Query optimization techniques and

the processing mechanisms in relational databases have been studied for a quarter of a century

A very large percentage of the data are currently stored in RDMS

Nov 2, 2001The Storage and Benchmarking of

XML 8

Storage and Retrieval of XML Using Relational Database

XML

Table1

Table1

Relational to XMLConversion

XML to Relational Mapping

XML Query to SQL SQL

XML

XMLQueryLanguagesuch asXPath,XQuery,Quilt, XQL

Nov 2, 2001The Storage and Benchmarking of

XML 9

Classifications of Various Mapping Methods Structure-mapping

approach The XML document’s

logical structures (or DTDs if available) are represented by the database schemas

1 DTD : 1 set of generated schemas

Model-mapping approach Constructs of XML

model are represented by the database schemas

1 set of generated schemas for all/any DTD

Nov 2, 2001The Storage and Benchmarking of

XML 10

Relational Schema Prototype Tree Mapping Method M. Yan and A. Fu @ The Chinese

University of Hong Kong (2001) Structure-mapping Global Schema Extraction

Algorithm DTD Splitting Schema Extraction

Algorithm

Nov 2, 2001The Storage and Benchmarking of

XML 11

Relational Schema Prototype Tree Mapping Method (Cont’d) Relational Databases for Querying XML

Documents: Limitations and Opportunities (J. Shanmugasundaram)

Basic steps:1. Simplify DTD2. Construct schema prototype tree3. Generate relational schema prototypes4. Detect functional dependencies and

candidate keys5. Normalize the relational schema

prototypes

Nov 2, 2001The Storage and Benchmarking of

XML 12

DTD Splitting Schema Extraction Algorithm Step 1: Simplify DTDp|p' p, p'p+ p*(p, p') p, p'..., p,..., p*,... p*p? p(p, p’)* p*, p’*..., p,..., p,... p*

Nov 2, 2001The Storage and Benchmarking of

XML 13

<!ENTITY %txt “(#PCDATA)”>

<!ELEMENT book (booktitle, price?, author, authority*)>

<!ELEMENT authority (authname, country)>

<!ELEMENT authname %txt>

<!ELEMENT country %txt>

<!ELEMENT booktitle %txt>

<!ELEMENT price %txt>

<!ELEMENT monograph (title, author, editor)>

<!ELEMENT title %txt>

<!ELEMENT editor (monograph+)>

<!ATTLIST editor name CDATA #REQUIRED>

<!ELEMENT author (name, address)>

<!ATTLIST author id ID>

<!ELEMENT name (firstname, lastname)>

<!ELEMENT firstname %txt>

<!ELEMENT lastname %txt>

<!ELEMENT address %txt>

An Book DTD

Nov 2, 2001The Storage and Benchmarking of

XML 14

<!ELEMENT book (booktitle, price, author, authority*)><!ELEMENT authority (authname, country)><!ELEMENT authname (#PCDATA)><!ELEMENT country (#PCDATA)><!ELEMENT booktitle (#PCDATA)><!ELEMENT price (#PCDATA)><!ELEMENT monograph (title, author, editor)><!ELEMENT title (#PCDATA)><!ELEMENT editor (monograph*)><!ATTLIST editor name CDATA><!ELEMENT author (name, address)><!ATTLIST author id ID><!ELEMENT name (firstname, lastname)><!ELEMENT firstname (#PCDATA)><!ELEMENT lastname (#PCDATA)><!ELEMENT address (#PCDATA)>

Transformed/ Simplified DTD

Nov 2, 2001The Storage and Benchmarking of

XML 15

Step 2: Construct Schema Prototypes Trees1. Only an element can become a root2. An element that is not nested inside other

elements can become the root3. A non-#PCDATA element that is nested in

more than 1 other element becomes the root4. If a non-#PCDATA element B is not the only

subelement of A and B only appears in A with a “*”, it becomes the root

5. One of the elements in the recursion is selected as root should recursion occurs in the DTD

Nov 2, 2001The Storage and Benchmarking of

XML 16

Roots for the Example DTD

Element book is selected as root – rule 2

Element author is selected as root – rule 3

Element authority is selected as root – rule 4

Element monograph is selected as root – rule 5

<!ELEMENT book (booktitle, price, author, authority*)><!ELEMENT authority (authname, country)><!ELEMENT authname (#PCDATA)><!ELEMENT country (#PCDATA)><!ELEMENT booktitle (#PCDATA)><!ELEMENT price (#PCDATA)><!ELEMENT monograph (title, author, editor)><!ELEMENT title (#PCDATA)><!ELEMENT editor (monograph*)><!ATTLIST editor name CDATA><!ELEMENT author (name, address)><!ATTLIST author id ID><!ELEMENT name (firstname, lastname)><!ELEMENT firstname (#PCDATA)><!ELEMENT lastname (#PCDATA)><!ELEMENT address (#PCDATA)>

Nov 2, 2001The Storage and Benchmarking of

XML 17

Step 2: Construct Schema Prototypes Trees (Cont’d) Tree construction:

Depth-first scan on DTD for all selected root(s) starting from the subelements of the root

New nodes for each visited elements and attributes

A mixed element (element containing both #PCDATA and other subelement) will be marked with a “#” in the tree

Recursion – a new leaf node with label <node name>.A

Nov 2, 2001The Storage and Benchmarking of

XML 18

Schema Prototype Trees

Nov 2, 2001The Storage and Benchmarking of

XML 19

Step 3: Generate Relational Schema Prototype All necessary descendants are

inlined starting from the root except key nodes or foreign key nodes.

Nov 2, 2001The Storage and Benchmarking of

XML 20

Relational Schema Prototype

Book (booktitle, price)Authority (country, authname)Author (address, id, firstname, lastname)Monograph (title, name)

Nov 2, 2001The Storage and Benchmarking of

XML 21

Step 4: Discover FDs and Candidate Keys Functional dependencies (FDs) and

the candidate keys discovery by analyzing the XML data

TANE algorithm (http://www.cs.helsinki.fi/research/fdk/datamining/tane/)

Nov 2, 2001The Storage and Benchmarking of

XML 22

Candidate Keys

Book {booktitle}Authority {country, authname}Monograph {title}Author {id}, {lastname, address}

Nov 2, 2001The Storage and Benchmarking of

XML 23

Relational Schema Prototype With Candidate Keys

Book (booktitle, price, author.id)Authority (country, authname, assigned, book.booktitle)Author (address, id, firstname, lastname)Monograph (title, name, author.id, monograph.title)Book (booktitle, price)

Authority (country, authname)Author (address, id, firstname, lastname)Monograph (title, name)

Nov 2, 2001The Storage and Benchmarking of

XML 24

Step 5: Normalize the Relational Schema Prototypes The last step. Normalize the schema to 3NF

(third normal form) if possible. Structure mapping methods does

not handle order but leave it to metadata or user to handle.

Nov 2, 2001The Storage and Benchmarking of

XML 25

X-Rel

Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki Shimura and Shunsuke Uemura @ Nara Institute of Science and Technology, Japan (2001)

Model-mapping Data model: XPath (root node,

element nodes, attribute nodes, and text nodes)

The concept of region

Nov 2, 2001The Storage and Benchmarking of

XML 26

Definition of Region

The region of: An element node or a text node is a

pair of numbers representing the start and end positions of the node in the XML document

An attribute node is a pair of identical numbers equal to the start position of the parent element node plus one

Nov 2, 2001The Storage and Benchmarking of

XML 27

Simple Path Expressions

Path – an unit of decomposition of XML trees Store simple path expression (denoted by

SimplePathExpr) from the root node. Why? Path is appear in XML queries frequently

Nov 2, 2001The Storage and Benchmarking of

XML 28

Why “#” Is Added?

Look for family descendants of issue.1. WHERE p1.pathexp LIKE

‘/issue%/family’2. WHERE p1.pathexp LIKE ‘#/issue#

%/family’ /issuelist/family (WRONG) is match

for the first but not the second.

Nov 2, 2001The Storage and Benchmarking of

XML 29

<Paper Title = “The Suffix-Signature Method for Searching Phrases in Text”><Authors>

<FN> Mei </FN><LN> Zhou </LN>

<Affiliation> Open Text Corporation </Affiliation> </Authors>

<Authors> <FN> Frank </FN><LN> Tompa </LN>

<Affiliation> University of Waterloo </Affiliation> </Authors></Paper>

Example XML Document

Nov 2, 2001The Storage and Benchmarking of

XML 30

ROOT

Paper

Element

attribute

abc

text

string-valueTitle

The Suffix-Signature Method forSearching Phrases in Text

Authors Authors

FN FNLN LN AffiliationAffiliation

Mei Zhou OpenTextCorporation

Frank Tompa UniversityOFWaterloo

1

2

3 4 11

5 7 9 12 14 16

6 8 10 13 15 1711

XML Tree

Nov 2, 2001The Storage and Benchmarking of

XML 31

Simple Path Expressions /Regions

Node 3 #/Paper#/@Title (1,1)

Node 9 #/Paper#/

Authors#/affiliation

(99, 145)

ROOT

Paper

Element

attribute

abc

text

string-valueTitle

The Suffix-Signature Method forSearching Phrases in Text

Authors Authors

FN FNLN LN AffiliationAffiliation

Mei Zhou OpenTextCorporation

Frank Tompa UniversityOFWaterloo

1

2

3 4 11

5 7 9 12 14 16

6 8 10 13 15 1711

Nov 2, 2001The Storage and Benchmarking of

XML 32

Mapping Idea

A relational table per node type Simple path expression are normalized docID is introduced Basic XRel schema Element (docID, pathID, start, end, index,

reindex) Attribute (docID, pathID, start, end, value) Text (docID, pathID, start, end, value) Path (pathID, pathexp)

Nov 2, 2001The Storage and Benchmarking of

XML 33

Table - Element

docID pathID start end index reindex

1 1 0 257 1 1

1 3 66 155 1 2

1 4 75 86 1 2

1 5 87 99 1 2

1 6 100 145 1 2

1 3 156 249 2 1

1 4 165 178 2 1

1 5 179 192 2 1

1 6 193 239 2 1

Nov 2, 2001The Storage and Benchmarking of

XML 34

Table - Attribute

docID pathID start end value

1 2 1 1 The Suffix-Signature Method for Searching Phrases in Text

Nov 2, 2001The Storage and Benchmarking of

XML 35

Table - Text

docID

pathID

start

end value

1 4 79 81 Mei

1 5 91 94 Zhou

1 6 113 131 Open Text Corporation

1 4 169 173 Frank

1 5 183 187 Tompa

1 6 206 225 University of Waterloo

Nov 2, 2001The Storage and Benchmarking of

XML 36

Table - Path

pathID pathexpr

1 #/Paper

2 #/Paper/@Title

3 #/Paper#/Authors

4 #/Paper#/Authors#/FN

5 #/Paper#/Authors#/LN

6 #/Paper#/Authors#/Affiliation

Nov 2, 2001The Storage and Benchmarking of

XML 37

XML Benchmarking Desiderata

Bulk loading Reconstruction Path traversals Casting Missing elements

Nov 2, 2001The Storage and Benchmarking of

XML 38

XML Benchmarking Desiderata (continued) Ordered access References Joins Construction of large results Containment, full-text search

Nov 2, 2001The Storage and Benchmarking of

XML 39

XML Benchmarks

3 XML benchmark proposals. XMach-1.

University of Leipzig, Germany. XML benchmark project.

CWI, the Netherlands. Kanda et al. Proposal (unpublished).

University of Michigan, IBM Toronto lab center for advanced studies.

Nov 2, 2001The Storage and Benchmarking of

XML 40

Conclusion

From the different mapping approaches and experiments, there are a few places where relational database enhancement can help in coping with XML model differences. Support for sets. Flexible comparisons operators. Multi-predicate merge join.

Nov 2, 2001The Storage and Benchmarking of

XML 41

Questions & Answers

Nov 2, 2001The Storage and Benchmarking of

XML 42

Appendix A

Enhancing Structural Mappings Based on Statistics

Nov 2, 2001The Storage and Benchmarking of

XML 43

Optimal Hybrid Database Algorithm M. Klettke, and H. Meyer XML and object-relational database

systems - enhancing structural mappings based on statistics (2000)

An algorithm that finds a type of optimal mapping based on the statistics and the DTD

Nov 2, 2001The Storage and Benchmarking of

XML 44

Optimal Hybrid Database Algorithm

1. Build a graph representing the hierarchy of the elements and attributes of the DTD.

2. For every element/attribute of the graph, a measure of significance, w, is determined.

3. Derive the resulting database design from the graph.

Nov 2, 2001The Storage and Benchmarking of

XML 45

Nov 2, 2001The Storage and Benchmarking of

XML 46

Graph for an Example DTD

Nov 2, 2001The Storage and Benchmarking of

XML 47

Calculate the Weight (Step 2)

W = 1/6 (SQ + SA + SH) + ¼ (DA/DG) + ¼ (QA/QG) where SQ - exploitation of quantifiers SA - exploitation of alternatives SH - position in the hierarchy DA -number of documents containing the element/attribute DG - absolute number of XML documents QA - number of queries containing the element/attribute QG - absolute number of queries

Nov 2, 2001The Storage and Benchmarking of

XML 48

The Graph With the Colored Weight

Nov 2, 2001The Storage and Benchmarking of

XML 49

Step 3 - Deriving Hybrid Databases From the Graph First, specify a limit on which

attributes and/or elements is represented as attributes of the databases and which attributes and/or element are represented as XML attributes

Nov 2, 2001The Storage and Benchmarking of

XML 50

Step 3 - Deriving Hybrid Databases From the Graph Then, search for all nodes of the

graph that satisfy the following conditions:

The node is not a leaf of the graph The node and all its descendants

are below the limit given No predecessor that satisfies the

first two conditions exists.

Nov 2, 2001The Storage and Benchmarking of

XML 51

Step 3 - Deriving Hybrid Databases From the Graph The selected nodes and its

descendents (the whole sub-graph) will be replaced by an XML attribute. (A BLOB like attribute)

All other elements and attributes will be mapped to relational database using mapping.

Nov 2, 2001The Storage and Benchmarking of

XML 52

Resulting XML Attributes for the Example DTD

Nov 2, 2001The Storage and Benchmarking of

XML 53

Referenceshttp://www.geocities.com/ysee.geo/ml.htmlhttp://db-www.aist-nara.ac.jp/members/Yoshikawa/paper/TOIT2001.pdfhttp://citeseer.nj.nec.com/454884.htmlhttp://dol.uni-leipzig.de/pub/2001-1/enhttp://www.cs.wisc.edu/niagara/papers/xmlstore.pdfhttp://www.cwi.nl/htbin/ins1/publications?request=papers