1 managing xml and semistructured data part 1: preliminaries, motivation and overview...

43
1 Managing XML and Semistructured Data Part 1: Preliminaries, Motivation and Overview Acknowledgement: Part of the materials in this set of XML slides are extracted from Prof. Dan Suciu’s course materials. Thanks for his permission of using them COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

1

Managing XML and Semistructured Data

Part 1: Preliminaries, Motivation and Overview

Acknowledgement: Part of the materials in this set of XML slides are extracted from Prof. Dan Suciu’s course materials. Thanks for his permission of using them

COMP630L Topics in DB Systems: Managing Web DataFall, 2007

Dr Wilfred Ng

2

HTML XML SGMLHTML XML SGML a W3C standard to complement HTML origins: structured text SGML motivation:

• HTML describes presentation• XML describes content

http://www.w3.org/TR/2000/REC-xml-20001006 (version

2, 10/2000)

SGMLXMLHTML4.0

3

HTMLHTML<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteoul, Buneman, Suciu

<br> Morgan Kaufmann, 1999

<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteoul, Buneman, Suciu

<br> Morgan Kaufmann, 1999

HTML describes the presentation

4

XMLXML<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author>

<author> Vianu </author>

<publisher> Addison Wesley </publisher>

<year> 1995 </year>

</book>

</bibliography>

<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author>

<author> Vianu </author>

<publisher> Addison Wesley </publisher>

<year> 1995 </year>

</book>

</bibliography>XML describes the content

5

SGML and XML SGML and XML eXtensible Markup Language - XML XML 1.0 – a recommendation from W3C, 1998 XML – related standards: W3C group – XSL, XQuery,

… Roots: SGML [http://www.w3.org/MarkUp/SGML/]

• Markup = encoding: possibilities for expressing information• Information modelling freedom, Reusability, provability,

validity• But a very nasty language • SGML is an international standard for device-independent,

system-independent methods of representing texts in electronic form

After the roots: a format for sharing data (on the Web)

6

Basic XML TerminologyBasic XML Terminology tags: book, title, author, … start tag: <book>, end tag: </book> elements:

<book>…<book>,<author>…</author> elements are nested empty element: <red></red> abbrv. <red/> an XML document: single root element

7

The Role of XML DataThe Role of XML Data

XML is designed for data exchange, not to replace relational or E/R data

Sources of XML data:• Created manually with text editors: not really data

• Generated automatically from relational data

• Text files, replacing older data formats: Web server logs, scientific data (biological, astronomical)

• Stored/processed in native XML engines: very few applications need that today

8

XML Advantages for Web DataXML Advantages for Web Data

Over SGML• Supported by

mainstream browsers such as IE and Netscape

• Standard Stylesheet• Standard linking

Over HTML• Interchangable• Searchable• Reusable• Enables Automation

9

Why XML is of Interest to UsWhy XML is of Interest to Us HTML fails to meet structure specification – tags for

appearance only XML is a language syntax for data

• Note: we have no langauge syntax for relational data

• But XML is not relational: semistructured

This is exciting because:• Can translate any data to XML

• Can ship XML over the Web (HTTP)

• Can input XML into any application

• Thus: make data sharing and exchange on the Web possible!

40% annual growth and accelerating: publishers, government, education,…

10

Relational Data in Relational Data in XMLXML

<person><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>

6363</phone></row></person>

<person><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>

6363</phone></row></person>

n a m e p h o n e

J o h n 3 6 3 4

S u e 6 3 4 3

D i c k 6 3 6 3

row row row

name name namephone phone phone

“John” 3634 “Sue” “Dick”6343 6363

person XML: person

11

XML Data: more expressiveXML Data: more expressive

Missing attributes:

Could represent ina table with nulls

<person> <name> John</name> <phone>1234</phone> </person>

<person> <name>Joe</name></person>

<person> <name> John</name> <phone>1234</phone> </person>

<person> <name>Joe</name></person> no phone !

name phone

John 1234

Joe -

12

XML Data: more expressiveXML Data: more expressive

Repeated attributes

Impossible in tables:

<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>

<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>

two phones !

name phone

Mary 2345 3456 ???

13

XML Data: more expressiveXML Data: more expressive

Attributes with different types in different objects

Nested collections (no 1NF) Heterogeneous collections:

• <db> contains both <book>s and <publisher>s

<person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone></person>

<person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone></person>

structured name !

14

XML Data Sharing and ExchangeXML Data Sharing and Exchange

application

relational data

Transform

Integrate

Warehouse

XML Data WEB (HTTP)

application

application

legacy data

object-relational

Specific data management tasks

15

From Relational Data to XML From Relational Data to XML DataData

<persons><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>

6363</phone></row></persons>

<persons><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>

6363</phone></row></persons>

name phone

John 3634

Sue 6343 Dick 6363

row row row

name name namephone phone phone

“John” 3634 “Sue” “Dick”6343 6363

persons

XML – Data tree and its doc

persons

16

XML DataXML Data

XML is self-describing Schema elements become part of the data

• Relational schema: persons(name,phone)• In XML <persons>, <name>, <phone> are part

of the data, and are repeated many times !

Consequence: XML is much more flexible XML data is semistructured data

17

XML from/to Relational DataXML from/to Relational Data XML publishing:

• relational data XML

XML storage:• XML relational data

Relational data is regular in XML tree representation. But XML data can be non-relational!

Supplier Supplier

Common XML

Illusion

Broker

??

18

XML Publishing: ideas sketchXML Publishing: ideas sketch

RelationalDatabase

ApplicationWebXML

publishing

Tuplestreams

XML

SQL XPath/XQuery

19

XML PublishingXML Publishing Exporting the data is relatively

easier: we do this already for HTML

Translating XQuery SQL is hard XML publishing systems: Research: Experanto (IBM/DB2),

SilkRoute (AT&T Labs and UW, software unavailable) now SilkRoute downloadable• XQuery SQL

Commercial: SQL Server, Oracle• Only XPath SQL and with

restrictions

A middle-ware approach

20

XML PublishingXML Publishing

Backend relational engine Relational schema:

• Student(sid, name, address)• Course(cid, title, room)• Enroll(sid, cid, grade)

Schemas are often proprietary but XML schemas are public

student courseenroll

Will follow the idea in SilkRoute [ACM TODS 27(4), 2002]

21

XML PublishingXML Publishing<xmlview>

<course> <title> Operating Systems </title> <room> MGH084 </room> <student> <name> John </name> <address> Seattle </address > <grade> 3.8 </grade> </student> <student> …</student> …</course><course> <title> Database </title> <room> EE045 </room> <student> <name> Mary </name> <address> Shoreline </address > <grade> 3.9 </grade> </student> <student> …</student> …</course>…

</xmlview>

<xmlview><course> <title> Operating Systems </title>

<room> MGH084 </room> <student> <name> John </name> <address> Seattle </address > <grade> 3.8 </grade> </student> <student> …</student> …</course><course> <title> Database </title> <room> EE045 </room> <student> <name> Mary </name> <address> Shoreline </address > <grade> 3.9 </grade> </student> <student> …</student> …</course>…

</xmlview>

• Group by courses:Redundant representation of students• Other representations possible too: tags may not be the same as attributes in general

22

XML PublishingXML Publishing

<!ELEMENT xmlview (course*)>

<!ELEMENT course (title,room,student*)>

<!ELEMENT student (name,address,grade)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT address (#PCDATA)>

<!ELEMENT grade (#PCDATA)>

<!ELEMENT xmlview (course*)>

<!ELEMENT course (title,room,student*)>

<!ELEMENT student (name,address,grade)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT address (#PCDATA)>

<!ELEMENT grade (#PCDATA)>

First thing to do: design the DTD (public for users)

Second thing to do: develop an XML canonical view under db root for each relation {t1,…tk} with schema R(A1,…,An)

<R><row><A1>a11</A1>…<An>a11</An></row>… <row><A1>ak1</A1>…<An>ak1</An></row> </R>

23

<xmlview>{ FOR $x IN /db/Course/row RETURN <course> <title> { $x/title/text() } </title> <room> { $x/room/text() } </room> { FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row $z IN /db/Student/row[sid/text() = $y/sid/text()]/row RETURN <student> <name> { $z/name/text() } </name> <address> { $z/address/text() } </address> <grade> { $y/grade/text() } </grade> </student> } </course>}</xmlview>

<xmlview>{ FOR $x IN /db/Course/row RETURN <course> <title> { $x/title/text() } </title> <room> { $x/room/text() } </room> { FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row $z IN /db/Student/row[sid/text() = $y/sid/text()]/row RETURN <student> <name> { $z/name/text() } </name> <address> { $z/address/text() } </address> <grade> { $y/grade/text() } </grade> </student> } </course>}</xmlview>

Now we write an XQuery to export relational data XMLNote: result should conform to DTD (slide 20 for the relationalSchema; the generated result is shown in slide 21.)

24

XML PublishingXML PublishingQuery: find Mary’s grade in Operating Systems

FOR $x IN /xmlview/course[title/text()=“Operating Systems”], $y IN $x/student/[name/text()=“Mary”] RETURN <answer> $y/grade/text() </answer>

FOR $x IN /xmlview/course[title/text()=“Operating Systems”], $y IN $x/student/[name/text()=“Mary”] RETURN <answer> $y/grade/text() </answer>

XQuery over public view

SELECT Enroll.gradeFROM Student, Enroll, CourseWHERE Student.name=“Mary” and Course.title=“OS” and Student.sid = Enroll.sid and Enroll.cid = Course.cid

SELECT Enroll.gradeFROM Student, Enroll, CourseWHERE Student.name=“Mary” and Course.title=“OS” and Student.sid = Enroll.sid and Enroll.cid = Course.cid

SQL over database schema

SilkRoutedoes this

automatically

25

XML PublishingXML Publishing

How do we choose the output structure ? Determined by agreement, with our partners, or dictated by

committees in XML dialects (called applications) and generate DTDs

The DTD is selective wrt the underlying databases XML Data is often nested, irregular, etc – that’s why we

need the xmlview query in slide 23 to do the transformation

No agreed normal forms for XML but a few work on it… [M. Arenas and L. Libkin, A Normal Form for XML

Documents]

26

XML StorageXML Storage Often the XML data is small and is parsed directly

into the application (DOM API) – file based management (pros and cons?)

Sometimes it is big, and we need to store it in a database

Much harder than XML publishing (why ?) A fundamental XML storage problem:

• How do we choose the schema of the database ? Possible solutions:

• Schema derived from DTD – structure mapping• Storing XML as a graph such as “Edge relation” – model

mapping• Native Approach – native XML/SSD engine

27

XML Storage in a Relational DBXML Storage in a Relational DB Use generic schema

• [Florescu, Kossman 1999]

Use DTD to derive schema• [Shanmugasundaram, et al. 1999]

Use data mining to derive schema• [Deutsch, Fernandez, Suciu 1999]

Use the Path table• [T.Amagasa, T.Shimura, S.Uemura 2001]

28

XML Storage: Edge RelationXML Storage: Edge Relation [Florescu, Kossman 1999] Monet [Schmidt et al. WebDB 00] Structured-based approach Mapping tree to relations Use generic relational schemas

(independent on the XML schema):

Ref(source,label,dest)

Val(node,value)

Ref(source,label,dest)

Val(node,value)DOM Tree

29

&o1

&o3

&o2

&o4 &o5

paper

title author authoryear

&o6

“The Calculus” “…” “…” “1986”

XML Storage: Edge RelationXML Storage: Edge Relation

[Florescu, Kossman 1999]

S o u r c e L a b e l D e s t

& o 1 p a p e r & o 2& o 2 t i t l e & o 3& o 2 a u t h o r & o 4& o 2 a u t h o r & o 5& o 2 y e a r & o 6

N o d e V a l u e

& o 3 T h e C a l c u l u s& o 4 …& o 5 …& o 6 1 9 8 6

Ref

Val

30

XML Storage: Edge RelationXML Storage: Edge Relation In practice may need more tables for

reference links and nodes:

RefTag1(source,dest)

RefTag2(source,dest)

IntVal(node,intVal)

RealVal(node,realVal)

RefTag1(source,dest)

RefTag2(source,dest)

IntVal(node,intVal)

RealVal(node,realVal)

RefTag1

Source Dest

&o2 &o7&o5 &o8&o6 &o8&o1

&o3

&o2

&o4 &o5

paper

titleauthor author

year

&o6

&o7&o8

31

XML Storage: DTD to SchemaXML Storage: DTD to Schema[Christophides, Abiteboul, Cluet, Scholl 1994]

[Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton 1999]

Basic idea: use the XML schema to derive the relational schema

DTD:

Relational schemas:

<!ELEMENT paper (title, author*, year?)><!ELEMENT author (firstName, lastName)>

<!ELEMENT paper (title, author*, year?)><!ELEMENT author (firstName, lastName)>

Paper(pid, title, year)Author(aid, pid, firstName, lastName)

Paper(pid, title, year)Author(aid, pid, firstName, lastName)

32

XML Storage: DTD to SchemaXML Storage: DTD to Schema Each Element corresponds to a relation Each Attribute of Element corresponds to a column of

relation Connect elements using foreign keys But the problems: fragmentation! Example: How many relations should be used for finding

an address of an author?

<!ELEMENT paper (title, author*, year?)><!ELEMENT author (firstName,lastName,address)><!ELEMENT address (street, city, country)><!ELEMENT city (postcode?, cityname)><!ELEMENT street (streetno?, streetname)>

<!ELEMENT paper (title, author*, year?)><!ELEMENT author (firstName,lastName,address)><!ELEMENT address (street, city, country)><!ELEMENT city (postcode?, cityname)><!ELEMENT street (streetno?, streetname)>

Paper(pid, title, year)Author(aid, pid, firstName, lastName)Address (addid, aid, country)City (cid, addid, postcode, cityname)Street (sid, addid, postcode, streetname)

Paper(pid, title, year)Author(aid, pid, firstName, lastName)Address (addid, aid, country)City (cid, addid, postcode, cityname)Street (sid, addid, postcode, streetname)

33

XML Storage: Path RelationsXML Storage: Path Relations

[T.Amagasa, T.Shimura, S.Uemura 2001]ACM TOIT 2001 1(1)Store paths as strings (model-based approach) XPath expressions become the SQL like operator Additional information for parent/child,

ancestor/descendant relationship XRel: table schemas for paths, elements and val XParent:table schemas for elements, labelpath,

parent (alternative but similar)

34

XML Storage: Path RelationsXML Storage: Path RelationspathID Pathexpr

1 #/bib

2 #/bib#/paper

3 #/bib#/paper#/author

4 #/bib#/paper#/title

5 #/bib#/paper#/year

6 #/bib#/book#/author

7 #/bib#/book#/title

8 #/bib#/book#/publisher

Path

One entry for every path in the databaseRelatively small

35

XML Storage: Path RelationsXML Storage: Path Relations

Node

ID

Path

IDStart End

ParentID

1 1 0 1000 -

2 2 5 200 1

3 3 8 20 2

4 3 21 30 2

5 3 31 100 2

6 3 101 150 2

7 4 151 180 2

8 2 300 500 1

. . .

Element

One entry for every element in the database relatively large

NodeID Val

3 Smith

4 Vance

5 Tim

6 Wallace

7 The Best…

6 3

7 4

8 2

. . .

Val

One entry for every leaf in the databaseRelatively large

Positions in doc

36

XRelXRel Example: Path Table Example: Path Table

Contains all path string• Path(pathID, pathexp)

PathID PathExp

0 #/PLAY

1 #/PLAY#/ACT

2 #/PLAY#/ACT#/SCENE

3 #/PLAY#/ACT#/SCENE#/@id

4 #/PLAY#/ACT#/SCENE#/TITLE

5 #/PLAY#/ACT#/SCENE#/SPEECH

… …

PLAY2

ACT3

ACT…

TITLE6

SPEECH7

“Intro”

“CURIO” “This is …”

SCENESCENE

4

id5

“000”

SPEAKER8

TEXT9

root1

Path Table

37

XRelXRel Example: Other Tables Example: Other Tables

ExampledocID PathID start end index reindex

1 0 0 … 1 1

1 1 6 … 1 1

1 2 11 … 1 1

1 4 27 48 1 1

… … … … … …

Element Table

Attribute TabledocID PathID start end value

1 3 21 21 “000”

<PLAY><ACT><SCENE id=“000”><TITLE> Intro </TITLE><SPEECH><SPEAKER>CURIO</SPEAKER><TEXT>This is …</TEXT></SPEECH></SCENE></ACT></PLAY>

0611

27 48

21

4957 9091

Text TabledocID PathID start end value

1 4 34 40 “Intro”

… … … … …

PathID PathExp

0 #/PLAY

… …

Path Table

38

Storing XML as a GraphStoring XML as a Graph Every XML instance is a tree Hence we can store it as any graph, using an Edge

table In addition we need a Value table to store the data

values (#PCDATA)Edge relation summary: Same relational schema for every XML document:

• Edge(Source, Tag, Dest)• Value(Source, Val)

Generic: works for every XML instance But inefficient:

• Repeat tags multiple times• Need many joins to reconstruct data

39

Storing XML as a GraphStoring XML as a Graph

db

book book publisher

title author title author author title state“CompleteGuideto DB2”

“Chamberlin”“TransactionProcessing”

“Bernstein”“Newcomer”“MorganKaufman”

“CA”

1

2

3 4

5

6 7 8

9

10 11

0

Source Tag Dest

0 db 1

1 book 2

2 title 3

2 author 4

1 book 5

5 title 6

5 author 7

. . . . . . . . .

Source Val

3 Complete guide . . .

4 Chamberlin

6 . . .

. . . . . .

Edge

Value

40

Storing XML as a GraphStoring XML as a GraphWhat happens to queries:

SELECT vtitle.valueFROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle, Value vauthor, Value vtitleWHERE xdb.source=0 and xdb.tag = ‘db’ and xdb.dest = xbook.source and xbook.tag = ‘book’ and xbook.dest = xauthor.source and xauthor.tag = ‘author’ and xbook.dest = xtitle.source and xtitle.tag = ‘title’ and xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and xtitle.dest = vtitle.source

SELECT vtitle.valueFROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle, Value vauthor, Value vtitleWHERE xdb.source=0 and xdb.tag = ‘db’ and xdb.dest = xbook.source and xbook.tag = ‘book’ and xbook.dest = xauthor.source and xauthor.tag = ‘author’ and xbook.dest = xtitle.source and xtitle.tag = ‘title’ and xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and xtitle.dest = vtitle.source

FOR $x IN /db/book[author/text()=“Chamberlin”] RETURN $x/title

FOR $x IN /db/book[author/text()=“Chamberlin”] RETURN $x/title

41

XML Storage: Data Mining to SchemaXML Storage: Data Mining to Schema

[Deutsch, Fernandez, Suciu 1999]Given:

• One large XML data instance• No schema/DTD• Query workload

Problem: find a “good” relational schema for it Notice: even when a DTD is present, it may be

imprecise:• E.g. when a person may have 1-3 phones: phone*

42

XML Storage: Data Mining to Schema XML Storage: Data Mining to Schema

paperpaper paper

paper

authorauthor author author author

titletitle title title

year

fn fn fn fn lnlnlnln

a u t h o r t i t l eX X

f n 1 l n 1 f n 2 l n 2 t i t l e y e a r

X X X X X -X X - - X XX X - - X -

Paper1

Paper2

[Deutsch, Fernandez, Suciu 1999]

AB C

D

AB

C

D

43

Useful ReferencesUseful References Data on the Web: from Relations, to Semistructured Data and XML,

Abiteboul, Buneman, Suciu• For foundations

W3C homepage, www.w3.org• For current standards

F. Tian et al., The Design and Performance Evaluation of Alternative XML Storage Strategies, SIGMOD Record, 2002

XParent: An Efficient RDBMS-Based XML Database System, In Proc. of ICDE, 2002 Tatarinov, Stroring and Querying Ordered XML Using a Relational Database System, In

Proc. of SIGMOD, 2002 Yoshikawa et al., XRel: A Path-Based Approach to Storage and Retrieval of XML

Documents Using Relational Databases, ACM Trans. on Internet Technology, Vol. 1, No. 1, pp 110-141, 2001

D. Florescu, D. Kossman, Storing and Querying XML Data using an RDBMS. IEEE Data Engineering Bulletin 22(3), 1999

J. Shanmugasundaram et al. Relational Databases for Querying XML Documents: Limitations and Opportunities, In Proc. of VLDB, 1999

C. Zhang et al., On Supporting Containment Queries in Relational Databases Management Systems, In Proc. of SIGMOD, 2001