1 managing xml and semistructured data part 1: preliminaries, motivation and overview...
Post on 20-Dec-2015
214 views
TRANSCRIPT
1
Managing XML and Semistructured Data
Part 1: Preliminaries, Motivation and Overview
Acknowledgement: Part of the materials in this set of XML slides are extracted from Prof. Dan Suciu’s course materials. Thanks for his permission of using them
COMP630L Topics in DB Systems: Managing Web DataFall, 2007
Dr Wilfred Ng
2
HTML XML SGMLHTML XML SGML a W3C standard to complement HTML origins: structured text SGML motivation:
• HTML describes presentation• XML describes content
http://www.w3.org/TR/2000/REC-xml-20001006 (version
2, 10/2000)
SGMLXMLHTML4.0
3
HTMLHTML<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
HTML describes the presentation
4
XMLXML<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>XML describes the content
5
SGML and XML SGML and XML eXtensible Markup Language - XML XML 1.0 – a recommendation from W3C, 1998 XML – related standards: W3C group – XSL, XQuery,
… Roots: SGML [http://www.w3.org/MarkUp/SGML/]
• Markup = encoding: possibilities for expressing information• Information modelling freedom, Reusability, provability,
validity• But a very nasty language • SGML is an international standard for device-independent,
system-independent methods of representing texts in electronic form
After the roots: a format for sharing data (on the Web)
6
Basic XML TerminologyBasic XML Terminology tags: book, title, author, … start tag: <book>, end tag: </book> elements:
<book>…<book>,<author>…</author> elements are nested empty element: <red></red> abbrv. <red/> an XML document: single root element
7
The Role of XML DataThe Role of XML Data
XML is designed for data exchange, not to replace relational or E/R data
Sources of XML data:• Created manually with text editors: not really data
• Generated automatically from relational data
• Text files, replacing older data formats: Web server logs, scientific data (biological, astronomical)
• Stored/processed in native XML engines: very few applications need that today
8
XML Advantages for Web DataXML Advantages for Web Data
Over SGML• Supported by
mainstream browsers such as IE and Netscape
• Standard Stylesheet• Standard linking
Over HTML• Interchangable• Searchable• Reusable• Enables Automation
9
Why XML is of Interest to UsWhy XML is of Interest to Us HTML fails to meet structure specification – tags for
appearance only XML is a language syntax for data
• Note: we have no langauge syntax for relational data
• But XML is not relational: semistructured
This is exciting because:• Can translate any data to XML
• Can ship XML over the Web (HTTP)
• Can input XML into any application
• Thus: make data sharing and exchange on the Web possible!
40% annual growth and accelerating: publishers, government, education,…
10
Relational Data in Relational Data in XMLXML
<person><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>
6363</phone></row></person>
<person><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>
6363</phone></row></person>
n a m e p h o n e
J o h n 3 6 3 4
S u e 6 3 4 3
D i c k 6 3 6 3
row row row
name name namephone phone phone
“John” 3634 “Sue” “Dick”6343 6363
person XML: person
11
XML Data: more expressiveXML Data: more expressive
Missing attributes:
Could represent ina table with nulls
<person> <name> John</name> <phone>1234</phone> </person>
<person> <name>Joe</name></person>
<person> <name> John</name> <phone>1234</phone> </person>
<person> <name>Joe</name></person> no phone !
name phone
John 1234
Joe -
12
XML Data: more expressiveXML Data: more expressive
Repeated attributes
Impossible in tables:
<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>
<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>
two phones !
name phone
Mary 2345 3456 ???
13
XML Data: more expressiveXML Data: more expressive
Attributes with different types in different objects
Nested collections (no 1NF) Heterogeneous collections:
• <db> contains both <book>s and <publisher>s
<person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone></person>
<person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone></person>
structured name !
14
XML Data Sharing and ExchangeXML Data Sharing and Exchange
application
relational data
Transform
Integrate
Warehouse
XML Data WEB (HTTP)
application
application
legacy data
object-relational
Specific data management tasks
15
From Relational Data to XML From Relational Data to XML DataData
<persons><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>
6363</phone></row></persons>
<persons><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>
6363</phone></row></persons>
name phone
John 3634
Sue 6343 Dick 6363
row row row
name name namephone phone phone
“John” 3634 “Sue” “Dick”6343 6363
persons
XML – Data tree and its doc
persons
16
XML DataXML Data
XML is self-describing Schema elements become part of the data
• Relational schema: persons(name,phone)• In XML <persons>, <name>, <phone> are part
of the data, and are repeated many times !
Consequence: XML is much more flexible XML data is semistructured data
17
XML from/to Relational DataXML from/to Relational Data XML publishing:
• relational data XML
XML storage:• XML relational data
Relational data is regular in XML tree representation. But XML data can be non-relational!
Supplier Supplier
Common XML
Illusion
Broker
??
18
XML Publishing: ideas sketchXML Publishing: ideas sketch
RelationalDatabase
ApplicationWebXML
publishing
Tuplestreams
XML
SQL XPath/XQuery
19
XML PublishingXML Publishing Exporting the data is relatively
easier: we do this already for HTML
Translating XQuery SQL is hard XML publishing systems: Research: Experanto (IBM/DB2),
SilkRoute (AT&T Labs and UW, software unavailable) now SilkRoute downloadable• XQuery SQL
Commercial: SQL Server, Oracle• Only XPath SQL and with
restrictions
A middle-ware approach
20
XML PublishingXML Publishing
Backend relational engine Relational schema:
• Student(sid, name, address)• Course(cid, title, room)• Enroll(sid, cid, grade)
Schemas are often proprietary but XML schemas are public
student courseenroll
Will follow the idea in SilkRoute [ACM TODS 27(4), 2002]
21
XML PublishingXML Publishing<xmlview>
<course> <title> Operating Systems </title> <room> MGH084 </room> <student> <name> John </name> <address> Seattle </address > <grade> 3.8 </grade> </student> <student> …</student> …</course><course> <title> Database </title> <room> EE045 </room> <student> <name> Mary </name> <address> Shoreline </address > <grade> 3.9 </grade> </student> <student> …</student> …</course>…
</xmlview>
<xmlview><course> <title> Operating Systems </title>
<room> MGH084 </room> <student> <name> John </name> <address> Seattle </address > <grade> 3.8 </grade> </student> <student> …</student> …</course><course> <title> Database </title> <room> EE045 </room> <student> <name> Mary </name> <address> Shoreline </address > <grade> 3.9 </grade> </student> <student> …</student> …</course>…
</xmlview>
• Group by courses:Redundant representation of students• Other representations possible too: tags may not be the same as attributes in general
22
XML PublishingXML Publishing
<!ELEMENT xmlview (course*)>
<!ELEMENT course (title,room,student*)>
<!ELEMENT student (name,address,grade)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT grade (#PCDATA)>
<!ELEMENT xmlview (course*)>
<!ELEMENT course (title,room,student*)>
<!ELEMENT student (name,address,grade)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT grade (#PCDATA)>
First thing to do: design the DTD (public for users)
Second thing to do: develop an XML canonical view under db root for each relation {t1,…tk} with schema R(A1,…,An)
<R><row><A1>a11</A1>…<An>a11</An></row>… <row><A1>ak1</A1>…<An>ak1</An></row> </R>
23
<xmlview>{ FOR $x IN /db/Course/row RETURN <course> <title> { $x/title/text() } </title> <room> { $x/room/text() } </room> { FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row $z IN /db/Student/row[sid/text() = $y/sid/text()]/row RETURN <student> <name> { $z/name/text() } </name> <address> { $z/address/text() } </address> <grade> { $y/grade/text() } </grade> </student> } </course>}</xmlview>
<xmlview>{ FOR $x IN /db/Course/row RETURN <course> <title> { $x/title/text() } </title> <room> { $x/room/text() } </room> { FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row $z IN /db/Student/row[sid/text() = $y/sid/text()]/row RETURN <student> <name> { $z/name/text() } </name> <address> { $z/address/text() } </address> <grade> { $y/grade/text() } </grade> </student> } </course>}</xmlview>
Now we write an XQuery to export relational data XMLNote: result should conform to DTD (slide 20 for the relationalSchema; the generated result is shown in slide 21.)
24
XML PublishingXML PublishingQuery: find Mary’s grade in Operating Systems
FOR $x IN /xmlview/course[title/text()=“Operating Systems”], $y IN $x/student/[name/text()=“Mary”] RETURN <answer> $y/grade/text() </answer>
FOR $x IN /xmlview/course[title/text()=“Operating Systems”], $y IN $x/student/[name/text()=“Mary”] RETURN <answer> $y/grade/text() </answer>
XQuery over public view
SELECT Enroll.gradeFROM Student, Enroll, CourseWHERE Student.name=“Mary” and Course.title=“OS” and Student.sid = Enroll.sid and Enroll.cid = Course.cid
SELECT Enroll.gradeFROM Student, Enroll, CourseWHERE Student.name=“Mary” and Course.title=“OS” and Student.sid = Enroll.sid and Enroll.cid = Course.cid
SQL over database schema
SilkRoutedoes this
automatically
25
XML PublishingXML Publishing
How do we choose the output structure ? Determined by agreement, with our partners, or dictated by
committees in XML dialects (called applications) and generate DTDs
The DTD is selective wrt the underlying databases XML Data is often nested, irregular, etc – that’s why we
need the xmlview query in slide 23 to do the transformation
No agreed normal forms for XML but a few work on it… [M. Arenas and L. Libkin, A Normal Form for XML
Documents]
26
XML StorageXML Storage Often the XML data is small and is parsed directly
into the application (DOM API) – file based management (pros and cons?)
Sometimes it is big, and we need to store it in a database
Much harder than XML publishing (why ?) A fundamental XML storage problem:
• How do we choose the schema of the database ? Possible solutions:
• Schema derived from DTD – structure mapping• Storing XML as a graph such as “Edge relation” – model
mapping• Native Approach – native XML/SSD engine
27
XML Storage in a Relational DBXML Storage in a Relational DB Use generic schema
• [Florescu, Kossman 1999]
Use DTD to derive schema• [Shanmugasundaram, et al. 1999]
Use data mining to derive schema• [Deutsch, Fernandez, Suciu 1999]
Use the Path table• [T.Amagasa, T.Shimura, S.Uemura 2001]
28
XML Storage: Edge RelationXML Storage: Edge Relation [Florescu, Kossman 1999] Monet [Schmidt et al. WebDB 00] Structured-based approach Mapping tree to relations Use generic relational schemas
(independent on the XML schema):
Ref(source,label,dest)
Val(node,value)
Ref(source,label,dest)
Val(node,value)DOM Tree
29
&o1
&o3
&o2
&o4 &o5
paper
title author authoryear
&o6
“The Calculus” “…” “…” “1986”
XML Storage: Edge RelationXML Storage: Edge Relation
[Florescu, Kossman 1999]
S o u r c e L a b e l D e s t
& o 1 p a p e r & o 2& o 2 t i t l e & o 3& o 2 a u t h o r & o 4& o 2 a u t h o r & o 5& o 2 y e a r & o 6
N o d e V a l u e
& o 3 T h e C a l c u l u s& o 4 …& o 5 …& o 6 1 9 8 6
Ref
Val
30
XML Storage: Edge RelationXML Storage: Edge Relation In practice may need more tables for
reference links and nodes:
RefTag1(source,dest)
RefTag2(source,dest)
…
IntVal(node,intVal)
RealVal(node,realVal)
…
RefTag1(source,dest)
RefTag2(source,dest)
…
IntVal(node,intVal)
RealVal(node,realVal)
…
RefTag1
Source Dest
&o2 &o7&o5 &o8&o6 &o8&o1
&o3
&o2
&o4 &o5
paper
titleauthor author
year
&o6
&o7&o8
31
XML Storage: DTD to SchemaXML Storage: DTD to Schema[Christophides, Abiteboul, Cluet, Scholl 1994]
[Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton 1999]
Basic idea: use the XML schema to derive the relational schema
DTD:
Relational schemas:
<!ELEMENT paper (title, author*, year?)><!ELEMENT author (firstName, lastName)>
<!ELEMENT paper (title, author*, year?)><!ELEMENT author (firstName, lastName)>
Paper(pid, title, year)Author(aid, pid, firstName, lastName)
Paper(pid, title, year)Author(aid, pid, firstName, lastName)
32
XML Storage: DTD to SchemaXML Storage: DTD to Schema Each Element corresponds to a relation Each Attribute of Element corresponds to a column of
relation Connect elements using foreign keys But the problems: fragmentation! Example: How many relations should be used for finding
an address of an author?
<!ELEMENT paper (title, author*, year?)><!ELEMENT author (firstName,lastName,address)><!ELEMENT address (street, city, country)><!ELEMENT city (postcode?, cityname)><!ELEMENT street (streetno?, streetname)>
<!ELEMENT paper (title, author*, year?)><!ELEMENT author (firstName,lastName,address)><!ELEMENT address (street, city, country)><!ELEMENT city (postcode?, cityname)><!ELEMENT street (streetno?, streetname)>
Paper(pid, title, year)Author(aid, pid, firstName, lastName)Address (addid, aid, country)City (cid, addid, postcode, cityname)Street (sid, addid, postcode, streetname)
Paper(pid, title, year)Author(aid, pid, firstName, lastName)Address (addid, aid, country)City (cid, addid, postcode, cityname)Street (sid, addid, postcode, streetname)
33
XML Storage: Path RelationsXML Storage: Path Relations
[T.Amagasa, T.Shimura, S.Uemura 2001]ACM TOIT 2001 1(1)Store paths as strings (model-based approach) XPath expressions become the SQL like operator Additional information for parent/child,
ancestor/descendant relationship XRel: table schemas for paths, elements and val XParent:table schemas for elements, labelpath,
parent (alternative but similar)
34
XML Storage: Path RelationsXML Storage: Path RelationspathID Pathexpr
1 #/bib
2 #/bib#/paper
3 #/bib#/paper#/author
4 #/bib#/paper#/title
5 #/bib#/paper#/year
6 #/bib#/book#/author
7 #/bib#/book#/title
8 #/bib#/book#/publisher
Path
One entry for every path in the databaseRelatively small
35
XML Storage: Path RelationsXML Storage: Path Relations
Node
ID
Path
IDStart End
ParentID
1 1 0 1000 -
2 2 5 200 1
3 3 8 20 2
4 3 21 30 2
5 3 31 100 2
6 3 101 150 2
7 4 151 180 2
8 2 300 500 1
. . .
Element
One entry for every element in the database relatively large
NodeID Val
3 Smith
4 Vance
5 Tim
6 Wallace
7 The Best…
6 3
7 4
8 2
. . .
Val
One entry for every leaf in the databaseRelatively large
Positions in doc
36
XRelXRel Example: Path Table Example: Path Table
Contains all path string• Path(pathID, pathexp)
PathID PathExp
0 #/PLAY
1 #/PLAY#/ACT
2 #/PLAY#/ACT#/SCENE
3 #/PLAY#/ACT#/SCENE#/@id
4 #/PLAY#/ACT#/SCENE#/TITLE
5 #/PLAY#/ACT#/SCENE#/SPEECH
… …
PLAY2
ACT3
ACT…
TITLE6
SPEECH7
“Intro”
“CURIO” “This is …”
…
SCENESCENE
4
…
id5
“000”
…
SPEAKER8
TEXT9
root1
Path Table
37
XRelXRel Example: Other Tables Example: Other Tables
ExampledocID PathID start end index reindex
1 0 0 … 1 1
1 1 6 … 1 1
1 2 11 … 1 1
1 4 27 48 1 1
… … … … … …
Element Table
Attribute TabledocID PathID start end value
1 3 21 21 “000”
<PLAY><ACT><SCENE id=“000”><TITLE> Intro </TITLE><SPEECH><SPEAKER>CURIO</SPEAKER><TEXT>This is …</TEXT></SPEECH></SCENE></ACT></PLAY>
0611
27 48
21
4957 9091
Text TabledocID PathID start end value
1 4 34 40 “Intro”
… … … … …
PathID PathExp
0 #/PLAY
… …
Path Table
38
Storing XML as a GraphStoring XML as a Graph Every XML instance is a tree Hence we can store it as any graph, using an Edge
table In addition we need a Value table to store the data
values (#PCDATA)Edge relation summary: Same relational schema for every XML document:
• Edge(Source, Tag, Dest)• Value(Source, Val)
Generic: works for every XML instance But inefficient:
• Repeat tags multiple times• Need many joins to reconstruct data
39
Storing XML as a GraphStoring XML as a Graph
db
book book publisher
title author title author author title state“CompleteGuideto DB2”
“Chamberlin”“TransactionProcessing”
“Bernstein”“Newcomer”“MorganKaufman”
“CA”
1
2
3 4
5
6 7 8
9
10 11
0
Source Tag Dest
0 db 1
1 book 2
2 title 3
2 author 4
1 book 5
5 title 6
5 author 7
. . . . . . . . .
Source Val
3 Complete guide . . .
4 Chamberlin
6 . . .
. . . . . .
Edge
Value
40
Storing XML as a GraphStoring XML as a GraphWhat happens to queries:
SELECT vtitle.valueFROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle, Value vauthor, Value vtitleWHERE xdb.source=0 and xdb.tag = ‘db’ and xdb.dest = xbook.source and xbook.tag = ‘book’ and xbook.dest = xauthor.source and xauthor.tag = ‘author’ and xbook.dest = xtitle.source and xtitle.tag = ‘title’ and xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and xtitle.dest = vtitle.source
SELECT vtitle.valueFROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle, Value vauthor, Value vtitleWHERE xdb.source=0 and xdb.tag = ‘db’ and xdb.dest = xbook.source and xbook.tag = ‘book’ and xbook.dest = xauthor.source and xauthor.tag = ‘author’ and xbook.dest = xtitle.source and xtitle.tag = ‘title’ and xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and xtitle.dest = vtitle.source
FOR $x IN /db/book[author/text()=“Chamberlin”] RETURN $x/title
FOR $x IN /db/book[author/text()=“Chamberlin”] RETURN $x/title
41
XML Storage: Data Mining to SchemaXML Storage: Data Mining to Schema
[Deutsch, Fernandez, Suciu 1999]Given:
• One large XML data instance• No schema/DTD• Query workload
Problem: find a “good” relational schema for it Notice: even when a DTD is present, it may be
imprecise:• E.g. when a person may have 1-3 phones: phone*
42
XML Storage: Data Mining to Schema XML Storage: Data Mining to Schema
paperpaper paper
paper
authorauthor author author author
titletitle title title
year
fn fn fn fn lnlnlnln
a u t h o r t i t l eX X
f n 1 l n 1 f n 2 l n 2 t i t l e y e a r
X X X X X -X X - - X XX X - - X -
Paper1
Paper2
[Deutsch, Fernandez, Suciu 1999]
AB C
D
AB
C
D
43
Useful ReferencesUseful References Data on the Web: from Relations, to Semistructured Data and XML,
Abiteboul, Buneman, Suciu• For foundations
W3C homepage, www.w3.org• For current standards
F. Tian et al., The Design and Performance Evaluation of Alternative XML Storage Strategies, SIGMOD Record, 2002
XParent: An Efficient RDBMS-Based XML Database System, In Proc. of ICDE, 2002 Tatarinov, Stroring and Querying Ordered XML Using a Relational Database System, In
Proc. of SIGMOD, 2002 Yoshikawa et al., XRel: A Path-Based Approach to Storage and Retrieval of XML
Documents Using Relational Databases, ACM Trans. on Internet Technology, Vol. 1, No. 1, pp 110-141, 2001
D. Florescu, D. Kossman, Storing and Querying XML Data using an RDBMS. IEEE Data Engineering Bulletin 22(3), 1999
J. Shanmugasundaram et al. Relational Databases for Querying XML Documents: Limitations and Opportunities, In Proc. of VLDB, 1999
C. Zhang et al., On Supporting Containment Queries in Relational Databases Management Systems, In Proc. of SIGMOD, 2001