xml and beyond - university of texas at arlingtonlambda.uta.edu/cse6331/spring03/papers/xml1.pdffall...

59
Fall 2001 CSE330 1 XML and Beyond: Parts I and II http://db.cis.upenn.edu http://www.w3c.org

Upload: others

Post on 13-Feb-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 1

XML and Beyond:Parts I and II

http://db.cis.upenn.edu http://www.w3c.org

Page 2: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 2

Outline

• Background: documents (SGML/HTML) and databases (structured and semistructured data)

• XML Basics and Document Type Descriptors

• XML API’s: Document Object Model (DOM), SAX (not covered in this course)

• XML query languages: XML-QL, XSL, Quilt.

Page 3: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 3

Part I: Background

What’s the difference between the world of documents and information retrieval and

databases and query interfaces?

Page 4: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 4

Documents vs DatabasesDocument world

> plenty of small documents> usually static

> implicit structuresection, paragraph, toc,

> tagging

> human friendly

> contentform/layout, annotation

> Paradigms“Save as”

> meta-dataauthor name, date, subject

Database world> a few large databases> usually dynamic

> explicit structure (schema)

> records

> machine friendly

> contentschema, data, methods

> ParadigmsAtomicity, Concurrency, Isolation, Durability

> meta-dataschema description

Page 5: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 5

What to do with themDatabase

• updating

• cleaning

• querying

• composing/transforming

Documents

• editing

• printing

• spell-checking• counting words

• retrieving (IR)

• searching

Page 6: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 6

HTML• Lingua franca for publishing hypertext on the

World Wide Web• Designed to describe how a Web browser should

arrange text, images and push-buttons on a page.• Easy to learn, but does not convey structure.• Fixed tag set.

Text (PCDATA)

<HTML><HEAD><TITLE>Welcome to the XML course</TITLE></HEAD><BODY>

<H1>Introduction</H1><IMG SRC=”dragon.jpeg" WIDTH="200" HEIGHT="150” >

</BODY></HTML>

Closing tag “Bachelor” tagAttribute name Attribute value

Opening tag

Page 7: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 7

Thin red line• The line between the document world and the

database world is not clear.• In some cases, both approaches are legitimate.• An interesting middle ground is data formats -- of

which XML is an example

• Examples– Personal address book

Page 8: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 8

Personal address book over 20 years1977

N Achison, MalcolmF Dr. M.P. AchisonA Dept. of Computer ScienceA University of EdinburghA Kings BuildingsA Edinburgh E12 8QQA ScotlandT 031-123-8855 ext. 4359 (work)T 031-345-7570 (home)

N Albani, PaoloF Prof. Paolo AlbaniA Dip. Informatica e SistemisticaA Universita di Roma La Sapienza...

1990

N Achison, MalcolmF Prof. M.P. AchisonA Dept. of Computing ScienceA University of GlasgowA Lilybank GardensA Glasgow G12 8QQA ScotlandT 041-339-8855 ext. 4359T 041-357-3787 (private)T 031-667-7570 (home)X 041-339-0090C [email protected]

N Achison, MalcolmF Prof. M.P. AchisonA 34 Inverness PlaceA Edinburgh, EH3 8UV

1980

N Achison, MalcolmF Dr. M.P. AchisonA Dept. of Computer Science....

T 031-667-7570 (home)C [email protected]

1997

N Achison, MalcolmF Prof. M.P. AchisonA Department of Computing Science...T 031-667-7570 (home)X 041-339-0090C [email protected] http://www.dcs.gla.ac.uk/mpa

2000?

Page 9: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 9

The Structure of XML

• XML consists of tags and text

• Tags come in pairs <date> ...</date>

• They must be properly nested<date> <day> ... </day> ... </date> --- good<date> <day> ... </date>... </day> --- bad

(You can’t do <i> ... <b> ... </i> ...</b> in HTML)

Page 10: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 10

XML textXML has only one “basic” type -- text.

It is bounded by tags e.g.<title> The Big Sleep </title><year> 1935 </ year> --- 1935 is still text

XML text is called PCDATA (for parsedcharacter data). It uses a 16-bit encoding,e.g. \&\#x0152 for the Hebrew letter Mem

Page 11: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 11

XML structure

Nesting tags can be used to express various structures. E.g. A tuple (record) :

<person><name> Malcolm Atchison </name><tel> (215) 898 4321 </tel><email> [email protected] </email>

</person>

Page 12: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 12

XML structure• We can represent a list by using the same

tag repeatedly:

<addresses><person> ... </person><person> ... </person><person> ... </person>...

</addresses>

Page 13: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 13

Terminology

The segment of an XML document between an opening and a corresponding closing tag is called an element.

<person><name> Malcolm Atchison </name><tel> (215) 898 4321 </tel><tel> (215) 898 4321 </tel><email> [email protected] </email>

</person>

element

not an elementelement, a sub-elementof

Page 14: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 14

XML is tree-like

person

name emailtel tel

Malcolm [email protected]

(215) 898 4321(215) 898 4321

Semistructured data models typically put the labels on the edges

Page 15: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 15

Mixed Content

An element may contain a mixture of sub-elements and PCDATA

<airline><name> British Airways </name><motto> World’s <dubious> favorite</dubious> airline </motto>

</airline>

Data of this form is not typically generated from databases. It is needed for consistency with HTML

Page 16: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 16

A Complete XML Document<?xml version="1.0"?><person><name> Malcolm Atchison </name><tel> (215) 898 4321 </tel><email> [email protected] </email>

</person>

Page 17: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 17

Representing relational DBs:Two ways

projects:title budget managedBy

employees:name ssn age

Page 18: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 18

Project and Employee relations in XML

Projects and employees are intermixed

<db><project>

<title> Pattern recognition </title><budget> 10000 </budget><managedBy> Joe

</managedBy></project><employee>

<name> Joe </name><ssn> 344556 </ssn><age> 34 < /age>

</employee>

<employee><name> Sandra </name><ssn> 2234 </ssn><age> 35 </age>

</employee><project>

<title> Auto guided vehicle </title><budget> 70000 </budget><managedBy> Sandra </managedBy>

</project>:

</db>

Page 19: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 19

Project and Employee relations in XML (cont’d)

Employees follows projects<db>

<projects><project>

<title> Pattern recognition </title><budget> 10000 </budget><managedBy> Joe </managedBy>

</project><project>

<title> Auto guided vehicles </title><budget> 70000 </budget><managedBy> Sandra

</managedBy></project>

:</projects>

<employees><employee>

<name> Joe </name><ssn> 344556 </ssn><age> 34 </age>

</employee> <employee>

<name> Sandra </name><ssn> 2234 </ssn><age>35 </age>

</employee>:<employees>

</db>

Page 20: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 20

Project and Employee relations in XML (cont’d)

Or without “separator” tags …<db>

<projects> <title> Pattern recognition </title><budget> 10000 </budget><managedBy> Joe </managedBy><title> Auto guided vehicles

</title><budget> 70000 </budget><managedBy> Sandra

</managedBy>:

</projects>

<employees> <name> Joe </name><ssn> 344556 </ssn><age> 34 </age><name> Sandra </name><ssn> 2234 </ssn><age> 35 </age>:

</employees></db>

Page 21: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 21

AttributesAn (opening) tag may contain attributes. These are typically used to describe the content of an element

<entry><word language = “en”> cheese </word><word language = “fr”> fromage </word><word language = “ro”> branza </word><meaning> A food made … </meaning>

</entry>

Page 22: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 22

Attributes (cont’d)Another common use for attributes is to express dimension or type

<picture><height dim= “cm”> 2400 </height><width dim= “in”> 96 </width><data encoding = “gif” compression = “zip”>

M05-.+C$@02!G96YE<FEC ...</data>

</picture>

A document that obeys the “nested tags” rule and does not repeat an attribute within a tag is said to be well-formed .

Page 23: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 23

When to use attributesIt’s not always clear when to use attributes

<person ssno= “123 45 6789”><name> F. MacNiel </name> <email> [email protected] </email> ...

</person>

OR<person><ssno>123 45 6789</ssno><name> F. MacNiel </name><email> [email protected] </email> ...

</person>

Page 24: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 24

Using IDs<family>

<person id="jane" mother="mary" father="john"> <name> Jane Doe </name>

</person><person id="john" children="jane jack">

<name> John Doe </name> <mother/></person> <person id="mary" children="jane jack">

<name> Mary Doe </name></person>

<person id="jack" mother=”mary" father="john"> <name> Jack Doe </name>

</person></family>

Page 25: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 25

An object-oriented schema

class Movie

( extent Movies, key title ){

attribute string title;

attribute string director;

relationship set<Actor> casts

inverse Actor::acted_In;

attribute int budget;} ;

class Actor

( extent Actors, key name ){

attribute string name;

relationship set<Movie> acted_In

inverse Movie::casts;

attribute int age;

attribute set<string> directed;} ;

Page 26: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 26

<db><movie id=“m1”>

<title>Waking Ned Divine</title><director>Kirk Jones III</director><cast idrefs=“a1 a3”></cast><budget>100,000</budget>

</movie><movie id=“m2”>

<title>Dragonheart</title><director>Rob Cohen</director><cast idrefs=“a2 a9 a21”></cast><budget>110,000</budget>

</movie><movie id=“m3”>

<title>Moondance</title><director>Dagmar Hirtz</director><cast idrefs=“a1 a8”></cast><budget>90,000</budget>

</movie>:

An example

<actor id=“a1”><name>David Kelly</name><acted_In idrefs=“m1 m3 m78” ></acted_In>

</actor><actor id=“a2”>

<name>Sean Connery</name><acted_In idrefs=“m2 m9 m11”></acted_In><age>68</age>

</actor><actor id=“a3”>

<name>Ian Bannen</name><acted_In idrefs=“m1 m35”></acted_In>

</actor>:

</db>

Page 27: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 27

Part II: Document Type Descriptors

Imposing structure on XML documents

Page 28: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 28

Document Type Descriptors

• Document Type Descriptors (DTDs) impose structure on an XML document.

• There is some relationship between a DTD and a schema, but it is not close – there is still a need for additional “typing” systems.

• The DTD is a syntactic specification.

Page 29: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 29

Example: An Address Book<person>

<name> MacNiel, John </name>

<greet> Dr. John MacNiel </greet>

<addr>1234 Huron Street </addr>

<addr> Rome, OH 98765 </addr>

<tel> (321) 786 2543 </tel>

<fax> (321) 786 2543 </fax>

<tel> (321) 786 2543 </tel>

<email> [email protected] </email>

</person>

Exactly one nameAt most one greeting

As many address lines as needed (in order)

Mixed telephones and faxes

As manyas needed

Page 30: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 30

Specifying the structure

• name to specify a name element• greet? to specify an optional

(0 or 1) greet elements• name,greet? to specify a name followed by an

optional greet

Page 31: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 31

Specifying the structure (cont)

• addr* to specify 0 or more address lines

• tel | fax a tel or a fax element

• (tel | fax)* 0 or more repeats of tel or fax

• email* 0 or more email elements

Page 32: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 32

Specifying the structure (cont)

So the whole structure of a person entry is specified by

name, greet?, addr*, (tel | fax)*, email*

This is known as a regular expression. Why is it important?

Page 33: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 33

Regular Expressions

Each regular expression determines a corresponding finite state automaton. Let’s start with a simpler example:

name, addr*, email

This suggests a simple parsing program

addr

emailname

Page 34: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 34

Another example

name,address*,(tel | fax)*,email*

name

address

teltel

fax

fax

email

email

email

Adding in the optional greet furthercomplicates things

Page 35: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 35

A DTD for the address book

<!DOCTYPE addressbook [<!ELEMENT addressbook (person*)><!ELEMENT person

(name, greet?, address*, (fax | tel)*, email*)><!ELEMENT name (#PCDATA)><!ELEMENT greet (#PCDATA)><!ELEMENT address (#PCDATA)><!ELEMENT tel (#PCDATA)><!ELEMENT fax (#PCDATA)><!ELEMENT email (#PCDATA)>

]>

Page 36: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 36

Two DTDs for the relational DB<!DOCTYPE db [

<!ELEMENT db (projects,employees)><!ELEMENT projects (project*)><!ELEMENT employees (employee*)><!ELEMENT project (title, budget, managedBy)><!ELEMENT employee (name, ssn, age)>...

]>

<!DOCTYPE db [<!ELEMENT db (project | employee)*><!ELEMENT project (title, budget, managedBy)><!ELEMENT employee (name, ssn, age)>...

]>

Page 37: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 37

Recursive DTDs

<DOCTYPE genealogy [<!ELEMENT genealogy (person*)><!ELEMENT person (

name,dateOfBirth,person, -- motherperson )> -- father

... ]>

What is the problem with this?

Page 38: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 38

Recursive DTDs cont’d.

<DOCTYPE genealogy [<!ELEMENT genealogy (person*)><!ELEMENT person (

name,dateOfBirth,person?, -- motherperson? )> -- father

... ]>

What is now the problem with this?

Page 39: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 39

Some things are hard to specify

Each employee element is to contain name, age and ssn elements in some order.

<!ELEMENT employee( (name, age, ssn) | (age, ssn, name) |(ssn, name, age) | ...

)>

Suppose there were many more fields !

Page 40: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 40

Summary of XML regular expressions

• A The tag A occurs• e1,e2 The expression e1 followed by e2• e* 0 or more occurrences of e• e? Optional -- 0 or 1 occurrences• e+ 1 or more occurrences• e1 | e2 either e1 or e2• (e) grouping

Page 41: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 41

It’s easy to get confused…

<!ELEMENT PARTNER (NAME?, ONETIME?, PARTNRID?, PARTNRTYPE?, SYNCIND?, ACTIVE?, CURRENCY?, DESCRIPTN?, DUNSNUMBER?, GLENTITYS?, NAME*, PARENTID?, PARTNRIDX?, PARTNRRATG?, PARTNRROLE?, PAYMETHOD?, TAXEXEMPT?, TAXID?, TERMID?, USERAREA?, ADDRESS*, CONTACT*)>

Cited from oagis_segments.dtd (one of the files in the Novell Developer Kit http://developer.novell.com/ndk/indexexe.htm)

<PARTNER> <NAME> Ben Franklin </NAME> </PARTNER>Q. Which NAME is it?

Page 42: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 42

Specifying attributes in the DTD

<!ELEMENT height (#PCDATA)><!ATTLIST height

dimension CDATA #REQUIREDaccuracy CDATA #IMPLIED >

The dimension attribute is required; the accuracyattribute is optional.

CDATA is the “type” of the attribute -- it means string.

Page 43: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 43

Specifying ID and IDREF attributes

<!DOCTYPE family [<!ELEMENT family (person)*><!ELEMENT person (name)><!ELEMENT name (#PCDATA)><!ATTLIST person

id ID #REQUIREDmother IDREF #IMPLIEDfather IDREF #IMPLIEDchildren IDREFS #IMPLIED>

]>

Page 44: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 44

Some conforming data<family>

<person id="jane" mother="mary" father="john"><name> Jane Doe </name>

</person><person id="john" children="jane jack">

<name> John Doe </name></person> <person id="mary" children="jane jack">

<name> Mary Doe </name></person>

<person id="jack" mother=”mary" father="john"> <name> Jack Doe </name>

</person></family>

Page 45: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 45

Consistency of ID and IDREF attribute values

•If an attribute is declared as ID– the associated values must all be distinct (no

confusion)•If an attribute is declared as IDREF

– the associated value must exist as the value of some ID attribute (no dangling “pointers”)

•Similarly for all the values of an IDREFSattribute

•ID and IDREF attributes are not typed

Page 46: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 46

An alternative specification

<!DOCTYPE family [<!ELEMENT family (person)*><!ELEMENT person (mother?, father?, children, name)><!ATTLIST person id ID #REQUIRED><!ELEMENT name (#PCDATA)><!ELEMENT mother EMPTY><!ATTLIST mother idref IDREF #REQUIRED><!ELEMENT father EMPTY><!ATTLIST father idref IDREF #REQUIRED><!ELEMENT children EMPTY><!ATTLIST children idrefs IDREFS #REQUIRED>

]>

Page 47: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 47

The revised data

<family><person id = "jane”>

<name> Jane Doe </name><mother idref = "mary”></mother><father idref = "john"></father>

</person><person id = "john”>

<name> John Doe </name><children idrefs = "jane jack"> </children>

</person> ...

</family>

Page 48: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 48

A useful abbreviation

When an element has empty content we can use

<tag blahblahbla/> for <tag blahblahbla></tag>

For example:<family>

<person id = "jane”><name> Jane Doe </name><mother idref = "mary”/>

<father idref = "john”/></person>...

</family>

Page 49: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 49

Back to the object-oriented schema

class Movie

( extent Movies, key title ){

attribute string title;

attribute string director;

relationship set<Actor> casts

inverse Actor::acted_In;

attribute int budget;} ;

class Actor

( extent Actors, key name ){

attribute string name;

relationship set<Movie> acted_In

inverse Movie::casts;

attribute int age;

attribute set<string> directed;} ;

Page 50: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 50

Schema.dtd

<!DOCTYPE db [<!ELEMENT db (movie+, actor+)><!ELEMENT movie (title,director,casts,budget)><!ATTLIST movie id ID #REQUIRED><!ELEMENT title (#PCDATA)><!ELEMENT director (#PCDATA)><!ELEMENT casts EMPTY>

<!ATTLIST casts idrefs IDREFS #REQUIRED><!ELEMENT budget (#PCDATA)>

Page 51: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 51

Schema.dtd (cont’d)

<!ELEMENT actor (name, acted_In,age?, directed*)><!ATTLIST actor id ID #REQUIRED><!ELEMENT name (#PCDATA)><!ELEMENT acted_In EMPTY>

<!ATTLIST acted_In idrefs IDREFS #REQUIRED><!ELEMENT age (#PCDATA)><!ELEMENT directed (#PCDATA)>

]>

Page 52: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 52

More on ODL and DTD

• Earlier last year (May 2000), Object Data Management Group (ODMG) suggested OIFML, a XML document type of Object Interchange Format.

• http://www.odmg.org/library/readingroom/oifml.pdf

Page 53: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 53

Constraints on IDs and IDREFs

• ID stands for identifier. No two ID attributes with the same name may have the same value (of type CDATA)

• IDREF stands for identifier reference. Every value associated with an IDREF attribute must exist as an ID attribute value

• IDREFS specifies several (0 or more) identifiers

Page 54: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 54

Connecting the document with its DTD

In line:<?xml version="1.0"?><!DOCTYPE db [<!ELEMENT ...> … ]><db> ... </db>

Another file:<!DOCTYPE db SYSTEM "schema.dtd">

A URL:<!DOCTYPE db SYSTEM

"http://www.schemaauthority.com/schema.dtd">

Page 55: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 55

Well-formed and Valid Documents

• Well-formed applies to any document (with or without a DTD): proper nesting of tags and unique attributes

• Valid specifies that the document conforms to the DTD: conforms to regular expression grammar, types of attributes correct, and constraints on references satisfied

Page 56: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 56

DTDs v.s Schemas (or Types)• By database (or programming language) standards

DTDs are rather weak specifications. – Only one base type -- PCDATA– No useful “abstractions” e.g., sets– IDREFs are untyped. You point to something, but you

don’t know what!– No constraints e.g., child is inverse of parent– No methods– Tag definitions are global

• Some of the XML extensions impose something like a schema or type on an XML document. We’ll see these later

Page 57: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 57

Lots of possibilities for schemas

• XML Schema (under W3C’s spotlight)• XDR (Microsoft’s BizTalk)• SOX (Schema for Object-Oriented XML)• Schematron• DSD (AT&T Labs and BRICS)• and more.

Page 58: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 58

Some tools• XML Authority

http://www.extensibility.com/tibco/solutions/xml_authority/index.htm

• XML Spy http://www.xmlspy.com/download.html

Page 59: XML and Beyond - University of Texas at Arlingtonlambda.uta.edu/cse6331/spring03/papers/XML1.pdfFall 2001 CSE330 6. HTML • Lingua franca for publishing hypertext on the World Wide

Fall 2001CSE330 59

Summary

• XML is a new data format. Its main virtues are widespread acceptance and the (important) ability to handle semistructured data (data without schema).

• DTDs provide some useful syntactic constraints on documents. As schemas they are weak.

• Next slides: XML programming, XML querying.