intro to xml

61
Intro to XML Hachim Haddouti Al Akhawayn University SSE [email protected] http://mail.alakhawayn.ma/~H.Haddouti

Upload: sabin

Post on 19-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Intro to XML. Hachim Haddouti Al Akhawayn University SSE [email protected] http://mail.alakhawayn.ma/~H.Haddouti. TOC. Intro W3C Historical (  ) development Scenarios in XML and Data Management. 1) Motivation. XML - E X tensible M arkup L anguage Markup-Language - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Intro to XML

Intro to XML

Hachim HaddoutiAl Akhawayn University

SSE

[email protected]

http://mail.alakhawayn.ma/~H.Haddouti

Page 2: Intro to XML

Hachim Haddouti 2

TOC

Intro W3C Historical () development Scenarios in XML and Data Management

Page 3: Intro to XML

Hachim Haddouti 3

1) Motivation

XML - EXtensible Markup Language

Markup-Language

- mark up – Data and information about information within a document

in einem Dokument Developed thru World Wide Web Consortium (W3C) Well readable Mostl used as an exchange format

Page 4: Intro to XML

Hachim Haddouti 4

W3C

(World Wide Web Consortium) Over 400 members: organisations, Hosted by MIT, INRIA,

Keio University, 50 full-time staff members Invention of WWW Examples:

– XML– HTML– DOM– XPath– XML Schema– ...

Page 5: Intro to XML

Hachim Haddouti 5

Process at W3C

Note– Sugesstions(no responsibilty W3C)

Working Draft– Working in progress, work has to be approved by all

invloved in W3C Candidate Recommendation

– Approved for test implementation only Recommendation

Page 6: Intro to XML

Hachim Haddouti 6

Phenomena XML

„XML is the ASCII of the 21th century.“ „XML is the ASCII of the Web“

Henry Thompson (1999)

Why popular ?

Page 7: Intro to XML

Hachim Haddouti 7

What’s Wrong with HTML?

<DT><IMG SRC="greenball.gif" >&nbsp;<A NAME="object-fusion"></A>Y.Papakonstantinou, S.Abiteboul, H.Garcia-Molina. <A HREF="http://www-cse.ucsd.edu/~yannis/papers/fusion.ps">"ObjectFusion in Mediator Systems".</A> In <I>VLDB 96.</I></DT>

Y. Papakonstantinou, S. Abiteboul, H. Garcia-Molina.“Object Fusion in Mediator Systems”. In VLDB 96.

HTML is markup for presentation only

Page 8: Intro to XML

Hachim Haddouti 8

...What’s Wrong with HTML...

<DT><IMG SRC= "greenball.gif" >&nbsp;<A NAME="object-fusion"></A>Y.Papakonstantinou, S.Abiteboul, H.Garcia-Molina. <A HREF="http://www-cse.ucsd.edu/~yannis/papers/fusion.ps">"ObjectFusion in Mediator Systems".</A> In <I>VLDB 96.</I></DT>

No Explicit Structure, Semantics, or Object-Orientation

Author

ConferenceTitle

Page 9: Intro to XML

Hachim Haddouti 9

... And Some Repercussions

Lack of schema/semantics when querying the Web (HTML):

– "find documents (books, papers, ...)

where author = Michael Jackson" (... and learn how software engineering meets the moon walker ...)

– "create a list of M. Jackson's books and (if available) their prices"

=> HTML is inappropriate for

data exchange

automation of information management (retrieval, manipulation, integration)

Page 10: Intro to XML

Hachim Haddouti 10

XML is ..

Standardized for all applications Exchange format worldwide (write once, read everywhere)

XML is a Meta language sprache in order to define other languages– examples: MathML, ChessML, XUL (User Interfaces),

CellML, Gene Expression Markup Language, Chemical Markup Language, XML/EDI, UN/EDIFACT

– Nowadays over 300

Page 11: Intro to XML

Hachim Haddouti 11

So what is XML (all about)?

Executive Summary:• XML = HTML – idiosyncrasies (simplified syntax)

+ user-definable ("semantic") tags• Separation of data and its presentation• Tags such Font or CENTRE are not necessary in XML. XML uses struktured Tags,

such as TITLE, CHAPTER, etc. document structure remains constant over different media.

=> simple, very flexible data exchange format:

semistructured data model

=> new applications: • Information exchange (B2B), sharing (diglib), integration

("mediation"), archival, ...

• Web site management (XML+XSL stylesheets), ...

Page 12: Intro to XML

Hachim Haddouti 12

It takes ten minutes to understand (base) XML, and then ten

month to understand the new technologies hung around it.

(Peter Chen)

So why XML course in AUI?

Page 13: Intro to XML

Hachim Haddouti 13

Extensible Markup Language (XML)

De-facto standard adopted by W3C– Describes content rather than presentation

Three major differences to HTML– New tags may be defined at will– Nested structures– XML doc can contain optional description of its grammar

Page 14: Intro to XML

Hachim Haddouti 14

Historical Development XML /1

from Neil Bradley: The XML companion

1997

1992

1960

1986

XML

HTML

SGML

WWW

InternetMarkup

Generalized

Page 15: Intro to XML

Hachim Haddouti 15

Historical Development XML /2

W3C recomendations

Other proposals

In progress

1997

1998

2001

1999

2000

XPath 1.0

XQL XML-QL

Quilt

XUpdate

XMLSchema

XQuery 1.0XPath 2.0

XML

DOM

2002

Page 16: Intro to XML

Hachim Haddouti 16

2) Documents ...

For communication between humans– Human – Human

• Natural (human) language Sprache is used, contains complex and irregular structures

For computer communication:– Computer – Computer

• Data-oriented – Human – Computer

• Document-oriented– XML allows representation and transport of this

information

Page 17: Intro to XML

Hachim Haddouti 17

XML Documents

Bevor Syntax some examples of XML Documents

Page 18: Intro to XML

Hachim Haddouti 18

Example: XML Document

<?xml version="1.0" encoding="UTF-8"?><invoice customerNo="k333063143">

<monthsprice>0,00</monathprice><detailedinvoice>

<call><date>26.2.</date><time>19:47</time><number>200xxxx</number><itemprice currency ="Euro">0,66</ itemprice>

</call><call>

<date>27.2.</date><time>19:06</time><number>200xxxx</number><itemprice currency ="Euro">0,46</ itemprice>

</call><call_charge_total currency "Euro">2.19</call_charge_total>

</ detailedinvoice></invoice>

Page 19: Intro to XML

Hachim Haddouti 19

XML Document - Features

XML documents contain enthalten data and structure of data withing a document (self describing)

All documents have the same/similar structure (regular) Typed Information in XML documents

Fo the previous example: Information could be stored in DB.

Page 20: Intro to XML

Hachim Haddouti 20

Other XML Documents

XML documents are also irregular

Semi structured information

document-centric information

Page 21: Intro to XML

Hachim Haddouti 21

Recall semi structured data

Features of semi structured data Structure is irregular. Schema is implicitly included in data. Structure of data is incomplete. Schema is felxible. Schema is big. Schema is changing.

(Abiteboul, 1997)

Page 22: Intro to XML

Hachim Haddouti 22

Object Exchange Model (OEM) /1

See slides of first session

Page 23: Intro to XML

Hachim Haddouti 23

XML is Based on Markup

<bibliography>

<paper ID= "object-fusion"> <authors> <author>Y.Papakonstantinou</author> <author>S.Abiteboul</author> <author>H.Garcia-Molina</author> </authors> <fullPaper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper>

</bibliography>

Markup indicates structure and semantics

Decoupled from presentation

Page 24: Intro to XML

Hachim Haddouti 24

XML Elements

Structure in an XML document is provided by markup, which consists of elements

Element consists of start/end tags– Element can be empty

See naming rules for elements– XML processors are case-sensitive– Start with letter or underscore, avoid colon

Special element called ROOT element– Contains all other elements

Page 25: Intro to XML

Hachim Haddouti 25

Element Content

XML element can be: empty OR have content ANY * OR be of mixed content-type (PCDATA I car I train I plane) OR have a list of child elements

*Elements with content type ANY are not checked by XML validators

Page 26: Intro to XML

Hachim Haddouti 26

Sample Elements and their Content

ElementContentElement name

EmptyElement

Nested Element

<bibliography>

<paper ID="object-fusion"> <authors> <author>Y.Papakonstantinou</author> <author>S.Abiteboul</author> <author>H.Garcia-Molina</author> </authors> <fullPaper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper>

</bibliography> Character Content (PCDATA)

Page 27: Intro to XML

Hachim Haddouti 27

Markup and Character Data

XML docs are made up of markup and character data– You can reference EXTERNAL binary data with entity

references Markup: Start/end tags, entity references, character

references, comments, CDATA section delimiters, document type declarations, and processing instructions

Character data: all text that is NOT markup

Page 28: Intro to XML

Hachim Haddouti 28

Example

<?xml version=“1.0” encoding=“UTF-8”?>

<DOCUMENT>

<GREETING>

This text is inside the &lt;GREETING&gt; element.

</GREETING>

<MESSAGE>

Welcome to the wild and woolly world of XML.

</MESSAGE>

</DOCUMENT>

general entity references

turns intoParsed Character Data

(PCDATA)

MarkupParsed character dataCharacter data

Page 29: Intro to XML

Hachim Haddouti 29

XML Attributes

Be careful with the terminology vs. relational attributes Attributes defined as Name-value pairs

– Let you specify additional information in start and empty tags

Follow same naming rules as for tag names Attribute values are text

– Enclose in quotation marks (“ ”) Given attribute may only occur once within a tag, but

element can repeated. Special attribute types

Page 30: Intro to XML

Hachim Haddouti 30

Element Attributes

<bibliography>

<paper ID="object-fusion"> <authors> <author>Y.Papakonstantinou</author> <author>S.Abiteboul</author> <author>H.Garcia-Molina</author> </authors> <fullPaper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper>

</bibliography>

Attribute name Attribute Value

Page 31: Intro to XML

Hachim Haddouti 31

Other XML Constructs

Prolog: XML Declaration, comments, processing instructions, DTD. XML Declaration

<?xml version = “1.0” standalone=“yes” encoding=“UTF-8”?> Comments

<!– this is a comment --> Processing Instruction

<?xml-stylesheet href=“book.css” type=“text/css”?> CDATA

– Escape block containing characters that are not to be parsed (o.w. would be recognized as markup), (Note with specifying DTD all attributes per definition are from type CDATA)

<![CDATA[<start>this is an incorrect element</end>]]>

Page 32: Intro to XML

Hachim Haddouti 32

Other XML Constructs

Entities (like macros)– &lt; for <

– &gt; forr >

– &amp; for &

– &apos; for ´

– &quot; for “– Parameter entities in DTD’s

Document Type Definition– Defines the documents grammar– See separate lecture

Page 33: Intro to XML

Hachim Haddouti 33

Well-Formed XML Documents

XML documents are subject to two specific constraints

– Well-formedness: A data object is an XML document if it is well-formed. A textual object is well-formed if:

• Document follows document production (prolog, root element, optional misc. part, e.g., comments and/or processing instrc.)

• Must adhere to the syntax rules specified in the XML 1.0 recommendation (eg. Unique attribute names, tags properly nested)

• Each parsed entity must itself be well-formed

– Validity: An XML document is valid, if it obeys the document type definition (DTD) or XML schema that you use to specify the legal syntax of the document

Ensures that XML document parses into labeled tree

Page 34: Intro to XML

Hachim Haddouti 34

XML and Semistructured Data

<bibliography> <paper id=23...> <authors> <author>Yannis</author> <author>Serge</author> ... </authors> <title>Object Fusion</title> ... </paper></bibliography>

{bibliography: {paper: {

id: 23, authors: {author: “Yannis”, author: “Serge”, … }, title: “Object Fusion …”}}

}

XML Document

can be represented as ssd-expression

Page 35: Intro to XML

Hachim Haddouti 35

XML = Labeled Ordered Graph

...

...

@id

23

bibliography

paper paper

authors

author author

Yannis Serge

titlefullpaper

Object Fusion …

XML denotes graphs with labels on nodes

Page 36: Intro to XML

Hachim Haddouti 36

Ssd-expression = Labeled Unordered Graph

...

...

23

bibliography

paperpaper

authors

author

Yannis Serge

titlefullpaper

Object Fusion …

Ssd-expression denotes graphswith labels on edges

author

id

Page 37: Intro to XML

Hachim Haddouti 37

XML Graphs

So far, only seen XML trees Using references, can create graphs

<state sid=“s2”>

<scode> NE </scode>

<sname> Nevada </sname>

</state> Can use IDREF attribute type, which holds the ID value of another element in

the document

<city cid=”c2”>

<ccode>CCN</ccode>

<cname>Carson City</cname>

<state-of state_ref=“s2”/>

</city>

ID: special attribute type; XML processorsmake sure that no two elements have the same value for the attribute that is of typeID in the same document.

IDREF: holds the ID value of anotherElement in the document.

Page 38: Intro to XML

Hachim Haddouti 38

A Word About Order

Our semistructured data model is based on UNORDERED collections of tuples– As are relations– Unordered collections can be processed more efficiently

(exploited by commercial DBMS) XML is ORDERD

– Based on its origins in the information retrieval community

• Order is critical in documents– However, attributes in XML are UNORDERD

Page 39: Intro to XML

Hachim Haddouti 39

Usage of Schema Description : DTD

Presentation, which elements can occur and how they will be nested

Also: Declaration of structure information

Advantages of DTD: – Similar to a documentation for XML Documents– Errors in XML documents could be detected– Better quality of XML Documents bcs structured and

well thought methodology

Page 40: Intro to XML

Hachim Haddouti 40

Definition of Elements in DTD

XML Document:<speaker> Ronald Bourret </spaeker>

Corresponding DTD:<!ELEMENT spaeker (#PCDATA)>

XML document:<speaker>

<lname> Bourret </lname> <fname> Ronald </fname>

</speaker>

Corresponding DTD:<!ELEMENT speaker (lname, fname)><!ELEMENT lname (#PCDATA)><!ELEMENT fname (#PCDATA)>

Page 41: Intro to XML

Hachim Haddouti 41

Definition of Elements in DTD cont.

Sequence (A , B)

A and B must occur in document in the given order

Alternative (A | B)

either A or B occurs in document Repetition

A? - 0..1 times

A+ - 1..n times

A* - 0..n times Mixed Content (#PCDATA | A | B)*

A, B or other text occurs in document

<!ELEMENT hotel (name, address)><!ELEMENT name (#PCDATA)><!ELEMENT address (zip, city, ((street, number ?) | BP))><!ELEMENT description (#PCDATA | equipment | gastronomy)*>

Page 42: Intro to XML

Hachim Haddouti 42

Example: Definition of Elements

<!ELEMENT hotel (name, address)>

<hotel> <name>Hotel Anwal</name><address>...<address>

</hotel>

Page 43: Intro to XML

Hachim Haddouti 43

Example: Definition of Elements cont.

<!ELEMENT address (zip, city, ((street, number?) | BP))>

<address><zip>53000</zip ><city>Ifrane</city >

<street> Abdelkrim El Khattabi</street><number>12<number>

<address>

<address>< zip > 62000 </zip ><city> Al Hoceima</city >

<bp>12345</bp><address>

Page 44: Intro to XML

Hachim Haddouti 44

Example: Definition of Elements DTD cont.

<!ELEMENT description (# PCDATA | equipment | gastronomy)*>

<description> The hotel Anwal is located in front of the City Hall, with view to Atlas mountain, and high quality service.</ description >

<description > Our Hotel consists of <equipment> Sauna </equipment> und eine< equipment > swimming pool </equipment>.

The <gastronomy> Hotel restaurant</gastronomy> offers

regional cuisine and see foods. </ description >

Page 45: Intro to XML

Hachim Haddouti 45

Attribute Syntax /1

Attribute will be assigned to each element of an XML document:

<spaeker tutorial=´T1´> Ronald Bourret </speaker>

Corresponding DTD:

<!ELEMENT speaker (#PCDATA)>

<!ATTLIST speaker tutorial CDATA #REQUIRED>

Start Tag EndTagElement content

Attribute name Attribute value

Page 46: Intro to XML

Hachim Haddouti 46

Attribute Syntax / 2

XML Document

<coordinates x=´200´ y=´300´ z=´150´ />

DTD

<!ELEMENT coordinates (EMPTY)>

<!ATTLIST coordinates x CDATA #REQUIRED

y CDATA #REQUIRED

z CDATA #IMPLIED >

Page 47: Intro to XML

Hachim Haddouti 47

Representation of XML Documents incl. elements and Attributes)

XML Documents are trees!

Example:<spaeker tutorial=´T1´>

<lname>Bourret</lname>

<fname>Ronald</fname>

</speaker>

Bourret

Element nodesText nodesAttribut nodes

T1

tutorial lname fname

Ronald

Speaker

Page 48: Intro to XML

Hachim Haddouti 48

Declaration of Attributes in DTD

Attributes have A name A type (CDATA, ID, IDREF/IDREFs, ENTITY/ENTITYS, NMTOKEN/NMTOKENS or (value1|value2|...) predicates, whether the attribute

have to occur (#REQUIRED, #IMPLIED oder #FIXED) oder An optional default value (in case of #FIXED this is necessary)

<!ATTLIST price currency

CDATA #REQUIRED>

<!ATTLIST project id

ID #REQUIRED>

<!ATTLIST person project

IDREF #REQUIRED>

<!ATTLIST zip xml-sqltype

CDATA#FIXED ´INTEGER´>

Page 49: Intro to XML

Hachim Haddouti 49

ID / IDREF

Attribute could also be defined as ID/ IDREF/ IDREFS

Values within a document are unique

<!ELEMENT project (title)>

<!ATTLIST project member IDREF

#REQUIRED>

<!ELEMENT person (EMPTY)>

<!ATTLIST person id ID

#REQUIRED>

...

<!ELEMENT department (EMPTY)>

<!ATTLIST department dep_id ID

#REQUIRED>

<project member=´p0001´>

<title>...</title>

</project>

<project member=´a0001´>

<title>...</title>

</project>

...

<person id=´p0001´>

<name> Khattabi</name>

</person>

...

<department dep_id=´a0001´>

Page 50: Intro to XML

Hachim Haddouti 50

ID/IDREF-Overview

Values of IDREF Attributes show which IDs are referenced. These could of different element types.

Global uniquenesses of IDs is MUST. (compare to PK and FK in DB?)

Project

ProjectProject

title titlemember

dept

deptdept

Person personDep_iddept_id person

id id idperson person person

member

Page 51: Intro to XML

Hachim Haddouti 51

Entities /1

General Entity Declaration:– Definition Document sections

<!ENTITY GermanDay ´ German Day September 30, 2003 ´ >

<!ENTITY Miteinander ´ Dialog of Culture ´>

<event>

We are glad to anounce you the organization of &GermanDay, entiteld &miteinander.

</event>

Page 52: Intro to XML

Hachim Haddouti 52

Entities /2

Predefined Entities Special symbols

– &lt; for <

– &gt; for >

– &amp; for &

– &apos; for ´

– &quot; for “ Symbols could be not easily displayed bcs specific

processing is needed.

Page 53: Intro to XML

Hachim Haddouti 53

Entities /3

Character Entities Decimal number in range 0..255

– Extended ASCII-set (ISO 8859/1), also called Latin-1 Decimal numbers in 256..65535

– Unicode (ISO 10646) Hexa decimal numbers, such x23 example:

– &#60 for <

Page 54: Intro to XML

Hachim Haddouti 54

Entities /4

Non parsed Entities to incoporate other file formats– Pictures and others

Example:<!ELEMENT Hotel (#PCDATA)>

<!ATTLIST Hotel View ENTITY #IMPLIED>

<!ENTITY View_winter

SYSTEM „view_winter.gif" NDATA GIF>

<!NOTATION GIF SYSTEM 'gifviewer.exe'

 

<Hotel view=„view_winter">

Anwal Hotel

</Hotel>

Page 55: Intro to XML

Hachim Haddouti 55

Entities /5

Parameter Entities– To be used in DTDs

example:– <!ENTITY % addressdef ´(city,zip,street,number)´>– <!ELEMENT address %addressdef;>

To be used as deklaration of DTD Objective: Resusage „Modularizing“

Page 56: Intro to XML

Hachim Haddouti 56

A Complete Example

<!-- Hotel DTD-->

<!ELEMENT hotel (name, kategory?, address,

pathdescription, price*>

<!ATTLIST hotel id ID #REQUIRED

url CDATA #REQUIRED>

<!ELEMENT name (#PCDATA)>

<!ELEMENT kategory (#PCDATA)>

<!ELEMENT address (zip, city,

street, number, telefon,

fax?, e-mail?)>

<!ELEMENT zip (#PCDATA)>

...

<!ELEMENT pathdescription (#PCDATA)>

<!ELEMENT price (#PCDATA | singleroom |

doubleroom | appartment)*>

<!ATTLIST price currency CDATA #REQUIRED>

<!ELEMENT singleroom (#PCDATA)>...

<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE hotel SYSTEM "hotel_dt.dtd"><hotel id="id001" url="http://www.anwal.ma"> <name> Hotel Anwal</name> <address> <plz> 53000</plz> <city>Ifrane</city> ... </address>

<pathdescription> The hotel Anwal is located in front of the City Hall, with view to Atlas mountain, and high quality service. </ pathdescription> <price currency =„DH"> <singleroom>from 200,-</></>...</hotel>

Page 57: Intro to XML

Hachim Haddouti 57

Example of complete DTD

<!ELEMENT publications (book | article | conference)*>

<!ELEMENT book (front, body, references)>

<!ATTLIST book isbn CDATA #REQUIRED>

<!ELEMENT front (title, author+, edition?, publisher)>

<!ELEMENT title (#PCDATA)>

<!ELEMENT author (first, second, e-mail?)>

<!ELEMENT first (#PCDATA)>

...

<!ELEMENT edition (#PCDATA)>

<!ELEMENT publisher (#PCDATA)>

<!ELEMENT body (part+ | chapter+)>

<!ELEMENT part (ptitle, chapter+)>

<!ATTLIST part id ID #REQUIRED>

<!ELEMENT ptitle (#PCDATA)>

<!ELEMENT chapter (ctitle, section+)>

Page 58: Intro to XML

Hachim Haddouti 58

Graphical Presentation of DTD

Page 59: Intro to XML

Hachim Haddouti 59

Klassification of XML Documents

Data centric Documents

strukturec, regular

Examples: Product catalog, orders,

invoices

Document centric Documents

unstruktured, irregular

example: scietific article, books, emails

Websites

Semi structured documents

data and document centric

Examples: publications, Amazon

<order> <customer>Meyer</customer> <position> <isbn>1-234-56789-0</isbn> <number>2</number> <price currency=„Euro“>30.00</price> </position></order>

<content>XML builds on the principles of two existing languages, <emph>HTML</emph> and <emph>SGML</emph> to create a simple mechanism .. The generalized markup concept ..</content>

<book> <author>Neil Bradley</author> <title>XML companion</title> <isbn>1-234-56789-0</isbn> <content> XML builds on the principles of two existing languages, <emph>HTML</emph> and .. </content></book>

Page 60: Intro to XML

Hachim Haddouti 60

Limitation of DTS

DTDs impose order Data Type instead of simple #PCDATA or #CDATA Lack of Range Specification

From DB point of view: we conclude that data typing offered by DTDs is inadequte.

Page 61: Intro to XML

Hachim Haddouti 61

Recommendation

To get familiar with XML– Create an XML document including elements and

Attributes– Test wellformed– Create DTD– Test the validity

Use e.g. of XML Spy, www.xmlspy.com