intro to xml
Post on 19-Jan-2016
25 Views
Preview:
DESCRIPTION
TRANSCRIPT
Intro to XML
Hachim HaddoutiAl Akhawayn University
SSE
H.Haddouti@alakhawayn.ma
http://mail.alakhawayn.ma/~H.Haddouti
Hachim Haddouti 2
TOC
Intro W3C Historical () development Scenarios in XML and Data Management
Hachim Haddouti 3
1) Motivation
XML - EXtensible Markup Language
Markup-Language
- mark up – Data and information about information within a document
in einem Dokument Developed thru World Wide Web Consortium (W3C) Well readable Mostl used as an exchange format
Hachim Haddouti 4
W3C
(World Wide Web Consortium) Over 400 members: organisations, Hosted by MIT, INRIA,
Keio University, 50 full-time staff members Invention of WWW Examples:
– XML– HTML– DOM– XPath– XML Schema– ...
Hachim Haddouti 5
Process at W3C
Note– Sugesstions(no responsibilty W3C)
Working Draft– Working in progress, work has to be approved by all
invloved in W3C Candidate Recommendation
– Approved for test implementation only Recommendation
Hachim Haddouti 6
Phenomena XML
„XML is the ASCII of the 21th century.“ „XML is the ASCII of the Web“
Henry Thompson (1999)
Why popular ?
Hachim Haddouti 7
What’s Wrong with HTML?
<DT><IMG SRC="greenball.gif" > <A NAME="object-fusion"></A>Y.Papakonstantinou, S.Abiteboul, H.Garcia-Molina. <A HREF="http://www-cse.ucsd.edu/~yannis/papers/fusion.ps">"ObjectFusion in Mediator Systems".</A> In <I>VLDB 96.</I></DT>
Y. Papakonstantinou, S. Abiteboul, H. Garcia-Molina.“Object Fusion in Mediator Systems”. In VLDB 96.
HTML is markup for presentation only
Hachim Haddouti 8
...What’s Wrong with HTML...
<DT><IMG SRC= "greenball.gif" > <A NAME="object-fusion"></A>Y.Papakonstantinou, S.Abiteboul, H.Garcia-Molina. <A HREF="http://www-cse.ucsd.edu/~yannis/papers/fusion.ps">"ObjectFusion in Mediator Systems".</A> In <I>VLDB 96.</I></DT>
No Explicit Structure, Semantics, or Object-Orientation
Author
ConferenceTitle
Hachim Haddouti 9
... And Some Repercussions
Lack of schema/semantics when querying the Web (HTML):
– "find documents (books, papers, ...)
where author = Michael Jackson" (... and learn how software engineering meets the moon walker ...)
– "create a list of M. Jackson's books and (if available) their prices"
=> HTML is inappropriate for
data exchange
automation of information management (retrieval, manipulation, integration)
Hachim Haddouti 10
XML is ..
Standardized for all applications Exchange format worldwide (write once, read everywhere)
XML is a Meta language sprache in order to define other languages– examples: MathML, ChessML, XUL (User Interfaces),
CellML, Gene Expression Markup Language, Chemical Markup Language, XML/EDI, UN/EDIFACT
– Nowadays over 300
Hachim Haddouti 11
So what is XML (all about)?
Executive Summary:• XML = HTML – idiosyncrasies (simplified syntax)
+ user-definable ("semantic") tags• Separation of data and its presentation• Tags such Font or CENTRE are not necessary in XML. XML uses struktured Tags,
such as TITLE, CHAPTER, etc. document structure remains constant over different media.
=> simple, very flexible data exchange format:
semistructured data model
=> new applications: • Information exchange (B2B), sharing (diglib), integration
("mediation"), archival, ...
• Web site management (XML+XSL stylesheets), ...
Hachim Haddouti 12
…
It takes ten minutes to understand (base) XML, and then ten
month to understand the new technologies hung around it.
(Peter Chen)
So why XML course in AUI?
Hachim Haddouti 13
Extensible Markup Language (XML)
De-facto standard adopted by W3C– Describes content rather than presentation
Three major differences to HTML– New tags may be defined at will– Nested structures– XML doc can contain optional description of its grammar
Hachim Haddouti 14
Historical Development XML /1
from Neil Bradley: The XML companion
1997
1992
1960
1986
XML
HTML
SGML
WWW
InternetMarkup
Generalized
Hachim Haddouti 15
Historical Development XML /2
W3C recomendations
Other proposals
In progress
1997
1998
2001
1999
2000
XPath 1.0
XQL XML-QL
Quilt
XUpdate
XMLSchema
XQuery 1.0XPath 2.0
XML
DOM
2002
Hachim Haddouti 16
2) Documents ...
For communication between humans– Human – Human
• Natural (human) language Sprache is used, contains complex and irregular structures
For computer communication:– Computer – Computer
• Data-oriented – Human – Computer
• Document-oriented– XML allows representation and transport of this
information
Hachim Haddouti 17
XML Documents
Bevor Syntax some examples of XML Documents
Hachim Haddouti 18
Example: XML Document
<?xml version="1.0" encoding="UTF-8"?><invoice customerNo="k333063143">
<monthsprice>0,00</monathprice><detailedinvoice>
<call><date>26.2.</date><time>19:47</time><number>200xxxx</number><itemprice currency ="Euro">0,66</ itemprice>
</call><call>
<date>27.2.</date><time>19:06</time><number>200xxxx</number><itemprice currency ="Euro">0,46</ itemprice>
</call><call_charge_total currency "Euro">2.19</call_charge_total>
</ detailedinvoice></invoice>
Hachim Haddouti 19
XML Document - Features
XML documents contain enthalten data and structure of data withing a document (self describing)
All documents have the same/similar structure (regular) Typed Information in XML documents
Fo the previous example: Information could be stored in DB.
Hachim Haddouti 20
Other XML Documents
XML documents are also irregular
Semi structured information
document-centric information
Hachim Haddouti 21
Recall semi structured data
Features of semi structured data Structure is irregular. Schema is implicitly included in data. Structure of data is incomplete. Schema is felxible. Schema is big. Schema is changing.
(Abiteboul, 1997)
Hachim Haddouti 22
Object Exchange Model (OEM) /1
See slides of first session
Hachim Haddouti 23
XML is Based on Markup
<bibliography>
<paper ID= "object-fusion"> <authors> <author>Y.Papakonstantinou</author> <author>S.Abiteboul</author> <author>H.Garcia-Molina</author> </authors> <fullPaper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper>
</bibliography>
Markup indicates structure and semantics
Decoupled from presentation
Hachim Haddouti 24
XML Elements
Structure in an XML document is provided by markup, which consists of elements
Element consists of start/end tags– Element can be empty
See naming rules for elements– XML processors are case-sensitive– Start with letter or underscore, avoid colon
Special element called ROOT element– Contains all other elements
Hachim Haddouti 25
Element Content
XML element can be: empty OR have content ANY * OR be of mixed content-type (PCDATA I car I train I plane) OR have a list of child elements
*Elements with content type ANY are not checked by XML validators
Hachim Haddouti 26
Sample Elements and their Content
ElementContentElement name
EmptyElement
Nested Element
<bibliography>
<paper ID="object-fusion"> <authors> <author>Y.Papakonstantinou</author> <author>S.Abiteboul</author> <author>H.Garcia-Molina</author> </authors> <fullPaper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper>
</bibliography> Character Content (PCDATA)
Hachim Haddouti 27
Markup and Character Data
XML docs are made up of markup and character data– You can reference EXTERNAL binary data with entity
references Markup: Start/end tags, entity references, character
references, comments, CDATA section delimiters, document type declarations, and processing instructions
Character data: all text that is NOT markup
Hachim Haddouti 28
Example
<?xml version=“1.0” encoding=“UTF-8”?>
<DOCUMENT>
<GREETING>
This text is inside the <GREETING> element.
</GREETING>
<MESSAGE>
Welcome to the wild and woolly world of XML.
</MESSAGE>
</DOCUMENT>
general entity references
turns intoParsed Character Data
(PCDATA)
MarkupParsed character dataCharacter data
Hachim Haddouti 29
XML Attributes
Be careful with the terminology vs. relational attributes Attributes defined as Name-value pairs
– Let you specify additional information in start and empty tags
Follow same naming rules as for tag names Attribute values are text
– Enclose in quotation marks (“ ”) Given attribute may only occur once within a tag, but
element can repeated. Special attribute types
Hachim Haddouti 30
Element Attributes
<bibliography>
<paper ID="object-fusion"> <authors> <author>Y.Papakonstantinou</author> <author>S.Abiteboul</author> <author>H.Garcia-Molina</author> </authors> <fullPaper source="fusion"/> <title>Object Fusion in Mediator Systems</title> <booktitle>VLDB 96</booktitle> </paper>
</bibliography>
Attribute name Attribute Value
Hachim Haddouti 31
Other XML Constructs
Prolog: XML Declaration, comments, processing instructions, DTD. XML Declaration
<?xml version = “1.0” standalone=“yes” encoding=“UTF-8”?> Comments
<!– this is a comment --> Processing Instruction
<?xml-stylesheet href=“book.css” type=“text/css”?> CDATA
– Escape block containing characters that are not to be parsed (o.w. would be recognized as markup), (Note with specifying DTD all attributes per definition are from type CDATA)
<![CDATA[<start>this is an incorrect element</end>]]>
Hachim Haddouti 32
Other XML Constructs
Entities (like macros)– < for <
– > forr >
– & for &
– ' for ´
– " for “– Parameter entities in DTD’s
Document Type Definition– Defines the documents grammar– See separate lecture
Hachim Haddouti 33
Well-Formed XML Documents
XML documents are subject to two specific constraints
– Well-formedness: A data object is an XML document if it is well-formed. A textual object is well-formed if:
• Document follows document production (prolog, root element, optional misc. part, e.g., comments and/or processing instrc.)
• Must adhere to the syntax rules specified in the XML 1.0 recommendation (eg. Unique attribute names, tags properly nested)
• Each parsed entity must itself be well-formed
– Validity: An XML document is valid, if it obeys the document type definition (DTD) or XML schema that you use to specify the legal syntax of the document
Ensures that XML document parses into labeled tree
Hachim Haddouti 34
XML and Semistructured Data
<bibliography> <paper id=23...> <authors> <author>Yannis</author> <author>Serge</author> ... </authors> <title>Object Fusion</title> ... </paper></bibliography>
{bibliography: {paper: {
id: 23, authors: {author: “Yannis”, author: “Serge”, … }, title: “Object Fusion …”}}
}
XML Document
can be represented as ssd-expression
Hachim Haddouti 35
XML = Labeled Ordered Graph
...
...
@id
23
bibliography
paper paper
authors
author author
Yannis Serge
titlefullpaper
Object Fusion …
XML denotes graphs with labels on nodes
Hachim Haddouti 36
Ssd-expression = Labeled Unordered Graph
...
...
23
bibliography
paperpaper
authors
author
Yannis Serge
titlefullpaper
Object Fusion …
Ssd-expression denotes graphswith labels on edges
author
id
Hachim Haddouti 37
XML Graphs
So far, only seen XML trees Using references, can create graphs
<state sid=“s2”>
<scode> NE </scode>
<sname> Nevada </sname>
</state> Can use IDREF attribute type, which holds the ID value of another element in
the document
<city cid=”c2”>
<ccode>CCN</ccode>
<cname>Carson City</cname>
<state-of state_ref=“s2”/>
</city>
ID: special attribute type; XML processorsmake sure that no two elements have the same value for the attribute that is of typeID in the same document.
IDREF: holds the ID value of anotherElement in the document.
Hachim Haddouti 38
A Word About Order
Our semistructured data model is based on UNORDERED collections of tuples– As are relations– Unordered collections can be processed more efficiently
(exploited by commercial DBMS) XML is ORDERD
– Based on its origins in the information retrieval community
• Order is critical in documents– However, attributes in XML are UNORDERD
Hachim Haddouti 39
Usage of Schema Description : DTD
Presentation, which elements can occur and how they will be nested
Also: Declaration of structure information
Advantages of DTD: – Similar to a documentation for XML Documents– Errors in XML documents could be detected– Better quality of XML Documents bcs structured and
well thought methodology
Hachim Haddouti 40
Definition of Elements in DTD
XML Document:<speaker> Ronald Bourret </spaeker>
Corresponding DTD:<!ELEMENT spaeker (#PCDATA)>
XML document:<speaker>
<lname> Bourret </lname> <fname> Ronald </fname>
</speaker>
Corresponding DTD:<!ELEMENT speaker (lname, fname)><!ELEMENT lname (#PCDATA)><!ELEMENT fname (#PCDATA)>
Hachim Haddouti 41
Definition of Elements in DTD cont.
Sequence (A , B)
A and B must occur in document in the given order
Alternative (A | B)
either A or B occurs in document Repetition
A? - 0..1 times
A+ - 1..n times
A* - 0..n times Mixed Content (#PCDATA | A | B)*
A, B or other text occurs in document
<!ELEMENT hotel (name, address)><!ELEMENT name (#PCDATA)><!ELEMENT address (zip, city, ((street, number ?) | BP))><!ELEMENT description (#PCDATA | equipment | gastronomy)*>
Hachim Haddouti 42
Example: Definition of Elements
<!ELEMENT hotel (name, address)>
<hotel> <name>Hotel Anwal</name><address>...<address>
</hotel>
Hachim Haddouti 43
Example: Definition of Elements cont.
<!ELEMENT address (zip, city, ((street, number?) | BP))>
<address><zip>53000</zip ><city>Ifrane</city >
<street> Abdelkrim El Khattabi</street><number>12<number>
<address>
<address>< zip > 62000 </zip ><city> Al Hoceima</city >
<bp>12345</bp><address>
Hachim Haddouti 44
Example: Definition of Elements DTD cont.
<!ELEMENT description (# PCDATA | equipment | gastronomy)*>
<description> The hotel Anwal is located in front of the City Hall, with view to Atlas mountain, and high quality service.</ description >
<description > Our Hotel consists of <equipment> Sauna </equipment> und eine< equipment > swimming pool </equipment>.
The <gastronomy> Hotel restaurant</gastronomy> offers
regional cuisine and see foods. </ description >
Hachim Haddouti 45
Attribute Syntax /1
Attribute will be assigned to each element of an XML document:
<spaeker tutorial=´T1´> Ronald Bourret </speaker>
Corresponding DTD:
<!ELEMENT speaker (#PCDATA)>
<!ATTLIST speaker tutorial CDATA #REQUIRED>
Start Tag EndTagElement content
Attribute name Attribute value
Hachim Haddouti 46
Attribute Syntax / 2
XML Document
<coordinates x=´200´ y=´300´ z=´150´ />
DTD
<!ELEMENT coordinates (EMPTY)>
<!ATTLIST coordinates x CDATA #REQUIRED
y CDATA #REQUIRED
z CDATA #IMPLIED >
Hachim Haddouti 47
Representation of XML Documents incl. elements and Attributes)
XML Documents are trees!
Example:<spaeker tutorial=´T1´>
<lname>Bourret</lname>
<fname>Ronald</fname>
</speaker>
Bourret
Element nodesText nodesAttribut nodes
T1
tutorial lname fname
Ronald
Speaker
Hachim Haddouti 48
Declaration of Attributes in DTD
Attributes have A name A type (CDATA, ID, IDREF/IDREFs, ENTITY/ENTITYS, NMTOKEN/NMTOKENS or (value1|value2|...) predicates, whether the attribute
have to occur (#REQUIRED, #IMPLIED oder #FIXED) oder An optional default value (in case of #FIXED this is necessary)
<!ATTLIST price currency
CDATA #REQUIRED>
<!ATTLIST project id
ID #REQUIRED>
<!ATTLIST person project
IDREF #REQUIRED>
<!ATTLIST zip xml-sqltype
CDATA#FIXED ´INTEGER´>
Hachim Haddouti 49
ID / IDREF
Attribute could also be defined as ID/ IDREF/ IDREFS
Values within a document are unique
<!ELEMENT project (title)>
<!ATTLIST project member IDREF
#REQUIRED>
<!ELEMENT person (EMPTY)>
<!ATTLIST person id ID
#REQUIRED>
...
<!ELEMENT department (EMPTY)>
<!ATTLIST department dep_id ID
#REQUIRED>
<project member=´p0001´>
<title>...</title>
</project>
<project member=´a0001´>
<title>...</title>
</project>
...
<person id=´p0001´>
<name> Khattabi</name>
</person>
...
<department dep_id=´a0001´>
Hachim Haddouti 50
ID/IDREF-Overview
Values of IDREF Attributes show which IDs are referenced. These could of different element types.
Global uniquenesses of IDs is MUST. (compare to PK and FK in DB?)
Project
ProjectProject
title titlemember
dept
deptdept
Person personDep_iddept_id person
id id idperson person person
member
Hachim Haddouti 51
Entities /1
General Entity Declaration:– Definition Document sections
<!ENTITY GermanDay ´ German Day September 30, 2003 ´ >
<!ENTITY Miteinander ´ Dialog of Culture ´>
<event>
We are glad to anounce you the organization of &GermanDay, entiteld &miteinander.
</event>
Hachim Haddouti 52
Entities /2
Predefined Entities Special symbols
– < for <
– > for >
– & for &
– ' for ´
– " for “ Symbols could be not easily displayed bcs specific
processing is needed.
Hachim Haddouti 53
Entities /3
Character Entities Decimal number in range 0..255
– Extended ASCII-set (ISO 8859/1), also called Latin-1 Decimal numbers in 256..65535
– Unicode (ISO 10646) Hexa decimal numbers, such x23 example:
– < for <
Hachim Haddouti 54
Entities /4
Non parsed Entities to incoporate other file formats– Pictures and others
Example:<!ELEMENT Hotel (#PCDATA)>
<!ATTLIST Hotel View ENTITY #IMPLIED>
<!ENTITY View_winter
SYSTEM „view_winter.gif" NDATA GIF>
<!NOTATION GIF SYSTEM 'gifviewer.exe'
<Hotel view=„view_winter">
Anwal Hotel
</Hotel>
Hachim Haddouti 55
Entities /5
Parameter Entities– To be used in DTDs
example:– <!ENTITY % addressdef ´(city,zip,street,number)´>– <!ELEMENT address %addressdef;>
To be used as deklaration of DTD Objective: Resusage „Modularizing“
Hachim Haddouti 56
A Complete Example
<!-- Hotel DTD-->
<!ELEMENT hotel (name, kategory?, address,
pathdescription, price*>
<!ATTLIST hotel id ID #REQUIRED
url CDATA #REQUIRED>
<!ELEMENT name (#PCDATA)>
<!ELEMENT kategory (#PCDATA)>
<!ELEMENT address (zip, city,
street, number, telefon,
fax?, e-mail?)>
<!ELEMENT zip (#PCDATA)>
...
<!ELEMENT pathdescription (#PCDATA)>
<!ELEMENT price (#PCDATA | singleroom |
doubleroom | appartment)*>
<!ATTLIST price currency CDATA #REQUIRED>
<!ELEMENT singleroom (#PCDATA)>...
<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE hotel SYSTEM "hotel_dt.dtd"><hotel id="id001" url="http://www.anwal.ma"> <name> Hotel Anwal</name> <address> <plz> 53000</plz> <city>Ifrane</city> ... </address>
<pathdescription> The hotel Anwal is located in front of the City Hall, with view to Atlas mountain, and high quality service. </ pathdescription> <price currency =„DH"> <singleroom>from 200,-</></>...</hotel>
Hachim Haddouti 57
Example of complete DTD
<!ELEMENT publications (book | article | conference)*>
<!ELEMENT book (front, body, references)>
<!ATTLIST book isbn CDATA #REQUIRED>
<!ELEMENT front (title, author+, edition?, publisher)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (first, second, e-mail?)>
<!ELEMENT first (#PCDATA)>
...
<!ELEMENT edition (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
<!ELEMENT body (part+ | chapter+)>
<!ELEMENT part (ptitle, chapter+)>
<!ATTLIST part id ID #REQUIRED>
<!ELEMENT ptitle (#PCDATA)>
<!ELEMENT chapter (ctitle, section+)>
Hachim Haddouti 58
Graphical Presentation of DTD
Hachim Haddouti 59
Klassification of XML Documents
Data centric Documents
strukturec, regular
Examples: Product catalog, orders,
invoices
Document centric Documents
unstruktured, irregular
example: scietific article, books, emails
Websites
Semi structured documents
data and document centric
Examples: publications, Amazon
<order> <customer>Meyer</customer> <position> <isbn>1-234-56789-0</isbn> <number>2</number> <price currency=„Euro“>30.00</price> </position></order>
<content>XML builds on the principles of two existing languages, <emph>HTML</emph> and <emph>SGML</emph> to create a simple mechanism .. The generalized markup concept ..</content>
<book> <author>Neil Bradley</author> <title>XML companion</title> <isbn>1-234-56789-0</isbn> <content> XML builds on the principles of two existing languages, <emph>HTML</emph> and .. </content></book>
Hachim Haddouti 60
Limitation of DTS
DTDs impose order Data Type instead of simple #PCDATA or #CDATA Lack of Range Specification
From DB point of view: we conclude that data typing offered by DTDs is inadequte.
Hachim Haddouti 61
Recommendation
To get familiar with XML– Create an XML document including elements and
Attributes– Test wellformed– Create DTD– Test the validity
Use e.g. of XML Spy, www.xmlspy.com
top related