2 document instances and grammars

54
SDPL 2003 Notes 2: Document Instances and Grammars 1 2 Document Instances and 2 Document Instances and Grammars Grammars Foundations of hierarchical Foundations of hierarchical document structures, or document structures, or Computer Scientist’s view of XML Computer Scientist’s view of XML 2.1 XML and XML documents 2.1 XML and XML documents 2.2 Basics of document grammars 2.2 Basics of document grammars 2.3 Basics of XML DTDs 2.3 Basics of XML DTDs 2.4 XML Namespaces 2.4 XML Namespaces 2.5 XML Schemas 2.5 XML Schemas

Upload: laurie

Post on 23-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

2 Document Instances and Grammars. Foundations of hierarchical document structures, or Computer Scientist’s view of XML 2.1 XML and XML documents 2.2 Basics of document grammars 2.3 Basics of XML DTDs 2.4 XML Namespaces 2.5 XML Schemas. 2.1 XML and XML documents. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

1

2 Document Instances and Grammars2 Document Instances and Grammars

Foundations of hierarchical document Foundations of hierarchical document structures, or structures, or Computer Scientist’s view of XMLComputer Scientist’s view of XML

2.1 XML and XML documents2.1 XML and XML documents2.2 Basics of document grammars 2.2 Basics of document grammars 2.3 Basics of XML DTDs2.3 Basics of XML DTDs2.4 XML Namespaces2.4 XML Namespaces2.5 XML Schemas2.5 XML Schemas

Page 2: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

2

2.1 XML and XML documents2.1 XML and XML documents

XML - Extensible Markup Language,XML - Extensible Markup Language,W3C Recommendation 10-Feb-1998W3C Recommendation 10-Feb-1998– not an official standard, but a stable industry standardnot an official standard, but a stable industry standard– Second Edition, 6-Oct-2000Second Edition, 6-Oct-2000

» an editorial revision, an editorial revision, notnot a new version of XML 1.0 a new version of XML 1.0

a simplified subset of SGML, Standard a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987Generalized Markup Language, ISO 8879:1987– what is said later about what is said later about validvalid XML documents applies XML documents applies

to SGML documents, tooto SGML documents, too

Page 3: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

3

What is XML?What is XML?

ExtensibleExtensible Markup Language Markup Language is is notnot a markup a markup language! language! – does not fix a tag set nor its semantics does not fix a tag set nor its semantics

(like markup languages like HTML do)(like markup languages like HTML do)

XML XML isis– A way to use markup to represent informationA way to use markup to represent information– A A metalanguagemetalanguage

» supports definition of specific markup languages through XML supports definition of specific markup languages through XML DTDs or SchemasDTDs or Schemas

» E.g. XHTML a reformulation of HTML using XMLE.g. XHTML a reformulation of HTML using XML

Page 4: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

4

How does it look?How does it look?

<?xml version=’1.0’ encoding=”iso-8859-1” ?><?xml version=’1.0’ encoding=”iso-8859-1” ?>

<invoice num=”1234”><invoice num=”1234”>

<client clNum=”00-01”><client clNum=”00-01”> <name>Pekka Kilpeläinen</name> <name>Pekka Kilpeläinen</name>

<email>[email protected]</email><email>[email protected]</email>

</client></client>

<item price=”60” unit=”EUR”><item price=”60” unit=”EUR”>XML Handkook</item> XML Handkook</item>

<item price=”350” unit=”FIM”><item price=”350” unit=”FIM”>XSLT Programmer’s Ref</item>XSLT Programmer’s Ref</item>

</invoice></invoice>

Page 5: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

5

Essential Features of XMLEssential Features of XML

We’ll have just an overview of the essentials of We’ll have just an overview of the essentials of XMLXML– many details skippedmany details skipped

» some to be discussed in exercises or with other topics some to be discussed in exercises or with other topics when the need ariseswhen the need arises

– learn to consult the original sources (specifications, learn to consult the original sources (specifications, documentation etc) for detailsdocumentation etc) for details» The XML spec is concise and readableThe XML spec is concise and readable

Page 6: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

6

XML document charactersXML document characters

XML documents are made of ISO-10646 (32-bit) XML documents are made of ISO-10646 (32-bit) characterscharacters; in practice of their 16-bit Unicode subset ; in practice of their 16-bit Unicode subset (used, e.g., in Java)(used, e.g., in Java)– Unicode 2.0 defines almost 39000 distinct coded charactersUnicode 2.0 defines almost 39000 distinct coded characters

Characters have three different aspectsCharacters have three different aspects::– their identification as numeric code pointstheir identification as numeric code points– their their representationrepresentation by bytes by bytes– theirtheir visual presentation visual presentation

Page 7: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

7

External Aspects of CharactersExternal Aspects of Characters

Documents are stored/transmitted as a sequence Documents are stored/transmitted as a sequence of bytes (of 8 bits). An of bytes (of 8 bits). An encodingencoding determines how determines how characters are characters are representedrepresented by bytes. by bytes.– UTF-8 (UTF-8 (7-bit ASCII) is the XML default encoding7-bit ASCII) is the XML default encoding– for Latin-1 (8-bit Western European ASCII) use for Latin-1 (8-bit Western European ASCII) use

encoding="iso-8859-1" encoding="iso-8859-1" A A fontfont (collection of character images called (collection of character images called

glyphsglyphs) determines the ) determines the visual presentationvisual presentation of of characterscharacters

Page 8: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

8

XML Encoding of Structure 1XML Encoding of Structure 1

XML is, essentially, just a textual encoding XML is, essentially, just a textual encoding scheme of scheme of labelledlabelled, , orderedordered and and attributedattributed treestrees, in which, in which– internal nodes are internal nodes are elementselements labelled by type names labelled by type names– leaves are leaves are text nodestext nodes labelled by string values, or labelled by string values, or

empty element nodesempty element nodes– the left-to-right order of children of a node mattersthe left-to-right order of children of a node matters– element nodes may carry element nodes may carry attributesattributes (= name-string- (= name-string-

value pairs)value pairs)

Page 9: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

9

XML Encoding of Structure 2XML Encoding of Structure 2

XML encoding of a treeXML encoding of a tree– corresponds to a pre-order walkcorresponds to a pre-order walk– start of an element node with type name A start of an element node with type name A

denoted by a denoted by a start tagstart tag <A>, and its end <A>, and its end denoted by denoted by end tagend tag </A> </A>

– possible attributes written within the start tag:possible attributes written within the start tag:<A attr<A attr11=“value=“value11” … attr” … attrnn=“value=“valuenn”>”>

– text nodes written as their string valuetext nodes written as their string value

Page 10: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

10

XML Encoding of Structure: XML Encoding of Structure: ExampleExample

<S><S>

SS

EE

<W><W> <W><W></W></W> <E A=‘1’><E A=‘1’> </E></E>HelloHello world!world!

WW

HelloHello

WW

world!world!

</W></W>

</S></S>

A=1A=1

Page 11: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

11

XML: Logical Document StructureXML: Logical Document Structure

ElementsElements – indicated by matching (case-sensitive!) tagsindicated by matching (case-sensitive!) tags<ElementTypeName> <ElementTypeName> … … </ElementTypeName></ElementTypeName>

– can contain text and/or subelementscan contain text and/or subelements– can be can be emptyempty::

<elem-type></elem-type><elem-type></elem-type> OR OR (e.g.)(e.g.)

<br/><br/>– unique root element unique root element document a single tree document a single tree

Page 12: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

12

Logical document structure (2)Logical document structure (2)

AttributesAttributes– name-value pairs attached to elementsname-value pairs attached to elements– ‘‘‘‘metadata’’, usually not treated as contentmetadata’’, usually not treated as content– in start-tag after the element type namein start-tag after the element type name

<div class="preface" date='990126'><div class="preface" date='990126'> … …

Also:Also:– <!--<!-- commentscomments outside other markup outside other markup --> -->– <?note this would be passed to the <?note this would be passed to the application as a application as a processing instructionprocessing instruction named ‘note’?>named ‘note’?>

Page 13: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

13

CDATA SectionsCDATA Sections

““CDATA Sections” to include XML markup CDATA Sections” to include XML markup characters as textual contentcharacters as textual content

<![CDATA[ <![CDATA[ Here we can easily include markup Here we can easily include markup characters and, for example, code characters and, for example, code fragments:fragments:

<example>if (Count < 5 && Count > 0) <example>if (Count < 5 && Count > 0) </example></example>

]]>]]>

Page 14: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

14

Two levels of correctnessTwo levels of correctness

Well-formedWell-formed documents documents – roughly: follows the syntax of XML,roughly: follows the syntax of XML,

markup correct (elements properly nested, tag markup correct (elements properly nested, tag names match, attributes of an element have names match, attributes of an element have unique names, ...)unique names, ...)

– violation is a fatal errorviolation is a fatal error ValidValid documentsdocuments

– (in addition to being well-formed) (in addition to being well-formed) obey an associated grammar (DTD/Schema)obey an associated grammar (DTD/Schema)

Page 15: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

15

An XML Processor (Parser)An XML Processor (Parser)

Reads XML documents and reports their contents Reads XML documents and reports their contents to an application to an application – relieves the application from details of markup relieves the application from details of markup – XML Rec specifies partly the behaviour of XML XML Rec specifies partly the behaviour of XML

processors: processors: – recognition of characters as markup or data; what recognition of characters as markup or data; what

information to pass to applications; information to pass to applications; how to check the correctness of documents; how to check the correctness of documents;

– validation based on comparing document against its validation based on comparing document against its grammar grammar Next: Basics of document grammarsNext: Basics of document grammars

Page 16: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

16

2.2 Basics of document grammars2.2 Basics of document grammars

Regular expressionsRegular expressions describe regular languages: describe regular languages:– relatively simple relatively simple sets of stringssets of strings over a finite over a finite alphabet alphabet ofof

» characters, events, characters, events, document elementsdocument elements, …, … Many uses: Many uses:

– text searching patterns (e.g., emacs, ed, awk, grep, Perl)text searching patterns (e.g., emacs, ed, awk, grep, Perl)– as a part of grammatical formalisms (for programming as a part of grammatical formalisms (for programming

languages, XML, in XML DTDs and schemas, …)languages, XML, in XML DTDs and schemas, …) Relevance for document structures: what kind of structural Relevance for document structures: what kind of structural

content is allowed within various document componentscontent is allowed within various document components

Page 17: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

17

Regular Expressions: SyntaxRegular Expressions: Syntax

A regular expression over an alphabetA regular expression over an alphabet is eitheris either an empty set, an empty set, – lambdalambda (sometimes (sometimes epsilonepsilon ), ), – aa any alphabet symbolany alphabet symbol aa – ((R | SR | S)) choicechoice, , – ((R SR S)) concatenationconcatenation, or, or– ((RR))** Kleene closure or Kleene closure or iterationiteration,,

where where RR and and SS are regular expressions are regular expressions N.B: different syntaxes exist but the idea is same N.B: different syntaxes exist but the idea is same

Page 18: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

18

Regular Expressions: GroupingRegular Expressions: Grouping

Conventions to simplify operator expressions by Conventions to simplify operator expressions by eliminating extra parentheses:eliminating extra parentheses:– outermost parentheses may be eliminated: outermost parentheses may be eliminated:

» ((EE) = ) = EE

– binary operations are associative:binary operations are associative:» ((AA | ( | (BB | | CC)) = (()) = ((AA | | BB) | ) | CC) = () = (AA | | BB | | CC))» ((AA ( (B CB C)) = (()) = ((A BA B) ) CC) = () = (A B CA B C))

– operations have priorities: operations have priorities: » iteration first, concatenation next, choice lastiteration first, concatenation next, choice last» for example,for example,

((AA | ( | (BB ( (C)C)*)) = *)) = AA | | B CB C**

Page 19: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

19

Regular Expressions: SemanticsRegular Expressions: Semantics

Regular expressionRegular expression EE denotes a denotes a languagelanguage (set of strings) (set of strings) LL((EE), ), defined inductively as followsdefined inductively as follows::– LL(() = {}) = {} (empty set)(empty set)– LL(( (singleton set of empty string(singleton set of empty string = ‘‘) = ‘‘)

– LL((aa) = {) = {aa}, }, ( (aa , singleton set of , singleton set of word "word "a"a"– LL((((R | SR | S)) = )) = LL((RR) ) LL((SS) = {) = {w | ww | w LL((RR) ) or or ww LL((SS)} )}

– LL((((R SR S)) = )) = LL((RR) ) LL((SS) = {) = {xy | xxy | x LL((RR) ) and and yy LL((SS)})}

– LL((RR**) = ) = LL((RR))** = { = {xx11……xxnn | | xxkk LL((RR), ), kk = 1, …, = 1, …, nn;; n n 0}0}

Page 20: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

20

Regular Expressions: ExamplesRegular Expressions: Examples

L(L(AA | | B B ((C DC D)*) = ?)*) = ?

= L(= L(AA)) L( L( B B ((C DC D)*))*)

= {= {AA}} L(L(BB)) L((L((C DC D)*))*)

= {= {AA}} {{BB}} L(L(C DC D)*)*

= {= {AA}} {{BB}} {{CD, CDCD, CDCDCD, …}CD, CDCD, CDCDCD, …}

= {= {AA, , BB, BCD, BCDCD, BCDCDCD, …}, BCD, BCDCD, BCDCDCD, …}

= {= {AA}} {{BB}} {{CDCD}*}*

Page 21: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

21

Regular Expressions: ExamplesRegular Expressions: Examples

Simplified top-level structure of a document: Simplified top-level structure of a document: – = {= {title, auth, date, secttitle, auth, date, sect}}– titletitle followed by an optional list of followed by an optional list of authorsauthors, ,

fby an optional fby an optional datedate, fby one or more , fby one or more sectionssections::

title auth* title auth* ((date |date |)) sect sect* sect sect* Commonly used abbreviations: Commonly used abbreviations:

– EE? = (? = (EE | |);); EE++ = = EE E*E* the above more compactly:the above more compactly:

title auth* datetitle auth* date?? sect+ sect+

Page 22: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

22

Context-Free Grammars (CFGs)Context-Free Grammars (CFGs)

Used widely to syntax specification (programming Used widely to syntax specification (programming languages, XML, …) and to parser/compiler languages, XML, …) and to parser/compiler generation (e.g. YACC/GNU Bison)generation (e.g. YACC/GNU Bison)

CFG CFG GG formally a 4-tuple (V, formally a 4-tuple (V, P, S) P, S)– V is the V is the alphabetalphabet of the grammar of the grammar GG– VViis a set of s a set of terminal symbolsterminal symbols;;

» N = VN = V is a set of is a set of nonterminal symbolsnonterminal symbols

– P set of P set of productionsproductions– S S V the V the start symbolstart symbol

Page 23: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

23

Productions and DerivationsProductions and Derivations

Productions: Productions: A A where where A A N, N, V* V*– E.g:E.g: A A aBa aBa ((production 1production 1))

LetLet V*. V*. StringString derives derives directlydirectly, , ifif– for somefor someV*V*

and and PP– E.g.E.g. A A => A => AaBaaBa (assuming production 1 above)(assuming production 1 above)

NBNB: CFGs are often given simply by listing the productions : CFGs are often given simply by listing the productions (P)(P)

Page 24: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

24

Language Generated by a CFGLanguage Generated by a CFG

derives derives , , if there’s a sequence of (0 or more) if there’s a sequence of (0 or more) direct derivations that transformsdirect derivations that transforms to to

The language generated by a CFGThe language generated by a CFG GG::– L(L(GG) = {w ) = {w * | S =>* w }* | S =>* w }

NB:NB: L( L(GG) ) is a set of is a set of stringsstrings;; – To model document To model document structuresstructures, we consider , we consider syntax treessyntax trees

Page 25: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

25

Syntax TreesSyntax Trees

Also called Also called parse treesparse trees or or derivation treesderivation trees Ordered treesOrdered trees

– consist of nodes that may have child nodes which are consist of nodes that may have child nodes which are ordered left-to-rightordered left-to-right

nodes labelled by symbols of nodes labelled by symbols of VV: : – internal nodes by nonterminals, root by start symbolinternal nodes by nonterminals, root by start symbol– leaves by terminal symbols (or empty stringleaves by terminal symbols (or empty string ))

A node with label A node with label AA can have children labelled by can have children labelled by XX11, …, X, …, Xkk only if only if AA XX11, …, X, …, Xkk PP

Page 26: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

26

Syntax Trees: ExampleSyntax Trees: Example

CFG for simplified arithmetic expressions:CFG for simplified arithmetic expressions:V = {E, +, *, I}; V = {E, +, *, I}; = {+, *, I}; N = {E}; S = E = {+, *, I}; N = {E}; S = E

((II stands for an arbitrary integer) stands for an arbitrary integer)

P = { E P = { E E+E, E+E, E E E*E, E*E, E E I, I, E E (E) } (E) }

Syntax tree for 2*(3+4)?Syntax tree for 2*(3+4)?

Page 27: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

27

Syntax Trees: ExampleSyntax Trees: Example

2 * ( 3 + 4 )2 * ( 3 + 4 )

EE

EE EE

E E E*E E*E

II

E E I I

II IIE E I I

EEEE

E E E+E E+EEE

E E ( E ) ( E )

Page 28: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

28

CFGs for Document StructuresCFGs for Document Structures

Nonterminals represent document elementsNonterminals represent document elements– E.g. model for items (E.g. model for items (Ref Ref ) of a bibliography list:) of a bibliography list:

Ref Ref AuthorList Title PublData AuthorList Title PublDataAuthorList AuthorList Author AuthorList Author AuthorList AuthorList AuthorList

Notice:Notice:– right-hand-side of a CFG production is a right-hand-side of a CFG production is a fixed stringfixed string of of

grammar symbolsgrammar symbols– Repetition simulated using recursion Repetition simulated using recursion

» e.g. e.g. AuthorList AuthorList aboveabove

Page 29: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

29

Example: List of Three AuthorsExample: List of Three Authors

AuthorListAuthorList

Author Author

Author Author

Author Author

AuthorListAuthorList

AuthorListAuthorList

AuthorListAuthorList

Aho Aho Hopcroft Hopcroft Ullman Ullman

RefRef

TitleTitle PublDataPublData

. . . . . . . . . . . .

Page 30: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

30

ProblemsProblems

"Auxiliary nonterminals" (like "Auxiliary nonterminals" (like AuthorListAuthorList) ) obscure the modelobscure the model– the last the last AuthorAuthor several levels apart from its several levels apart from its

intuitive parent elementintuitive parent element RefRef

– awkward to access and to count awkward to access and to count AuthorsAuthors of a of a referencereference

– avoided by avoided by extendedextended context-free grammars context-free grammars

Page 31: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

31

Extended CFGs (ECFGs)Extended CFGs (ECFGs)

like CFGs, but right-hand-sides of like CFGs, but right-hand-sides of productions are productions are regular expressionsregular expressions over over VV– E.g:E.g: Ref Ref Author* Title PublData Author* Title PublData

Let Let V*. V*. StringString derives derives directlydirectly, ,

if if

– for somefor someV*V*and and P P such thatsuch that L( L())

– E.g.E.g. Ref => Author Author Author Title PublDataRef => Author Author Author Title PublData L(Author* Title PublData)L(Author* Title PublData)

Page 32: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

32

Language Generated by an Language Generated by an EECFGCFG

L(G)L(G) defined similarly to CFGs: defined similarly to CFGs:– derivesderives , if , if

nn= = (for (for nn 0)0)

– L(G) = {w L(G) = {w * | S =>* w }* | S =>* w }

Theorem:Theorem: Extended and ordinary CFGs allow to Extended and ordinary CFGs allow to generate the same (string) languages.generate the same (string) languages.

• But syntax trees of ECFGs and CFGs differ! (Next)But syntax trees of ECFGs and CFGs differ! (Next)

Page 33: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

33

Syntax Trees of an ECFGSyntax Trees of an ECFG

Similar to parse trees of an ordinary CFG, Similar to parse trees of an ordinary CFG, except that ..except that ..– node with label A can have children labelled Xnode with label A can have children labelled X11, ,

…, X…, Xkk when A when A E E P such that P such that

XX11…X…Xkk L(E) L(E)

an internal node may have arbitrarily many an internal node may have arbitrarily many children (e.g., children (e.g., AuthorsAuthors below a below a RefRef node) node)

Page 34: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

34

Example: Three Authors of a RefExample: Three Authors of a Ref

RefRef

Author Author Author Author Author Author TitleTitle

. . .. . .

PublDataPublData

Aho Aho Hopcroft Hopcroft Ullman Ullman The Design and Analysis ...The Design and Analysis ...

Ref Ref Author* Title PublData Author* Title PublData P, P,Author Author Author Title PublData Author Author Author Title PublData L( L(Author* Title PublDataAuthor* Title PublData))

Page 35: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

35

Terminal Symbols in PractiseTerminal Symbols in Practise

In (extended) CFGs: In (extended) CFGs: – Leaves of parse trees are labelled by single Leaves of parse trees are labelled by single

terminal symbols (terminal symbols ()) Too granular for practise; instead terminal Too granular for practise; instead terminal

symbols which stand for all values of a typesymbols which stand for all values of a type– XML DTDs: XML DTDs: #PCDATA#PCDATA for variable length string content for variable length string content– In XML Schema definition language:In XML Schema definition language:

» string, byte, integer, boolean, date,string, byte, integer, boolean, date, ... ...

– Explicit string literals rare in document grammarsExplicit string literals rare in document grammars

Page 36: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

36

2. 3 Basics of XML DTDs2. 3 Basics of XML DTDs

A A Document Type DeclarationDocument Type Declaration provides a provides a grammar (grammar (document type definitiondocument type definition,, DTD DTD) for a ) for a class of documents [Defined in XML Rec]class of documents [Defined in XML Rec]

Syntax (in the prolog of a document instance):Syntax (in the prolog of a document instance):<!<!DOCTYPEDOCTYPE rootElemType rootElemType SYSTEMSYSTEM "ex.dtd" "ex.dtd"<!-- "<!-- "external subsetexternal subset" in file ex.dtd --> " in file ex.dtd --> [ <!-- "[ <!-- "internal subsetinternal subset" may come here --> " may come here --> ]>]>

DTD is the union of the external and internal subset; DTD is the union of the external and internal subset; internal subset has higher precedence internal subset has higher precedence– can override entity and attribute declarations (see next)can override entity and attribute declarations (see next)

Page 37: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

37

Markup DeclarationsMarkup Declarations

DTD consists of DTD consists of markup declarationsmarkup declarations – element type declarationselement type declarations

» similar to productions of ECFGssimilar to productions of ECFGs

– attribute-list declarationsattribute-list declarations » for declared element typesfor declared element types

– entity declarationsentity declarations (see later) (see later)– notation declarationsnotation declarations

» to pass information about external (binary) objects to pass information about external (binary) objects to the applicationto the application

Page 38: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

38

How do Declarations Look Like?How do Declarations Look Like?

<!ELEMENT invoice (client, item+)><!ELEMENT invoice (client, item+)>

<!ATTLIST invoice num NMTOKEN #REQUIRED><!ATTLIST invoice num NMTOKEN #REQUIRED>

<!ELEMENT client (name, email?)> <!ELEMENT client (name, email?)>

<!ATTLIST client num NMTOKEN #REQUIRED><!ATTLIST client num NMTOKEN #REQUIRED>

<!ELEMENT name (#PCDATA)> <!ELEMENT name (#PCDATA)>

<!ELEMENT email (#PCDATA)> <!ELEMENT email (#PCDATA)>

<!ELEMENT item (#PCDATA)><!ELEMENT item (#PCDATA)>

<!ATTLIST item <!ATTLIST item

priceprice NMTOKEN #REQUIREDNMTOKEN #REQUIRED

unit (FIM | EUR) ”EUR” >unit (FIM | EUR) ”EUR” >

Page 39: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

39

Element type declarationsElement type declarations

The general form isThe general form is<!ELEMENT<!ELEMENT elementTypeName elementTypeName ((EE)>)>

where where EE is a is a content modelcontent model regular expression of element namesregular expression of element names Content model operators:Content model operators:

E | F : alternationE | F : alternation EE,, F: concatenation F: concatenationE? : optionalE? : optional E* : zero or moreE* : zero or moreE+ : one or moreE+ : one or more (E) : grouping(E) : grouping

Page 40: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

40

Attribute-List DeclarationsAttribute-List Declarations

Can declare attributes for elements:Can declare attributes for elements:– Name, data type and possible default value Name, data type and possible default value

Example:Example:<!ATTLIST FIG<!ATTLIST FIG

idid IDID #IMPLIED#IMPLIEDdescr CDATA #REQUIREDdescr CDATA #REQUIREDclass (a | b | c) class (a | b | c) "a">"a">

Semantics mainly up to the applicationSemantics mainly up to the application– processor checks that processor checks that IDID attributes are unique and that attributes are unique and that

targets of targets of IDREFIDREF attributes exist attributes exist

Page 41: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

41

Mixed, Empty and Arbitrary ContentMixed, Empty and Arbitrary Content

Mixed contentMixed content::<!ELEMENT P<!ELEMENT P (#PCDATA | I | IMG)*>(#PCDATA | I | IMG)*>

– may contain text (may contain text (#PCDATA#PCDATA) and elements) and elements Empty contentEmpty content::

<!ELEMENT IMG <!ELEMENT IMG EMPTY>EMPTY>

Arbitrary content:Arbitrary content:<!ELEMENT HTML <!ELEMENT HTML ANY>ANY>

(= (= <!ELEMENT HTML<!ELEMENT HTML (#PCDATA |(#PCDATA | choice-of-all-declared-element-typeschoice-of-all-declared-element-types)*>)*>))

Page 42: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

42

Entities (1)Entities (1)

Physical storage units or named fragments Physical storage units or named fragments of XML documentsof XML documents

Multiple uses:Multiple uses:– character entitiescharacter entities: :

» &lt;&lt; &#60;&#60; and and &#x3C;&#x3C; all expand to ‘ all expand to ‘<<‘‘(treated as data, not as start-of-markup)(treated as data, not as start-of-markup)

» other other predefined entitiespredefined entities: : &amp; &gt; &apos; &quote;&amp; &gt; &apos; &quote;expand toexpand to &, >, ' &, >, ' and and ""

– general entitiesgeneral entities are shorthand notations: are shorthand notations:<!ENTITY UKU<!ENTITY UKU “University of Kuopio">“University of Kuopio">

Page 43: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

43

Entities (2)Entities (2)

physical storage units comprising a documentphysical storage units comprising a document– parsed entitiesparsed entities

<!ENTITY chap1 SYSTEM <!ENTITY chap1 SYSTEM "http://myweb/ch1">"http://myweb/ch1">

– document entity document entity is the starting point of processingis the starting point of processing– entities and elements must nest properly:entities and elements must nest properly:

<!DOCTYPE doc [<!DOCTYPE doc [<!ENTITY chap1 <!ENTITY chap1

((… as above …)… as above …) > ]> ]<doc><doc>

&chap1;&chap1;</doc></doc>

<sec num="1"><sec num="1">… …

</sec></sec><sec num="2"><sec num="2">

… … </sec></sec>

Page 44: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

44

Unparsed EntitiesUnparsed Entities

For connecting external binary objects to XML For connecting external binary objects to XML documents; (XML processor handles only text)documents; (XML processor handles only text)

Declarations:Declarations:<!NOTATION TIFF "bin/xv"><!NOTATION TIFF "bin/xv"><!ENTITY fig123 SYSTEM "figs/f123.tif" <!ENTITY fig123 SYSTEM "figs/f123.tif"

NDATA TIFF>NDATA TIFF><!ATTLIST IMG<!ATTLIST IMG

filefile ENTITYENTITY #REQUIRED>#REQUIRED>

Usage:Usage: <IMG file="fig123"/><IMG file="fig123"/>– application receives information about the notationapplication receives information about the notation

Page 45: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

45

Parameter entitiesParameter entities

Way to parameterize and modularize DTDsWay to parameterize and modularize DTDs

<!ENTITY % table-dtd SYSTEM "dtds/tab.dtd"><!ENTITY % table-dtd SYSTEM "dtds/tab.dtd">%table-dtd; <!-- include external sub-dtd -->%table-dtd; <!-- include external sub-dtd -->

<!ENTITY % stAttr "status (draft | ready) <!ENTITY % stAttr "status (draft | ready) 'draft'">'draft'">

<!ATTLIST CHAP %stAttr; ><!ATTLIST CHAP %stAttr; ><!ATTLIST SECT %stAttr; ><!ATTLIST SECT %stAttr; >

(The latter, parameter entities as a part of a markup (The latter, parameter entities as a part of a markup declaration, is allowed only in the external, and not in the declaration, is allowed only in the external, and not in the internal DTD subset)internal DTD subset)

Page 46: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

46

Speculations about XML ParsingSpeculations about XML Parsing

Parsing involves two things:Parsing involves two things:1. Pulling the entities together, and checking the syntactic 1. Pulling the entities together, and checking the syntactic correctness of the inputcorrectness of the input2. Building a parse tree for the input (a'la DOM), 2. Building a parse tree for the input (a'la DOM), or or otherwise informing the application about document otherwise informing the application about document content and structure (e.g. a'la SAX)content and structure (e.g. a'la SAX)

Task 2 is simple (Task 2 is simple ( simplicity of XML markup; simplicity of XML markup; see next) see next)

Slightly more difficult to implement areSlightly more difficult to implement are– checking the well-formednesschecking the well-formedness– checking the validity wrt the DTD (or a Schema)checking the validity wrt the DTD (or a Schema)

Page 47: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

47

Building an XML Parse TreeBuilding an XML Parse Tree

<S><S>

SS

</S></S><W><W> <W><W> </W></W></W></W> <E><E> </E></E>HelloHello world!world!

WW

HelloHello

EE WW

world!world!

Page 48: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

48

Sketching the validation of Sketching the validation of XMLXML

Treat the document as a tree Treat the document as a tree d d Document is valid w.r.t. a grammar Document is valid w.r.t. a grammar

(DTD/Schema) G iff (DTD/Schema) G iff dd is a syntax tree over G is a syntax tree over G How to check this?How to check this?

– Check that the root is labelled by start symbol S of GCheck that the root is labelled by start symbol S of G– For each element node For each element node nn of the tree, check that of the tree, check that

» the attributes of the attributes of nn match attributes declared for the type of match attributes declared for the type of nn» the contents of the contents of nn matches the content model for the type of matches the content model for the type of nn. .

That is, if That is, if nn is of type A and its children of type B is of type A and its children of type B11, …, B, …, Bnn, , check thatcheck that

word B word B11 … B … Bn n L(E) L(E) (1)(1)for some production A for some production A E of G. E of G.

Page 49: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

49

Sketching the validation… Sketching the validation… (2)(2)

How to check condition (1; matching of How to check condition (1; matching of children with a content model)?children with a content model)?– by an automaton built from content model Eby an automaton built from content model E

Normally, validation proceeds in Normally, validation proceeds in document document order order (= depth-first order of tree)(= depth-first order of tree). . For For example, to validate the content ofexample, to validate the content of

<A><B>…</B><C>...<C/><D/></A><A><B>…</B><C>...<C/><D/></A>

the content of B is checked before continuing the content of B is checked before continuing to verify that BCD matches the content to verify that BCD matches the content model for A model for A

Page 50: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

50

2.4 XML Namespaces2.4 XML Namespaces

Sometimes needed to combine document parts Sometimes needed to combine document parts processed by different applications (and/or defined in processed by different applications (and/or defined in different grammars) in a single document instancedifferent grammars) in a single document instance– for example, in XSLT scripts:for example, in XSLT scripts:

<xsl:template match="doc/title"> <xsl:template match="doc/title"> <H1><H1>

<xsl:apply-templates /><xsl:apply-templates /> </H1> </H1>

</xsl:template></xsl:template>

– How to manage multiple sets of names?How to manage multiple sets of names?

HTML HTML elementselements

XSLT XSLT elements/elements/

instructionsinstructions

Page 51: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

51

XML Namespaces (2/5) XML Namespaces (2/5)

Solution: XML Namespaces, W3C Rec. 14/1/1999 Solution: XML Namespaces, W3C Rec. 14/1/1999 for separating possibly overlapping “vocabularies” for separating possibly overlapping “vocabularies” (sets of element type and attribute names) within a (sets of element type and attribute names) within a single documentsingle document– separation by using a local name separation by using a local name prefixprefix, which is bound , which is bound

to a globally unique URIto a globally unique URI» For example, the local prefix “For example, the local prefix “xsl:xsl:” conventionally ” conventionally

used in XSLT scriptsused in XSLT scripts

Page 52: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

52

XML namespaces briefly (3/5)XML namespaces briefly (3/5)

Namespace identified by a URI (through Namespace identified by a URI (through the associated local prexif) the associated local prexif) e.g.e.g. http://www.w3.org/http://www.w3.org/1999/XSL/Transform1999/XSL/Transform for XSLTfor XSLT

– conventional but not required to use URLsconventional but not required to use URLs– the identifying URI has to be unique, but it does not the identifying URI has to be unique, but it does not

have to be an existing addresshave to be an existing address

Association inherited to sub-elementsAssociation inherited to sub-elements– see the next example (of an XSLT script)see the next example (of an XSLT script)

Page 53: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

53

XML Namespaces (4/5) XML Namespaces (4/5)

<xsl:stylesheet xsl:version=<xsl:stylesheet xsl:version="1.0""1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/TR/xhtml1/strict">xmlns="http://www.w3.org/TR/xhtml1/strict"><!-- XHTML is the ’default namespace’ --><!-- XHTML is the ’default namespace’ -->

<xsl:template match="doc/title"> <xsl:template match="doc/title"> <H1><H1>

<xsl:apply-templates /><xsl:apply-templates /> </H1> </H1> </xsl:template> </xsl:template>

</xsl:stylesheet> </xsl:stylesheet>

Page 54: 2 Document Instances and Grammars

SDPL 2003 Notes 2: Document Instances and Grammars

54

XML Namespaces briefly (5/5)XML Namespaces briefly (5/5)

Namespaces built on top of basic XML featuresNamespaces built on top of basic XML features– uses attribute syntax (uses attribute syntax (xmlns:)xmlns:) to introduce to introduce

namespacesnamespaces– does not affect validation does not affect validation

» namespace attributes have to be declared for DTD-namespace attributes have to be declared for DTD-validityvalidity

» all element type names have to be declared in the DTD all element type names have to be declared in the DTD (with their initial prefixes!)(with their initial prefixes!)

– > Namespaces and DTD-validation do not mix well> Namespaces and DTD-validation do not mix well