The idea
Language consists of terminals a, b, c
Set of productions beginning with non-terminals
A, B, C rules specifying how to generate sequences of
terminals
Grammar
Can be used to efficiently parse a language basis of all modern programming language
parsing since Algol-60 Java Language Specification is completely in
EBNF grammar
Grammar
XML grammar-based syntax adheres to EBNF
SGML SGML had a more complex language definition
syntax HTML is defined the SGML way
Regular expressions
Language for expressing patterns Basic components
pattern elements optional element = ? repetition (1 or more) = + repetition (0 or more) = * choice = | grouping = ( ) sequence = ,
Note
Regular expressions are different in different applications Perl Javascript XML Schemas
DTDs only support ?+*|,()
EBNF
EBNF is more compact version of BNF it uses regular expressions to simplify grammar expression
A aB A aBA turns into
A aB(A)?
only one production per non-terminal allowed
DTDs
Use EBNF to specify structure of XML documents
Plus attributes entities
Syntax holdover from SGML Ugly
DTD Syntax
<!ELEMENT element-name content_model>
Content model contains the RHS of the production rule
Example<!ELEMENT name
(firstName, lastName)>
Simple content models
Content can be any text #PCDATA
Content can be anything at all (useful for debugging) ANY
Element has no content EMPTY
Example<grades>
<grade><student>Jane Doe</student><assigned-grade>A</assigned-grade>
</grade><grade>
<student>John Doe</student><assigned-grade>A-</assigned-grade>
</grade></grades>
Example<grades>
<grade><student>Jane Doe</student><assigned-grade>A</assigned-grade>
</grade><grade>
<student>John Doe</student><assigned-grade>A-</assigned-grade>
</grade><grade> <student>Wayne Doe</student>
<assigned-grade>I</assigned-grade><reason>Alien abduction</reason>
</grade></grades>
Mixed content Legal to have a content model with text and element data
<story category="national" byline="Karen Wheatley"><headline>President Meets with Congress</headline><![CDATA[ The President meet with Congressional leaders today in
effort to jump-start faltering budget negotiations. Sources described the mood
of the meeting as "cordial". ]]> <full_text ref="news801" /> <image src="img2071.jpg" /> <image src="img2072.jpg" /> <image src="img2073.jpg" /></story>
CDATA?
Forgot to mention last week Content that appears here will not be parsed
Can include arbitrary text including <, &, etc. Only restriction
termination sequence ]]>
Mixed content, cont'd
<!ELEMENT story (headline, #PCDATA, full-story, image*)>
Mixed content makes handling XML complex necessary for many applications
Recursion
Unlike grammars recursive formulation ≠ repetition
Difference between <!ELEMENT students (student+)> <!ELEMENT students (student, students?)>
Restriction
The grammar cannot be ambiguous A (a, b)| (a, c) this makes the parser implementation difficult
Usually easy to make non-ambiguous A a, (b | c)
Attribute lists
Declared separately from elements can be anywhere in the DTD
Specification includes name of the element name of the attribute attribute type default
Attribute types Character data
CDATA different from XML CDATA section!
Enumerated (yes|no)
ID must be unique in the document
IDREF must refer to an id in the document
NMTOKEN a restriction of CDATA to single "word"
Also IDREFS and NMTOKENS
Default declaration
#REQUIRED #IMPLIED
means optional Value
this becomes the default #FIXED
value provided
Examples
<!ATTLIST img
src CDATA #REQUIRED
alt CDATA #REQUIRED
align (left|right|center) "left"
id ID #IMPLIED
>
<!ATTLIST timestamp
time-zone NMTOKEN #IMPLIED>
Entities
Like macros content to be inserted indicated with &name;
Predefined general entities & < essential part of XML
User-defined general entities &disclaimer;
Entities, cont'd
Parameter entities can also be used to simplify DTD creation or to combine DTDs indicated with a %
More on this next week
Defining general entities
<!ENTITY name content> Example
<!ENTITY disclaimer
"This is a work of fiction. Any resemblance to persons living or dead is unintentional.">
Unparsed data
What about non-text data? images, audio files
In XML we define a notation
create a name and associate an application suggestion to the application
how to interpret the unparsed data not part of parsing operation
Using Notation
<!NOTATION name SYSTEM url> Example
<!NOTATION jpeg SYSTEM "IExplore.exe"> declares the jpeg notation
Example <!ENTITY "photo53" SYSTEM "photo53.jpg"
NDATA jpeg>
Notation, cont'd
Note that the content is defined in the DTD not the document binary data embedded in XML document
Not that useful in practice more likely to use URLs
Typical Example<story category="national" byline="Karen Wheatley">
...
<full_text ref="news801" />
<image src="img2071.jpg" />
<image src="img2072.jpg" />
<image src="img2073.jpg" />
</story>
Now it is up to the application to do something appropriate with the src attribute
DTD limitations
Not in XML need a special parser for the DTD
No content type restrictions #PCDATA can be anything
Element names must be globally unique cannot reuse a common term at different places in the
document course-name professor-name
DTD benefits
Relatively easy to write and understand wait until you see XML Schema!
Possible to modularize and combine DTDs more next week