xml d nathan intro and formalism. roots a computer is not a typewriter electronic texts are more...

24
XML D Nathan Intro and formalism

Upload: norman-holt

Post on 05-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

XMLD Nathan

Intro and formalism

Page 2: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

Roots

A computer is not a typewriter• electronic texts are more than sequences of

characters• they have structure, and context• they also have multiple readings

Markup provides a means of making structure, context and readings explicit • only that which is symbolically explicit can be digitally

processed • digital processing is about more than reproducing

paper

Page 3: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

Textual ontologies

As annotations, markup adds value to data Facilitate multiple readings and multiple

usages• different contexts • different formats • different audiences• different purposes

There’s more: texts can not only be read but also analysed and manipulated

Page 4: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

What is markup, again?

A way of naming and identifying the parts of a document in a sharable and consistent way

A way of making explicit the distinctions we want a computer to make when it processes a sequence of characters

Making the document “machine readable” (computers can read and process it as if they understand it)

Page 5: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

... and again?

“A set of codes that tell an agent how to interpret, process or display content”

Thus, it’s usually more useful to markup what things really are than what they look like

Page 6: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

ExampleJames Bond1007 Fast DriveAston Martin420HP DB501865 007 08025/10/06

Dear Mr Khazakstanspy

It is with some regret that ....

What is “25/10/06”? How do we know? What does the software know?

Page 7: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

Design principles XML came out of SGML - a system for incremental and

collaborative “enrichment” of texts XML design principles

1. XML shall be straightforwardly usable over the Internet.2. XML shall support a wide variety of applications.3. XML shall be compatible with SGML.4. It shall be easy to write programs which process XML

documents.5. The number of optional features in XML is to be kept to the

absolute minimum, ideally zero.6. XML documents should be human-legible and reasonably clear.7. The XML design should be prepared quickly.8. The design of XML shall be formal and concise.9. XML documents shall be easy to create.10. Terseness is of minimal importance.

Page 8: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

XML

eXtensible Markup Language: a generic markup language

Simplifies the representation of structured data as linear character strings, i.e. can be thought of as:• as a stream of text and/or as a (tree) structure

XML looks like HTML, except that it:• is extensible• must be well-formed• can be validated• is application-, platform-, and vendor- independent

Page 9: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

XML landscape

SGMLXML

Markup languages

XML languagesHTML

XHTMLMathMLSMILIXFSVGCBML

XHTMLMathMLSMILIXFSVGCBML

Related technologies

CSSXSL:FOXPathXQueryXSLTXLink

layout

navigate, query

transformlink

Grammars

Schema DTD

Page 10: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

XML Formalism Create explicit formal structures using only plain

text structures are defined by tags in angle brackets:

eg: <noun>

tags are usually in pairs:• a start/open tag, and an end/close tag:

the <noun> dog </noun> chased ...

but can also be single and closed:the dog <pause /> sat down

Page 11: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

Elements, tags and content

Elements• Tags (opening, closing, empty)• Content

<a></a> is not empty; it has no content

Page 12: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

Attributes and values

Tags can have attributes with values :the <noun num=“1”> dog </ noun> sat down

Attribute names within elements are unique Order of attribute/value pairs insignificant:

the <noun num=“1” cl=“anim”> dog </ noun> sat

the <noun cl=“anim” num=“1”> dog </ noun> sat

Often attributes values have to be drawn from a closed set, e.g. consider:

<dog breed=“corgi” color=“noun”> Fifi </dog> ?

Page 13: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

Names

You can name your elements, attributes or values (almost) anything, but ...

Names should begin with “a-z” or “_”

Page 14: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

Characters

XML must be ASCII or Unicode

XML is case sensitive; in general use lower case

Reserved characters

<, >, &, “

less-than (<) &lt;

greater than (>) &gt;

ampersand (&) &amp;

quote (“) &quot;

Page 15: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

Character entity references

“Stand in” for reserved characters• e.g. &lt;

Provide standardised references• e.g. &t-pal;

Provide “short cuts” for strings• e.g. &n;

Have to be declared, but can be created to purpose

Page 16: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

Nesting (hierarchy), but no overlap:

<a>the<b><c>cat</c> sat</b> on the mat</a>

<a>the<b><c>cat</b> sat</c> on the mat</a>

Syntax

Page 17: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

More syntax

All elements must be closed Most attributes have values; values must

be enclosed in (plain) double quotes There are no size or number limits

Page 18: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

The XML document

A plain text file Main parts: prolog, body Body has a single root node (= element) Comments

<!-- this comment may be ignored --> Processing instructions (PI)

This (optional) special PI also called the XML declaration:<?xml version=“1.0” ?>

Document type declaration<!DOCTYPE IXF SYSTEM "IXF.DTD" [<!ENTITY LEXFILE

"..\DXF\PaakaDraft.xml">]>

Page 19: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

XML document layout

Is unimportant! ... in most circumstances, but some

applications might treat the white space differently

Page 20: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

This is the same as ...<panel n="3"> <panelDescription characters="cap" /> <caption> <paragraph> Before the hammer descends on cap, his shield <emphasis style="bold"> demolishes </emphasis> the evil mechanism! </paragraph> </caption> <soundEffect> KRAK! </soundEffect> </panel> <panel n="4"> <panelDescription characters="cap anon_man" /> <caption> <paragraph> The screaming suddenly <emphasis style="bold"> stops-- </emphasis> and, in the ensuing silence, <emphasis style="bold"> both </emphasis> men sink <emphasis style="bold"> slowly </emphasis> to the ground... </paragraph> </caption> </panel>

Page 21: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

... this!

<panel n="3"> <panelDescription characters="cap" /> <caption> <paragraph> Before the hammer descends on cap, his shield <emphasis style="bold"> demolishes </emphasis> the evil mechanism! </paragraph> </caption> <soundEffect> KRAK! </soundEffect> </panel> <panel n="4"> <panelDescription characters="cap anon_man" /> <caption> <paragraph> The screaming suddenly <emphasis style="bold"> stops-- </emphasis> and, in the ensuing silence, <emphasis style="bold"> both </emphasis> men sink <emphasis style="bold"> slowly </emphasis> to the ground... </paragraph> </caption> </panel>

Page 22: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

Putting it together

Page 23: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

... in XML

<story><metaDataField>(The Guardian, </metaDataField><metaDataField>July 1, 1997, </metaDataField><metaDataField>Andrew Higgins in Hong Kong)

</metaDataField><headLine>A last hurrah and an empire closes down

</headLine><p>With a clenched-jaw nod from the Prince of Wales, a last

rendition of <title>God Save the Queen</title>, and a wind machine to keep the Union flag flying for a final 16 minutes of indoor pomp...</p>

</story>

Page 24: XML D Nathan Intro and formalism. Roots  A computer is not a typewriter electronic texts are more than sequences of characters they have structure, and

XML capable software

(other than displaying “raw” XML) most browsers

• including XML, CSS, XSLT

software using XML-based data formats• e.g. Transcriber• may keep XML hidden but you can often manipulate it

software that exports data in some XML format• e.g. MS Excel, Toolbox, Filemaker Pro

dedicated XML editing software • e.g. oXygen