writing your last dtd ? alex brown griffin brown digital publishing ltd

Writing Your Last DTD ?

Alex BrownGriffin Brown Digital Publishing

Ltd

Background

• By DTD I mean simply the formal declarations as allowed by XML 1.x

• A ‘last DTD’ doesn’t mean a last validation mechanism: the future is not well-formed

• This presentation is in two parts:– Modelling– DTD-specific features

DTDs on the Wane?

• Some say DTDs are on the way out; have been saying this for a while

• Some evidence of shift, mostly driven by new tools and new XML implementers

• Rise of the pipelining model of validation (DSDL) likely. DTDs need to cooperate with other technologies

• DTDs are not very complete instruments of validation

Part I - Modelling

Human-facing XML Models

• XML can be seen as ‘just’ a serialisation format, in which case the models need ‘just’ to work

• This presentation concerned also with models that people experience (at some level)

• People often look at raw markup, and experience content models through tools (e.g. syntax-directed editors)

Machine-facing XML Models

• Desirable features:– Normalised– Machine efficient– Programmer efficient

• Techniques fairly easily borrowed from other disciplines (database schema design, type system design, etc.)

Machines vs People

• Also known as data vs documents ?

• In reality few resources are at the extremes of this spectrum

• Many resources mix data-like and document-like features

• The challenge is in finding a balance and tolerating the mess

Data Normalisation

• i.e., single items of data appear once

• A really good idea for some data

• E.g. link targets, database dumps

Mixed Content

• Normalisation not a natural feature of human languages

<p>The cat sat on the mat<p>

not

<p>The cat <verb infinitive=‘to sit’ tense=‘perfect’/> on the mat</p>

When natural language is suitable

• Don’t be afraid to model mixed content (‘diamonds in the mud’ approach)– e.g. bibliographic references

• Sometimes the precision of human language cannot be modelled precisely– e.g. addresses

Type Hierarchies (1)

Credit-card@type=‘…’

ExpiryNumberName

Credit-card@type=‘SWITCH

’

ExpiryNumberNameIssue

Number?

Type Hierarchies (2)

visa-card

Expiry

Number

Name

switch-card

Expiry

Number

Name

visa-card (etc.)

Expiry

Number

Name

IssueNumber

Credit-card

Optional Elements?

• Optional often doesn’t mean ‘optional’, in practice it is used to mean ‘must exist’ or ‘must not exist’

• Consider making choice explicit: e.g., (issue-number|no-issue-number)

• Type-safe models are good for machine facing data; but require maintenance

Mega Markup

• ‘Just Tag It’ ?

• Models should have a justification (often a business justification)

• Rich inline tagging in particular needs to be thought-through (KM technologies often better for enriching documents)

Part II - Practicalities

Documentation

• DTDs are comparatively easy to document: content models are terse but expressive (people like them) e.g.

• A DTD is not a .DTD – and documentation is costly!

• Don’t make the limits of the DTD the limits of your specification; DTDs ‘rough out’ content

• We need a graphical standard for representing models (not UML please)

Deployment

• Deploy a normalised version of your DTD via a web server

• Require that this authoritative version is used during data handovers

• Consider requiring the use of PUBLIC identifiers

Parameterisation

• Parameter entities: macro-like features for use in DTDs

<!ENTITY % p.zz "(%p.el;)|(%p.tbl;)|(%p.lst.d;)|(%p.form;)" >

• More useful in development than mature phases in a DTD’s life time.

Entities

• Entity declarations are a DTD-only feature. Not in W3 Schema or RELAX NG (but maybe in DSDL)

• Good reason for sticking with DTDs – especially character entities.

• But, will make your data DTD-dependent

• In publishing, losing entities has not proved a problem (surprisingly)

Namespaces

• DTDs and Namespaces are uneasy partners– Prefix inflexibility– Conventions and kludges, not standard– Buggy software (microsoft parsers)

• Avoid using Namespaces with DTDs whenever possible

But if you must …

• Do not use #FIXED or default attributes in the DTD (tools will complain)

• Pre-pick your prefixes, and qualify the names of vocabularies within your DTD (e.g. m: for MathML)

• #REQUIRE the xmlns attribute(s) on your root elements, and use an external tool to enforce this

Example

<!ELEMENT root (…)><!ATTLIST root

xmlns CDATA #REQUIRED xmlns:m CDATA #REQUIRED>

<rootxmlns=‘http://myorg.com/ns/’xmlns:m=‘http://www.w3.org/1998/Math/MathML’> …

But if you must (2)

• This works with tools, and means your namespaces work with/without the DTD being present

• Don’t get stressed: remember XSLT

Defaulting

• DTDs provide the means to add items to the infoset – default attribute values

• So do W3 Schemas; RELAX NG does not *

• Using defaulting makes your document depend on your DTD/Schema; do not use it (remember XSLT)

Example

<!ATTLIST para hide (yes|no) ‘no’>

<!ATTLIST para hide (yes|no) #IMPLIED>

• Make the value inferable, and document it

• Again, remember XSLT

Off-the-shelf standards

• For XML: MathML, SVG, CALS or Exchange Tables, XHTML, etc.

• Forget XLink: much pain, no gain

• Remember there are standards for many things: country, language, date time, latitude/longtitude. Good DTDs leverage standards.

In Summary

• Pick good models

• Document your DTD and control its deployment

• Use Namespaces defensively

• Do not use entity (or notation) declarations

• Do not use attribute defaulting

• Use standards where possible

Thank You

Any Questions ?

[email protected]://www.griffinbrown.co.uk/

writing your last dtd ? alex brown griffin brown digital publishing ltd

Documents