writing your last dtd ? alex brown griffin brown digital publishing ltd
TRANSCRIPT
Background
• By DTD I mean simply the formal declarations as allowed by XML 1.x
• A ‘last DTD’ doesn’t mean a last validation mechanism: the future is not well-formed
• This presentation is in two parts:– Modelling– DTD-specific features
DTDs on the Wane?
• Some say DTDs are on the way out; have been saying this for a while
• Some evidence of shift, mostly driven by new tools and new XML implementers
• Rise of the pipelining model of validation (DSDL) likely. DTDs need to cooperate with other technologies
• DTDs are not very complete instruments of validation
Human-facing XML Models
• XML can be seen as ‘just’ a serialisation format, in which case the models need ‘just’ to work
• This presentation concerned also with models that people experience (at some level)
• People often look at raw markup, and experience content models through tools (e.g. syntax-directed editors)
Machine-facing XML Models
• Desirable features:– Normalised– Machine efficient– Programmer efficient
• Techniques fairly easily borrowed from other disciplines (database schema design, type system design, etc.)
Machines vs People
• Also known as data vs documents ?
• In reality few resources are at the extremes of this spectrum
• Many resources mix data-like and document-like features
• The challenge is in finding a balance and tolerating the mess
Data Normalisation
• i.e., single items of data appear once
• A really good idea for some data
• E.g. link targets, database dumps
Mixed Content
• Normalisation not a natural feature of human languages
<p>The cat sat on the mat<p>
not
<p>The cat <verb infinitive=‘to sit’ tense=‘perfect’/> on the mat</p>
When natural language is suitable
• Don’t be afraid to model mixed content (‘diamonds in the mud’ approach)– e.g. bibliographic references
• Sometimes the precision of human language cannot be modelled precisely– e.g. addresses
Type Hierarchies (1)
Credit-card@type=‘…’
ExpiryNumberName
Credit-card@type=‘SWITCH
’
ExpiryNumberNameIssue
Number?
Type Hierarchies (2)
visa-card
Expiry
Number
Name
switch-card
Expiry
Number
Name
visa-card (etc.)
Expiry
Number
Name
IssueNumber
Credit-card
Optional Elements?
• Optional often doesn’t mean ‘optional’, in practice it is used to mean ‘must exist’ or ‘must not exist’
• Consider making choice explicit: e.g., (issue-number|no-issue-number)
• Type-safe models are good for machine facing data; but require maintenance
Mega Markup
• ‘Just Tag It’ ?
• Models should have a justification (often a business justification)
• Rich inline tagging in particular needs to be thought-through (KM technologies often better for enriching documents)
Documentation
• DTDs are comparatively easy to document: content models are terse but expressive (people like them) e.g.
• A DTD is not a .DTD – and documentation is costly!
• Don’t make the limits of the DTD the limits of your specification; DTDs ‘rough out’ content
• We need a graphical standard for representing models (not UML please)
Deployment
• Deploy a normalised version of your DTD via a web server
• Require that this authoritative version is used during data handovers
• Consider requiring the use of PUBLIC identifiers
Parameterisation
• Parameter entities: macro-like features for use in DTDs
<!ENTITY % p.zz "(%p.el;)|(%p.tbl;)|(%p.lst.d;)|(%p.form;)" >
• More useful in development than mature phases in a DTD’s life time.
Entities
• Entity declarations are a DTD-only feature. Not in W3 Schema or RELAX NG (but maybe in DSDL)
• Good reason for sticking with DTDs – especially character entities.
• But, will make your data DTD-dependent
• In publishing, losing entities has not proved a problem (surprisingly)
Namespaces
• DTDs and Namespaces are uneasy partners– Prefix inflexibility– Conventions and kludges, not standard– Buggy software (microsoft parsers)
• Avoid using Namespaces with DTDs whenever possible
But if you must …
• Do not use #FIXED or default attributes in the DTD (tools will complain)
• Pre-pick your prefixes, and qualify the names of vocabularies within your DTD (e.g. m: for MathML)
• #REQUIRE the xmlns attribute(s) on your root elements, and use an external tool to enforce this
Example
<!ELEMENT root (…)><!ATTLIST root
xmlns CDATA #REQUIRED xmlns:m CDATA #REQUIRED>
<rootxmlns=‘http://myorg.com/ns/’xmlns:m=‘http://www.w3.org/1998/Math/MathML’> …
But if you must (2)
• This works with tools, and means your namespaces work with/without the DTD being present
• Don’t get stressed: remember XSLT
Defaulting
• DTDs provide the means to add items to the infoset – default attribute values
• So do W3 Schemas; RELAX NG does not *
• Using defaulting makes your document depend on your DTD/Schema; do not use it (remember XSLT)
Example
<!ATTLIST para hide (yes|no) ‘no’>
<!ATTLIST para hide (yes|no) #IMPLIED>
• Make the value inferable, and document it
• Again, remember XSLT
Off-the-shelf standards
• For XML: MathML, SVG, CALS or Exchange Tables, XHTML, etc.
• Forget XLink: much pain, no gain
• Remember there are standards for many things: country, language, date time, latitude/longtitude. Good DTDs leverage standards.
In Summary
• Pick good models
• Document your DTD and control its deployment
• Use Namespaces defensively
• Do not use entity (or notation) declarations
• Do not use attribute defaulting
• Use standards where possible