1 digital preservation. principles and potential role of xml giovanni michetti urbino, 9 th october...

Post on 16-Jan-2016

220 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Digital preservation.Principles and potential role of XML

Giovanni Michetti

Urbino, 9th october 2002

2

Documents:form vs. content ?

Traditional environment:

Form

Content

3

Documents:form vs. content ?

Digital environment:

Form

Content

4

Documents:structure

Structure is unavoidably inside documents

Complexity grows structure grows Structure is (part of the) message

We deal with structure not in digital environment only

5

Documents:structure and digital environment

Moving information onto new media

Need of functionalities to manage the explosive growth of information

Need to make structure explicit

6

Markup

The proper description of an information resource requires: identifying its logical components making its structure explicit

Markup

7

Markup

Markup:every means of making interpretation of a document explicit

8

From a record ...University of Urbino

Faculty of Arts

Rome, 1st August 2002Dr. Giovanni Michetti

Protocol n. 1234/ABSubject: Teaching appointment

We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree n. 80/1998, the authorization by the administration you belong to.

The DeanProf. Giorgio Cerboni Baiardi

Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino

Tel: 0722.320125 Fax: 0722.322553 Email: preslet@lettere.uniurb.it

9

… to a marked record ...<XML><letter><sender>University of Urbino

Faculty of Arts </sender>

<date>Rome, 1st August 2002</date><addressee>Dr. Giovanni Michetti</addressee>

<protocolnumb>Protocol n. 1234/AB</protocolnumb><subject>Subject: Teaching appointment</subject>

<text>We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree 80/88, the authorization by the administration you belong to</text>

<author>The DeanProf. Giorgio Cerboni Baiardi</author>

<heading>Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino

Tel: 0722.320125 Fax: 0722.322553 Email: preslet@lettere.uniurb.it</heading></letter></XML>

10

… to a DTD ...

<! ELEMENT letter (sender, date, addressee, protocolnumb, subject, text, author,

heading)><!ELEMENT sender (#PCDATA)><!ELEMENT date (#PCDATA)><!ELEMENT addressee (#PCDATA)><!ELEMENT protocolnumb (#PCDATA)><!ELEMENT subject (#PCDATA)><!ELEMENT text (#PCDATA)><!ELEMENT author (#PCDATA)><!ELEMENT heading (#PCDATA)>

11

… to a more precise DTD

<! ELEMENT letter (sender, date, addressee, precedent?, protocolnumb, classif?, subject,

text, attachments?, author, heading)><!ELEMENT sender, date, addressee, protocolnumb, subject, text, author,

heading (#PCDATA)><!ELEMENT precedent (#PCDATA)><!ELEMENT classif (#PCDATA)><!ELEMENT attachments (#PCDATA)>

12

Let’s refine the markup ...<XML><letter><sender><body>University of Urbino</body>

<bureau>Faculty of Arts</bureau></sender>

<date><place>Rome,</place><time>1st August 2002</time></date><addressee>Dr. Giovanni Michetti</addressee>

<protocolnumb>Protocol n. 1234/AB</protocolnumb><subject>Subject: Teaching appointment</subject>

<text>We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree 80/88, the authorization by the administration you belong to</text>

<author><role>The Dean</role><name>Prof. Giorgio Cerboni Baiardi</name></author>

<heading>Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino

Tel: 0722.320125 Fax: 0722.322553 Email: preslet@lettere.uniurb.it</heading></letter></XML>

13

... keeping on refining ...<XML><letter><sender><body>University of Urbino</body>

<bureau>Faculty of Arts</bureau></sender>

<date><place>Rome,</place><time>1st August 2002</time></date><addressee>Dr. Giovanni Michetti</addressee>

[Protocolnumb + Subject + Text]

<author><role>The Dean</role><name><title>Prof.</title><propername>Giorgio</propername><surname>Cerboni

Baiardi</surname></name></author>

<heading><bureau>Faculty of Arts</bureau><address>Piano S. Lucia 6 - 61029 Urbino</address>

<tel>Tel: 0722.320125</tel><fax>Fax: 0722.322553</fax><email>Email:

preslet@lettere.uniurb.it</email></heading></letter></XML>

14

… and let’s refine the DTD<! ELEMENT letter

(sender, date, addressee, precedent?, protocolnumb, classifi?, subject, text,

attachment?, author, heading)>

<!ELEMENT sender (body, bureau)>

<!ELEMENT body (#PCDATA)>

<!ELEMENT bureau (#PCDATA)>

<!ELEMENT date (place, time)>

<!ELEMENT place (#PCDATA)>

<!ELEMENT time (#PCDATA)>

<!ELEMENT addressee (#PCDATA)>

<!ELEMENT precedent, protocolnumb, classif, subject, text, attachment (#PCDATA)>

<!ELEMENT author (role, name)>

<!ELEMENT role (#PCDATA)>

<!ELEMENT name (title, propername, surname)>

<!ELEMENT title, propername, surname (#PCDATA)>

<!ELEMENT heading (bureau,address, tel, fax, email)>

<!ELEMENT address, tel, fax, email (#PCDATA)>

15

The final DTD<! ELEMENT letter

(sender, date, addressee+, precedent?, protocolnumb, classifi?, subject, text,

attachment?, author, heading?)>

<!ELEMENT sender (body?, bureau)><!ELEMENT body (#PCDATA)><!ELEMENT bureau (#PCDATA)><!ELEMENT date (place, time)><!ELEMENT place (#PCDATA)><!ELEMENT time (#PCDATA)><!ELEMENT addressee (#PCDATA)><!ELEMENT precedent, protocolnumb, classif, subject, text, attachment (#PCDATA)>

<!ELEMENT author (role?, name)><!ELEMENT role (#PCDATA)>

<!ELEMENT name (title?, propername?, surname)><!ELEMENT title, propername, surname (#PCDATA)>

<!ELEMENT heading (bureau?, address?, tel?, fax?, email?)><!ELEMENT address, tel, fax, email (#PCDATA)>

16

XML declaration Every XML document should start with

an XML declaration, like<?XML version="1.0">

Such declaration must be right at the start of the document: there should be nothing before it (comments, instructions, white spaces, ...)

17

XML declaration

A parser uses the first 5 characters <?XML to understand which kind of character set the document uses

The version attribute must have value 1.0

18

XML declaration

It is possible to specify the language encoding using the optional encoding attribute.

Example:

<?XML version="1.0" encoding="ISO-8859-1"?>

19

Elements Elements are the most important

components of XML documents: they are the logical components through which you can identify the structure of documents. Example:

<author>Giovanni Michetti</author>delimiter

tag-namecontent

start-tagend-tag

element

20

Elements

Each start-tag must have a corresponding end-tag (starting with a forward slash)

Empty elements (like <img>, <br>, <hr> in HTML) are represented by a tag starting with a delimiter and ending with a forward slash before the closing bracket. Example: <image/>

21

Attributes Attributes are expressed as name-value

pairs associated with elements and appearing only in start-tags

Names are separated from related values by an equal sign (=). Values are wrapped in single or double quotes

Attributes must be associated to elements

No matter of the order of the attributes inside a start-tag

22

XML tree

An XML document is a kind of a hierarchical tree. It starts from a root (root or document element) and it develops from it into child elements, that can be sibling

23

XML tree

Each element has one and only one father (except from root)

Each element is completely wrapped inside another element

24

Entities Example:

<author>Giovanni Michetti</author>

The string Giovanni Michetti (the element content) is also called character data. Character data can appear anywhere inside elements, or as values of attributes

25

Entities There are special characters that are

not allowed in text blocks: what if we want to use the less than symbol < in a mathematical formula (a < b ) ?

Stratagem 1 Stratagem 2

26

Entities

1. CDATA sections: They start with the CDATA start marker

<!CDATA[

and end with the CDATA end marker

]]>

27

Entities

2. Entity references:Example:

&lt; <

The parser recognizes the entity &lt; and substitute it with the proper value <

28

Entities

A parser is a piece of software able to read and interpret an XML document. A parser read the XML document as plain text

Some parsers (validating parsers) are able to check the conformance of an XML document with a DTD

29

Entities Standard (i.e. predefined) entities:

&lt; <&gt; >&amp; &&apos; '&quot; "

Any XML parser recognizes these entities and substitutes them with the proper values

30

Well-formed documents Any XML document must be well

formed: it has to comply with some constraints, some of which are:

Each start-tag has a corresponding end-tag Elements can’t overlap There must be one and only one root

element Attribute values must be quoted An element can’t contain different attributes

with the same name

31

Document Type Definition (DTD)

Once able to create a set of attributes and tags, we need to share it with other users in order to adopt the same syntax

We need a Document Type Definition (DTD)

32

Document Type Definition (DTD)

A DTD defines what markup can be used in a document that is supposed to conform to a specific structure, whose components are identified by tags

33

Document Type Definition (DTD)

For example, a DTD defines what elements a document can contain, their occurrences, their order, and so on

A DTD can set out which attributes an element can take and whether they must be valued. It is also possible to define a set of predefined values for the attributes, and so on

34

Internal and external DTD

A DTD can be an external file or it can be included as part of the XML document. If it is an external file, the XML document must contain an explicit reference inside the Document Type Declaration:

<!DOCTYPE MyXMLDocs SYSTEM “file.dtd”>

35

Internal and external DTD

A DTD can also be written inside the document type declaration. In this case we have an internal DTD, like:<!DOCTYPE MyXMLDoc [

<!ELEMENT MyXMLDoc (#PCDATA)>

]> In this case, all the constraints on the

structure of the document are provided as declarations inside the square brackets

36

Element declarations A DTD is a set of declarations, the most

important of which is the element declaration. Any DTD must have at least one element declaration (referred to the root element)

The syntax for a declaration is:

<!ELEMENT elementname (contentmodel)>

37

Element declarations Example:

<!ELEMENT anthology (poem+)>

<!ELEMENT poem (title?, (stanza+|line+) )

<!ELEMENT title (#PCDATA)>

<!ELEMENT stanza (verso+)>

<!ELEMENT line (#PCDATA)>

38

Cardinality suffixes Cardinality suffixes are symbols used to

specify how many times an element can occur at a certain point of the structure. Symbols used are:

? 0-1+ 1-n* 0-n

(none) 1

39

Connectors Connectors are symbols used to specify

order and relationships between components of a model

Symbols used are:

, (comma)

| (vertical line)

40

Attribute declarations An attribute declaration allows to define

attributes associated to a given element

The syntax for a declaration is:

<!ATTLIST element_name attribute_definition*>

where an attribute definition is like:

attribute_name attribute_type default_declaration

41

Valid documents Well-formed documents: XML

documents conforming to the rules laid down in the XML 1.0 specifications

Valid documents: well-formed documents conforming to the rules laid down in a DTD

42

Stylesheets

So far the structure. But how can we render documents in the proper way?

Stylesheets

43

Stylesheets Since content is separated from style, we do

need no more to re-write the whole document each time we want to change the layout: we simply need to change the “instructions” that modify rendering. In other words, we can modify representation without modifying content

XSL (eXtensible Stylesheet Language) is a style language based upon DSSL (Document Style Semantics and Specification Language)

44

So far the document …

… but a document is (generally) part of a file, which is in turn part of a series or a more complex archival collection

Archival bond

45

The object of analysis:from documents ...

46

… to files ...

47

.....

48

… to series

49

Archives:a complex system of relationships

File

Series

Archiv

e

Document

50

Preserving, of course; but what?

Preserving

Original data

Context allowing data to be interpreted

Hardware

??

?

51

Preserving context

Preserving the context

Need to manage a network of metadata

52

XML technologies XML Schema Document Object Model (DOM) Simple API for XML (SAX) XSLT/Xpath XML Query Xlink Xpointer Xbase Xform XML Fragment interchange Xinclude

53

XML features It’s a formal, non-proprietary standard

it is acceptable to a wide range of users It’s a meta-language

it allows to define DTDs and validate documents It allows to manage highly structured documents It’s human-readable and self-descriptive

good chances to last It uses Unicode text

no problems related to internationalization

54

XML features

It’s a family of technologies It’s modular It’s license-free and platform-independent It can be transported across Web using

existing transport protocol re-use of communication and

security structures already in place

55

XML features

It allows to easily manage metadata It provides very good mechanism for

representing the layout It’s easy, powerful, but not too expensive

56

XML double-edged features

1. It’s a meta-language: it allows to define DTDs danger of specialization (each user community with its own language)

Without a common language, XML is not so competitive with respect to other mechanism of data interchange

XSL does allow to translate between different encodings, but it could be quite complex

RosettaNet and OASIS: trying to adopt common languages

57

XML double-edged features

2. It’s self-descriptive: you can create documents without using a DTD ...

58

XML double-edged features

3. It supports sophisticated searching by means of the tags embedded in the text, but a bad markup (not complete or not correct) highly reduces search effectiveness

59

XML limitations

It’s a syntax: it contains no semantics you need to use other XML modules such as XML Schema and RDF

It’s based upon text: the size of the markup can be much larger than the data itself

60

Preservation

Some considerations ...

61

Thanks to all

Giovanni Michetti

giovanni.michetti@uniroma1.it

top related