unit no. 4 mark-up

27
Unit no. 4 Unit no. 4 Mark-up Mark-up Adolf Knoll Adolf Knoll National Library of the Czech National Library of the Czech Republic Republic [email protected] [email protected]

Upload: fatima-cherry

Post on 31-Dec-2015

30 views

Category:

Documents


0 download

DESCRIPTION

Unit no. 4 Mark-up. Adolf Knoll National Library of the Czech Republic [email protected]. Learning objectives. After the completion of this unit the learner will be able to: Understand what to do with the digital output for further use - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Unit no. 4  Mark-up

Unit no. 4 Unit no. 4 Mark-upMark-up

Adolf KnollAdolf Knoll

National Library of the Czech RepublicNational Library of the Czech Republic

[email protected]@nkp.cz

Page 2: Unit no. 4  Mark-up

Learning objectivesLearning objectives

After the completion of this unit the learner After the completion of this unit the learner will be able to:will be able to:

Understand Understand what to do with the digital what to do with the digital output for further useoutput for further use

Understand the basics of the mark-up Understand the basics of the mark-up languages, especially XMLlanguages, especially XML

Have a basic orientation in their Have a basic orientation in their application to be able to application to be able to mmake correct ake correct decisions for building a digitization projectdecisions for building a digitization project

Page 3: Unit no. 4  Mark-up

Production of a digital documentProduction of a digital document

Digitaldocument

Originaldocument

Digitization

Description

Data

Metadata

Page 4: Unit no. 4  Mark-up

What do we produce?What do we produce?

DataData direct product of direct product of

digitization: digital digitization: digital images, full text, video & images, full text, video & audio filesaudio files

usually a set of files usually a set of files that that represent the original represent the original documentdocument

MetadataMetadata added value through added value through

textual informationtextual information they express:they express:

identification with the identification with the originaloriginal

structure and links to data structure and links to data filesfiles

technical information about technical information about datadata

accessibilityaccessibility administrative mattersadministrative matters etc.etc.

Page 5: Unit no. 4  Mark-up

Mark-upMark-up

CCreated because of a need to store additional reated because of a need to store additional (hidden) information in text in order to:(hidden) information in text in order to:

better format it when displayed and/or printed = better format it when displayed and/or printed = prescriptive mark-upprescriptive mark-up

classify parts of it as objects relevant to various classify parts of it as objects relevant to various rules of description such as cataloguing rules, rules of description such as cataloguing rules, rules of providing technical parameters, various rules of providing technical parameters, various good practices, rules of associating them with good practices, rules of associating them with their visual representation, etc. = their visual representation, etc. = descriptive descriptive mark-upmark-up

Page 6: Unit no. 4  Mark-up

MarkMark--upup For example, in MS Word the paragraph is marked with For example, in MS Word the paragraph is marked with

a a ¶¶

In the HTML code the paragraph is marked with In the HTML code the paragraph is marked with <p><p>paragraphparagraph</p></p>

InIn HTML HTML the the boldbold text or the text or the breakbreak of the line are marked of the line are marked as follows:as follows:This is an HTML This is an HTML <b><b>documentdocument</b></b>, which consists, which consists

ofof <br><br>elementselements</br></br>.. All this is All this is pprocedural (prescriptive) markrocedural (prescriptive) mark--upup. Mind the use . Mind the use

of of <><> brackets to start with brackets to start with <start><start> and end with and end with </start></start> the marked-up element.the marked-up element.

The paragraph is markedThe paragraph is marked withwith ¶¶ParagraphParagraph¶¶

Page 7: Unit no. 4  Mark-up

ObjectsObjects The markup marks: The markup marks:

OBJECTSOBJECTS Which objects? Which objects?

THOSE, WHICH WE THOSE, WHICH WE DDEFINE AS OBJECTSEFINE AS OBJECTS On which basis do we define them?On which basis do we define them?

On the basis of CERTAIN RULESOn the basis of CERTAIN RULES HowHow the rules are establish the rules are establish??

On the basis of an agreementOn the basis of an agreement; they are usually a written ; they are usually a written (even (even published) published) document specifying the objects that should be followed document specifying the objects that should be followed and described.and described. Examples: AACR2 Cataloguing Rules in libraries, Examples: AACR2 Cataloguing Rules in libraries, ISBD rules, CDWA or AMICO description rules for museum objects, ISBD rules, CDWA or AMICO description rules for museum objects, Data Dictionary for Still Digital Images, etc.Data Dictionary for Still Digital Images, etc.

The description rules do not define how the objects are The description rules do not define how the objects are marked up – this is done via a marked up – this is done via a mark-up formal languagemark-up formal language

The most sophisticated mark-up approach is The most sophisticated mark-up approach is SGMLSGML

Page 8: Unit no. 4  Mark-up

General markup languageGeneral markup language

SGMLSGML Standard Generalized Markup LanguageStandard Generalized Markup Language (ISO standard from (ISO standard from

1986) is the base for other derived approaches that may be 1986) is the base for other derived approaches that may be called called mmarkark--up languages of the 2nd generationup languages of the 2nd generation::

HTML (prescriptive)HTML (prescriptive) TEITEI …… XML (descriptive)XML (descriptive)

The markup language marks the object withoutThe markup language marks the object without assigning any kind of behaviour to it.assigning any kind of behaviour to it.

Its behaviour is prescribed by an independent ruleIts behaviour is prescribed by an independent rule..

Page 9: Unit no. 4  Mark-up

How does it work?How does it work?

the main construction unit of an SGML-based the main construction unit of an SGML-based mark-up approach is called mark-up approach is called ELEMENTELEMENT

each element must be defined by an external each element must be defined by an external content descriptive rulecontent descriptive rule;; e.g. a cataloguing rule e.g. a cataloguing rule (AACR2 or another one) defines the element (AACR2 or another one) defines the element TitleTitle; it may also define the sub-elements such ; it may also define the sub-elements such as as Main TitleMain Title, , Parallel TitleParallel Title, or , or Sub-TitleSub-Title, etc., etc.

it results there may be hierarchical relationships it results there may be hierarchical relationships between elements (parents with children)between elements (parents with children)

Page 10: Unit no. 4  Mark-up

How to define the metadata How to define the metadata standard?standard?

We need formal rules to express the content We need formal rules to express the content descriptive standardsdescriptive standards

In SGML environment, this is done in the In SGML environment, this is done in the Document Type Definition (DTD)Document Type Definition (DTD)

DTD can, among others, do the following:DTD can, among others, do the following: List all the elements and set up their properties List all the elements and set up their properties

(mandatory, non-mandatory, repeatable etc.)(mandatory, non-mandatory, repeatable etc.) Define relations between elementsDefine relations between elements Refine their attributes, e.g. through a list of permitted Refine their attributes, e.g. through a list of permitted

valuesvalues Point from them to external entitities, i.e. other Point from them to external entitities, i.e. other

definitions or binary data, e.g. digital imagesdefinitions or binary data, e.g. digital images

Page 11: Unit no. 4  Mark-up

If we take as example that we need a If we take as example that we need a

description element description element authorauthor,, then: then:

Formal rule for display of theelement author

formal definitionof the element

author

Content definitionof the element

author

description rules / e.g., AACR2

rules for formal definition / e.g., DTD

rules of transformation for display / e.g., XSLT for XML

is given by

is given by

is given by

In this way, we work in XML

Page 12: Unit no. 4  Mark-up

XMLXMLeeXXtensible tensible MMarkup arkup LLanguageanguage

XML file*.xml

It contains the reference to the DTDthat controls itIt can contain the reference to the transformation rule that formats itfor display, e.g. a XSLT file

DTD*.dtd

DTD for XML is still written in SGML syntax; therefore, a W3C Schemahas been introduced to replace it. Like this, a document can be controlledeither by a DTD (*.dtd) or by a Schema (*.xsd).

*.xslt

Page 13: Unit no. 4  Mark-up

DTD = Document Type DefinitionDTD = Document Type Definition

The basic construction piece is ELEMENTThe basic construction piece is ELEMENT ELEMENT can have a content or it can be ELEMENT can have a content or it can be

EMPTYEMPTY ELEMENTS can consist of other elementsELEMENTS can consist of other elements

Page 14: Unit no. 4  Mark-up

Here the element Title consists of a group of three elements (MainTitle, SubTitle, and ParallelTitle); from them only the MainTitle is mandatory, SubTitle and ParallelTitle are not, while ParallelTitle can be repeatable.

In a DTD it is written like this:

<!ELEMENT Title (MainTitle, SubTitle?, ParallelTitle*)><!ELEMENT MainTitle (#PCDATA)><!ELEMENT SubTitle (#PCDATA)><!ELEMENT ParallelTitle (#PCDATA)>

Page 15: Unit no. 4  Mark-up

The element PageRepresentation enables to link the concrete pagewith the image or full text that represent it.

<!ELEMENT MonographPage (PageNumber+, Notes?, PageRepresentation+)><!ATTLIST MonographPage

Type (Advertisement | BackCover | BackEndSheet | Blank | FlyLeaf | FrontCover | FrontEndSheet | Index | ListOfIllustrations | ListOfMaps | ListOfTables | NormalPage | Spine | Table | TableOfContents | TitlePage) "NormalPage">

<!ELEMENT PageNumber (#PCDATA)><!ELEMENT PageRepresentation ((PageImage | PageText), TechnicalDescription?)><!ELEMENT PageImage EMPTY>

<!ATTLIST PageImagehref CDATA #REQUIRED

><!ELEMENT PageText EMPTY>

<!ATTLIST PageTexthref CDATA #REQUIRED

>To note: we can also set up a list of attributes; here these are Type of the MonographPage or href, i.e. reference to external data entity.

Page 16: Unit no. 4  Mark-up

<!ELEMENT MonographPage (PageNumber+, Notes?, PageRepresentation+)><!ATTLIST MonographPage

Type (Advertisement | BackCover | BackEndSheet | Blank | FlyLeaf | FrontCover | FrontEndSheet | Index | ListOfIllustrations | ListOfMaps | ListOfTables | NormalPage | Spine | Table | TableOfContents | TitlePage) "NormalPage">

The above part of a DTD means this: The element MonographPage consistsof the elements PageNumber, Notes and PageRepresentation.

We classify the MonographPage in relationship to its content into the Types suchas Advertisement, BackCover, …, TableOfContents, and TitlePage. We have set up the defaulf value as NormalPage, because we expect this will be the mostfrequent choice.

The meaning of the qualifying signs is as follows:

Element - lack of sign = the element is mandatory and it occurs only onceElement+ - the sign + = the element is mandatory and occurs at least onceElement? - the sign ? = the element is not mandatory and it can occur

only onceElement* - the sign * = the element is not mandatory and it occurs at least once

Page 17: Unit no. 4  Mark-up

<!ELEMENT PageNumber (#PCDATA)><!ELEMENT PageRepresentation ((PageImage | PageText), TechnicalDescription?)><!ELEMENT PageImage EMPTY>

<!ATTLIST PageImagehref CDATA #REQUIRED

><!ELEMENT PageText EMPTY>

<!ATTLIST PageTexthref CDATA #REQUIRED

>

Each element that does not consist of any further elements must be defined, too.The expression (#PCDATA) announces that in the XML files written on the basis of this DTD, an analyzable string of metadata is expected, here, for example, a page number like this <PageNumber>221</PageNumber>

The sign | in (PageImage | PageText) indicates that only one of the two elements is applied for the concrete PageRepresentation. The philosophy of this DTD shows that in case of the page representation both by image and text, each of them is attached to a new PageRepresentation.

The ATTLIST (list of attributes) sets up the href attribute as a reference/navigation link to non-analyzable external data (CDATA). The elements PageImage and PageText are empty as they serve only to link the page to the image or full text files.

<PageRepresentation><PageImage href=“http://digit.nkp.cz/Data/Image7.jpg"/>

</ PageRepresentation >

Page 18: Unit no. 4  Mark-up

<MonographPage Type="FlyLeaf"> <PageNumber>2</PageNumber> <Notes>List of publications of U. Eco at Bompiani</Notes> <PageRepresentation>

<PageImage href="Data/Image4.gif"/> </PageRepresentation>

</MonographPage>

This is a concrete section from an XML file, where we can see that the reference is made to the image in GIF format located in the Data subdirectory. We can also see that it is the page no. 2 of the Type Flyleaf.

For more understanding, we will now make a simple project whose aim is to write a DTD for the document we may need in a project of digitization of old postcards.

The steps are: analysis of the document, establishment of needed elements and their relationships, setup of the element linking to digitized images, writing the DTD, writing an XML file based on the DTD, and its display.

The aim is to show how it is done, not to teach everything as it requires a more thourough XML training course.

Page 19: Unit no. 4  Mark-up

How to write a simple DTD?How to write a simple DTD?

1.1. Analyze well the object you wish to Analyze well the object you wish to describe and representdescribe and represent

2.2. Try to establish the necessary elements Try to establish the necessary elements for description and their basic properties for description and their basic properties (mandatory yes/no, repeatable yes/no)(mandatory yes/no, repeatable yes/no)

3.3. Try to define whether these elements will Try to define whether these elements will consist of other elementsconsist of other elements

4.4. Establish from which elements the visual Establish from which elements the visual image files will be referenced toimage files will be referenced to

Page 20: Unit no. 4  Mark-up

Digitized postcardDigitized postcard Root element: Root element: PostcardDescriptionPostcardDescription Elements of the 2Elements of the 2ndnd level: level: author author (consists of (consists of surnamesurname and name and name elementselements)) titletitle themetheme publisher publisher (consists of (consists of PlaceOfPublicationPlaceOfPublication, , NameOfPublisherNameOfPublisher, ,

DateOfPublicationDateOfPublication)) PhysicalDescription PhysicalDescription (consists of (consists of SizeSize and and TechniqueTechnique elements) elements) TypeOfDocumentTypeOfDocument VisualRepresentationVisualRepresentation (consists of (consists of ImageOfRectoPartImageOfRectoPart and and

ImageOfVersoPartImageOfVersoPart elements) elements) language language annotationannotation

The necessary elements and hierarchies for a DTD of a Digitized Postcard

Page 21: Unit no. 4  Mark-up

They can be representedby this graph

Page 22: Unit no. 4  Mark-up

<?xml version="1.0" encoding="UTF-8"?><!-- edited with XMLSPY v5 rel. 3 U (http://www.xmlspy.com) by Adolf Knoll (National Library) --><!ELEMENT PostcardDescription (author*, title, theme+, publisher+, PhysicalDescription, TypeOfDocument, VisualRepresentation?, language, annotation)><!ELEMENT author (surname, name*)><!--If the author has a name that cannot be split into parts, this name is always written in the field marked as surname.--><!ELEMENT surname (#PCDATA)><!ELEMENT name (#PCDATA)><!--The title must be always entered; if missing, an artificial title will be created.--><!ELEMENT title (#PCDATA)><!ELEMENT theme (#PCDATA)><!ELEMENT publisher (PlaceOfPublication?, NameOfPublisher?, DateOfPublication)><!ELEMENT PlaceOfPublication (#PCDATA)><!ELEMENT NameOfPublisher (#PCDATA)><!ELEMENT DateOfPublication (#PCDATA)><!ELEMENT PhysicalDescription (Size, Technique)><!ELEMENT Size (#PCDATA)><!ELEMENT Technique (#PCDATA)><!ELEMENT TypeOfDocument (#PCDATA)><!--Here will be links to computer graphic files representing the postcard.--><!ELEMENT VisualRepresentation (ImageOfRectoPart*, ImageOfVersoPart*)><!ELEMENT ImageOfRectoPart EMPTY><!ATTLIST ImageOfRectoPart(preview | normal | excellent) #REQUIREDCDATA #REQUIRED><!ELEMENT ImageOfVersoPart EMPTY><!ATTLIST ImageOfVersoPart(preview | normal | excellent) #REQUIREDCDATA #REQUIRED><!ELEMENT language (#PCDATA)><!ELEMENT annotation (#PCDATA)>

Postcard.dtdPostcard.dtd

Page 23: Unit no. 4  Mark-up

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE PostcardDescription SYSTEM "Postcard.dtd"><?xml-stylesheet type="text/xsl" href="Postcard.xslt"?><PostcardDescription>

<author><surname>Lyer</surname><name>Antonín</name>

</author><title>Hronov</title><theme>views of streets</theme><theme>Nádražní ulice</theme><theme>Dvorská ulice</theme><theme>Jiráskova ulice</theme><theme>Náměstí</theme><publisher>

<PlaceOfPublication>Hronov</PlaceOfPublication><NameOfPublisher>Karel Šefelín</NameOfPublisher><DateOfPublication>[1910]</DateOfPublication>

</publisher><PhysicalDescription>

<Size>9x13 cm</Size><Technique>colour printing</Technique>

</PhysicalDescription><TypeOfDocument>postcard</TypeOfDocument><VisualRepresentation>

<ImageOfRectoPart quality="normal" href="vzorky/pohled-b.jpg"/><ImageOfRectoPart quality="excellent" href="vzorky/pohled-b.png"/><ImageOfVersoPart quality="excellent" href="vzorky/pohled-b-2.png"/>

</VisualRepresentation><language>cz</language><annotation>The postcard was sent by my great grand-mother to her husband, who was

in military service in first years of the World War I.</annotation></PostcardDescription>

Postcard.xmlPostcard.xml

Reference to a formattingstylesheet

Reference to image files

Page 24: Unit no. 4  Mark-up

How does it work in a web How does it work in a web browser?browser?

When we click on the xml file:When we click on the xml file: The browser will look for the formatting file The browser will look for the formatting file

(stylesheet(stylesheet – the *.xslt file – the *.xslt file) and will call it) and will call it It will display the file following the prescribed It will display the file following the prescribed

rulesrules We can click on the links leading to images We can click on the links leading to images

that represent the postcard visually and we that represent the postcard visually and we will be navigated to themwill be navigated to them

So, let’s try it and click on the file So, let’s try it and click on the file Postcard.xmlPostcard.xml

Page 25: Unit no. 4  Mark-up

XML ConclusionsXML Conclusions

The language enables to define and control any type of The language enables to define and control any type of descriptionsdescriptions

It can relate them to the outer dataIt can relate them to the outer data It makes the structure of the digitized documents clear It makes the structure of the digitized documents clear

and readable for the long termand readable for the long term It enables that the output of our work (production of XML It enables that the output of our work (production of XML

files and digitized documents) corresponds files and digitized documents) corresponds withwith what we what we defined we wished to dodefined we wished to do

It means that for example our Digital Library can be fed It means that for example our Digital Library can be fed by correct and standardized documents that enable, by correct and standardized documents that enable, among others, also their among others, also their long-term digital preservationlong-term digital preservation

Page 26: Unit no. 4  Mark-up

Work with XMLWork with XML

From the user perspective a good digitization From the user perspective a good digitization project develops XML editors that:project develops XML editors that: make the work easy (filling forms)make the work easy (filling forms) check the validity against the applied DTDcheck the validity against the applied DTD output only correct XML structuresoutput only correct XML structures

If you wish to check your forces, dowload the If you wish to check your forces, dowload the free M-TOOL from the Manuscriptorium Digital free M-TOOL from the Manuscriptorium Digital Library free tools at Library free tools at http://manuscriptorium.com/Site/ENG/mtool_eng.asphttp://manuscriptorium.com/Site/ENG/mtool_eng.asp and try to work with itand try to work with it

Page 27: Unit no. 4  Mark-up

Where to find more?Where to find more?

GeneralGeneral http://www.w3.org/XML/http://www.w3.org/XML/ (XML Home) (XML Home) http://www.xml.com/pub/a/98/10/guide0.htmlhttp://www.xml.com/pub/a/98/10/guide0.html (Technical (Technical

Introduction to XML)Introduction to XML) http://www.altova.com/http://www.altova.com/ (XMLSpy editor) (XMLSpy editor)AppliedApplied http://digit.nkp.cz/techstandards.htmlhttp://digit.nkp.cz/techstandards.html (several DTDs (several DTDs

implemented in functioning digital libraries)implemented in functioning digital libraries) http://www.loc.gov/standards/mets/http://www.loc.gov/standards/mets/ (METS format for (METS format for

containerization of XML-based digital documents)containerization of XML-based digital documents) http://www.tei-c.org/http://www.tei-c.org/ (TEI – Text Encoding Initiative) (TEI – Text Encoding Initiative)