lecture 14: metadata and markup

67
2003.10.09 - SLIDE 1 IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003 http://www.sims.berkeley.edu/academics/courses/ is202/f03/ SIMS 202: Information Organization and Retrieval Lecture 14: Metadata and Markup

Upload: ravi

Post on 23-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003 http://www.sims.berkeley.edu/academics/courses/is202/f03/. Lecture 14: Metadata and Markup. SIMS 202: Information Organization and Retrieval. Lecture Overview. Review - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 1IS 202 – FALL 2003

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 pm

Fall 2003http://www.sims.berkeley.edu/academics/courses/is202/f03/

SIMS 202:

Information Organization

and Retrieval

Lecture 14: Metadata and Markup

Page 2: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 2IS 202 – FALL 2003

Lecture Overview

• Review– XML and Document Engineering

• Metadata And Markup– XML As A Metadata Lingua Franca

• METS

– SGML vs. XML DTD Construction– XML Schemas– XML For Protocols And Metadata Languages

• Readings/Discussion

Page 3: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 3IS 202 – FALL 2003

Lecture Overview

• Review– XML and Document Engineering

• Metadata And Markup– XML As A Metadata Lingua Franca

• METS

– SGML vs. XML DTD Construction– XML Schemas– XML For Protocols And Metadata Languages

• Readings/Discussion

Page 4: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 4IS 202 – FALL 2003

Lecture Overview

• Review– XML and Document Engineering

• Metadata And Markup– XML As A Metadata Lingua Franca

• METS

– SGML vs. XML DTD Construction– XML Schemas– XML For Protocols And Metadata Languages

• Readings/Discussion

Page 5: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 5IS 202 – FALL 2003

XML as a common syntax

• XML (and SGML) provide a way of expressing the structure of documents that can be verified and validated by document processing systems

• “Documents” can be metadata structures– Such as the description of a particular

photograph in our Phone project

• XML thus provides a way of representing metadata descriptions as well as the content that they describe

Page 6: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 6IS 202 – FALL 2003

XML as a common syntax

• All XML documents follow some simple rules that make them interchangeable and usable across different systems– All data and markup is in UNICODE– All elements are marked by begin and end

tags– All markup is case-sensitive– XML DTD’s and/or Schemas define the valid

structure (and sometimes content) of the documents

Page 7: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 7IS 202 – FALL 2003

Example – METS

• METS – the Metadata Encoding and Transmission Standard is a new Schema intended to provide:– “a standard for encoding descriptive, administrative,

and structural metadata regarding objects within a digital library, expressed using the XML schema language of the World Wide Web Consortium”

• METS can be used to “wrap” complex sets of data (the actual data, with rules for encoding binary forms), the metadata describing the parts of that data, and the sequence and conditions under which the data can or should be presented or displayed

Page 8: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 8IS 202 – FALL 2003

Lecture Overview

• Review– XML and Document Engineering

• Metadata And Markup– XML As A Metadata Lingua Franca

• METS

– SGML vs. XML DTD Construction– XML Schemas– XML For Protocols And Metadata Languages

• Readings/Discussion

Page 9: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 9IS 202 – FALL 2003

SGML/XML Structure

• An SGML document consists of three parts:– The SGML Declaration– The Document Type Definition (DTD)– The Document Instance

• An XML document REQUIRES only the document instance, but for effective processing a DTD is very important

• XML Schema (later) provides an alternative to DTDs for XML applications

Page 10: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 10IS 202 – FALL 2003

Document Type Definitions

• The DTD describes the structural elements and "shorthand" markup for a particular document type and defines:– Names of "legal" elements– How many times elements can appear– The order of elements in a document– Whether markup can be omitted (SGML only)– Contents of elements (i.e., nested structures)– Attributes associated with elements– Names of "entities"– Short-hand conventions for element tags

(SGML only)

Page 11: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 11IS 202 – FALL 2003

DTD Components

• The major components of a DTD are:– Entity Declarations– Element Declarations– Attribute Declarations

Page 12: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 12IS 202 – FALL 2003

Document Type Definitions

• Entity Declarations are a "macro" definition facility for both DTD and Document instance parts– General Internal Entity Definitions

<!ENTITY name "substitute string">referenced by &name;

– General External Entity Definitions<!ENTITY name SYSTEM "file path">referenced by &name;

– Parameter Entity Definitions (used only inside DTDs)<!ENTITY %name "substitute string">or<!ENTITY %name SYSTEM "file path">referenced by %name; or %name

Page 13: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 13IS 202 – FALL 2003

Document Type Definitions

• SGML Element Declarations define the structural elements of a document and its associated markup<!ELEMENT name - - content_model or declared_content +(include_list) -(exclude_list) >– Omitted tag minimization indicates whether

start-tags or end-tags can be omitted in the markup (o) or (-) are required in SGML but can NOT be used in XML

Page 14: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 14IS 202 – FALL 2003

Document Type Definitions

• Content model provides a nested structural description of the elements that make up this element, e.g.:<!ELEMENT memo - - ((to & from), body,

close?)><!ELEMENT body - O (p)* ><!ELEMENT p - O (#PCDATA | q)*><!ELEMENT q - - (#PCDATA)>...– ANY (in SGML) may be used to indicate a

content model of any elements in the DTD, in any order

Page 15: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 15IS 202 – FALL 2003

Document Type Definitions

• Same content model in XML<?xml version = “1.0”?><!DOCTYPE memo [<!ELEMENT memo ((to | from)+, body,

close?)><!ELEMENT body (p)* ><!ELEMENT p (#PCDATA | q)* ><!ELEMENT q (#PCDATA)>…

]>– Note the XML processing instruction “Prolog”– Note that & in previous page is not legal XML

Page 16: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 16IS 202 – FALL 2003

Document Type Definitions

• Declared content can be:PCDATA, CDATA, RCDATA, EMPTY

• Inclusion and Exclusion lists can be used to indicate elements that can occur or are forbidden to occur in any sub-elements of the content model (NOT in XML), e.g.:<!ELEMENT memo -- ((to & from), body close?)

+(fn)>– Says that element fn can appear anyplace in

the memo

Page 17: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 17IS 202 – FALL 2003

Document Type Definitions

• Attribute Declarations define attributes associated with (potentially) each element of a document and provide the acceptable values for those attributes

Page 18: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 18IS 202 – FALL 2003

Attributes Example

• <!ATTLIST associate_element attribute_name declared_value default_value >

• <!ATTLIST memo status (PUBLIC | CONFIDENTIAL) PUBLIC>– In markup of a document:

<memo status="CONFIDENTIAL">also, because of the default set:<memo>would be the same as <memo status="PUBLIC">There are a variety of special defaults and data types that can be given in attribute definitions

Page 19: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 19IS 202 – FALL 2003

Sample SGML DTD

<!doctype ELIB-TEXTS [

<!-- This is a DTD for bibliographic records extracted from the elib/rfc1357 simple bibliographic format. -->

<!ELEMENT ELIB-TEXTS o o (ELIB-BIB*)>

<!-- We allow most elements to occur any number of times in any order --><!-- this is because there is little consistency in the actual usage. --><!ELEMENT ELIB-BIB - - (BIB-VERSION, ID, ENTRY?, DATE?, TITLE*, ORGANIZATION*,(SERIES | TYPE | REVISION | REVISION-DATE |AUTHOR-PERSONAL | AUTHOR-INSTITUTIONAL | AUTHOR-CONTRIBUTING-PERSONAL |AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-INSTITUTIONAL | CONTACTAUTHOR | PROJECT | PAGES | BIOREGION | CERES-BIOREGION | TEXTSOUP | LOCATION |ULTIMATE-CLIENT | URL |KEYWORDS | NOTES | ABSTRACT)*, (TEXT-REF | PAGED-REF)* )>

<!-- We won't make any assumptions about content... all PCDATA -->

<!ELEMENT ID - o (#PCDATA)><!ELEMENT ABSTRACT - o (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-INSTITUTIONAL - o (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-PERSONAL - o (#PCDATA)><!ELEMENT AUTHOR-PERSONAL-CONTRIBUTING - o (#PCDATA)>… etc… ]>

Page 20: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 20IS 202 – FALL 2003

XML Version<!doctype ELIB-TEXTS [

<!-- This is a DTD for bibliographic records extracted from the elib/rfc1357 simple bibliographic format. -->

<!ELEMENT ELIB-TEXTS(ELIB-BIB*)>

<!-- We allow most elements to occur any number of times in any order --><!-- this is because there is little consistency in the actual usage. --><!ELEMENT ELIB-BIB (BIB-VERSION, ID, ENTRY?, DATE?, TITLE*, ORGANIZATION*,(SERIES | TYPE | REVISION | REVISION-DATE |AUTHOR-PERSONAL | AUTHOR-INSTITUTIONAL | AUTHOR-CONTRIBUTING-PERSONAL |AUTHOR-CONTRIBUTING-PERSONAL | AUTHOR-CONTRIBUTING-INSTITUTIONAL | CONTACTAUTHOR | PROJECT | PAGES | BIOREGION | CERES-BIOREGION | TEXTSOUP | LOCATION |ULTIMATE-CLIENT | URL |KEYWORDS | NOTES | ABSTRACT)*, (TEXT-REF | PAGED-REF)* )>

<!-- We won't make any assumptions about content... all PCDATA -->

<!ELEMENT ID (#PCDATA)><!ELEMENT ABSTRACT (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-INSTITUTIONAL (#PCDATA)><!ELEMENT AUTHOR-CONTRIBUTING-PERSONAL (#PCDATA)><!ELEMENT AUTHOR-PERSONAL-CONTRIBUTING (#PCDATA)>… etc… ]>

Page 21: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 21IS 202 – FALL 2003

Document Using That DTD

<ELIB-BIB><BIB-VERSION>ELIB-v1.0 </BIB-VERSION><ID>6</ID><ENTRY>February 13 1995</ENTRY><DATE>March 1, 1993</DATE><TITLE>Water Conditions in California Report 2</TITLE><ORGANIZATION>California Department of Water Resources</ORGANIZATION><SERIES>120-93</SERIES><TYPE>bulletin</TYPE><AUTHOR-INSTITUTIONAL>California Department of Water Resources </AUTHOR-INSTITUTIONAL><PAGES>17</PAGES><TEXT-REF>/elib/data/disk/disk5/documents/6/HYPEROCR/hyperocr.html </TEXT-REF><PAGED-REF>/elib/data/disk/disk5/documents/6/OCR-ASCII-NOZONE </PAGED-REF></ELIB-BIB>

Page 22: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 22IS 202 – FALL 2003

Dublin Core

• Review…

• Simple metadata for describing internet resources

• For “Document-Like Objects”

• 15 Elements

Page 23: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 23IS 202 – FALL 2003

Dublin Core Elements

• Title• Creator• Subject• Description• Publisher• Other Contributors• Date• Resource Type

• Format• Resource Identifier• Source• Language• Relation• Coverage• Rights Management

Page 24: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 24IS 202 – FALL 2003

DC XML DTD Implementation

• There have been various versions

• This one is the one recommended (required) by the Open Archives Initiative Metadata Harvesting Protocol (OAI-MHP)

• Uses XML Name Spaces• Available at

http://dublincore.org/documents/2001/09/20/dcmes-xml/

Page 25: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 25IS 202 – FALL 2003

DC Element and Attribute Definitions

<!-- The elements from DCMES 1.1 -->

<!-- The name given to the resource. --> <!ELEMENT dc:title (#PCDATA)> <!ATTLIST dc:title xml:lang CDATA #IMPLIED>

<!-- An entity primarily responsible for making the content of the resource. --> <!ELEMENT dc:creator (#PCDATA)> <!ATTLIST dc:creator xml:lang CDATA #IMPLIED>

<!-- The topic of the content of the resource. --> <!ELEMENT dc:subject (#PCDATA)> <!ATTLIST dc:subject xml:lang CDATA #IMPLIED>

<!-- An account of the content of the resource. --> <!ELEMENT dc:description (#PCDATA)> <!ATTLIST dc:description xml:lang CDATA #IMPLIED>

<!-- The entity responsible for making the resource available. --> <!ELEMENT dc:publisher (#PCDATA)> <!ATTLIST dc:publisher xml:lang CDATA #IMPLIED>

<!-- An entity responsible for making contributions to the content of the resource. --> <!ELEMENT dc:contributor (#PCDATA)> <!ATTLIST dc:contributor xml:lang CDATA #IMPLIED>

<!-- A date associated with an event in the life cycle of the resource. --> <!ELEMENT dc:date (#PCDATA)> <!ATTLIST dc:date xml:lang CDATA #IMPLIED>

Page 26: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 26IS 202 – FALL 2003

DC Element Definitions (cont.)

<!-- The nature or genre of the content of the resource. --> <!ELEMENT dc:type (#PCDATA)> <!ATTLIST dc:type xml:lang CDATA #IMPLIED>

<!-- The physical or digital manifestation of the resource. --> <!ELEMENT dc:format (#PCDATA)> <!ATTLIST dc:format xml:lang CDATA #IMPLIED>

<!-- An unambiguous reference to the resource within a given context. --> <!ELEMENT dc:identifier (#PCDATA)> <!ATTLIST dc:identifier xml:lang CDATA #IMPLIED> <!ATTLIST dc:identifier rdf:resource CDATA #IMPLIED>

<!-- A Reference to a resource from which the present resource is derived. --> <!ELEMENT dc:source (#PCDATA)> <!ATTLIST dc:source xml:lang CDATA #IMPLIED> <!ATTLIST dc:source rdf:resource CDATA #IMPLIED>

<!-- A language of the intellectual content of the resource. --> <!ELEMENT dc:language (#PCDATA)> <!ATTLIST dc:language xml:lang CDATA #IMPLIED>

<!-- A reference to a related resource. --> <!ELEMENT dc:relation (#PCDATA)> <!ATTLIST dc:relation xml:lang CDATA #IMPLIED> <!ATTLIST dc:relation rdf:resource CDATA #IMPLIED>

<!-- The extent or scope of the content of the resource. --> <!ELEMENT dc:coverage (#PCDATA)> <!ATTLIST dc:coverage xml:lang CDATA #IMPLIED>

<!-- Information about rights held in and over the resource. --> <!ELEMENT dc:rights (#PCDATA)> <!ATTLIST dc:rights xml:lang CDATA #IMPLIED>

Page 27: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 27IS 202 – FALL 2003

A More Complex SGML DTD

<!DOCTYPE USMARC [<!-- USMARC DTD. UCB-SLIS v.0.08 --><!-- By Jerome P. McDonough, April 1, 1994 --><!ELEMENT USMARC - - (Leader, Directry, VarFlds)><!ATTLIST USMARC Material (BK|AM|CF|MP|MU|VM|SE) "BK" id CDATA #IMPLIED><!-- Author's Note: the id attribute for the USMARC element is intended to hold a unique record number for each MARC record in the local database. That is to say, it is intended ONLY as an aid in maintaining the local database of MARC records -->

<!ELEMENT Leader - O (LRL, RecStat, RecType, BibLevel, UCP, IndCount, SFCount, BaseAddr, EncLevel, DscCatFm, LinkRec, EntryMap)><!ELEMENT Directry - O (#PCDATA)><!ELEMENT VarFlds - O (VarCFlds, VarDFlds)>

<!-- Component parts of Leader --><!-- Logical Record Length --><!ELEMENT LRL - O (#PCDATA)>…etc…

Page 28: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 28IS 202 – FALL 2003

More Complex DTD (cont.)

<!-- Variable Data Fields --><!ELEMENT VarDFlds - O (NumbCode, MainEnty?, Titles, EdImprnt?, PhysDesc?, Series?, Notes?, SubjAccs?, AddEnty?, LinkEnty?, SAddEnty?, HoldAltG?, Fld9XX?)>

<!-- Component Parts of Variable Data Fields --><!-- Numbers & Codes --><!ELEMENT NumbCode - O (Fld010?, Fld011?, Fld015?, Fld017*, Fld018?,

Fld019*, Fld020*, Fld022*, Fld023*, Fld024*, Fld025*, Fld027*,

Fld028*, Fld029*, Fld030*, Fld032*, Fld033*, Fld034*, Fld035*, Fld036?, Fld037*, Fld039*, Fld040?, Fld041?, Fld042?, Fld043?, Fld044?, Fld045?, Fld046?, Fld047?, Fld048*, Fld050*, Fld051*, Fld052*, Fld055*, Fld060*, Fld061*, Fld066?, Fld069*, Fld070*, Fld071*, Fld072*, Fld074*, Fld080?, Fld082*,

Fld084*, Fld086*, Fld088*, Fld090*, Fld096*)>

<!-- Main Entries --><!ELEMENT MainEnty - O (Fld100?, Fld110?, Fld111?, Fld130?)>

<!-- Titles --><!ELEMENT Titles - O (Fld210?, Fld211*, Fld212*, Fld214*, Fld222*,

Fld240?, Fld242*, Fld243?, Fld245, Fld246*, Fld247*)>

<!-- Edition, Imprint, etc. --><!ELEMENT EdImprnt - O (Fld250?, Fld254?, Fld255*, Fld256?, Fld257?, Fld260?, Fld261?, Fld262?, Fld263?, Fld265?)>

<!-- Physical Description, etc. --><!ELEMENT PhysDesc - O (Fld300*, Fld305*, Fld306?, Fld310?, Fld315?,

Fld321*, Fld340*, Fld350?, Fld351*,Fld355*, Fld357*, Fld362*)>

…etc…

Page 29: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 29IS 202 – FALL 2003

Complex DTD (cont.)

<!-- Title Statement --><!ELEMENT Fld245 - O (Six?, (a|b|c|f|g|h|k|n|p|s)+)><!ATTLIST Fld245 AddEnty (No|Yes|Blank) #IMPLIED NFChars (0|1|2|3|4|5|6|7|8|9|Blnk) #IMPLIED>

…etc…

<!-- Subfield Element Declarations --><!ELEMENT a - O (#PCDATA)><!ELEMENT b - O (#PCDATA)><!ELEMENT c - O (#PCDATA)><!ELEMENT d - O (#PCDATA)>

<!ELEMENT e - O (#PCDATA)>

Page 30: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 30IS 202 – FALL 2003

Document Markup

• All document markup is derived from the DTD for the particular document type

• In SGML the DTD should be referenced in the document using the DOCTYPE declaration:

<!DOCTYPE name SYSTEM "file_path" >or<!DOCTYPE name SYSTEM "file_path" [doctype_declaration_subset]>or<!DOCTYPE name [doctype_declaration_subset]>The doctype_declaration_subset can be any combination of elements, entity, and attribute declarations

Page 31: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 31IS 202 – FALL 2003

HTML

• HTML was not originally "real" SGML, the DTD was invented after the language

• It is often more concerned with the form of the output on the screen than with the structural contents of the HTML docs

• Relies on the application (such as Netscape) to implement interesting actions like hypertext linking

• XHTML is now a W3C “recommendation” that applies XML conventions to HTML, and provides a growing set of capabilities within an XML framework (our phones use XHTML)

Page 32: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 32IS 202 – FALL 2003

Lecture Overview

• Review– XML and Document Engineering

• Metadata And Markup– XML As A Metadata Lingua Franca

• METS

– SGML vs. XML DTD Construction– XML Schemas– XML For Protocols And Metadata Languages

• Readings/Discussion

Page 33: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 33IS 202 – FALL 2003

What are XML Schemas?

• An XML vocabulary for expressing your data's structure AND content types, and even the business rules involved in processing the data

• Written in XML themselves

• Support namespaces for combining multiple schemas in the same documents

– The slides in this section are based on an XML tutorial by Roger L. Costello

Page 34: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 34IS 202 – FALL 2003

Example<location> <latitude>32.904237</latitude> <longitude>73.620290</longitude> <uncertainty units="meters">2</uncertainty></location>

Is this data valid?To be valid, it must meet these constraints (data business rules): 1. The location must be comprised of a latitude, followed by a longitude, followed by an indication of the uncertainty of the lat/lon measurements. 2. The latitude must be a decimal with a value between -90 to +90 3. The longitude must be a decimal with a value between -180 to +180 4. For both latitude and longitude the number of digits to the right of the decimal point must be exactly six digits. 5. The value of uncertainty must be a non-negative integer 6. The uncertainty units must be either meters or feet.

We can express all these data constraints using XML Schemas

Page 35: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 35IS 202 – FALL 2003

Validating your data

<location> <latitude>32.904237</latitude> <longitude>73.620290</longitude> <uncertainty units="meters">2</uncertainty></location>

-check that the latitude is between -90 and +90-check that the longitude is between -180 and +180- check that the fraction digits is 6 for lat and lon...

XML Schema

XML Schemavalidator

Data is ok!

Page 36: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 36IS 202 – FALL 2003

Purpose of XML Schemas

• Specify:– the structure of instance documents

• "this element contains these elements, which contains these other elements, etc"

– the datatype of each element/attribute• "this element shall hold an integer with the range 0

to 12,000" (DTDs don't do too well with specifying datatypes like this)

Page 37: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 37IS 202 – FALL 2003

Motivation for XML SchemasWhy Schemas?

• People are dissatisfied with DTDs– It's a different syntax

• You write your XML (instance) document using one syntax and the DTD using another syntax --> bad, inconsistent

– Limited datatype capability• DTDs support a very limited capability for specifying

datatypes. You can't, for example, express "I want the <elevation> element to hold an integer with a range of 0 to 12,000"

– Desire a set of datatypes compatible with those found in databases

• DTD supports 10 datatypes; XML Schemas supports 44+ datatypes

Page 38: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 38IS 202 – FALL 2003

Highlights of XML Schemas

• XML Schemas are a tremendous advancement over DTDs:– Enhanced datatypes

• 44+ versus 10• Can create your own datatypes

– Example: "This is a new type based on the string type and elements of this type must follow this pattern: ddd-dddd, where 'd' represents a digit".

– Written in the same syntax as instance documents• less syntax to remember

– Object-oriented'ish• Can extend or restrict a type (derive new type definitions on

the basis of old ones)

– Can express sets, i.e., can define the child elements to occur in any order

Page 39: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 39IS 202 – FALL 2003

Highlights of XML Schemas

• Can specify element content as being unique (keys on content) and uniqueness within a region

• Can define multiple elements with the same name but different content

• Can define elements with nil content

• Can define substitutable elements - e.g., the "Book" element is substitutable for the "Publication" element.

Page 40: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 40IS 202 – FALL 2003

BookStore.dtd

<!ELEMENT BookStore (Book)+><!ELEMENT Book (Title, Author, Date, ISBN, Publisher)><!ELEMENT Title (#PCDATA)><!ELEMENT Author (#PCDATA)><!ELEMENT Date (#PCDATA)><!ELEMENT ISBN (#PCDATA)><!ELEMENT Publisher (#PCDATA)>

Page 41: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 41IS 202 – FALL 2003

ATTLISTELEMENT

ID

#PCDATA

NMTOKEN

ENTITY

CDATA

BookStore

BookTitle

Author

Date

ISBNPublisher

This is the vocabulary that DTDs provide to define yournew vocabulary

Page 42: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 42IS 202 – FALL 2003

elementcomplexType

schema

sequence

http://www.w3.org/2001/XMLSchema

string

integer

boolean

BookStore

BookTitle

Author

Date

ISBNPublisher

http://www.books.org (targetNamespace)

This is the vocabulary that XML Schemas provide to define yournew vocabulary

One difference between XML Schemas and DTDs is that the XML Schema vocabularyis associated with a name (namespace). Likewise, the new vocabulary that you define must be associated with a name (namespace). With DTDs neither set ofvocabulary is associated with a name (namespace) [DTDs pre-dated namespaces].

Page 43: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 43IS 202 – FALL 2003

<?xml version="1.0"?><xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.books.org" xmlns="http://www.books.org" elementFormDefault="qualified"> <xsd:element name="BookStore"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Book" minOccurs="1" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Book"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Title" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Author" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Date" minOccurs="1" maxOccurs="1"/> <xsd:element ref="ISBN" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Publisher" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Title" type="xsd:string"/> <xsd:element name="Author" type="xsd:string"/> <xsd:element name="Date" type="xsd:string"/> <xsd:element name="ISBN" type="xsd:string"/> <xsd:element name="Publisher" type="xsd:string"/></xsd:schema>

BookStore.xsd

xsd = Xml-Schema Definition

Page 44: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 44IS 202 – FALL 2003

<?xml version="1.0"?><xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.books.org" xmlns="http://www.books.org" elementFormDefault="qualified"> <xsd:element name="BookStore"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Book" minOccurs="1" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Book"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Title" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Author" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Date" minOccurs="1" maxOccurs="1"/> <xsd:element ref="ISBN" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Publisher" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Title" type="xsd:string"/> <xsd:element name="Author" type="xsd:string"/> <xsd:element name="Date" type="xsd:string"/> <xsd:element name="ISBN" type="xsd:string"/> <xsd:element name="Publisher" type="xsd:string"/></xsd:schema>

<!ELEMENT Title (#PCDATA)><!ELEMENT Author (#PCDATA)><!ELEMENT Date (#PCDATA)><!ELEMENT ISBN (#PCDATA)><!ELEMENT Publisher (#PCDATA)>

<!ELEMENT Book (Title, Author, Date, ISBN, Publisher)>

<!ELEMENT BookStore (Book)+>

Page 45: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 45IS 202 – FALL 2003

<?xml version="1.0"?><xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.books.org" xmlns="http://www.books.org" elementFormDefault="qualified"> <xsd:element name="BookStore"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Book" minOccurs="1" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Book"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Title" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Author" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Date" minOccurs="1" maxOccurs="1"/> <xsd:element ref="ISBN" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Publisher" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Title" type="xsd:string"/> <xsd:element name="Author" type="xsd:string"/> <xsd:element name="Date" type="xsd:string"/> <xsd:element name="ISBN" type="xsd:string"/> <xsd:element name="Publisher" type="xsd:string"/></xsd:schema>

All XML Schemas have"schema" as the rootelement.

Page 46: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 46IS 202 – FALL 2003

<?xml version="1.0"?><xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.books.org" xmlns="http://www.books.org" elementFormDefault="qualified"> <xsd:element name="BookStore"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Book" minOccurs="1" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Book"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Title" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Author" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Date" minOccurs="1" maxOccurs="1"/> <xsd:element ref="ISBN" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Publisher" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Title" type="xsd:string"/> <xsd:element name="Author" type="xsd:string"/> <xsd:element name="Date" type="xsd:string"/> <xsd:element name="ISBN" type="xsd:string"/> <xsd:element name="Publisher" type="xsd:string"/></xsd:schema>

The elements anddatatypes thatare used to constructschemas - schema - element - complexType - sequence - stringcome from the http://…/XMLSchemanamespace

Page 47: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 47IS 202 – FALL 2003

elementcomplexType

schema

sequence

http://www.w3.org/2001/XMLSchema

string

integer

boolean

XMLSchema Namespace

Page 48: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 48IS 202 – FALL 2003

<?xml version="1.0"?><xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.books.org" xmlns="http://www.books.org" elementFormDefault="qualified"> <xsd:element name="BookStore"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Book" minOccurs="1" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Book"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Title" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Author" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Date" minOccurs="1" maxOccurs="1"/> <xsd:element ref="ISBN" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Publisher" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Title" type="xsd:string"/> <xsd:element name="Author" type="xsd:string"/> <xsd:element name="Date" type="xsd:string"/> <xsd:element name="ISBN" type="xsd:string"/> <xsd:element name="Publisher" type="xsd:string"/></xsd:schema>

Says that theelements definedby this schema - BookStore - Book - Title - Author - Date - ISBN - Publisherare to go in thisnamespace

Page 49: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 49IS 202 – FALL 2003

BookStore

BookTitle

Author

Date

ISBNPublisher

http://www.books.org (targetNamespace)

Book Namespace (targetNamespace)

Page 50: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 50IS 202 – FALL 2003

<?xml version="1.0"?><xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.books.org" xmlns="http://www.books.org" elementFormDefault="qualified"> <xsd:element name="BookStore"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Book" minOccurs="1" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Book"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Title" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Author" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Date" minOccurs="1" maxOccurs="1"/> <xsd:element ref="ISBN" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Publisher" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Title" type="xsd:string"/> <xsd:element name="Author" type="xsd:string"/> <xsd:element name="Date" type="xsd:string"/> <xsd:element name="ISBN" type="xsd:string"/> <xsd:element name="Publisher" type="xsd:string"/></xsd:schema>

This is referencing a Book element declaration.The Book in whatnamespace? Since thereis no namespace qualifierit is referencing the Bookelement in the defaultnamespace, which is thetargetNamespace! Thus,this is a reference to theBook element declarationin this schema.

The default namespace Is http://www.books.orgwhich is the targetNamespace!

Page 51: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 51IS 202 – FALL 2003

<?xml version="1.0"?><xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.books.org" xmlns="http://www.books.org" elementFormDefault="qualified"> <xsd:element name="BookStore"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Book" minOccurs="1" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Book"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Title" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Author" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Date" minOccurs="1" maxOccurs="1"/> <xsd:element ref="ISBN" minOccurs="1" maxOccurs="1"/> <xsd:element ref="Publisher" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Title" type="xsd:string"/> <xsd:element name="Author" type="xsd:string"/> <xsd:element name="Date" type="xsd:string"/> <xsd:element name="ISBN" type="xsd:string"/> <xsd:element name="Publisher" type="xsd:string"/></xsd:schema>

This is a directive to anyinstance documents whichconform to this schema: Any elements used by the instance document whichwere declared in this schema must be namespace qualified.

Page 52: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 52IS 202 – FALL 2003

Referencing a schema in an XML instance document

<?xml version="1.0"?><BookStore xmlns ="http://www.books.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.books.org BookStore.xsd"> <Book> <Title>My Life and Times</Title> <Author>Paul McCartney</Author> <Date>July, 1998</Date> <ISBN>94303-12021-43892</ISBN> <Publisher>McMillin Publishing</Publisher> </Book> ...</BookStore>

1. First, using a default namespace declaration, tell the schema-validator that all of the elementsused in this instance document come from the http://www.books.org namespace.

2. Second, with schemaLocation tell the schema-validator that the http://www.books.org namespace is defined by BookStore.xsd (i.e., schemaLocation contains a pair of values).

3. Third, tell the schema-validator that the schemaLocation attribute we are using is the one inthe XMLSchema-instance namespace.

1

2

3

Page 53: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 53IS 202 – FALL 2003

schemaLocationtype

noNamespaceSchemaLocation

http://www.w3.org/2001/XMLSchema-instance

nil

XMLSchema-instance Namespace

Page 54: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 54IS 202 – FALL 2003

Referencing a schema in an XML instance document

BookStore.xml BookStore.xsd

targetNamespace="http://www.books.org"schemaLocation="http://www.books.org BookStore.xsd"

- defines elements in namespace http://www.books.org

- uses elements from namespace http://www.books.org

A schema defines a new vocabulary. Instance documents use that new vocabulary.

Page 55: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 55IS 202 – FALL 2003

Note multiple levels of checking

BookStore.xml BookStore.xsd XMLSchema.xsd(schema-for-schemas)

Validate that the xml documentconforms to the rules describedin BookStore.xsd

Validate that BookStore.xsd is a validschema document, i.e., it conformsto the rules described in theschema-for-schemas

Page 56: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 56IS 202 – FALL 2003

Default Value for minOccurs and maxOccurs

• The default value for minOccurs is "1"

• The default value for maxOccurs is "1"

<xsd:element ref="Title" minOccurs="1" maxOccurs="1"/>

<xsd:element ref="Title"/>

Equivalent!

Page 57: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 57IS 202 – FALL 2003

Much More to XMLSchema!

• This was an overview of some basics

• There are many other features, such as:– The ability to import other schemas or parts of

schemas– Ability to specify many data types– Etc.

• XMLSchema definitions are at W3C– http://www.w3.org/TR/xmlschema-0/ is a

good place to start

Page 58: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 58IS 202 – FALL 2003

Lecture Overview

• Review– XML and Document Engineering

• Metadata And Markup– XML As A Metadata Lingua Franca

• METS

– SGML vs. XML DTD Construction– XML Schemas– XML For Protocols And Metadata Languages

• Readings/Discussion

Page 59: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 59IS 202 – FALL 2003

Other Protocols and Metadata Systems Using XML

• SOAP (Simple Object Access Protocol)• DAV/DASL (Distributed Authoring and Versioning)• SDLIP (Simple Digital Library Interoperability

Protocol)• RDF (Resource Description Framework)• ADL Gazetteer Protocol • OAI-MHP (already discussed)• MPEG-7 (more next time)• METS• Also versions of MARC and other formats in XML

Page 60: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 60IS 202 – FALL 2003

SGML and XML Sources and Resources

• Books: – van Herwijnen, Eric. Practical SGML. (2nd Ed.)

Boston: Kluwer Academic Publishers, 1994.– Goldfarb, Charles F. The SGML Handbook. Oxford:

Clarenden Press, 1990. (and MANY XML books)

• Web Sites:– The W3C web site (all XML standards documents)

• http://www.w3.org

– Robin Cover’s SGML/XML Site• http://www.oasis-open.org/cover/sgml-xml.html

Page 61: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 61IS 202 – FALL 2003

Lecture Overview

• Review– XML and Document Engineering

• Metadata And Markup– XML As A Metadata Lingua Franca

• METS

– SGML vs. XML DTD Construction– XML Schemas– XML For Protocols And Metadata Languages

• Readings/Discussion

Page 62: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 62IS 202 – FALL 2003

Discussion – Vam Makam

• Kirk covers examples of DTDs for books and newspapers. Many individuals and corporations have been creating numerous DTDs for themselves and general purposes. What are some innovative and useful ideas for areas where designing DTDs might be useful? For ideas that may have already been thought of, how could they be improved or extended?

Page 63: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 63IS 202 – FALL 2003

Discussion – Vam Makam

• However, recent XML DTDs have emerged, newer ideas such as XML schemas have presented themselves as a better option. Given the thought process and work gone into designing existing DTDs, at what point is it worth modifying an existing DTD to an XML schema?

• Now that you have learned how to design a dtd and have basic knowledge about XML, what are some existing technologies that combined with XML become more useful?

Page 64: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 64IS 202 – FALL 2003

Discussion – Annie Yeh

• Kirk addresses the advantages of using external DTDs, the reusability of public DTDs, the ability to focus on content rather than structure, easier management or multiple documents, and easier data error checking. What are some of the existing repositories in which we can store these DTDs? What are some of the ways with which we can facilitate this process? What are their pros and cons? What are some of the more ideal interfaces with which to facilitate this?

Page 65: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 65IS 202 – FALL 2003

Discussion – Annie Yeh

• What are the differences between DTDs and Schemas, and what are the pros and cons of each?

Page 66: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 66IS 202 – FALL 2003

Next Time

• Metadata for Motion Pictures: MPEG-7

• Readings/Discussion– MPEG-7 (Part 1) (J. M. Martinez, R. Koenen,

F. Pereira)– MPEG-7 (Part 2) (J. Martinez)

Page 67: Lecture 14: Metadata and Markup

2003.10.09 - SLIDE 67IS 202 – FALL 2003