dublin core and metadata: a tutorial lorcan dempsey andy powell ukoln, university of bath (with a...
TRANSCRIPT
Dublin Core and metadata:a tutorial
Lorcan Dempsey
Andy Powell
UKOLN, University of Bath
(with a little help from our friends)
http://www.ukoln.ac.uk/metadata
2 - Lux, 1-2 Dec 1997
Questions for you ...
• Metadata• EAD, CIMI, TEI • PICS, XML, RDF• MARC• 856• Dublin Core• you are
• geeks/people with sensible shoes• goers/doers
3 - Lux, 1-2 Dec 1997
Overview
• UKOLN and metadata
• Metadata landscape
• Dublin Core
• Metadata management
• Interoperability
• Harvesting
• Future
4 - Lux, 1-2 Dec 1997
UKOLN and metadata
• ROADS• subject gateways• WHOIS++ templates
• BIBLINK • CIP for electronic data• Dublin Core (+
MARC)
• Desire• WHOIS++, GILS,
Dublin Core• Z39.50/WHOIS++
• NewsAgent • current awareness,
Ariadne
• Dublin Core, DC-dot • MODELS
• collection description??
• Agora• PRIDE• Initiatives
6 - Lux, 1-2 Dec 1997
What is metadata …?
• It’s just cataloguing, isn’t it … ?
• Yes and no …
• Data which supports operations carried out on information objects …
– discover, buy, ...
• In the company of strangers (Brody)• Relieve user of having to have full advance
knowledge of characteristics of resources …
… variety
7 - Lux, 1-2 Dec 1997
Semantics, syntax, content
MARC, ISO 2709, AACR2
Libraries
MARC AACR2
Metadata model: the library example
Picture by Stu Weibel
8 - Lux, 1-2 Dec 1997
Variety of formal and informal metadata models
Museums
GeospatialLibraries
InternetCommons
Commerce
Whatever...
ScientificData
HomePages
Picture by Stu Weibel
9 - Lux, 1-2 Dec 1997
Variety of operations ...
• Discovery• Location• Selection
• fit for use
• Acquire• terms
• Manipulate• Exploit
• IPR
• Document• Contextualise• Preserve • Manage
• dates, people, structures, …
• Agent/client access
• ….
10 - Lux, 1-2 Dec 1997
Variety of sectors ...
• Curatorial traditions• ‘cataloguing’/documentation• libraries, archives, text archives, museums,
geospatial data, etc
• Network resource discovery • directory services, search engines, etc• influence from computer science
• Network information management• web developments, W3C, database• sitemap, time to live, ...• pragmatic - market needs, vendor push
11 - Lux, 1-2 Dec 1997
Variety of creation models ...
• Author/creator• web pages?
• Repository/site manager• effective disclosure• better management
• Third party creator• e.g. eLib subject gateways• Library
12 - Lux, 1-2 Dec 1997
Metadata ...
• Variety of metadata models • syntax, semantics, content
• scope
• sectors/domains
• Variety of operations supported
• Variety of creation models
• Variety of architectures for disclosure/discovery• Search and retrieve
• Disclosure/distribution
• Management
… complex
13 - Lux, 1-2 Dec 1997
Band One
(full text indexes)
Band Two
(simple structuredgeneric formats)
(syntax/semantics?)
Band Three
(more complex structure,domain specific)
(part of alargersemanticframework)
Proprietary formats Proprietary formats
FGDC TEIheaders
Dublin Core ‘MARC’ ICPSR
IAFA/WHOIS++templates
GILS EAD
RFC 1807 … CIMI
… … …
Some formats
richer… semantics, structure, domain-specific, ...
15 - Lux, 1-2 Dec 1997
Dublin Core• Metadata model
• Simple element set • focus on semantics - several
target syntaxes
• Operations• resource
discovery on the web
• Explicitly cross sector/domain• No constraint on creation model
or application architecture
FG
DC
MA
RC
Museum
...
Dublin Core
… simple and intuitive
16 - Lux, 1-2 Dec 1997
Dublin core - why success?
• Simple
• Coincides with strategic needs in each of sectors we identified
– Curatorial: semantic interoperability between richer metadata models
– Resource discovery: a simple format for descriptive metadata (DLOs)
– Web management: associate metadata with Web resources
• Inclusive (countries/domains/traditions)• Stu Weibel
18 - Lux, 1-2 Dec 1997
Dublin Core - elements
• Title • Subject • Description • Creator • Publisher • Contributor • Date • Type
• Format • Identifier • Source • Language • Relation• Coverage • Rights
• 15 element core metadata set
19 - Lux, 1-2 Dec 1997
Dublin Core - HTML Example<HTML><HEAD>
<TITLE>UKOLN Home Page</TITLE>
<META NAME="DC.Title” CONTENT="UKOLN: UK Office for Library and Information Networking">
<META NAME="DC.Subject" CONTENT="national centre, network information support, library community, awareness, research, information services, public library networking, bibliographic management, distributed library systems, metadata, resource discovery, conferences, lectures, workshops">
<META NAME="DC.Description" CONTENT="UKOLN is a national centre for support in network information management in the library and information communities. It provides awareness, research and information services">
<META NAME="DC.Creator" CONTENT=”Isobel Stark">
</HEAD>
...
21 - Lux, 1-2 Dec 1997
Data creation
Practical issues of using Dublin Core for Internet resource description...
• UKOLN metadata system• Requirements• 3 models for metadata management• Implementation at UKOLN
22 - Lux, 1-2 Dec 1997
UKOLN metadata system requirements
• Easy to use
• Work with a variety of methods of creating HTML
• Simple migration to future metadata formats
• Separate metadata from resource
23 - Lux, 1-2 Dec 1997
Managing Dublin Core (1)HTML Authoring tool
Pros…• Simple• May be useful for
training and familiarisation
Cons…• May not be possible
with all editors• Maintenance
problems• Easy to make errors
Embed by hand using HTML or text editor
24 - Lux, 1-2 Dec 1997
DC-dot
• A Web based tool for creating Dublin Core <meta> tags
• Automatic generation of some tags based on content of the resource
• Forms based editing of tags
• Cut-and-paste output into HTML
• Conversion to other formats…• SOIF, ROADS/WHOIS++, USMARC,
GILS...
http://www.ukoln.ac.uk/metadata/dcdot/Run
demo
25 - Lux, 1-2 Dec 1997
Managing Dublin Core (2)Web-site management tool
Pros…• Use of Web-site
management tools likely to increase
• Object-oriented database approach
Cons…• Proprietry formats• Early days - too
early to evaluate use for metadata yet?
Use Web-site management tool,for example NetObjects Fusion
26 - Lux, 1-2 Dec 1997
Managing Dublin Core (3)On the fly generation
Pros…• Separates
metadata from resource
• Future migration fairly simple
Cons…• Performance• Lack of integration
with HTML tools• Server specific
Hold Dublin Core separately and embedon-the-fly using server-side include (SSI)
27 - Lux, 1-2 Dec 1997
UKOLN metadata system (1)
• Embed on-the-fly
• Apache SSI script
• Store metadata using SOIF records
• Use MS-Access as tool to create the records
• Associate metadata with resource by co-locating them in the Web server filestore
28 - Lux, 1-2 Dec 1997
UKOLN metadata system (2)
MS-AccessDatabase
HTMLeditor
<html><head><title>…</title><!--#exec cmd="getmeta" --></head>...
<html><head><title>…</title><!--#exec cmd="getmeta" --></head>...
intro.html
@FILE { http://www.ukoln.ac....keywords{13}: xxx, yyy, zzzdescription{14}: blah blah bauthor{13}: Stark, Isobel...}
@FILE { http://www.ukoln.ac....keywords{13}: xxx, yyy, zzzdescription{14}: blah blah bauthor{13}: Stark, Isobel...}
intro.html.soif
Apache syntax for calling server-side script<!--#exec cmd="getmeta" -->
29 - Lux, 1-2 Dec 1997
UKOLN metadata system (3)
MS-Access frontend...
Filename browser
Text boxes
Name choosers
UKOLNspecificmetadata
30 - Lux, 1-2 Dec 1997
UKOLN metadata system (4)
UKOLNWeb server
<html><head><title>…</title><!--#exec cmd="getmeta" --></head>...
<html><head><title>…</title><!--#exec cmd="getmeta" --></head>...
intro.html
intro.html.soif
SSIscript
2
3
45
6
1
@FILE { http://www.ukoln.ac....keywords{13}: xxx, yyy, zzzdescription{14}: blah blah bauthor{13}: Stark, Isobel...}
@FILE { http://www.ukoln.ac....keywords{13}: xxx, yyy, zzzdescription{14}: blah blah bauthor{13}: Stark, Isobel...}
Webrobot
31 - Lux, 1-2 Dec 1997
Issues
• Performance
• Interaction with Web caches
• Dublin Core vs Alta Vista style metadata<META NAME=”Description” CONTENT=”blah, blah"><META NAME="Keywords” CONTENT="xxx, yyy, zzz">
• Granularity• Which pages should have metadata?
A short history:Dublin to Helsinki
We have borrowed some of this material from Stu Weibel, with
permission
33 - Lux, 1-2 Dec 1997
Dublin Core Workshop Series ..
• DC-1: OCLC/NCSA Metadata Workshop Mar, 1995
• Limited Scope: Discovery of document-like objects• 13 element Dublin Core• Interdisciplinary consensus
• DC-2: OCLC/UKOLN Warwick Workshop April, 1996
• Warwick Framework - modularity• Syntax issues
34 - Lux, 1-2 Dec 1997
.. Dublin Core Workshop Series
• DC-3: CNI/OCLC Image Metadata Workshop, Sep, 1996• Images are in scope• 15 element core; some element name
changes
• DC-4: Canberra Metadata Workshop Mar, 1997
• Minimalists and Structuralists
• Canberra Qualifiers (additional information useful for interpretation of metadata)
35 - Lux, 1-2 Dec 1997
Dublin core - qualifiers
• Language of element value• Scheme
• specifies a context for interpretation
<META NAME=“DC.Subject” SCHEME=“ddc.21” CONTENT=“170.42”>
• Sub-element• specifies a facet - narrows
<META NAME="DC.Creator.Address"
CONTENT=“[email protected]">
36 - Lux, 1-2 Dec 1997
DC-5
• DC-5: National Library of Finland/OCLC Workshop, October 1997
– Formal Data Model (expressed in RDF)– many other problems are hereby made simpler– Resource Description Framework– The return of modularity
– Finnish finish (of unqualified DC)– minimalist DC is done and will not be changed
– Semantics for additional sub-structure– a small number of sub-elements will be established
– Closer DC-W3C collaboration
37 - Lux, 1-2 Dec 1997
Working groups
• Data Model• date, relationship,
source• what is a resource?• 1:1• RDF
• Relationships• Typology
• Sub-elements
• Date
38 - Lux, 1-2 Dec 1997
RFCs in preparation
• Simple DC semantics (the minimalist position)
• Simple DC syntax for embedded HTML • DC semantics with qualifiers• DC syntax with qualifiers
• HTML 2.0• HTML 4.0• RDF
40 - Lux, 1-2 Dec 1997
Projects
• 30 projects; 10 countrieshttp://purl.org/metadata/dublin_core/projects.html
• “Interdisciplinary and international recognition as the lingua franca for resource discovery metadata for electronic resources” Stu Weibel
• Support for use for non-digital objects
41 - Lux, 1-2 Dec 1997
The HTML 2.0 “kludge”• Convention for simple embedded
metadata• Bootstrapping early Dublin Core
deployments • META tags and standard HTML syntax
• Useful for simple metadata without qualifiers• Can support Dublin Core qualifiers, but with
risks for interoperability and indexing purity
<META NAME="DC.Subject" CONTENT="(SCHEME=LCSH)Information technology -- higher education">
42 - Lux, 1-2 Dec 1997
HTML 4.0 - DC influences the web
• Richer <META> tag attributes• LANG (language of the metadata)• SCHEME (formal qualifier)• SUB-ELEMENTS (dot syntax extensions)
• Allows syntactically “clean” implementation of metadata with qualifiers
<META NAME="DC.Subject" SCHEME="LCSH" CONTENT="Information technology -- highereducation">
43 - Lux, 1-2 Dec 1997
Some quick statistics
• UK (academic sites only)• Total pages: ~1.5M (a guess!)• Embedded DC: ‘a few hundred’http://www.cs.ukc.ac.uk/people/staff/djb1/
• Sweden• Total pages: 1.4M• Embedded DC: ‘a few dozen’http://www.lub.lu.se/nwiPaper/
Informationprovided by
DaveBeckett
Informationprovided by
SigfridLundburg
45 - Lux, 1-2 Dec 1997
Interoperability
• What do we mean by interoperability?
• Issues
• Z39.50 and Dublin Core
• Metadata registries
46 - Lux, 1-2 Dec 1997
Interoperability?
• Unify access to data in different domains - Web, library, museums, archives, ...
• Issues• Protocols - Z39.50, WHOIS++, …
– gateways
• Attribute names - author/creator/...– Semantic interoperability - mapping tables
• Format of results– format converters
In real lifethese can allget mixed up
47 - Lux, 1-2 Dec 1997
Protocol Gateways - an example
• ZEXI - a Z39.50 to WHOIS++ gateway
• Based on CNIDR's Isite
• Accepts Z39.50 searches
• Converts them to WHOIS++
• Returns SUTRS records
http://roads.ukoln.ac.uk/cgi-bin/egwcgi/egwirtcl/targets.egw
48 - Lux, 1-2 Dec 1997
Attribute names
• Different databases may use different ‘names’ for the same thing• ‘creator’ vs ‘author’
• Need to be able to construct searches that ‘work’ against different databases irrespective of the ‘names’ in use
• Dublin Core provides a minimal set of agreed ‘names’ with which we can construct searches
49 - Lux, 1-2 Dec 1997
Format of results
• Different databases may return results in different formats• USMARC, GRS-1, SUTRS, IAFA, ...
• Early stages of searching ideally need results to be returned in single ‘simple’ format
• Dublin Core provides a minimal set of agreed data elements with which we can construct results
50 - Lux, 1-2 Dec 1997
Z39.50 and DC - searching
• Version 2• Searches phrased in terms of single attribute
set only• Either need to
– add DC attributes to Bib-1– map DC to Bib-1
• Version 3• Multiple attribute sets allowed for searching• New simple DC attribute set to be proposed• Other attributes taken from Bib-1
http://cypress.dev.oclc.org:12345/~rrl/docs/dublincoreandz3950.html
51 - Lux, 1-2 Dec 1997
Z39.50 and DC - retrieval
• To return Dublin Core ‘records’ using Z39.50…• use GRS-1 (General Record Syntax)• elements are assigned tags• DC elements have been added to
tagset-G
52 - Lux, 1-2 Dec 1997
Format conversion - issues
• Simple to rich, e.g. DC to MARC• May not generate valid rich record
without manual enhancement• Use of DC qualifiers required for
decent MARC record
• Rich to simple, e.g. MARC to DC• Loss of data
53 - Lux, 1-2 Dec 1997
Metadata registries
• Semantics• Agreement on element meanings• Agreement on enumerated lists
• Qualifiers• Thesaurus naming
• Publishing existing metadata sets• Re-use by others - prevent duplication
of work• e.g. Administrative metadata
54 - Lux, 1-2 Dec 1997
Some pointers
• Mapping tableshttp://www.ukoln.ac.uk/metadata/interoperability/
• Software• Generalhttp://www.ukoln.ac.uk/metadata/software-tools/
• d2m : Dublin Core to MARC converterhttp://www.bibsys.no/meta/d2m/
• USEMARCONhttp://www2.echo.lu/libraries/en/projects/
usemarc.html
56 - Lux, 1-2 Dec 1997
Harvesting Dublin Core
• General Issues
• Building a Web index• Harvest and NWI
• Building a ‘local’ search engine• Harvest, SWISH-E, Isite, Zebra
• DC as cataloguer’s aid
57 - Lux, 1-2 Dec 1997
Harvesting - issues
• Mappings
• Multiple element values
• Multiple languages
• Complex data values• e.g. DC.Date, DC.Coverage
• SCHEMES
58 - Lux, 1-2 Dec 1997
Harvesting - issues
• Frames
• Harvesting non-embedded metadata
• HTML 3.2 vs HTML 4.0
• Hidden pages
• Controlling the robot
59 - Lux, 1-2 Dec 1997
Harvest
• Resource discovery suite of tools - robot, summarisers, indexers
• SOIF records
• Supports a variety of indexers
• Supports database brokerage model
• CGI based user-interface
• UKOLN’s HTML summariser is Dublin Core aware
http://www.tardis.ed.ac.uk/harvest/
60 - Lux, 1-2 Dec 1997
Nordic Web Index
• Custom robot - NWI/Combine
• Dublin Core aware
• GILS-II records
• Indexed using Zebra
• Searched using Z39.50
• User interface based on Europagate
http://nwi.ub2.lu.se/?lang=uk
61 - Lux, 1-2 Dec 1997
Other software
• SWISH-E• system for indexing local collections of
Web pages or other text fileshttp://sunsite.berkeley.edu/SWISH-E/
• Isite• text indexer (Isearch) and Z39.50http://www.cnidr.org/ir/isite.html
• Zebra• text indexer and Z39.50
http://www.indexdata.dk
62 - Lux, 1-2 Dec 1997
DC as cataloguer’s aid
• ROADS• Software to create, manage and
search Internet resource descriptions• WHOIS++• Records created manually• Pump-prime’ metadata record with
values based on embedded DC using robot
http://www.ukoln.ac.uk/roads/
63 - Lux, 1-2 Dec 1997
DC as cataloguer’s aid
• BIBLINK• Flow of information from publishers to
National Bibliographic Agencies• MARC based catalogues of electronic
publications• Initial MARC record based on DC
description supplied by publisher using email
http://www.ukoln.ac.uk/metadata/BIBLINK/
70 - Lux, 1-2 Dec 1997
Limits
• In development
• Syntax
• Simple• Discovery• Document like objects• Weak model• Administrative metadata
• Addressed in Helsinki
73 - Lux, 1-2 Dec 1997
Syntax• HTML 2, HTML 4, RDF, ...• RDF - W3C (World Wide Web Consortium)
initiative• “RDF is the realization of the Warwick Framework
for the Web”• RDF will be the foundation for an architecture for
metadata on the WebResource description Electronic commerce
Site mapping Third party rating
Digital signatures
74 - Lux, 1-2 Dec 1997
RDF: Why is it important?
• RDF provides a coherent data model and syntactical framework for ‘plug-n-play’ metadata• the semantics and structure of metadata packages will
be determined by stakeholder communities via independently developed and maintained metadata element sets
• e.g.: MARC, DC, TEI, GILS, CIMI, Ratings….
• Political imperatives for deployment• Software infrastructure will be ubiquitous (and
come for free in browsers and servers)
75 - Lux, 1-2 Dec 1997
Semantics
• Tension• simple vs complex• generic vs specific• interoperability vs selfstanding
• Development• relationship• sub-elements• scheme
76 - Lux, 1-2 Dec 1997
Environment
• ‘Save the time of the user’
• Diverse resources• Broker/middleware/
gateway/trading place/…
• Variety of protocols and metadata models
• DC• simple - volume• ‘shallow’ - interop