a library science perspective on digitization bryan heidorn university of arizona

23
A Library Science Perspective on Digitization Bryan Heidorn University of Arizona

Upload: amice-morrison

Post on 28-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

A Library Science Perspective on Digitization

Bryan HeidornUniversity of Arizona

Library-Museum Parallels

• Intellectual Property Rights• Physical/Digital Objects Sharing• Descriptive Metadata Formats• Preservation Metadata • Transport Metadata Formats• Communication Protocols (no so much)• Similar Digitization Workflow• OCR Challenges

Intellectual Property Rights

• Expanded to 75yrs in US from 25• Academic Publishing anomalies• Attribution required (data no so much) • Decoupling of Data from Text

Online Computer Library Center (OCLC)

• Collaborative Automation of libraries including copy cataloging

• Started 1967• Catalog 271 million items/year• 72,000 libraries in 170 countries and

territories use OCLC services to locate, acquire, catalog, lend and preserve library materials.

Descriptive Metadata Formats

• MARC(XML) 21 Standard• METS• Dublin Core (Interchange Format only)

Biodiversity Heritage Library Workflow

Courtesy: Martin KalfatovicProgram Director, Biodiversity Heritage Library, Smithsonian Institution Libraries

MARC 21 Standard

• Formats: Bibliographic, Authority, Holdings, Classification, Community

• Bibliographic Material Types: – Books (BK)– Continuing resources (CR) – Computer files (CF) – Maps (MP) – Music (MU) – Visual materials (VM) – Mixed materials (MX)

http://www.loc.gov/marc/

MARC Fields• 00X: Control Fields• 01X-09X: Numbers and Code Fields• Heading Fields - General Information• 1XX: Main Entry Fields• 20X-24X: Title and Title-Related Fields• 25X-28X: Edition, Imprint, Etc. Fields• 3XX: Physical Description, Etc. Fields• 4XX: Series Statement Fields• 5XX: Note Fields• 6XX: Subject Access Fields• 70X-75X: Added Entry Fields• 76X-78X: Linking Entry Fields• 80X-83X: Series Added Entry Fields• 841-88X: Holdings, Location, Alternate Graphics, Etc. Fields

MARC Book Exampleeader/00-23 *****nam##22*****#a#4500001 <control number>003 <control number identifier>005 19920331092212.7007/00-01 ta008/00-39 820305s1991####nyu###########001#0#eng##020 ##$a0845348116 :$c$29.95 (£19.50 U.K.)020 ##$a0845348205 (pbk.)040 ##$a[organization code]$c[organization code]050 14$aPN1992.8.S4$bT47 1991082 04$a791.45/75/0973$219100 1#$aTerrace, Vincent,$d1948-245 10$aFifty years of television :$ba guide to series and pilots, 1937-1988 /$cVincent Terrace.246 1#$a50 years of television260 ##$aNew York :$bCornwall Books,$cc1991.300 ##$a864 p. ;$c24 cm.500 ##$aIncludes index.650 #0$aTelevision pilot programs$zUnited States$vCatalogs.650 #0$aTelevision serials$zUnited States$vCatalogs.

Difference between Museum and Library

• Full Darwin code has parallels in MARC• Many more commercial and custom products• Larger installed base• Library Entries somewhat more detailed • There is a MARC(XML) and MARC Lite• MARC differentiates among material types

Digital Content Transport

• METS – Metadata Encoding and Transmission Standard

• The METS schema is a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language.

Courtesy: Martin KalfatovicProgram Director, Biodiversity Heritage Library, Smithsonian Institution Libraries

METS Components

• METS Header • Descriptive Metadata • Administrative Metadata • File Section - The file section lists all files containing content

which comprise the electronic versions of the digital object. <file> elements may be grouped within <fileGrp> elements, to provide for subdividing the files by object version.

• Structural Map • Structural Links • Behavior

I/O

• Submission Information Package (SIP), which is sent from the information producer to the archive;

• the Archive Information Package (AIP), which is the information package actually stored by the archive; and

• the Dissemination Information Package (DIP), which is the information package transferred from the archive in response to a request by a consumer.

Courtesy: Martin KalfatovicProgram Director, Biodiversity Heritage Library, Smithsonian Institution Libraries

Open Archives Initiative Protocol for Metadata Harvesting

• The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. Data Providers are repositories that expose structured metadata via OAI-PMH. Service Providers then make OAI-PMH service requests to harvest that metadata. OAI-PMH is a set of six verbs or services that are invoked within HTTP.

OAI Verbs

• Get• Identify• ListIdentifiers• ListMetadataFormats• ListRecords• ListSets

Get

• http://arXiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:cs/0112017&metadataPrefix=oai_dc

<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2002-02-08T08:55:46Z</responseDate> <request verb="GetRecord" identifier="oai:arXiv.org:cs/0112017" metadataPrefix="oai_dc">http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv.org:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>Using Structural Metadata to Localize Experience of Digital Content</dc:title> <dc:creator>Dushay, Naomi</dc:creator> <dc:subject>Digital Libraries</dc:subject> <dc:description>With the increasing technical sophistication of both information consumers and providers, there is increasing demand for more meaningful experiences of digital information. We present a framework that separates digital object experience, or rendering, from digital object storage and manipulation, so the rendering can be tailored to particular communities of users. </dc:description> <dc:description>Comment: 23 pages including 2 appendices, 8 figures</dc:description> <dc:date>2001-12-14</dc:date> </oai_dc:dc> </metadata> </record> </GetRecord></OAI-PMH>

Metadata Collection and Workflow (Macaw)

Physical/Digital Objects Sharing

• Books both part of an Edition and Unique• 20th century books have standard front matter• LMS contained Metadata Only• Journals indexed by article• Most digital content is commercially owned and

born digital• 2011 author-publishing exceeded commercial • Born analog digitization (Google Books and BHL)

Governance

• Libraries pay for OCLC• OCLC is Participatory• Close Collaboration with Library of Congress

on Standards• School System exists to train librarians• Libraries are being cut in academic, public and

school sectors