Download - Technology Support for ESSSS

TECHNOLOGY SUPPORT FOR ESSSS

Progress, Issues, and Challenges

Marshall BreedingDirector for Innovative Technology and ResearchVanderbilt University LibraryFounder and Publisher, Library Technology Guideshttp://www.librarytechnology.org/http://twitter.com/mbreedingESSSS Digital Archive WorkshopFebruary 4, 2012

http://www.librarytechnology.org/

http://twitter.com/mbreeding

Turning Pages on Paper to Digital Images

Digitizing in the field involves many compromises compared to what can be done in more controlled settings

Access to archives may be of limited duration Arbitrary and political

Materials deteriorating rapidly Practices related to physical preservation

tend to be minimal Must be light, fast, and expensive

Achieve best results possible Maximize quality and consistency Handheld digital cameras

Rapid advancement in capabilities Early images down at lower resolutions

compared with what is possible today Fixed camera stands Consistency in orientation and framing Organization of Images (folders / image

names)

Image Standards

TIFF: Currently regarded as best image format for archiving images

RAW: Native proprietary format of a camera

JPEG: Compressed images for display on the Web Data lost during compression: non-

reversible VU system creates multiple sizes of JPEG

images JPEG2000

Lossless compression method Not well supported on the Web

Bringing Images to the Web

Take advantage of infrastructure developed at by the Vanderbilt University Library to manage images

Digital Library framework: Presentation and functionality created in Perl-based

interface Data and Metadata stored in MySQL relational tables ODBC connectivity between presentation layer and

MySQL Microsoft Windows Server/IIS for Web server Images reside on digital storage provided by the

Vanderbilt University Library

Digital Preservation

Disaster Recovery Ability to restore files in the case of any

hardware, software, or human Error Digital Preservation

Commitment and processes in place to preserve digital information for the very long term

Multiple replications Migration of data into future formats as

current standards become obsolete

Building structure through Metadata

Metadata structure based on Dublin Core Volume-level descriptive metadata

Courtney Campbell designed metadata structure and is analyzing volumes to populate metadata for each volume

EXIF Data extracted from images into the individual records for each page

Page-level structure Supports ability to select volumes and

browse page images

Demonstration

Image management environment Interface Metadata Page Images

Turning Pages into Data

The contents of the page images contain valuable data

Page images can be read by humans but do not support essential features: search, computer analysis, etc.

Full value of these collections can be realized through transcription

Challenges in transcription

Page characteristics Hand written by many different hands Many names and numbers Spanish language Varying contrast Many defects: water damage, insects, etc

Human transcription

Scholars that work with pages of interest can create transcriptions manually

Optical character recognition? Highly accurate for typescript Not effective for handwritten manuscripts

Crowdsourcing

Find ways to have large numbers of persons create transcript snippets

Google uses crowdsourcing to improve transcripts for Google Books project.

Google ReCAPTCHA:

“Digitizing books one word at a time” Each transaction transcribes one or two

words Each word is transcribed many times Results compared to determine correct

version

Google ReCAPTCHA

Crowdsourcing to Transcribe ESSSS Scholars contribute any transcriptions

created as they work with any given set of pages

Students assigned to create transcriptions Language, history, LIS

Collaboration with some organization with ReCAPTCHA like infrastructure

Download - Technology Support for ESSSS

Top Related