Download - Technology Support for ESSSS
TECHNOLOGY SUPPORT FOR ESSSS
Progress, Issues, and Challenges
Marshall BreedingDirector for Innovative Technology and ResearchVanderbilt University LibraryFounder and Publisher, Library Technology Guideshttp://www.librarytechnology.org/http://twitter.com/mbreedingESSSS Digital Archive WorkshopFebruary 4, 2012
Turning Pages on Paper to Digital Images
Digitizing in the field involves many compromises compared to what can be done in more controlled settings
Access to archives may be of limited duration Arbitrary and political
Materials deteriorating rapidly Practices related to physical preservation
tend to be minimal Must be light, fast, and expensive
Achieve best results possible Maximize quality and consistency Handheld digital cameras
Rapid advancement in capabilities Early images down at lower resolutions
compared with what is possible today Fixed camera stands Consistency in orientation and framing Organization of Images (folders / image
names)
Image Standards
TIFF: Currently regarded as best image format for archiving images
RAW: Native proprietary format of a camera
JPEG: Compressed images for display on the Web Data lost during compression: non-
reversible VU system creates multiple sizes of JPEG
images JPEG2000
Lossless compression method Not well supported on the Web
Bringing Images to the Web
Take advantage of infrastructure developed at by the Vanderbilt University Library to manage images
Digital Library framework: Presentation and functionality created in Perl-based
interface Data and Metadata stored in MySQL relational tables ODBC connectivity between presentation layer and
MySQL Microsoft Windows Server/IIS for Web server Images reside on digital storage provided by the
Vanderbilt University Library
Digital Preservation
Disaster Recovery Ability to restore files in the case of any
hardware, software, or human Error Digital Preservation
Commitment and processes in place to preserve digital information for the very long term
Multiple replications Migration of data into future formats as
current standards become obsolete
Building structure through Metadata
Metadata structure based on Dublin Core Volume-level descriptive metadata
Courtney Campbell designed metadata structure and is analyzing volumes to populate metadata for each volume
EXIF Data extracted from images into the individual records for each page
Page-level structure Supports ability to select volumes and
browse page images
Demonstration
Image management environment Interface Metadata Page Images
Turning Pages into Data
The contents of the page images contain valuable data
Page images can be read by humans but do not support essential features: search, computer analysis, etc.
Full value of these collections can be realized through transcription
Challenges in transcription
Page characteristics Hand written by many different hands Many names and numbers Spanish language Varying contrast Many defects: water damage, insects, etc
Human transcription
Scholars that work with pages of interest can create transcriptions manually
Optical character recognition? Highly accurate for typescript Not effective for handwritten manuscripts
Crowdsourcing
Find ways to have large numbers of persons create transcript snippets
Google uses crowdsourcing to improve transcripts for Google Books project.
Google ReCAPTCHA:
“Digitizing books one word at a time” Each transaction transcribes one or two
words Each word is transcribed many times Results compared to determine correct
version
Google ReCAPTCHA
Crowdsourcing to Transcribe ESSSS Scholars contribute any transcriptions
created as they work with any given set of pages
Students assigned to create transcriptions Language, history, LIS
Collaboration with some organization with ReCAPTCHA like infrastructure