down and dirty digitization: everything you need to know about putting content online
DESCRIPTION
Down and Dirty Digitization: Everything you need to know about putting content online. Roy Tennant California Digital Library. Outline. Project Planning Selecting Material to Digitize Digitization Purpose Basic Imaging Principles Capturing Images Editing Images Best Practices - PowerPoint PPT PresentationTRANSCRIPT
Down and Dirty Digitization:Everything you need to know about putting content online
Roy TennantCalifornia Digital Library
Outline
Project Planning Selecting Material to Digitize Digitization Purpose Basic Imaging Principles Capturing Images Editing Images Best Practices Conversion to Text Metadata Access Systems Skills Required of Staff Preservation
Project PlanningWho will do the work?What systems will be required?What are the specifications for images
and metadata?How much will the project cost?Who will own and manage the digital
products that will be produced?
Steve Chapman, from Handbook for Digital Projects, NEDCC
Selecting Material to Digitize
Publishing rights Available support/funding opportunity Critical mass Uniqueness Reputation Audience and potential use Diversity of material type Ability to stand on its own and fit in with other
collections
What Do We Preserve?The body or the soul?
The artifact The intellectual content
How do we decide that the artifact has preservation value?
Who decides?
The Artifact The “look and feel” The experience of interacting with a specific
object Consequences:
Choices for providing access are limited Time and money spent on recreating the artifact
may be better spent on increasing access In some cases, preserving the look and feel
actually harms other uses
Written MaterialHandwritten texts (diaries, etc.), or
those with handwritten notations (manuscript drafts, etc.) can easily be considered to have artifactual value
But how much artifactual value do printed texts have?
And born-digital texts?What’s it worth to you?
“If the goal of preservation is persistent utility, then functionality rather than aesthetics should drive system design.”
— Stephen Chapman, “Content Follows Form: Preservation via Systems Design, Microform & Imaging Review
Persistent UtilityForm must be allowed to be altered or
destroyed to retain or enhance function If function cannot be retained or
enhanced, then form should be preserved
Considerations for Retaining Items in Original FormatAgeEvidential valueAesthetic valueScarcityAssociational valueMarket valueExhibition value
“The issue is not to evaluate the artifact per se to determine what survives and what does not…The issue is the need to agree on a method for interrogating the individual artifact, that would, in a climate of finite resources, help make a good decision about whether and how to preserve it.”
— Council on Library and Information Resources, The Evidence in Hand: the Report of the Task Force on the Artifact in Library Collections
How Do We Preserve It?
$0
$200
$400
$600
$800
$1,000
$1,200
$1,400
$1,600
$1,800
$2,000
Bind/Box Deacidify Microfilm Digitize Simple Book Digitize Complex Book Conserve
Preservation costs by method calculated by the Library of Congress Preservation Directorate
Types of Materials
Printed text/
Simple line art
Manuscripts
Halftones
Continuous Tone
Mixed
From Anne Kenney, et.al., Moving Theory into Practice
Benchmarking The process whereby you determine your
digitization requirements using the material you will digitize
Resolution
One pixel
The number of pixels in a given area defines the resolution of an image
1”
500 x 1,000 pixels
Dynamic Range (bit-depth)
1 bit 8 bit grayscale 8 bit color 24 bit color (GIF) (GIF) (JPEG)
1 bit = black or white8 bits = 256 shades16 bits = thousands24 bits = millions36 bits = billions
RGB Color Space
Red
Green
Blue
8 bits per channel = 24 bit color image
12 bits per channel = 36 bit color image
Color Channels
Image CompressionLossless — the image is unchanged
after compression (no image data is lost) Typical file size: 50% of original Example: LZW compression
Lossy — the image is altered after compression (image data is lost) Example: JPEG
TIFF
Tagged Image File FormatMost often used to save “master
versions” of images (unedited)Can be compressed or uncompressed
Compuserve GIF
Graphic Interchange Format (GIF) Maximum 8 bits/pixel: 256 colors (shades) Good for:
Text and line art Thumbnails
Not good for: Full-color pictures Anything that requires more than 256 colors
JPEG
Joint Photographic Engineers Group JPEG is actually a compression scheme; the
image file format is JFIF (JPEG File Image Format)
Good for: Full-color pictures Anything that requires more than 256 colors
Not good for: Text or line art
New Image Formats
Portable Network Graphics (PNG) - from the W3C to replace the Compuserve GIF format and provide more capabilities
JPEG2000 - An upgrade of the JPEG format Flashpix - from a consortium of commercial
companies, to provide much higher-resolution images in a way that allows speedy network delivery
MrSID - From LizardTech, good for large format materials (maps, panoramic photos, etc.)
Capturing Images
Technologies Digital Cameras Flatbed Scanners Film Scanners Kodak PhotoCD
OutsourcingStandards and Best Practices
Digital Cameras
BetterLight Super6K6,000 x 8,000 pixels, 136MB (24bit RGB)$16,990
Phase One PowerPhase FX10,500 x 12,600 pixels, 760MB (48 bit RGB)
Flatbed ScannersMinimum requirements:
600 X 1200 dpi optical resolution
36-bit colorNot for slides or transparencies, best for
81/2”x11” or 81/2”x14” originalsSheet feeder (often optional) helpful for
digitizing text
Film ScannersFor 35mm slides and negatives;
others available for larger formats$600 - $3,000 Most around 2700-4000
dpi,30-36 bit color
Kodak PhotoCDTake pictures with a normal camera, but
have your pictures “developed” onto a PhotoCD
A proprietary image format: ImagePAC, but very high resolution (4 different resolutions)
Outsourcing: Pros and Cons Benefits:
No ramp-up costs (both time and money) Probably higher quality, at least to begin with High volume capability
Drawbacks: May be more costly if you have underutilized staff
time No internal capability or experience developed (that
is, when the money runs out, so does your chance to do anything more)
Rare items may require in-house digitization
Outsourcing: How Write an RFQ (Request for Quote) outlining:
Type and amount of material being digitized Quality requirements Volume per unit of time requirements
For RFQ guidance and samples, see RLG Tools for Digital Imaging: www.rlg.org/preserv/RLGtools.html
Digital Image Work Flow
Original TIFF or PCD10-100+MB
JPEG100K
GIF10K
RGB Color Space IndexedColorSpace
Resize,Sharpen
Rotate,Crop,
Retouch,Brightness/
Contrast
Stored offline Stored online
Editing Images
RotatingCroppingRetouchingAdjustingResizingSharpeningSaving
Image Editing Demonstration
Conversion to Text Optical Character Recognition (OCR)
software is required (Caere OmniPage Pro, Xerox TextBridge, etc.)
Quality and typography of originals is key Less than 99.5% accuracy is less expensive
to have re-keyed offshore For some applications, uncorrected text is
sufficient
Imaging Best PracticesGeneral guidelines for archival versions:
Photos, illustrations, maps, etc.: 300-600dpi 24-36 bit color
B/W Text document: 300-600dpi 8 bit grayscale
Negatives and Slides: 2000-4000 pixels in longest dimension 24-36 bit color for color; 8 bit grayscale for B/W
Imaging Best Practices
“The key to image quality is not to capture at the highest resolution or bit depth possible, but to match the conversion process to the informational content of the original, and to scan at that level--no more, no less.” — Moving Theory Into Practice
Metadata: Types
Structured description of an object or collection of objects
Three basic types: descriptive - e.g., title, creator, subject -
used for discovery administrative - e.g., resolution, bit
depth - used for managing the collection
structural - e.g., table of contents page, page 34, etc. - used for navigation
Metadata: Appropriate LevelMetadata: Appropriate Level
Collection-level access: Discovery metadata describes the collection Example: Archival finding aid encoded in
SGML; see http://www.oac.cdlib.org/
Item-level access: Discovery metadata describes the item Example: individual metadata records for
each item; see http://jarda.cdlib.org/cgi-bin/imagesearch.pl
IndividualFinding
Aid
Images
Collection Level AccessCollection Level Access
Search Interface (Library catalog
or dedicated)
IndividualFinding
Aid
Search Interface (Dedicated)
Images
Item Level AccessItem Level AccessFinding Aids
jarda.cdlib.org/search.html
Metadata: Granularity <name>William Randolph Hearst</name> <name>
<first>William</first><middle>Randolph</middle><last>Hearst</last>
</name> Consider all uses for the metadata Design for the most granular use Store it in a machine-parseable format
Metadata: Qualification<name role=“creator”>William Randolph
Hearst</name><subject scheme=“LCSH”>Builder --
Castles -- Southern California</subject>
Metadata: Machine Parseability
The ability to pull apart and reconstruct metadata via software
For example, this:
Can easily become this:
<name><first>William</first><middle>Randolph</middle><last>Hearst</last>
</name>
<DC.creator>Hearst, William Randolph</DC.creator>
Metadata: Standards
Metadata: Collection Level:
Encoded Archival Description (EAD) - lcweb.loc.gov/ead/
Item Level: MARC Dublin Core - purl.org/DC/ MODS - www.loc.gov/standards/mods/
Harvesting: Open Archives Initiative, www.openarchives.org
Access SystemsExhibitBrowseSearch
Access Systems: Exhibit Goals:
Inviting Easy to navigate Highlight selected parts of a collection Teach
Requirements: Great graphic design Informative and succinct commentary Interesting subject matter
Access Systems: BrowseGoals:
Provide intriguing and interesting paths into and throughout a collection
Give a broad sense of a collection, but not show everything necessarily
Requirements: Logical browse paths May have multiple paths to the same items
(e.g., time, geography, subject)
Access Systems: Search Goals
To provide post-coordinate access to all items in a collection relevant to a particular query
To provide good methods to create a search as well as refine or alter the display as required
Requirements: Good search software (database or indexing software) Good metadata (minimum is probably a title or caption
for each item) Good interface (options for navigation, search
refinement, etc.)
Skills Required of Staff Imaging OCR Markup languages (HTML, XML) Cataloging & metadata Indexing and database technology User interface design Programming Web technology Project management
How Does Digital Data Die?
Let me count the ways… New replaces old Death of a sponsor Sponsor loses interest Lost functionality Format rot Media format obsolescence Content format obsolescence Disaster
Preserving Digital Content No preservation format Digital preservation techniques:
Print (on acid free paper!) Store Refresh Encapsulate Emulate Proliferate (Lots Of Copies Keep Stuff Safe or
LOCKSS)
Preserving Digital Content Institutional commitmentConsortial agreementsCooperatively funded central
repositoriesPreservation Open Market
The Best DefenseWhat will ensure that material will not be
preserved? Ignorance of its existence Ignorance of its worth Inability or unwillingness to pay for its
preservationAccess helps with all of these problems