down and dirty digitization: everything you need to know about putting content online

Post on 06-Jan-2016

37 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Down and Dirty Digitization: Everything you need to know about putting content online. Roy Tennant California Digital Library. Outline. Project Planning Selecting Material to Digitize Digitization Purpose Basic Imaging Principles Capturing Images Editing Images Best Practices - PowerPoint PPT Presentation

TRANSCRIPT

Down and Dirty Digitization:Everything you need to know about putting content online

Roy TennantCalifornia Digital Library

Outline

Project Planning Selecting Material to Digitize Digitization Purpose Basic Imaging Principles Capturing Images Editing Images Best Practices Conversion to Text Metadata Access Systems Skills Required of Staff Preservation

Project PlanningWho will do the work?What systems will be required?What are the specifications for images

and metadata?How much will the project cost?Who will own and manage the digital

products that will be produced?

Steve Chapman, from Handbook for Digital Projects, NEDCC

Selecting Material to Digitize

Publishing rights Available support/funding opportunity Critical mass Uniqueness Reputation Audience and potential use Diversity of material type Ability to stand on its own and fit in with other

collections

What Do We Preserve?The body or the soul?

The artifact The intellectual content

How do we decide that the artifact has preservation value?

Who decides?

The Artifact The “look and feel” The experience of interacting with a specific

object Consequences:

Choices for providing access are limited Time and money spent on recreating the artifact

may be better spent on increasing access In some cases, preserving the look and feel

actually harms other uses

Written MaterialHandwritten texts (diaries, etc.), or

those with handwritten notations (manuscript drafts, etc.) can easily be considered to have artifactual value

But how much artifactual value do printed texts have?

And born-digital texts?What’s it worth to you?

“If the goal of preservation is persistent utility, then functionality rather than aesthetics should drive system design.”

— Stephen Chapman, “Content Follows Form: Preservation via Systems Design, Microform & Imaging Review

Persistent UtilityForm must be allowed to be altered or

destroyed to retain or enhance function If function cannot be retained or

enhanced, then form should be preserved

Considerations for Retaining Items in Original FormatAgeEvidential valueAesthetic valueScarcityAssociational valueMarket valueExhibition value

“The issue is not to evaluate the artifact per se to determine what survives and what does not…The issue is the need to agree on a method for interrogating the individual artifact, that would, in a climate of finite resources, help make a good decision about whether and how to preserve it.”

— Council on Library and Information Resources, The Evidence in Hand: the Report of the Task Force on the Artifact in Library Collections

How Do We Preserve It?

$0

$200

$400

$600

$800

$1,000

$1,200

$1,400

$1,600

$1,800

$2,000

Bind/Box Deacidify Microfilm Digitize Simple Book Digitize Complex Book Conserve

Preservation costs by method calculated by the Library of Congress Preservation Directorate

Types of Materials

Printed text/

Simple line art

Manuscripts

Halftones

Continuous Tone

Mixed

From Anne Kenney, et.al., Moving Theory into Practice

Benchmarking The process whereby you determine your

digitization requirements using the material you will digitize

Resolution

One pixel

The number of pixels in a given area defines the resolution of an image

1”

500 x 1,000 pixels

Dynamic Range (bit-depth)

1 bit 8 bit grayscale 8 bit color 24 bit color (GIF) (GIF) (JPEG)

1 bit = black or white8 bits = 256 shades16 bits = thousands24 bits = millions36 bits = billions

RGB Color Space

Red

Green

Blue

8 bits per channel = 24 bit color image

12 bits per channel = 36 bit color image

Color Channels

Image CompressionLossless — the image is unchanged

after compression (no image data is lost) Typical file size: 50% of original Example: LZW compression

Lossy — the image is altered after compression (image data is lost) Example: JPEG

TIFF

Tagged Image File FormatMost often used to save “master

versions” of images (unedited)Can be compressed or uncompressed

Compuserve GIF

Graphic Interchange Format (GIF) Maximum 8 bits/pixel: 256 colors (shades) Good for:

Text and line art Thumbnails

Not good for: Full-color pictures Anything that requires more than 256 colors

JPEG

Joint Photographic Engineers Group JPEG is actually a compression scheme; the

image file format is JFIF (JPEG File Image Format)

Good for: Full-color pictures Anything that requires more than 256 colors

Not good for: Text or line art

New Image Formats

Portable Network Graphics (PNG) - from the W3C to replace the Compuserve GIF format and provide more capabilities

JPEG2000 - An upgrade of the JPEG format Flashpix - from a consortium of commercial

companies, to provide much higher-resolution images in a way that allows speedy network delivery

MrSID - From LizardTech, good for large format materials (maps, panoramic photos, etc.)

Capturing Images

Technologies Digital Cameras Flatbed Scanners Film Scanners Kodak PhotoCD

OutsourcingStandards and Best Practices

Digital Cameras

BetterLight Super6K6,000 x 8,000 pixels, 136MB (24bit RGB)$16,990

Phase One PowerPhase FX10,500 x 12,600 pixels, 760MB (48 bit RGB)

Flatbed ScannersMinimum requirements:

600 X 1200 dpi optical resolution

36-bit colorNot for slides or transparencies, best for

81/2”x11” or 81/2”x14” originalsSheet feeder (often optional) helpful for

digitizing text

Film ScannersFor 35mm slides and negatives;

others available for larger formats$600 - $3,000 Most around 2700-4000

dpi,30-36 bit color

Kodak PhotoCDTake pictures with a normal camera, but

have your pictures “developed” onto a PhotoCD

A proprietary image format: ImagePAC, but very high resolution (4 different resolutions)

Outsourcing: Pros and Cons Benefits:

No ramp-up costs (both time and money) Probably higher quality, at least to begin with High volume capability

Drawbacks: May be more costly if you have underutilized staff

time No internal capability or experience developed (that

is, when the money runs out, so does your chance to do anything more)

Rare items may require in-house digitization

Outsourcing: How Write an RFQ (Request for Quote) outlining:

Type and amount of material being digitized Quality requirements Volume per unit of time requirements

For RFQ guidance and samples, see RLG Tools for Digital Imaging: www.rlg.org/preserv/RLGtools.html

Digital Image Work Flow

Original TIFF or PCD10-100+MB

JPEG100K

GIF10K

RGB Color Space IndexedColorSpace

Resize,Sharpen

Rotate,Crop,

Retouch,Brightness/

Contrast

Stored offline Stored online

Editing Images

RotatingCroppingRetouchingAdjustingResizingSharpeningSaving

Image Editing Demonstration

Conversion to Text Optical Character Recognition (OCR)

software is required (Caere OmniPage Pro, Xerox TextBridge, etc.)

Quality and typography of originals is key Less than 99.5% accuracy is less expensive

to have re-keyed offshore For some applications, uncorrected text is

sufficient

Imaging Best PracticesGeneral guidelines for archival versions:

Photos, illustrations, maps, etc.: 300-600dpi 24-36 bit color

B/W Text document: 300-600dpi 8 bit grayscale

Negatives and Slides: 2000-4000 pixels in longest dimension 24-36 bit color for color; 8 bit grayscale for B/W

Imaging Best Practices

“The key to image quality is not to capture at the highest resolution or bit depth possible, but to match the conversion process to the informational content of the original, and to scan at that level--no more, no less.” — Moving Theory Into Practice

Metadata: Types

Structured description of an object or collection of objects

Three basic types: descriptive - e.g., title, creator, subject -

used for discovery administrative - e.g., resolution, bit

depth - used for managing the collection

structural - e.g., table of contents page, page 34, etc. - used for navigation

Metadata: Appropriate LevelMetadata: Appropriate Level

Collection-level access: Discovery metadata describes the collection Example: Archival finding aid encoded in

SGML; see http://www.oac.cdlib.org/

Item-level access: Discovery metadata describes the item Example: individual metadata records for

each item; see http://jarda.cdlib.org/cgi-bin/imagesearch.pl

IndividualFinding

Aid

Images

Collection Level AccessCollection Level Access

Search Interface (Library catalog

or dedicated)

IndividualFinding

Aid

Search Interface (Dedicated)

Images

Item Level AccessItem Level AccessFinding Aids

jarda.cdlib.org/search.html

Metadata: Granularity <name>William Randolph Hearst</name> <name>

<first>William</first><middle>Randolph</middle><last>Hearst</last>

</name> Consider all uses for the metadata Design for the most granular use Store it in a machine-parseable format

Metadata: Qualification<name role=“creator”>William Randolph

Hearst</name><subject scheme=“LCSH”>Builder --

Castles -- Southern California</subject>

Metadata: Machine Parseability

The ability to pull apart and reconstruct metadata via software

For example, this:

Can easily become this:

<name><first>William</first><middle>Randolph</middle><last>Hearst</last>

</name>

<DC.creator>Hearst, William Randolph</DC.creator>

Metadata: Standards

Metadata: Collection Level:

Encoded Archival Description (EAD) - lcweb.loc.gov/ead/

Item Level: MARC Dublin Core - purl.org/DC/ MODS - www.loc.gov/standards/mods/

Harvesting: Open Archives Initiative, www.openarchives.org

Access SystemsExhibitBrowseSearch

Access Systems: Exhibit Goals:

Inviting Easy to navigate Highlight selected parts of a collection Teach

Requirements: Great graphic design Informative and succinct commentary Interesting subject matter

Access Systems: BrowseGoals:

Provide intriguing and interesting paths into and throughout a collection

Give a broad sense of a collection, but not show everything necessarily

Requirements: Logical browse paths May have multiple paths to the same items

(e.g., time, geography, subject)

Access Systems: Search Goals

To provide post-coordinate access to all items in a collection relevant to a particular query

To provide good methods to create a search as well as refine or alter the display as required

Requirements: Good search software (database or indexing software) Good metadata (minimum is probably a title or caption

for each item) Good interface (options for navigation, search

refinement, etc.)

Skills Required of Staff Imaging OCR Markup languages (HTML, XML) Cataloging & metadata Indexing and database technology User interface design Programming Web technology Project management

How Does Digital Data Die?

Let me count the ways… New replaces old Death of a sponsor Sponsor loses interest Lost functionality Format rot Media format obsolescence Content format obsolescence Disaster

Preserving Digital Content No preservation format Digital preservation techniques:

Print (on acid free paper!) Store Refresh Encapsulate Emulate Proliferate (Lots Of Copies Keep Stuff Safe or

LOCKSS)

Preserving Digital Content Institutional commitmentConsortial agreementsCooperatively funded central

repositoriesPreservation Open Market

The Best DefenseWhat will ensure that material will not be

preserved? Ignorance of its existence Ignorance of its worth Inability or unwillingness to pay for its

preservationAccess helps with all of these problems

top related