vra 2012, cataloging case studies, robocataloging

Post on 20-Jun-2015

487 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented by Joshua Polansky at the Annual Conference of the Visual Resources Association, April 18th - April 21st, 2012, in Albuquerque, New Mexico. The Cataloguing Case Studies session will explore metadata migration, workflows, cloud computing, and tagging and how they can be applied to digital collections. Mary Alexander of the University of Alabama will present on the second of two migrations that have taken place at the University of Alabama Libraries and the importance of metadata schema and workflows in that process. Joshua Polansky of the University of Washington will describe his automated workflow using optical character recognition (OCR), Apple Automator, and Microsoft Excel to speed the process of collecting metadata for 75,000 digital assets. Elizabeth Berenz of ARTstor will look at the advantages of cloud based software for image management using Shared Shelf as a working example. And finally Ian McDermott will demonstrate the advantages of expert tagging and annotation in improving metadata. His presentation will focus on two ARTstor collections that could benefit from the knowledge of the larger ARTstor community: the Gernsheim Photographic Corpus of Drawings and the Larry Qualls Archive of contemporary art exhibitions. MODERATOR: Jeannine Keefer, University of Richmond, VA PRESENTERS: Mary Alexander, University of Alabama Elizabeth Berenz, ARTstor Ian McDermott, ARTstor Joshua Polansky, University of Washington

TRANSCRIPT

Joshua PolanskyUniversity of Washington

College of Built EnvironmentsVisual Resources Collection

ROBOCATALOGINGAccelerated workflows using OCR and automation

Cataloging Case Studies April 21, 2012

University of Washington College of Built EnvironmentsVisual Resources Collection

Serves the departments of Architecture, Construction Management,Landscape Architecture and Urban Design & Planning

Analog collection:• 130,000 35mm slides accessioned and cataloged since 1950s• Typewritten records; no digital database or online component until 2002

Visual Resources CollectionDigital components:

MS Access database catalog MDID2 for faculty / students

The big question:

Automated processes exist for batch digitizing analog photos.

The big question:

Automated processes exist for batch digitizing analog photos.

Is it possible to batch digitize old cataloging data, too?

Good cataloging information here, researched and typed years ago.

More good data, including source and a unique accession number.

Paper records to the rescue

Binders and binders of accession records Pristine label photocopies

Accession numberCollection ID that appears on every label in this form

Architect

Building name

Location / Year

View

Source

Photocopied label edge that will interfere with OCR later

A closer look at the slide label

The big challenge:

• Digitize these typewritten pages• Sort slide label text into distinct columns in Excel• Identify each record with its accession number• Do it all with common or affordable tools

Photo: Alvaro Farfán via Flickr. 3392225359

Hardware

Photo: Alvaro Farfán via Flickr. 3392225359

Apple iMac• 2010 model• OS 10.6

Any recent Mac will do (OS 10.4 or higher)

Hardware

Photo: Alvaro Farfán via Flickr. 3392225359

Epson Perfection V500 scanner• With optional Automatic Document

Feeder for stacks of 30 sheets at a time• Standard transparency unit makes it

useful for other scanning projects• Retails for less than $300 with ADF

Photo: Zak Moreira via Flickr. 3425393424

Software

Photo: Zak Moreira via Flickr. 3425393424

Adobe Photoshop CS4• Resize and realign scanned page into a

single-column tif with Actions

Adobe Acrobat Pro• Create a pdf of each tif• Analyze pdf with optical character recognition

(OCR) and make pdf text selectable

Apple AutomatorAutomator Virtual Input• Execute workflows to control multiple

applications. Launch, copy, paste, manipulate, save, repeat.

• Create Folder Actions for Finder automation• Virtual Input: Extend the functionality of

Automator for even more control over apps, mouse, keyboard

Microsoft Excel 2008• Receive text from Acrobat in columns• After text manipulation and sorting, output

in a cross-platform format like csv

Automator

• Comes standard with Mac OS X 10.4+

• Allows scripting and workflow creation via GUI

• Can perform operations within an application or across multiple applications

Document scanning: Automator, Folder Actions, Photoshop[video here in original presentataion]

Text processing: Automator + Automator Virtual Input, Folder Actions, Acrobat, Excel[video here in original presentataion]

Processed output in Excel

Sometimes it looks good...

Sometimes it doesn’t.

Sometimes it looks good...

Final result after text sorting and cleanup

Goal• Produce nearly perfect metadata,

clean enough to import into existing database

Goal• Produce nearly perfect metadata,

clean enough to import into existing database

Actual outcome• Produced pretty good metadata• Spent lots of time on data cleanup

to get there

Goal• Use tools on hand; any new tools

should be cheap or useful for other projects

Actual outcome• Used standard software, plus one

new application ($25)• iMac is a student workstation• Epson scanner is in use for print

and film scanning plus pdf creation

Goal• Use tools on hand; any new tools

should be cheap or useful for other projects

Goal• Have 75,000 new records ready

to pair with images and publish to MDID

Goal• Have 75,000 new records ready

to pair with images and publish to MDID

Actual outcome• Got 75,000 records!• Created a searchable shelf list and

archival finding aid• With further data cleanup, the

original goal of MDID use can be achieved

Photo: JF Sebastian via Flickr. 412874324

Photo: JF Sebastian via Flickr. 412874324

• Every Mac comes with Automator and it is easy to learn

• You probably have OCR tools on your computer right now

• Experimenting can produce great results

Photo: JF Sebastian via Flickr. 412874324

Photo credits• Software icons and screenshots by Adobe, Apple,

Microsoft and Singed Labcoat• Kraftwerk images by Flickr users Zak Moreira,

Alvaro Farfán and JF Sebastian• Other photo and video by UW CBE VRC

Thank youRainer Metzger University of Washington

• Every Mac comes with Automator and it is easy to learn

• You probably have OCR tools on your computer right now

• Experimenting can produce great results

top related