vra 2012, cataloging case studies, robocataloging

33
Joshua Polansky University of Washington College of Built Environments Visual Resources Collection ROBOCATALOGING Accelerated workflows using OCR and automation taloging Case Studies April 21, 2012

Upload: visual-resources-association

Post on 20-Jun-2015

487 views

Category:

Technology


0 download

DESCRIPTION

Presented by Joshua Polansky at the Annual Conference of the Visual Resources Association, April 18th - April 21st, 2012, in Albuquerque, New Mexico. The Cataloguing Case Studies session will explore metadata migration, workflows, cloud computing, and tagging and how they can be applied to digital collections. Mary Alexander of the University of Alabama will present on the second of two migrations that have taken place at the University of Alabama Libraries and the importance of metadata schema and workflows in that process. Joshua Polansky of the University of Washington will describe his automated workflow using optical character recognition (OCR), Apple Automator, and Microsoft Excel to speed the process of collecting metadata for 75,000 digital assets. Elizabeth Berenz of ARTstor will look at the advantages of cloud based software for image management using Shared Shelf as a working example. And finally Ian McDermott will demonstrate the advantages of expert tagging and annotation in improving metadata. His presentation will focus on two ARTstor collections that could benefit from the knowledge of the larger ARTstor community: the Gernsheim Photographic Corpus of Drawings and the Larry Qualls Archive of contemporary art exhibitions. MODERATOR: Jeannine Keefer, University of Richmond, VA PRESENTERS: Mary Alexander, University of Alabama Elizabeth Berenz, ARTstor Ian McDermott, ARTstor Joshua Polansky, University of Washington

TRANSCRIPT

Page 1: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Joshua PolanskyUniversity of Washington

College of Built EnvironmentsVisual Resources Collection

ROBOCATALOGINGAccelerated workflows using OCR and automation

Cataloging Case Studies April 21, 2012

Page 2: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

University of Washington College of Built EnvironmentsVisual Resources Collection

Serves the departments of Architecture, Construction Management,Landscape Architecture and Urban Design & Planning

Analog collection:• 130,000 35mm slides accessioned and cataloged since 1950s• Typewritten records; no digital database or online component until 2002

Page 3: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Visual Resources CollectionDigital components:

MS Access database catalog MDID2 for faculty / students

Page 4: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

The big question:

Automated processes exist for batch digitizing analog photos.

Page 5: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

The big question:

Automated processes exist for batch digitizing analog photos.

Is it possible to batch digitize old cataloging data, too?

Good cataloging information here, researched and typed years ago.

More good data, including source and a unique accession number.

Page 6: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Paper records to the rescue

Binders and binders of accession records Pristine label photocopies

Page 7: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Accession numberCollection ID that appears on every label in this form

Architect

Building name

Location / Year

View

Source

Photocopied label edge that will interfere with OCR later

A closer look at the slide label

Page 8: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

The big challenge:

• Digitize these typewritten pages• Sort slide label text into distinct columns in Excel• Identify each record with its accession number• Do it all with common or affordable tools

Page 9: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Photo: Alvaro Farfán via Flickr. 3392225359

Page 10: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Hardware

Photo: Alvaro Farfán via Flickr. 3392225359

Apple iMac• 2010 model• OS 10.6

Any recent Mac will do (OS 10.4 or higher)

Page 11: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Hardware

Photo: Alvaro Farfán via Flickr. 3392225359

Epson Perfection V500 scanner• With optional Automatic Document

Feeder for stacks of 30 sheets at a time• Standard transparency unit makes it

useful for other scanning projects• Retails for less than $300 with ADF

Page 12: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Photo: Zak Moreira via Flickr. 3425393424

Page 13: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Software

Photo: Zak Moreira via Flickr. 3425393424

Page 14: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Adobe Photoshop CS4• Resize and realign scanned page into a

single-column tif with Actions

Adobe Acrobat Pro• Create a pdf of each tif• Analyze pdf with optical character recognition

(OCR) and make pdf text selectable

Page 15: VRA 2012, Cataloging Case Studies, ROBOCATALOGING
Page 16: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Apple AutomatorAutomator Virtual Input• Execute workflows to control multiple

applications. Launch, copy, paste, manipulate, save, repeat.

• Create Folder Actions for Finder automation• Virtual Input: Extend the functionality of

Automator for even more control over apps, mouse, keyboard

Microsoft Excel 2008• Receive text from Acrobat in columns• After text manipulation and sorting, output

in a cross-platform format like csv

Page 17: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Automator

• Comes standard with Mac OS X 10.4+

• Allows scripting and workflow creation via GUI

• Can perform operations within an application or across multiple applications

Page 18: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Document scanning: Automator, Folder Actions, Photoshop[video here in original presentataion]

Page 19: VRA 2012, Cataloging Case Studies, ROBOCATALOGING
Page 20: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Text processing: Automator + Automator Virtual Input, Folder Actions, Acrobat, Excel[video here in original presentataion]

Page 21: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Processed output in Excel

Page 22: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Sometimes it looks good...

Page 23: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Sometimes it doesn’t.

Sometimes it looks good...

Page 24: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Final result after text sorting and cleanup

Page 25: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Goal• Produce nearly perfect metadata,

clean enough to import into existing database

Page 26: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Goal• Produce nearly perfect metadata,

clean enough to import into existing database

Actual outcome• Produced pretty good metadata• Spent lots of time on data cleanup

to get there

Page 27: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Goal• Use tools on hand; any new tools

should be cheap or useful for other projects

Page 28: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Actual outcome• Used standard software, plus one

new application ($25)• iMac is a student workstation• Epson scanner is in use for print

and film scanning plus pdf creation

Goal• Use tools on hand; any new tools

should be cheap or useful for other projects

Page 29: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Goal• Have 75,000 new records ready

to pair with images and publish to MDID

Page 30: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Goal• Have 75,000 new records ready

to pair with images and publish to MDID

Actual outcome• Got 75,000 records!• Created a searchable shelf list and

archival finding aid• With further data cleanup, the

original goal of MDID use can be achieved

Page 31: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Photo: JF Sebastian via Flickr. 412874324

Page 32: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Photo: JF Sebastian via Flickr. 412874324

• Every Mac comes with Automator and it is easy to learn

• You probably have OCR tools on your computer right now

• Experimenting can produce great results

Page 33: VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Photo: JF Sebastian via Flickr. 412874324

Photo credits• Software icons and screenshots by Adobe, Apple,

Microsoft and Singed Labcoat• Kraftwerk images by Flickr users Zak Moreira,

Alvaro Farfán and JF Sebastian• Other photo and video by UW CBE VRC

Thank youRainer Metzger University of Washington

• Every Mac comes with Automator and it is easy to learn

• You probably have OCR tools on your computer right now

• Experimenting can produce great results