a standardized digitool ingest approach to internet archive digitized books joseph shubitowski...

21
A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski ([email protected]) IGeLU 2008, September 9, 2008

Upload: deasia-shillingford

Post on 30-Mar-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

A Standardized DigiTool Ingest Approach to Internet Archive

Digitized Books

Joseph Shubitowski ([email protected])IGeLU 2008, September 9, 2008

Page 2: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Talking Points:

• Scope / Background• Why?• Major hurdles• Manual / automated workflows• Outcomes• What can we share?

– Results– Methodologies– Tools, etc.

IGeLU Conference 2008, September 9, 2008

Page 3: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Alfred P. Sloan Foundation

Getty Research InstituteArchaeology and antiquities

Boston Public LibraryJohn Adams collection

Johns HopkinsAnti-slavery materials

The Metropolitan Museum of ArtMuseum Publications

Bancroft LibraryGold Rush and westward expansion

IGeLU Conference 2008, September 9, 2008

Page 4: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Scope of the Digitization Project

2,000,000 pages or approx. 5,000 books

Self-evident collection

Public domainpre-1923 for works published in U.S.pre-1909 for works published outside

of U.S.

IGeLU Conference 2008, September 9, 2008

Page 5: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Internet Archive Scribe Station

1 Pod = 10 Scribe Stations

IGeLU Conference 2008, September 9, 2008

Page 6: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Why Do it?

• Internet Archive issues– Response/search time– Metadata only searching– No control

• Full-text searching

• Use in metasearch

• More control!

IGeLU Conference 2008, September 9, 2008

Page 7: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Major Hurdles

• Getting the files• Disk space issues – for general

storage and for DTL• What/how to process all the files• Abbyy OCR vs. ALTO OCR• Thumbnail generation• Handle configuration/synchronization

IGeLU Conference 2008, September 9, 2008

Page 8: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

List of OCR’d books received from Internet Archive

Processed by GRI

URLs from Internet Archive

Link to URLs

Download files from Internet Archive

Zipped or tar files: *_orig_jp2 *_jp2 *_raw_jp2

high & low resolution PDFs*abbyy.gz*meta.xml*marc.xml

Process downloaded files

Ready for Digitool Ingest

IGeLU Conference 2008, September 9, 2008

Page 9: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

IGeLU Conference 2008, September 9, 2008

Page 10: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

IGeLU Conference 2008, September 9, 2008

Page 11: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

IGeLU Conference 2008, September 9, 2008

Page 12: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Disk Space Issues

• Each digitized book = 500MB to 1.5 GB of raw files

• Further untarring and processing consume even more disk!

• DTL scratch/processing space, permanent storage space, and Oracle tablespace – including full text indexing consumes even more disk space

• 3000 books in the queue will require 10-15 TB for this project alone.

IGeLU Conference 2008, September 9, 2008

Page 13: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

• DTL ingest package =– Archive = raw jpeg2000 (renamed to

*.j2k)– View = use copy jpeg2000 (*.jp2)– Index = ALTO files– Thumbnail = appropriate thumb of title

page for display of the complex object– PDF = high res PDF as additional

manifestation– MARCXML record for IE level metadata

• No TIF files from IA – everything is jpeg2000

• Mapping file same for every ingest• CSV file is produced automatically

IGeLU Conference 2008, September 9, 2008

Page 14: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Abbyy to ALTO

• IA scanning produces one huge OCR file in Abbyy proprietary XML

• Discussions with / proposal from CCS• Real need to open source approach

– Abbyy XSD can morph in future– Desire to share

• Contract with Ex Libris to produce tool– Java based– Includes jar and class files– Free to share and redistribute

• Tool transforms single ABBYY file to ALTO-file-per-page XML files

IGeLU Conference 2008, September 9, 2008

Page 15: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Thumbnail Creation

• Initial ingest flow created complex object thumbnail from first page of PDF manifestation

• Boring!• Ghostscript/PDF/ImageMagick problems• Decision to go semi-manual with script/cgi

that:– creates thumbnails for first 15 jpeg2000 page

images– sends URL in email for each separate ingest– creates web page for page image viewing and

thumbnail selection– adds chosen thumbnail to staging directory,

cleans up, and sends confirmation email

IGeLU Conference 2008, September 9, 2008

Page 16: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

IGeLU Conference 2008, September 9, 2008

Page 17: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Handle Generation

• Setup per DTL docs• Firewall tweaks• Ingest flow tweaks

– Handle for IE– Handles for all archive jpeg2000 images

• DTL errors with mass publication of Handles– Fixed in SP21

IGeLU Conference 2008, September 9, 2008

Page 18: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Ingest Summary

• Get/process/stage files• Generate ALTO OCR files• Web CGI for thumbnail selection• Load.sh script moves all files to

locations DTL expects• Activate saved Ingest Flow from

DTL Web Ingest client• Wait.........

IGeLU Conference 2008, September 9, 2008

Page 19: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Outstanding Issues

• Ingest speed– Remedied somewhat in SP21– Digitized books are just darn big!– Low number of ingests per day

• Handles– Manual publishing process– Need to populate Voyager bib record

• METS viewer performance issues

IGeLU Conference 2008, September 9, 2008

Page 20: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Success Factors !!

• Code to share– Get/process/staging scripts– Abbyy/ALTO transform code– Web cgi thumbnail code– YMMV

• Handles provide true persistent IDs– http://hdl.handle.net/10020/17473

• Full-text multilingual searching– MetaLib QuickSet for metasearch of all

local repositories

IGeLU Conference 2008, September 9, 2008

Page 21: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski (jshubitowski@getty.edu) IGeLU 2008, September 9, 2008

Demo and Thanks......

• http://archives.getty.edu

[email protected]

IGeLU Conference 2008, September 9, 2008