a standardized digitool ingest approach to internet archive digitized books

21
A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books Joseph Shubitowski ([email protected]) IGeLU 2008, September 9, 2008

Upload: meris

Post on 11-Jan-2016

31 views

Category:

Documents


4 download

DESCRIPTION

A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books. Joseph Shubitowski ([email protected]) IGeLU 2008, September 9, 2008. Talking Points:. Scope / Background Why? Major hurdles Manual / automated workflows Outcomes What can we share? Results Methodologies - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

A Standardized DigiTool Ingest Approach to Internet Archive

Digitized Books

Joseph Shubitowski ([email protected])IGeLU 2008, September 9, 2008

Page 2: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Talking Points:

• Scope / Background• Why?• Major hurdles• Manual / automated workflows• Outcomes• What can we share?

– Results– Methodologies– Tools, etc.

IGeLU Conference 2008, September 9, 2008

Page 3: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Alfred P. Sloan Foundation

Getty Research InstituteArchaeology and antiquities

Boston Public LibraryJohn Adams collection

Johns HopkinsAnti-slavery materials

The Metropolitan Museum of ArtMuseum Publications

Bancroft LibraryGold Rush and westward expansion

IGeLU Conference 2008, September 9, 2008

Page 4: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Scope of the Digitization Project

2,000,000 pages or approx. 5,000 books

Self-evident collection

Public domainpre-1923 for works published in U.S.pre-1909 for works published outside

of U.S.

IGeLU Conference 2008, September 9, 2008

Page 5: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Internet Archive Scribe Station

1 Pod = 10 Scribe Stations

IGeLU Conference 2008, September 9, 2008

Page 6: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Why Do it?

• Internet Archive issues– Response/search time– Metadata only searching– No control

• Full-text searching

• Use in metasearch

• More control!

IGeLU Conference 2008, September 9, 2008

Page 7: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Major Hurdles

• Getting the files• Disk space issues – for general

storage and for DTL• What/how to process all the files• Abbyy OCR vs. ALTO OCR• Thumbnail generation• Handle configuration/synchronization

IGeLU Conference 2008, September 9, 2008

Page 8: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

List of OCR’d books received from Internet Archive

Processed by GRI

URLs from Internet Archive

Link to URLs

Download files from Internet Archive

Zipped or tar files: *_orig_jp2 *_jp2 *_raw_jp2

high & low resolution PDFs*abbyy.gz*meta.xml*marc.xml

Process downloaded files

Ready for Digitool Ingest

IGeLU Conference 2008, September 9, 2008

Page 9: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

IGeLU Conference 2008, September 9, 2008

Page 10: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

IGeLU Conference 2008, September 9, 2008

Page 11: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

IGeLU Conference 2008, September 9, 2008

Page 12: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Disk Space Issues

• Each digitized book = 500MB to 1.5 GB of raw files

• Further untarring and processing consume even more disk!

• DTL scratch/processing space, permanent storage space, and Oracle tablespace – including full text indexing consumes even more disk space

• 3000 books in the queue will require 10-15 TB for this project alone.

IGeLU Conference 2008, September 9, 2008

Page 13: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

• DTL ingest package =– Archive = raw jpeg2000 (renamed to

*.j2k)– View = use copy jpeg2000 (*.jp2)– Index = ALTO files– Thumbnail = appropriate thumb of title

page for display of the complex object– PDF = high res PDF as additional

manifestation– MARCXML record for IE level metadata

• No TIF files from IA – everything is jpeg2000

• Mapping file same for every ingest• CSV file is produced automatically

IGeLU Conference 2008, September 9, 2008

Page 14: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Abbyy to ALTO

• IA scanning produces one huge OCR file in Abbyy proprietary XML

• Discussions with / proposal from CCS• Real need to open source approach

– Abbyy XSD can morph in future– Desire to share

• Contract with Ex Libris to produce tool– Java based– Includes jar and class files– Free to share and redistribute

• Tool transforms single ABBYY file to ALTO-file-per-page XML files

IGeLU Conference 2008, September 9, 2008

Page 15: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Thumbnail Creation

• Initial ingest flow created complex object thumbnail from first page of PDF manifestation

• Boring!• Ghostscript/PDF/ImageMagick problems• Decision to go semi-manual with script/cgi

that:– creates thumbnails for first 15 jpeg2000 page

images– sends URL in email for each separate ingest– creates web page for page image viewing and

thumbnail selection– adds chosen thumbnail to staging directory,

cleans up, and sends confirmation email

IGeLU Conference 2008, September 9, 2008

Page 16: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

IGeLU Conference 2008, September 9, 2008

Page 17: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Handle Generation

• Setup per DTL docs• Firewall tweaks• Ingest flow tweaks

– Handle for IE– Handles for all archive jpeg2000 images

• DTL errors with mass publication of Handles– Fixed in SP21

IGeLU Conference 2008, September 9, 2008

Page 18: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Ingest Summary

• Get/process/stage files• Generate ALTO OCR files• Web CGI for thumbnail selection• Load.sh script moves all files to

locations DTL expects• Activate saved Ingest Flow from

DTL Web Ingest client• Wait.........

IGeLU Conference 2008, September 9, 2008

Page 19: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Outstanding Issues

• Ingest speed– Remedied somewhat in SP21– Digitized books are just darn big!– Low number of ingests per day

• Handles– Manual publishing process– Need to populate Voyager bib record

• METS viewer performance issues

IGeLU Conference 2008, September 9, 2008

Page 20: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Success Factors !!

• Code to share– Get/process/staging scripts– Abbyy/ALTO transform code– Web cgi thumbnail code– YMMV

• Handles provide true persistent IDs– http://hdl.handle.net/10020/17473

• Full-text multilingual searching– MetaLib QuickSet for metasearch of all

local repositories

IGeLU Conference 2008, September 9, 2008

Page 21: A Standardized DigiTool Ingest Approach to Internet Archive Digitized Books

Demo and Thanks......

• http://archives.getty.edu

[email protected]

IGeLU Conference 2008, September 9, 2008