where preservation meets mass digitization john a. kunze california digital library lauc fall...
TRANSCRIPT
Where Preservation Meets Mass Digitization
John A. Kunze
California Digital Library
LAUC Fall Assembly, UC Merced, 16 November 2007
2
The UC Libraries’ Digital Preservation Program
UC-wide program: serves all 10 UC campuses– 208,000 students– 121,000 faculty and staff – 10+ libraries– Museums
Located at the CDL
4
Preservation challenges: case studies
With benefit of hindsight, what’s hard?
• Policy
• Making files small
• Fast data transfer
• Cheap, reliable storage
• Lots of annoying files
• Preserving the revenue stream
5
What’s digital preservation?
Storing digital objects while retaining a balance of usability and faithfulness (truthiness) to their creators’ original intentions
6
Policy Challenges
• How faithful
• How long
• How many replicas
• How much manipulation
• Right(s)mare
7
Fast data transfer challenges
Lots of files, lots of data• Could take months to move and replicate
Explore data transfer / replication options• Test with CDL and New York University
Survey tool performance and usability
Continuing conversations with the San Diego Supercomputer Center and the Library of Congress with goal of creating guidelines
8
Transfer tools testedUbiquitous, usual suspects: RSYNC, SCP, SFTP, FTP• MogileFS (simple distributed filesystem, Perl scripts) http://www.danga.com/mogilefs/
• High Performance SSH (no system gaming) http://www.psc.edu/networking/projects/hpn-ssh/
But parallelism really works:• GridFTP (high security, from Grid community) http://www.globus.
org/grid_software/data/gridftp.php
• SRB (bundled Sget/Sput tools) http://www.sdsc.edu/srb/index.php/Main_Page
• BBFTP (easy installation and use) http://doc.in2p3.fr/bbftp/
• BBCP (easy installation and use) http://www.slac.stanford.edu/~abh/bbcp/
Practically, combine parallelism with common tools: 20 x SCP!
11
What is mass digitization?
Large-scale scanning of newspapers, books, videos, etc. from the world’s major libraries– Millions of items/hours to digitize, e.g.,
12
Why mass digitization?
For better access and search– Page images remotely accessible– OCR (Optical Character Recognition) makes
text visible to search engines
Mass digitization is, for us, not intended to
replace the physical item
13
“Page Image Compression for Mass Digitization”
A study of page image tradeoffs with:• National Library of France (BnF)• Harvard University Libraries (HUL)
– With Google Book Search: G9 Libraries – Harvard, Michigan, Stanford, NYPL, Oxford, University of California, etc.
• University of California Berkeley (UCB) and the California Digital Library (CDL)
– With Open Content Alliance: Internet Archive, Microsoft, University of Toronto, etc.
Presented at IS&T Archiving 2007, Arlington, May 2007
14
Mass book digitization tradeoffs
For our millions of volumes• Need to strike balance between size of the files and
quality of the reading experience• Images need to work with OCR• Possibility of re-printing books (print on demand), but this
was not investigated formallyRecommendations common to all 3 groups:• JPEG 2000 JP2 (ISO/IEC 15444-1) file format• An all color, all lossy solution is feasible
21
Don’t forget audio/video
Case: Swedish National Archive of Sound and Moving Images is digitizing 6 million hours of material– 50 different recording formats and
catalogs, growing 10% per annum– Eg, 500,000 hours of open-reel 4 track
using 16 simultaneous players, 8 players per operator
– Eg, 220,000 hours VHS using 12 simultaneous players
Digitizing and ingesting 42 TB/month
22
Cheap, reliable storage
OK, we can make files smaller and we can move lots of them quickly, but can we make disk cheaper and still reliable?
• RAID (Redundant Arrays of Inexpensive Disk) 1980s
• JOBD (Just a Bunch of Disks) 1990s• MAID (Massive Arrays of Idle Disks) 2000s
23
Lots of annoying files, or “making files fewer”
Origin: web archiving
Solution: aggregate W/ARC file format– Many “files” in one file for speed and ease– Records are sort of peers of files
Generalization to mass digitization and other processing products
W/ARC File Anatomy
WARC = Web ARChive file format
.
.
.
Text header
Content block
W/ARC File
W/ARC Record
Length, source URI, date, type, …
E.g., HTTP responseheaders and length bytes of HTML, GIF, PDF, …
Append at will WARC is fast track ISO work item
25
Digitizing the Digital
Origin: preservation of revenue streamCase of Data Desiccation, creating no-frills, sometimes
feature-poor derivatives that retain most of the original scholarly value but are likely to be less perishable than original format (similar to “digital microfilm”)
Save desiccated derivatives along with original, just in case no one ever again
• Has the funds to touch files• Has the expertise to convert them properly
26
Example Photo of Mission San Luis de Tolosa [2]About the City [3]Visiting SLO [4]What’s New [5]City Government [6]Employment Opportunities [7]Bids & Proposals [8]Economic Development [9]FAQs [10]How are we doing? City of San Luis Obispo About the City
[Choose a Destination....] [11]Search [12]Contact Us [13]City Home A Brief History
Who we are and how we got started. The City of San Luis Obispo serves as the commercial, governmental and cultural hub of California’s Central Coast. One of California’s oldest communities, it began with the founding of Mission San Luis Obispo de Tolosa in 1772 by Father Junípero Serra as the fifth mission in the California chain of 21 missions. The mission was named after Saint Louis, a 13th Century Bishop of Toulouse, France. (San Luis Obispo is Spanish for "St. Louis, the Bishop".) It was first incorporated in 1856 as a General Law City, and became a Charter City in 1876.
Where we’re located. With a population of 44,000, the City is located eight miles from the Pacific Ocean and is midway between San Francisco and Los Angeles at the junction of Highway 101 and scenic Highway 1. San Luis Obispo is the County Seat, and a number of federal and state regional offices and facilities are located here, including Cal Poly State University, Cuesta Community College, Regional Water Quality Board and Caltrans District offices. The City’s ideal weather and natural beauty provide numerous opportunities for outdoor recreation at nearby City and State parks, lakes, beaches and wilderness areas.
Great place to live and visit. While San Luis Obispo grew relatively…
27
Example continued: endnotes…[18]About the City | [19]Visiting SLO | [20]What’s New | [21]City Government | [22]Employment [23]Bids & Proposals | [24]Economic Development | [25]FAQs | [26]How are we doing? [27]©2006, City of San Luis Obispo
References
1. http://www.ci.san-luis-obispo.ca.us/briefhistory.asp#content 2. http://www.ci.san-luis-obispo.ca.us/about.asp 3. http://www.ci.san-luis-obispo.ca.us/visit.asp 4. http://www.ci.san-luis-obispo.ca.us/whatsnew.asp 5. http://www.ci.san-luis-obispo.ca.us/government.asp 6. http://www.ci.san-luis-obispo.ca.us/humanresources/index.asp 7. http://www.ci.san-luis-obispo.ca.us/finance/bids.asp 8. http://www.ci.san-luis-obispo.ca.us/economicdevelopment/index.asp 9. http://www.ci.san-luis-obispo.ca.us/faq.asp 10. http://www.ci.san-luis-obispo.ca.us/how.asp 11. http://www.ci.san-luis-obispo.ca.us/search2.asp 12. http://www.ci.san-luis-obispo.ca.us/contact.asp 13. http://www.ci.san-luis-obispo.ca.us/index.asp 14. http://www.ci.san-luis-obispo.ca.us/visit.asp…
28
Desiccation and Mass Digitization?
How to make the OCR’d plain text version of a book as acceptable as possible?
Very difficult problem: cf. work of Project Gutenberg and Digital Proofreaders– Born-digital plain text prettier than OCR– Page numbers, footnotes, sidebars– Multiple columns and reading order
At the same time, page/section/chapter structural layout is a mass digitization feature frontier