the million book project the mini-ul digital library platform carnegie mellon university school of...
TRANSCRIPT
The Million Book Project
The Mini-UL Digital Library Platform
Carnegie Mellon University
School of Computer Science
Raj Reddy
Eric Burns
What is the Million Book Project?
Free-to-read, open-platform digital library Worldwide distribution and mirroring Public domain works Out of print but in copyright Rare materials
Collaborative content acquisition India
20 mini scanning centers, 3 mega scanning centers Over 80,000 books to date
China Over 30,000 books to date
USA / Carnegie Mellon (Hunt Library/SCS) 1200 books, technology contributor
Truly multi-lingual corpus Several Indian languages Mandarin Chinese Most European languages
MBP offers unique systems challenges
Multiple deployments China India Partners in US
Human-intensive scanning process Error prone
DC XML entered by hand Operator error on scanning devices
Difficult to standardize Multiple QA passes required
Everyone wants autonomy and customization System-level solution must satisfy small and large data sets CMU must provide a framework for remote sites to extend Equipment budget is limited
Developing nations’ networks are limited China, India output must be shipped to US
Core Problems
Multiple scanning centers, each with: Distinct values and goals Limited connectivity Varying IT infrastructure
Common base requirements Searching Browsing Viewing File-system compatibility
Basic standard for acquiring and storing scanned books Data preservation Quality assurance Flexibility Openness
Fault-tolerant storage at all sites Data movement via physical shipment Standardized OS and base software
Our Solution: Mini-UL Embedded
Digital library on a CD OS (Knoppix Linux), servers (Apache, PaperSight ImageServer), code
(Perl) on single ISO Boots single systems or whole clusters Ensures standardization, eases upgrades
To use new software, admins burn CD and reboot Commodity PC and disk hardware spec
Software RAID: Use low-end PC as network-attach storage Sub-$1000 PC = 1 TB NASD
Barebones economy PC 250 GB OEM disk x 4
Add storage PCs as needs grow 1 processor per storage unit
CD + PC(s) = Embedded digital library “Black box” approach
Dump MBP-format books into upload bucket Easily search, browse, view, and download all books added
The MBP Book Format Dir w/ five subdirectories:
OTIFF “Original TIFF”: exactly as scanned Eight-digit, zero-padded page numbers (00000123.tif) 1-bit color at 600 DPI, lossless
PTIFF “Processed TIFF”: current best batch image processing Eight-digit zero-padded numbers match OTIFF
TXT ASCII, UTF-8, or UTF-16 text Numbers match OTIFF/PTIFF
HTML UTF-8 HTML w/ low-res JPEG images Numbers match OTIFF/PTIFF
[MARC|DC] Binary MARC record Dublin Core XML
Flexible: other format directories can be added Internal storage format:
OTIFF/PTIFF -> multipage TXT/HTML -> zip
500 page book = 2001 files Converted at addition time to 5 files Speeds copying
High-level Cluster Architecture
HeadNode
(NASD 0)Web traffic
NASD1
NASD2
NASD…
NASDn
Internal network subnet
Network-Attach Storage Devices (SATA RAID PCs)
Adding a Book
Head node has SMB share “Upload” User moves one or more MBP-format books into Upload
share System automatically checks each book for
completeness/correctness: All formats present Contiguous page numbers Metadata present and parseable Errors presented to user for correction
Converts to internal storage format Assigns serial number Moves to NASD node with most free space Incremental search index
Viewing a book
Users view original page images HTML, raw TXT as option
Intra-book searching Seeks to matching page Highlights token match Rapidly seek from one token match to the next Boolean queries, phrase matching
PaperSight ImageServer Convert 600 DPI 1-bit TIFF to ~96DPI 8-bit GIF Real-time conversion performance is faster than human
response Anti-aliased grey-scale image is ideal for monitor reading Significant reduction in bandwidth Conversion happens on hosting NASD node, not head
Browsing
Simple alphabetic browseKeep list sizes small
The Missing Piece: Search
Searching the full text of tens of thousands of books is computationally intensive
Solution: parallelize Each NASD node indexes and searches content it stores Results are unified and sorted at head node NASD cluster architecture maintains parity between processors and storage
Grow from n to 2n nodes? Search speed remains constant (assuming homogeneous corpus)
Search too slow? Increase machine count and redistribute data. Search features
Fast! 0.1 sec per-token response in most cases (AMD 1400+). Joint bibliographic and full-text search with single query Phrase matching, boolean queries, cross-page phrases Context display for full-text matches Rich scoring system:
Metadata matches Token proximity scoring (multi-token queries only)
Direct-to-page matching Full text matches yield actual matching page, with highlighting
Full search API (Perl)
Customization
APIs provided for all major components: Search Book Reader Metadata processing and conversion
All HTML lives in read-write space on head node Development sites can create rich HTML hierarchies
Scripting is not limited to CD contents cgi-bin and site_perl can be extended
CD/core upgrades leave extensions untouched
Future Directions
Search engine in wider distribution GPL Perl CPAN
“Phone Home” capability Individual Mini-UL systems with slow but persistent links
relay manifests Metadata + text
Master site to search all sites IIIT Hyderabad contributions
MySQL-based metadata search Separate search and storage clusters
9 TB hardware RAID servers Multiple diskless search nodes
Embedded Digital Library Uses
Gives MBP sites foundation on which to build Allows convergence on standards as sites contribute
new extensions to main distribution Gives basic search, browse, view, and audit capability
to any site, regardless of development staff
Uses extend beyond MBP deployments Any site with archives of multi-page text documents can
benefit Only requirements are a scanner and a PC Virtually no administration required
Questions?