the million book project the mini-ul digital library platform carnegie mellon university school of...

The Million Book Project

The Mini-UL Digital Library Platform

Carnegie Mellon University

School of Computer Science

Raj Reddy

Eric Burns

What is the Million Book Project?

Free-to-read, open-platform digital library Worldwide distribution and mirroring Public domain works Out of print but in copyright Rare materials

Collaborative content acquisition India

20 mini scanning centers, 3 mega scanning centers Over 80,000 books to date

China Over 30,000 books to date

USA / Carnegie Mellon (Hunt Library/SCS) 1200 books, technology contributor

Truly multi-lingual corpus Several Indian languages Mandarin Chinese Most European languages

MBP offers unique systems challenges

Multiple deployments China India Partners in US

Human-intensive scanning process Error prone

DC XML entered by hand Operator error on scanning devices

Difficult to standardize Multiple QA passes required

Everyone wants autonomy and customization System-level solution must satisfy small and large data sets CMU must provide a framework for remote sites to extend Equipment budget is limited

Developing nations’ networks are limited China, India output must be shipped to US

Core Problems

Multiple scanning centers, each with: Distinct values and goals Limited connectivity Varying IT infrastructure

Common base requirements Searching Browsing Viewing File-system compatibility

Basic standard for acquiring and storing scanned books Data preservation Quality assurance Flexibility Openness

Fault-tolerant storage at all sites Data movement via physical shipment Standardized OS and base software

Our Solution: Mini-UL Embedded

Digital library on a CD OS (Knoppix Linux), servers (Apache, PaperSight ImageServer), code

(Perl) on single ISO Boots single systems or whole clusters Ensures standardization, eases upgrades

To use new software, admins burn CD and reboot Commodity PC and disk hardware spec

Software RAID: Use low-end PC as network-attach storage Sub-$1000 PC = 1 TB NASD

Barebones economy PC 250 GB OEM disk x 4

Add storage PCs as needs grow 1 processor per storage unit

CD + PC(s) = Embedded digital library “Black box” approach

Dump MBP-format books into upload bucket Easily search, browse, view, and download all books added

The MBP Book Format Dir w/ five subdirectories:

OTIFF “Original TIFF”: exactly as scanned Eight-digit, zero-padded page numbers (00000123.tif) 1-bit color at 600 DPI, lossless

PTIFF “Processed TIFF”: current best batch image processing Eight-digit zero-padded numbers match OTIFF

TXT ASCII, UTF-8, or UTF-16 text Numbers match OTIFF/PTIFF

HTML UTF-8 HTML w/ low-res JPEG images Numbers match OTIFF/PTIFF

[MARC|DC] Binary MARC record Dublin Core XML

Flexible: other format directories can be added Internal storage format:

OTIFF/PTIFF -> multipage TXT/HTML -> zip

500 page book = 2001 files Converted at addition time to 5 files Speeds copying

High-level Cluster Architecture

HeadNode

(NASD 0)Web traffic

NASD1

NASD2

NASD…

NASDn

Internal network subnet

Network-Attach Storage Devices (SATA RAID PCs)

Adding a Book

Head node has SMB share “Upload” User moves one or more MBP-format books into Upload

share System automatically checks each book for

completeness/correctness: All formats present Contiguous page numbers Metadata present and parseable Errors presented to user for correction

Converts to internal storage format Assigns serial number Moves to NASD node with most free space Incremental search index

Viewing a book

Users view original page images HTML, raw TXT as option

Intra-book searching Seeks to matching page Highlights token match Rapidly seek from one token match to the next Boolean queries, phrase matching

PaperSight ImageServer Convert 600 DPI 1-bit TIFF to ~96DPI 8-bit GIF Real-time conversion performance is faster than human

response Anti-aliased grey-scale image is ideal for monitor reading Significant reduction in bandwidth Conversion happens on hosting NASD node, not head

Browsing

Simple alphabetic browseKeep list sizes small

The Missing Piece: Search

Searching the full text of tens of thousands of books is computationally intensive

Solution: parallelize Each NASD node indexes and searches content it stores Results are unified and sorted at head node NASD cluster architecture maintains parity between processors and storage

Grow from n to 2n nodes? Search speed remains constant (assuming homogeneous corpus)

Search too slow? Increase machine count and redistribute data. Search features

Fast! 0.1 sec per-token response in most cases (AMD 1400+). Joint bibliographic and full-text search with single query Phrase matching, boolean queries, cross-page phrases Context display for full-text matches Rich scoring system:

Metadata matches Token proximity scoring (multi-token queries only)

Direct-to-page matching Full text matches yield actual matching page, with highlighting

Full search API (Perl)

Customization

APIs provided for all major components: Search Book Reader Metadata processing and conversion

All HTML lives in read-write space on head node Development sites can create rich HTML hierarchies

Scripting is not limited to CD contents cgi-bin and site_perl can be extended

CD/core upgrades leave extensions untouched

Future Directions

Search engine in wider distribution GPL Perl CPAN

“Phone Home” capability Individual Mini-UL systems with slow but persistent links

relay manifests Metadata + text

Master site to search all sites IIIT Hyderabad contributions

MySQL-based metadata search Separate search and storage clusters

9 TB hardware RAID servers Multiple diskless search nodes

Embedded Digital Library Uses

Gives MBP sites foundation on which to build Allows convergence on standards as sites contribute

new extensions to main distribution Gives basic search, browse, view, and audit capability

to any site, regardless of development staff

Uses extend beyond MBP deployments Any site with archives of multi-page text documents can

benefit Only requirements are a scanner and a PC Virtually no administration required

Questions?

the million book project the mini-ul digital library platform carnegie mellon university school of...

Documents