practices and open problems of document digitization for million book project xiaohui zheng tsinghua...
TRANSCRIPT
Practices and Open Problems of Document Digitization For
Million Book Project
Xiaohui Zheng
Tsinghua Univ. Library
Background
THU participated in CADAL Project at the end of 2002 and finished 50000 E-books and E-dissertations in Jul 2006.
Digitization Center was founded in March of 2003. Affiliated to Digital Library Research Division of THU.
Experiences
In house or out source Planning and Source Material
Selection Digitization Process Facility and Staff Management
In house or out source In House
Pro:
1. Can control over all procedures, handling of materials and quality of products.
2. No worry about working with a vendor who turns out to be incompetent.
In house or out source In HousePro:
3. Provides a foundation of experience that helps to create policies, cost analyses, standard making, and data transferring.
4. keeping the production line in house makes other digitization projects smoothly forward in the whole flexible organization.
In house or out sourceIn House
Con:
1. Less staffing and workflow management experiences
2. Low productivity
3. Small Scale
In house or out source Out SourcePro:
1. Professional staff and developed workflow
2. High productivity. Large output in short time.
3. Large Scale
Our Choice
In house operation 10 staff is enough to finish 50000 E-books
in 3 years Enough time to training staff and improve
efficiency.
Source Material Selection
Copyright was the place to start Easy to handle Good quality of materials (not fragile) Quickly action for submitting the title
list to duduplicate
Digitization Process
Preparation (Selection, Identifier assignment) Scanning Image processing Metadata creation and packaging Quality control Data storage and backup
Ancient book Scanning and Image processing (Double page upside down scanning)
De-speckling and Centering
CADAL制作工具图像处理
Splitting into two pages (Batch processing)
Rotating (Batch processing)
De-skewing (batch processing)
TPI
Format transferring (Batch processing)
Metadata creation and packaging
Facility and Staff Management
Facility:
Three flatbed AVA3 AVISION scanners
Two FB6000E AVISION flatbed scanner
Minolta PS 7000
High speed AVISION AV3800 Staff:
1 manager, 1 technical supervisor, 11 temp. staff
Capacity: 5,000,000 page/year
Network topology and data storage system
WAN
Gigabit Ethernet Switch
NAS Backup System
DAS Dell System
4 Flatbed scanners
High-speed
scanner
9 Manual processing
PCs
6 Automatic processing
PCs
LAN
Gate-way
Face- up
Scanner
Related Software
Scanning: QuickScan…
Image processing: Bookshop, ACDSee, XnView, UltraEdit, Scanfix, DjVuerPro,…
Cataloging and Packaging: CADAL Cataloging Tool, OEBEditor, CMDL Cataloging Toolkit,…
Data transferring: DResManages
Open Problems And Considerations
Content Discovery
Metadata description is rough and inconsistent
Resource Selection
The coverage of the million books is not clear and systematical.
Open Problems And Considerations
OCR Processing
OCR processing has not yet started. The OCR technology for ancient book is under developed.
Copyright Problem
Almost 400,000 dissertations and modern books of CADAL collection haven’t clearly copyright disclaimer .
Open Problems And Considerations
Organization Structure
My suggestion is that more source collection provider, less digitization centers.
Thank you for your attention!