Mass digitisation?
Astrid VerheusenProjectmanagerResearch & Development DivisionNational library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Royal Library, Copenhagen, Denmark25 October 2007
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
What is mass digitisation?
• Millions of books rather than millions of pages
• No selection/no collections (digitise everything!)
• Mainly books
• Exclusion of special collections
• Low quality standards
• Ignore copyright issues
• Ignore long term preservation issues
2
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Koninklijke Bibliotheek - Digitisation in the past
3
• Experience with digitisation since 1995
• Webexpositions / highlights of collections
• Small-scale digitisation projects
• Mainly visually attractive images
• Emphasis on techniques / trial and error
• Exploration of possibilities
• Co-operation on a small scale
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
4
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Koninklijke Bibliotheek - Digitisation 2000-2005
55
Shift in emphasis:• From highlights to larger collections
• Project based
• (Inter)national co-operation
• Established methods and techniques
• Awareness of digital preservation
• More text material & audio/video
• Further exploration of possibilities
applications made with the digitised material
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
66
Memory of the Netherlands
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Koninklijke Bibliotheek - present & future -1
77
• Strategic plan 2006-2009:”Development of a national
programme for the mass digitisation of sources for
research in the humanities”
• Target audience
• Scientific research
• Public at large
• Development of standards and services
• Particular attention for digital preservation
• Preservation imaging
• No commercial partners for funding
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Koninklijke Bibliotheek - present & future -2
88
Text digitisation
• Until recently: on a small scale
• Printed and typed sources (not handwritten)
• Issues differ from images• Structure / navigation• Conversion to full text (OCR)• Scanning from microfilm• Search & Retrieval
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
9
Project Number of pages
Budget
Dutch parliamentary papers 1814-1995 2.300.000 M€ 10.5
Dutch daily newspapers 1618-1995 8.000.000 M€ 12.5
Special collections – books before 1800 1.300.000 M€ 3.0
Radio news bulletins 1.500.000 M€ 0.5
Metamorfoze - preservation imaging 28.000.000? M€ 24
Atjeh 200.000 M€ 0,3
Memory of the Netherlands 350.000 M€ 3,5
Totaal 42.150.150 M€ 54,3
Koninklijke Bibliotheek - Projects 2007-2011
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Koninklijke Bibliotheek - Issues
1010
• Costs of digitisation: € 1.3 per page
• Costs of exploitation: millions per year from 2011 onwards
• Technical infrastructure
• Storage (1 PB needed)
• Processing 2 million files per month
• Search & retrieval is not effective enough
• Organisational infrastructure is not efficient
• The process is too slow, we want to digitise faster and
more...
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
1111
We cannot slow down to make things perfect
The rising tide will lift all boats
Mass Digitization: Implications for Information Policy Report from “Scholarship and Libraries in Transition: A Dialogue about the Impacts of Mass Digitization Projects” Symposium held on March 10-11, 2006 University of Michigan, Ann Arbor
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
12
Content PresentationSearch & Retrieval
StorageProcessing
Project management & Organization
Finance
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Content: Selection & Preparation
Old approaches
• Much effort spent on selection
• Ignorence of copyright issues…
• Minute assessment of missing material
• Replacement of torn pages
13
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Content: Selection & Preparation
New approaches
• Less effort on the selection process (integral
collections)
• Negotiation/co-operation with publishing
sector
• Limited effort on retrieving missing
pages/issues
• Limited effort on restoration
14
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Content: Digital imaging & metadata
Old approaches
• Very high quality images
• Capture as much detail from the original as possible
• Minimize damage to the original
• Master & access images
• Lossless compression (TIFF)
• Experiment with our own scanners
15
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Content: Digital imaging & metadata
New approaches
• One format for both access and preservation
• New formats to save storage (JPEG2000)
• Outsource all imaging activities
• Consider .txt as a master…
16
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Processing: Quality assurance
Old approaches
• High standards for quality assurance (often
manual)
• Expensive Document Management System for
quality control
17
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Processing : Quality assurance
New approaches
• Not realistic to check quality for all files
• We need automatic quality assurance tools
• OCR often not involved in quality assurance
18
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Search & Retrieval
Old approaches
• Find the best search engine
• Search in metadata
• Digitise text without OCR
• We decide what the user wants
19
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Search & Retrieval
New approaches
• All text digitisation projects include OCR
• Search through millions of pages of text
• Experiment with tools for enhanced access &
textmining
• Growing awareness that we have to involve our
users
20
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Storage
Old approaches
• Storage on CD Rom and DVD
• Master files in e-Depot: 1 Petabyte needed
• Storage of all master files for the long term
• Access files are stored in a different system
21
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Storage
New approaches
• Storage strategy which balances costs, access
and preservation
• Alternative file formats to minimize storage
costs & increase throughput for delivery and
transfer
• Use one file both as master and access file
22
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Finance
• All costs are now specified
• Division of budget
• 30 % Staff
• 10 % Hard- & software
• 10 % Research & Development
• 50 % Digitisation, OCR & metadata
• Exploitation costs are becoming ‘dramatic’
• New business models
23
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Organisation
• All digitisation activities in R&D department
• Involvement of other parts of the library is necessary
• Digitisation & digital preservation are separate
activities
• Integration is necessary
• Digitisation activities are all project based
• Integration with standing organisation is necessary
24
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
‘Holding out for an ideal solution is often not feasible;
moreover, implementing less-than-perfect solutions can
enable us to be flexible, modular, and nimble so that we can
continue to refine our strategies as new options become
available’.
Preservation in the Age of Large-scale Digitization
A white paper
By Oya Y. Rieger
Council on Library and Information Resources
25
Conclusion
Koninklijke Bibliotheek – National Library of the Netherlands
LIBER-EBLIDA Workshop on Digitisation
Thank [email protected]