© january/2008 ccs content conversion specialists gmbh weidestr. 134, 22083 hamburg, germany...
TRANSCRIPT
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
consultingtechnologydigitization services
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
docWORKS/METAe, a tool for converting documents into structured digital objects
Accessible e-books, Paris, January 28th, 2008
Claus Gravenhorst, Director Strategic Initiatives
3
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksBackground
• Only a fraction of the world-wide cultural and scientific heritage is available in electronic form
• Limited access to digitised documents
• No common metadata standard for ingest and long-term preservation
• In-house digitisation: set up and operation of an efficient workflow based on a patchwork of various digitisation tools is complicated and expensive
• Manual work takes the largest part of these costs
• Outsourced digitisation: limited control and administration mechanisms regarding quality, quantity and adherence to schedules
• Cost for a digitised page is still high
4
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksChallenges
• Increase the amount of accessible content in a reasonable timeframe – Enable digitisation on a larger scale
• Enable quick and enhanced access by high structured documents – for everyone
• Provide integrated digitisation/conversion technologies as well as efficient workflows to lower the total cost of content creation
• Provide a standardized output format
• Open up new dimensions for research, presentation and distribution of digital content
5
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksGoals of new technologies
• Automate the digitisation and conversion process to create more content at lower costs, less than 10 EuroCent a book page!
• Increase the added value of digital content through automated tagging and metadata extraction
• Provide effective workflow environments for integration of various “state of the art” technologies
• Full integration into the given environment and workflow of content owners
• Scalability for enabling the setup of networked and distributed production environments
6
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksWorkflow: Value Chain
Image pre-processing
Layout analysis
OCR (text)
ISR (structure)
Metadata Extraction
Automated QA
QA Feedback: + Resolution
+ Deskew
+ OCR accuracy
+ Page sequence
Items: + Books
+ Newspapers
+ Journals
+ Manuscripts
+ etc...
Source: + Original
+ Microfilm
+ Digital Image
collect
NEWS
Acapture convert
AA
preserve present
Scan from originals and microfilm
Import from digital image or PDF files
Non-proprietary format
OS
Independent
METS/ALTO compliant
XML data compatible with multiple presentation systems
Improved navigation and search through structured information
reversereversemigrationmigration
Automated workflow for the conversion of printed items into fullystructured digital objects based on common open metadata standards
7
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksTraditional OCR
THE
AMERICAN MISSIONARY.
Vo.. XXXII JANUARY, 1878 No. 1
American Missionary Association
1877 - 1888xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
8
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksPhysical and Logical Structure
9
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksStructure Analysis
FRONT
MAIN
BACK
10
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksStructure Analysis
Chapter 1
Chapter 2
Subchapter1Subchapter 2
11
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksPhysical and Logical Structure
Preface
Table of ContentsTitlePage Statement Page
12
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksMETS/ALTO – open metadata standard
Advantages of fully tagged objects based on XML and openmetadata standards:
• Supporting open metadata standards like METS, DC, MODS, NISO MIX, ALTO
• Full description of the original -> “Digital Original”
• With logical structures search results are improved (chapter-, article-based) and more easily accessed (chapter titles, headlines, pictures with captions, footnotes, etc.)
• Enables data exchange with any XML based 3rd party system
• Provides the source for transformation to other formats being used for distribution (various eBook-formats up to the XML based open eBook format EPUB (International Digital Publishing Forum -IDPF), navigation and adapted visual or audio-visual presentation (e.g. DAISY)
• Supporting Digital long-term preservation
• Enables migration to meet future standards
13
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-books
METSMetadata Encoding and Transmission Standard
document
TIFF ALTO
ALTO – Analyzed Layout and Text Object
METS/ALTO XML objectA document processed in docWORKS is converted into one METS XML file. It reflects the whole physical and logical structure, manages all links to the image files and the related ALTO XML files. There is exactly one ALTO file for one image file.
ALTO is based on a standardized page description schema and contains all information of a page (print space, margins, coordinates, OCR results).
METS/ALTO – open metadata standard
14
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksWorkflow: Institution-based, integrated
Re-Scan
Conversion
Imaging
Layout Analysis
OCR
ISR
Post OCR Correction
Reject Condition
Delivery QA random
Final Output
Book DeliveryQA+Correctionoffshore
ScanningImage
Metadata
Database----------------------
Repository
Metadata
Z 39.50
Automated Quality Assurance
DocumentUID
BarcodeISBN
15
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksSelected Reference
CCS technology is in use at digitisation service providers as well as ivy-league cultural and scientific institutions around the world:
British Library
• Institution-based centre, fully operated by CCS staff
• Process aligned with standards, data bases and workflows in use at the library
• Employment of robotic scanning technology (colour)
• 1 million pages per month, 25 million in 2 years
• Full production since begin of September 2007
• Out of copyright books, 19th century
• Output in METS/ALTO, JPEG2000 and PDF (e-book)
16
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksConclusion and Perspective
• docWORKS/METAe reached a high degree of automation and supports a broad range of document types
• Improved accessibility through highly structured digital objects for everyone
• docWORKS in use at in-house digitisation centres of various content owners all around the world
• Growing demand of the publishing industry
• Scalable technology enables mass digitisation
• Constantly increasing the level of automation
• Major goal is lowering the cost of digitisation while assuring high quality standards
17
© January/2008 CCS Content Conversion Specialists GmbH
Weidestr. 134, 22083 Hamburg, Germany
e-booksContact
CCS Content Conversion Specialists GmbH
information:accessible
Weidestr. 134, D-22083 Hamburg, Germany +49 (0) 402 271316 phone +49 (0) 402 2713011 fax +49 (0) 163 271316 mobile
Internet: www.content-conversion.com
Claus GravenhorstDirector Strategic Initiatives