enp belgrade ws olr @ ccs
TRANSCRIPT
June 14, 2013Page 1
Content Conversion SpecialistsWS Refinement and Quality Assessment
Claus GravenhorstDirector Strategic Initiatives
CCS Content Conversion Specialists
europeana newspapersWorkshop Refinement and Quality Assessment, Belgrade 14.6.2013
OLR at CCSFrom unstructured to structured newspaper data and the roleof content providers in the overall process
Claus Gravenhorst
June 14, 2013Page 2
Content Conversion SpecialistsWS Refinement and Quality Assessment
Claus GravenhorstDirector Strategic Initiatives
Agenda
About CCS
General workflow for mass digitization of newspapers
OLR – Layout and structure analysis
ENP OLR workflow (involvement of CP‘s)
Quality assurance
Output - METS/ALTO package
Demo of first results
June 14, 2013Page 3
Content Conversion SpecialistsWS Refinement and Quality Assessment
Claus GravenhorstDirector Strategic Initiatives
About CCS
CCS Content Conversion Specialists GmbH (Hamburg), as technical project partner, will provide its expertise and docWorks technology to set up and operate a mass digitisation workflow to create high quality structured content from 2 million scanned newspaper pages provided by 5 library partners
Page volume:
BNF=1.000 k, NLE=500 k , SUB HH=480 k, NLF=90 k, SBB=10 k
The distributed OLR workflow enables the contribution of project partners (content providers) to the integrated quality assurance process
CCS will also contribute to the specification of the metadata model
June 14, 2013Page 4
Content Conversion SpecialistsWS Refinement and Quality Assessment
Claus GravenhorstDirector Strategic Initiatives
General workflow for mass digitization
Re-Scan
Conversion
ImagingLayout
AnalysisOCRISR
Reject Condition
DeliveryQA
random
Final Output
Scanning
Image
Metadata
Database----------------Repository
Automated QA
DocumentUID
BarcodeItem Tracking
Manual QA
•in-house•near-shore•off-shore•multiple locations
Manual QA
•in-house•near-shore
Check in
Check out
Scanner
•Robot-•Book-•Document-•Microfilm-
QA+CorrectionQA+Correcti
onQA +
Correction
Z 39.50Metadata
June 14, 2013Page 5
Content Conversion SpecialistsWS Refinement and Quality Assessment
Claus GravenhorstDirector Strategic Initiatives
Layout and structure analysis
Layout analysis based on „bottom up“ approach
General rule system enables recognition of words, text lines, text blocks, columns and classification of text blocks, illustrations, advertisements, tables and the following page types:
- title page (the title page of an issue) - content page (a page that consists of content/text only) - illustration page (a page that has at least one illustration) - advertisement page (a page that contains adverts only)
Structure analysis through classification of headlines and grouping of zones into articles
(incl. article continuation)
June 14, 2013Page 6
Content Conversion SpecialistsWS Refinement and Quality Assessment
Claus GravenhorstDirector Strategic Initiatives
ENP OLR workflow | Conversion without scanning
Digital ImageMetadataDelivery
Digital ImageMetadataDelivery
Digital ObjectReturn
Digital ObjectReturn
Inspection / Automatic QAInspection /
Automatic QA
Doc DeliveryDoc Delivery
RejectReject
Conversion facility
Material location
ConversionMD Recording
June 14, 2013Page 7
Content Conversion SpecialistsWS Refinement and Quality Assessment
Claus GravenhorstDirector Strategic Initiatives
Possible conversion scenarios
A) Conversion at library (on-site)
B) Conversion off-shore at CCS data center,final QA at the library via internet transfer (remote QA solution)
C) Conversion off-shore at CCS,final QA at the library by backup shipment
June 14, 2013Page 8
Content Conversion SpecialistsWS Refinement and Quality Assessment
Claus GravenhorstDirector Strategic Initiatives
Scenario B | Remote QA at library
Internet
StorageStorage
IN
OUTPOOL
dW Share
Master
OffshoreProcessing
@ CCS
OUTPUT
METS ALTO
StorageStorage
POOL
dW Share
RQA
QA on-site @ Library
INPUT
HDDHDDHDD
June 14, 2013Page 9
Content Conversion SpecialistsWS Refinement and Quality Assessment
Claus GravenhorstDirector Strategic Initiatives
Quality assurance
@ CCS | Automated markup and basic manual correction:
- headlines, illustrations, tables, captions, advertisements, etc.
- article segmentation and grouping of zones into articles (incl. continuation)
@ Content Provider (Library)
Recommended:
- Zoning: correct classification of blocks as „text“ or „illustration“ - Article segmentation: correct identification of headlines/text blocks/captions - Grouping: correct gouping of blocks (text, illustration) to articles - Metadata: correct title, issue date and issue number
Optional:
- Page types: correct page types - Page numbers: correct page sequence - OCR: perform text correction of specific zones (e.g. headlines, captions)
June 14, 2013Page 10
Content Conversion SpecialistsWS Refinement and Quality Assessment
Claus GravenhorstDirector Strategic Initiatives
Output | METS/ALTO package
METS/ALTO metadata schemas to describe the structured digital ouput object
A newspaper issue processed in docWorks is converted into one METS XML file. It reflects the whole physical and logical structure, manages all links to the image files and the related ALTO XML files. ALTO is based on a standardized page description schema and contains all information of a page (print space, margins, coordinates, OCR results).
Benefits of structural markup:
- better browsing and more precise text search
- better access and display on tablet and mobile devices - automated article classification and clustering through data/text mining and linguistic technologies - user engagement for manual online text correction, article classification, annotation, building personal collections, etc. - sharing articles via social media platforms like Facebook, Twitter, etc. _______________
METS = Metadada Encoding and Transmission Standard
ALTO = Analyzed Layout and Text Object
June 14, 2013Page 11
Content Conversion SpecialistsWS Refinement and Quality Assessment
Claus GravenhorstDirector Strategic Initiatives
Access and Presentation
Access through Europeana as well as content provider portals
Existing newspaper presentation systems at National Library of Australia (Trove), Library of Congress/NDNP (Chronicling America), Dutch National Library (DDD), National Library of Luxembourg (eLuxemburgensia), ...
Veridian demo:
Example of a newspaper presentation system to demonstrate access to already processed ENP newspaper issues
June 14, 2013Page 12
Content Conversion SpecialistsWS Refinement and Quality Assessment
Claus GravenhorstDirector Strategic Initiatives
Questions + answers
June 14, 2013Page 13
Content Conversion SpecialistsWS Refinement and Quality Assessment
Claus GravenhorstDirector Strategic Initiatives
Contact
Claus Gravenhorst
Director Strategic InitiativesCCS Content Conversion Specialists GmbH
Weidestr. 134
22083 Hamburg
Germany [email protected]
www.content-conversion.com