refinement of digitised newspapers
Post on 15-Jun-2015
62 Views
Preview:
DESCRIPTION
TRANSCRIPT
Europeana Newspapers Workshop:
Refinement
WP2 – Introduction to Refinement
Munich, 26 June 2013
Clemens Neudecker (@cneudecker)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Overview
• Objectives & Challenges
• Overview of Refinement Dataset
• Introduction to Refinement: Workflow & Technologies
• Questions & Answers
2
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Objectives
- Analysis of available digital newspaper collections of project partners and identification of subsets suitable for refinement
- Definition of requirements and minimum quality of digitized newspapers for refinement to enable advanced services in Europeana
- Coordination of the scalable processing of 10 million digitised newspaper pages with several refinement technologies
- Providing recommendations on best practices for the refinement of digitised newspaper collections with full-text (and ingest to Europeana)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Challenges
• Processing quality vs. speed/throughput
• Volume of data requires focus on simple & standardised workflow with clear checkpoints
• Diverse partners supplying content with different digitisation & access policies
• Large variety of content in terms of file formats, fonts, languages, etc.
4
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
The data
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Refinement Workflow steps
10
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Tools (BCT)
• BCT = Binarisation and Colour Reduction Tool
• Purpose: Convert grey/colour scans to bitonal using highly optimised GPP method
• Background: Reduce total file size of master images to guarantee feasibility and timing of data transfers
11
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Tools (FRT)
• FRT = File Rename Tool
• Purpose: Support content holders in preparing their data in the correct format
• Background: Ensure folder structure and file naming requirements for automated processing are met
12
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Tools (FAT)
• FAT = File Analyzer Tool
• Purpose: Final quality check of data before refinement
• Background: Ensure content and refinement partners that all preparation steps have been executed successfully
13
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Refinement: OCR@UIBK
• OCR = Optical Character Recognition
• Number of pages to be refined: 8 million
• Technologies: ABBYY FineReader SDK
• State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts
• Result: METS/ALTO package containing images, metadata & full text
14
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
OCR Full text search
15
http://www.europeana-newspapers.eu/building-a-content-browser-for-digital-newspapers/
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Refinement: OLR@CCS
• OLR = Optical Layout Recognition
• Number of pages to be refined: 2 million
• Technologies: docWorks
• Separation of columns, articles, headlines, page classes
• Result: METS/ALTO package containing images, metadata & full text
16
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
OLR Article separation
17
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Refinement: NER@KB
• NER = Named Entities Recognition
• Number of pages to be refined: 2 million
• Technologies: Stanford CRF-NER
• Languages supported: German, Dutch, English (+ French, Latvian)
• Open source: https://github.com/KBNLresearch/europeananp-ner
• Detection of Named entities: Person, Location, Organization
• Feedback cycle with manual training step better results
18
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
NER Browse by names or places
19
Thank you for your attention!clemens.neudecker@kb.nl
top related