bratislava ws - schlarb - onb - technical tools_pdf

Post on 22-Nov-2014

890 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May, Bratislava

The challenges of historical materials and an overview on the technical solutions in IMPACT

Sven Schlarb, Austrian National Library

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Overview Challenges Techical solutions Integration

– Interoperability– Modularisation

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Challenges of historical materials Warped book pages (caused by thick spines) Skewed and distorted scans Curved text lines (caused by creased or due to humidity warped

paper) Annoying colour blots, different print intensities Shine through and bleed through Gothic font Handwritten annotations Complex layouts (e.g. newspaper pages and the article reading order) Historical languages and time-specific words in the documents

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Tables – Curved cell bordersChallenges

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Extreme warping Gothic font Annotations Chapter numbers

Challenges

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Warpage due to humidity Distortion Crinkles Dots and blots Page/Chapter numbers

Challenges

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Complex layout Reading order

Challenges

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Skewed image Gothic font Bleed through Page number

Challenges

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Gothic font Warping Page borders Curved text lines Page/Chapter numbers

Challenges

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Consortium (including new partners)• 11 Libraries

– Koninklijke Bibliotheek (Netherlands)– British Library– Bibliothèque national de France– Deutsche Nationalbibliothek– Bayrische Staatsbibliothek– Niedersächsische Staats- und

Universitätsbibliothek Göttingen– Österreichische Nationalbibliothek– Universitätsbibliothek der Universität Innsbruck– “St. Cyril and Methodius” National Library, Sofia– National Library of the Czech Republic, Prague– National Library, Madrid (Spain)

• 2 Industry partners– IBM (Research Centre Haifa, Israel)– ABBYY (Moscow)

• 13 Universities and Research Centres– Instituut voor Nederlandse Lexicologie, Leiden

(Netherlands)– National Research Centre Demokritos, Athens– University of Salford, Great Britain– University Munich, Centrum für Informations-

und Sprachverarbeitung (CIS), Germany– University Innsbruck (InfMath), Austria– University Bath, Great Britain– Institute for Parallel Processing, Bulgarian

Academy of Sciences– Jožef Stefan Institute, Ljubljana (Slovenia)– Institute of the Czech National Corpus,

Charles University Prague (Czech Republic)– Analyse et Traitement Informatique de la

Langue Française (ATILF), Nancy (France)– Foundation Virtual Library Miguel de

Cervantes, Alicante (Spain)– Poznan Supercomputing and Networking

Center, Poznan (Poland)– University of Warsaw, Department of Formal

Linguistics, Warsaw (Poland)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Tools: Border detection/removal

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Tools: Geometric Dewarping

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Tools: Geometric Deskewing

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Tools: Binarisation

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Tools: Historical Lexicons Lexicons for German, Dutch, English, French, Spanish, Polish,

Bulgarian and Czech available. Tools for building historical lexicons Interface to ABBYY FRE to integrate external lexicons

Basically ABBYY provides the information on how the weighing parameters of word lists with word frequencies have to be created.

Procedural information disclosed but results can be evaluated against each other, e.g. by evaluating the results with or without or with different dictionaries against each other.

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Tools: Named Entities Registry Named entities (=

persons names, geographic locations, organizations) and general

Collaborative Named Entities Registry

Named Entities to be integrated into ABBYY FR as word lists

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Tools: Linguistic Post-Correction

OCR (ABBYY) and OCR analysis (CIS group, LMU) The colors indicate different types of analysis results, like a

word being found in the historical or hypothetical dictionary, or a supposed OCR error, etc.

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Tools: Collaborative Correction Integrated web-

based system for collaborative post-correction of OCR results

Character/Word/Page modi

Main purpose: Collaboratively correct OCR errors and use results for improving OCR

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Tools: Functional Extension Parser Recognition of the structure

of book pages– Print space– Standard font of the main

text– Page numbers

Enrichment of OCR results with structural information

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Tools: Word Spotting Alternative technique for indexing

historical documents After word segmentation relevant

words are detected and highlighted Key words can be person and

location names (e.g. taken from the Named Entities Registry)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Interoperability ABBYY XML METS/ALTO AltoEx (IBM) PAGE XML (TEI)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7 May 2010, Bratislava

Modularisation

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

http://www.impact-project.eu

top related