impact final conference - ulrich reffle
Post on 20-Nov-2014
2.230 Views
Preview:
DESCRIPTION
TRANSCRIPT
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Analysis and Post-Correction of OCR-processed historical documents
Ulrich Reffle
CISUniversity of Munich
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 2
Overview
Document specific analysis of OCR results of historical documents A system for interactive OCR post-correction
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 3
Document specific analysis of OCR results of historical documents
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 4
Why do we need special methods?
Problems specific to the processing of historical language in the context of mass digitization:– High OCR error rates– No standardized language
Special resources and methods are needed for OCR, post-processing and Information Retrieval
OCR-
resultOCR Post-
Correction IRDigital
image
Problem of historicallanguage variation
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 5
Why do we need special methods?
Diversity of input material makes document specific parameter settings important:– Distribution of spelling variants– Special vocabulary– OCR channel model
OCR-
resultOCR Post-
Correction IRDigital
image
Problem of historicallanguage variation
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 6
Document specific language and error profiles
Language and error profiles provide document specific characteristics of the language and OCR errors.
Language profile: shares of foreign languages (such as Latin, French), frequencies for language modeling, important patterns of spelling variation (in English: e.g. oou, vu )
Error profile: estimated error rate, important error patterns (like ec, il), frequent erroneous words
Language and error profiles are computed fully automatically, no manual interaction or groundtruth needed.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 7
Global Profile of a document
Language
profile
Error
profile
Frequency
t→th 120
i→y 106
ä→a 38
… …
Frequency
e→c 51
n→u 45
t→i 34
… …
Lexicon %
Modern 82%
Historic 9%
Place names 6%
Latin 3%
Correct words 72%
Erroneous words 20%
Unknown words 8%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 8
Local profile of all words of a document
„theil“„theil“„theil“„theil“„hatn“
Weighted set of interpretations/ correction suggestions for each word of the document.
Correction suggestion Modern spelling probability
hath has 0,95
hat Hat 0,01
hate hate 0,04
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 9
Summary
Document specific profiles …– are computed in a fully automated way from OCR output– provide characteristics of language and OCR error channel in order to adapt
OCR and downstream processes.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 10
System for interactive post-correction of OCR results
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 11
Post-correction system
A graphical user interface for fast and convenient post-correction specifically for OCRed historical documents
Novel possibilities for detection, presentation and correction of systematic OCR errors.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 12
Post-correction system
Special functionality
Image
OCR Editor
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 13
Proper treatment of spelling variants
Historical spelling variants are identified with the help of historical lexica and language profiles.
Local profiles include non-modern words as correction suggestions.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 14
Conventional correction methods
Correcting words in the text view– Manual input– Selection of a correction suggestion
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 15
Batch-Correction of systematic OCR errors
Systematic OCR errors are identified by error profile Batches of errors can be corrected with just a few keystrokes.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 16
Evaluation
User experiment with 14 participants. Novel technology makes correction up to 2.7 times faster.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 17
Availability
Graphical interface is going to be distributed open source. Document pre-processing to obtain language and error profiles is protected
by US patent application.– Pre-processing is offered as a web-service, as of now free of charge.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18
Thank you!
http://ocr.cis.uni-muenchen.deuli@cis.uni-muenchen.de
24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de
top related