impact final conference - ulrich reffle

Post on 20-Nov-2014

2.230 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Postcorrection in IMPACT with Ulrich Reffle from the University of Munich

TRANSCRIPT

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Analysis and Post-Correction of OCR-processed historical documents

Ulrich Reffle

CISUniversity of Munich

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 2

Overview

Document specific analysis of OCR results of historical documents A system for interactive OCR post-correction

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 3

Document specific analysis of OCR results of historical documents

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 4

Why do we need special methods?

Problems specific to the processing of historical language in the context of mass digitization:– High OCR error rates– No standardized language

Special resources and methods are needed for OCR, post-processing and Information Retrieval

OCR-

resultOCR Post-

Correction IRDigital

image

Problem of historicallanguage variation

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 5

Why do we need special methods?

Diversity of input material makes document specific parameter settings important:– Distribution of spelling variants– Special vocabulary– OCR channel model

OCR-

resultOCR Post-

Correction IRDigital

image

Problem of historicallanguage variation

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 6

Document specific language and error profiles

Language and error profiles provide document specific characteristics of the language and OCR errors.

Language profile: shares of foreign languages (such as Latin, French), frequencies for language modeling, important patterns of spelling variation (in English: e.g. oou, vu )

Error profile: estimated error rate, important error patterns (like ec, il), frequent erroneous words

Language and error profiles are computed fully automatically, no manual interaction or groundtruth needed.

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 7

Global Profile of a document

Language

profile

Error

profile

Frequency

t→th 120

i→y 106

ä→a 38

… …

Frequency

e→c 51

n→u 45

t→i 34

… …

Lexicon %

Modern 82%

Historic 9%

Place names 6%

Latin 3%

Correct words 72%

Erroneous words 20%

Unknown words 8%

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 8

Local profile of all words of a document

„theil“„theil“„theil“„theil“„hatn“

Weighted set of interpretations/ correction suggestions for each word of the document.

Correction suggestion Modern spelling probability

hath has 0,95

hat Hat 0,01

hate hate 0,04

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 9

Summary

Document specific profiles …– are computed in a fully automated way from OCR output– provide characteristics of language and OCR error channel in order to adapt

OCR and downstream processes.

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 10

System for interactive post-correction of OCR results

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 11

Post-correction system

A graphical user interface for fast and convenient post-correction specifically for OCRed historical documents

Novel possibilities for detection, presentation and correction of systematic OCR errors.

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 12

Post-correction system

Special functionality

Image

OCR Editor

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 13

Proper treatment of spelling variants

Historical spelling variants are identified with the help of historical lexica and language profiles.

Local profiles include non-modern words as correction suggestions.

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 14

Conventional correction methods

Correcting words in the text view– Manual input– Selection of a correction suggestion

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 15

Batch-Correction of systematic OCR errors

Systematic OCR errors are identified by error profile Batches of errors can be corrected with just a few keystrokes.

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 16

Evaluation

User experiment with 14 participants. Novel technology makes correction up to 2.7 times faster.

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de 17

Availability

Graphical interface is going to be distributed open source. Document pre-processing to obtain language and error profiles is protected

by US patent application.– Pre-processing is offered as a web-service, as of now free of charge.

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

18

Thank you!

http://ocr.cis.uni-muenchen.deuli@cis.uni-muenchen.de

24.10.2011 Ulrich Reffle uli@cis.uni-muenchen.de

top related