datech2014 - session 4 - ocr of historical printings of latin texts: problems, prospects, progress

16
Uwe Springmann 1 , Dietmar Najock 2 , Hermann Morgenroth 2 , Helmut Schmid 1 , Annette Gotscharek 1 and Florian Fink 1 OCR of Historical Printings of Latin Texts Problems, Prospects, Progress 1 CIS, Ludwig-Maximilans-Universität München 2 Institute for Greek and Latin Languages and Literatures, Freie Universität Berlin

Upload: impact-centre-of-competence

Post on 22-Nov-2014

336 views

Category:

Technology


0 download

DESCRIPTION

Presentation of the paper OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress by Uwe Springmann, Dietmar Najock, Hermann Morgenroth, Helmut Schmid, Annette Gotscharek and Florian Fink in DATeCH 2014. #digidays

TRANSCRIPT

Page 1: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

Uwe Springmann1, Dietmar Najock2, Hermann Morgenroth2, Helmut Schmid1, Annette Gotscharek1 and Florian Fink1

OCR of Historical Printings of Latin TextsProblems, Prospects, Progress

1CIS, Ludwig-Maximilans-Universität München

2Institute for Greek and Latin Languages and Literatures, Freie Universität Berlin

Page 2: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

p. 2 (16) OCR of historical printings of Latin textsSpringmann et al.

Overview

● Why Latin?● Problems● Prospects● Progress

Page 3: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

p. 3 (16) OCR of historical printings of Latin textsSpringmann et al.

Why Latin?

● huge heritage: largest body of historical literary sources● Latin publications dominate print production until about 1750● many titles have never been reprinted● either key or barrier to cultural heritage of the western world● has been left out of the IMPACT project despite its importance

Page 4: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

p. 4 (16) OCR of historical printings of Latin textsSpringmann et al.

Some problems for OCR engines

historical fonts

long s ( )ſ

historical ligatures: Æ, æ, Œ, œ, st,

polytonic Greek words

diacritics

abbreviations

historical spellings

Problems

Page 5: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

p. 5 (16) OCR of historical printings of Latin textsSpringmann et al.

Some problems for OCR engines (continued)

● historical typography and spelling are also a problem for early modern languages

● ambiguities of abbreviations (especially in incunabula) will not immediately lead to fully expanded, machine readable text

● but discretionary diacritics are helpful in POS/morphology disambiguation:– adverb/vocative: altè/alte– adverb/pronoun: quàm/quam– conjunction/preposition: cùm/cum– ablative/nominative: hastâ/hasta

Problems

Page 6: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

p. 6 (16) OCR of historical printings of Latin textsSpringmann et al.

State of the art – example pages

Prospects

1544

1779

1649

Page 7: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

p. 7 (16) OCR of historical printings of Latin textsSpringmann et al.

State of the art – results for example pages

Prospects

Year Abbyy FR 11.1 Tesseract 3.03 OCRopus 0.7

1544 83,14 70,32 74,59

1649 88,07 84,87 78,98

1779 82,13 80,77 75,46

character accuracy in %

out-of-the-box performance, no language model (or default = English)

OCRopus hampered by bad image-text segmentation

Page 8: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

p. 8 (16) OCR of historical printings of Latin textsSpringmann et al.

Prospects

Overcoming the obstacles

● Training (Tesseract, OCRopus)– (a) generate pseudo-historical images from existing texts and

historical-looking computer fonts (add some degradation to the image)– (b) transcribe some real pages and train on true historical fonts

● Lexical resources (Tesseract) in recognition● Post-processing– correct OCR errors, not historical spelling (might be interesting itself)– add annotation: expand abbreviations, ligatures, normalize spelling– helpful: language model, lexicon of historical word forms

Page 9: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress
Page 10: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

p. 10 (16) OCR of historical printings of Latin textsSpringmann et al.

Progress

Postcorrection: Open-Source-Tool PoCoTo(see paper of Vobl et al. - presentation by Christoph Ringlstetter)

Page 11: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

p. 11 (16) OCR of historical printings of Latin textsSpringmann et al.

Progress

Training on historical fonts (artificial images)Example: Pontanus, Progymnasmata Latinitatis (1589)

Page 12: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

p. 12 (16) OCR of historical printings of Latin textsSpringmann et al.

Progress

Training on fonts, ideal lexiconExample: Pontanus, Progymnasmata Latinitatis (1589)character accuracy in %

PageAbbyy FR 11.1

Tesseract 3.03

Ocropus 0.7

Tesseract(font)

Tesseract(font + lex.)

Ocropus(font)

15 87,79 80,88 80,70 91,02 93,90 92,55

16 82,94 77,41 76,94 80,12 85,65 80,47

17 85,25 75,98 86,07 85,41 91,56 93,93

18 85,93 79,51 85,53 88,29 92,68 89,67

19 87,94 80,09 79,09 86,06 90,15 87,83

OCRopus: no language model!red: accuracy better than Abbyy

Page 13: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress
Page 14: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

p. 14 (16) OCR of historical printings of Latin textsSpringmann et al.

Progress

Training on historical fonts (real images)Example: Thanner, Petronij Arbitri Sathyra (1500)character accuracy in %

PageTesseract

3.03Ocropus

0.7Ocropus(trained)

13 41,59 44,59 93,15

14 52,38 57,77 94,61

15 53,09 62,38 95,17

16 59,09 61,45 93,27

page 1-12: training set; page 13-16: test set

Page 15: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

p. 15 (16) OCR of historical printings of Latin textsSpringmann et al.

Progress

Summary

● very old printings are hard to OCR out-of-the box● Tesseract and OCRopus can be trained to results above ABBYY● applying lexica as well as font training helps a lot● OCRopus can be trained to accuracies > 90%, but must at present be

combined with good line segmentation in a preprocessing step● postcorrection will do the rest

Page 16: Datech2014 - Session 4 - OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress

p. 16 (16) OCR of historical printings of Latin textsSpringmann et al.

Progress

Thank you for your interest!