datech2014 - session 4 - ocr of historical printings of latin texts: problems, prospects, progress

Uwe Springmann1, Dietmar Najock2, Hermann Morgenroth2, Helmut Schmid1, Annette Gotscharek1 and Florian Fink1

OCR of Historical Printings of Latin TextsProblems, Prospects, Progress

1CIS, Ludwig-Maximilans-Universität München

2Institute for Greek and Latin Languages and Literatures, Freie Universität Berlin

p. 2 (16) OCR of historical printings of Latin textsSpringmann et al.

Overview

● Why Latin?● Problems● Prospects● Progress


Why Latin?

● huge heritage: largest body of historical literary sources● Latin publications dominate print production until about 1750● many titles have never been reprinted● either key or barrier to cultural heritage of the western world● has been left out of the IMPACT project despite its importance


Some problems for OCR engines

historical fonts

long s ( )ſ

historical ligatures: Æ, æ, Œ, œ, st,

polytonic Greek words

diacritics

abbreviations

historical spellings

Problems


Some problems for OCR engines (continued)

● historical typography and spelling are also a problem for early modern languages

● ambiguities of abbreviations (especially in incunabula) will not immediately lead to fully expanded, machine readable text

● but discretionary diacritics are helpful in POS/morphology disambiguation:– adverb/vocative: altè/alte– adverb/pronoun: quàm/quam– conjunction/preposition: cùm/cum– ablative/nominative: hastâ/hasta

Problems


State of the art – example pages

Prospects

1544

1779

1649


State of the art – results for example pages

Prospects

Year Abbyy FR 11.1 Tesseract 3.03 OCRopus 0.7

1544 83,14 70,32 74,59

1649 88,07 84,87 78,98

1779 82,13 80,77 75,46

character accuracy in %

out-of-the-box performance, no language model (or default = English)

OCRopus hampered by bad image-text segmentation


Prospects

Overcoming the obstacles

● Training (Tesseract, OCRopus)– (a) generate pseudo-historical images from existing texts and

historical-looking computer fonts (add some degradation to the image)– (b) transcribe some real pages and train on true historical fonts

● Lexical resources (Tesseract) in recognition● Post-processing– correct OCR errors, not historical spelling (might be interesting itself)– add annotation: expand abbreviations, ligatures, normalize spelling– helpful: language model, lexicon of historical word forms


Progress

Postcorrection: Open-Source-Tool PoCoTo(see paper of Vobl et al. - presentation by Christoph Ringlstetter)


Progress

Training on historical fonts (artificial images)Example: Pontanus, Progymnasmata Latinitatis (1589)


Progress

Training on fonts, ideal lexiconExample: Pontanus, Progymnasmata Latinitatis (1589)character accuracy in %

PageAbbyy FR 11.1

Tesseract 3.03

Ocropus 0.7

Tesseract(font)

Tesseract(font + lex.)

Ocropus(font)

15 87,79 80,88 80,70 91,02 93,90 92,55

16 82,94 77,41 76,94 80,12 85,65 80,47

17 85,25 75,98 86,07 85,41 91,56 93,93

18 85,93 79,51 85,53 88,29 92,68 89,67

19 87,94 80,09 79,09 86,06 90,15 87,83

OCRopus: no language model!red: accuracy better than Abbyy


Progress

Training on historical fonts (real images)Example: Thanner, Petronij Arbitri Sathyra (1500)character accuracy in %

PageTesseract

3.03Ocropus

0.7Ocropus(trained)

13 41,59 44,59 93,15

14 52,38 57,77 94,61

15 53,09 62,38 95,17

16 59,09 61,45 93,27

page 1-12: training set; page 13-16: test set


Progress

Summary

● very old printings are hard to OCR out-of-the box● Tesseract and OCRopus can be trained to results above ABBYY● applying lexica as well as font training helps a lot● OCRopus can be trained to accuracies > 90%, but must at present be

combined with good line segmentation in a preprocessing step● postcorrection will do the rest


Progress

Thank you for your interest!

datech2014 - session 4 - ocr of historical printings of latin texts: problems, prospects, progress

Technology