datech2014 - session 4 - ocr of historical printings of latin texts: problems, prospects, progress
DESCRIPTION
Presentation of the paper OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress by Uwe Springmann, Dietmar Najock, Hermann Morgenroth, Helmut Schmid, Annette Gotscharek and Florian Fink in DATeCH 2014. #digidaysTRANSCRIPT
Uwe Springmann1, Dietmar Najock2, Hermann Morgenroth2, Helmut Schmid1, Annette Gotscharek1 and Florian Fink1
OCR of Historical Printings of Latin TextsProblems, Prospects, Progress
1CIS, Ludwig-Maximilans-Universität München
2Institute for Greek and Latin Languages and Literatures, Freie Universität Berlin
p. 2 (16) OCR of historical printings of Latin textsSpringmann et al.
Overview
● Why Latin?● Problems● Prospects● Progress
p. 3 (16) OCR of historical printings of Latin textsSpringmann et al.
Why Latin?
● huge heritage: largest body of historical literary sources● Latin publications dominate print production until about 1750● many titles have never been reprinted● either key or barrier to cultural heritage of the western world● has been left out of the IMPACT project despite its importance
p. 4 (16) OCR of historical printings of Latin textsSpringmann et al.
Some problems for OCR engines
historical fonts
long s ( )ſ
historical ligatures: Æ, æ, Œ, œ, st,
polytonic Greek words
diacritics
abbreviations
historical spellings
Problems
p. 5 (16) OCR of historical printings of Latin textsSpringmann et al.
Some problems for OCR engines (continued)
● historical typography and spelling are also a problem for early modern languages
● ambiguities of abbreviations (especially in incunabula) will not immediately lead to fully expanded, machine readable text
● but discretionary diacritics are helpful in POS/morphology disambiguation:– adverb/vocative: altè/alte– adverb/pronoun: quàm/quam– conjunction/preposition: cùm/cum– ablative/nominative: hastâ/hasta
Problems
p. 6 (16) OCR of historical printings of Latin textsSpringmann et al.
State of the art – example pages
Prospects
1544
1779
1649
p. 7 (16) OCR of historical printings of Latin textsSpringmann et al.
State of the art – results for example pages
Prospects
Year Abbyy FR 11.1 Tesseract 3.03 OCRopus 0.7
1544 83,14 70,32 74,59
1649 88,07 84,87 78,98
1779 82,13 80,77 75,46
character accuracy in %
out-of-the-box performance, no language model (or default = English)
OCRopus hampered by bad image-text segmentation
p. 8 (16) OCR of historical printings of Latin textsSpringmann et al.
Prospects
Overcoming the obstacles
● Training (Tesseract, OCRopus)– (a) generate pseudo-historical images from existing texts and
historical-looking computer fonts (add some degradation to the image)– (b) transcribe some real pages and train on true historical fonts
● Lexical resources (Tesseract) in recognition● Post-processing– correct OCR errors, not historical spelling (might be interesting itself)– add annotation: expand abbreviations, ligatures, normalize spelling– helpful: language model, lexicon of historical word forms
p. 10 (16) OCR of historical printings of Latin textsSpringmann et al.
Progress
Postcorrection: Open-Source-Tool PoCoTo(see paper of Vobl et al. - presentation by Christoph Ringlstetter)
p. 11 (16) OCR of historical printings of Latin textsSpringmann et al.
Progress
Training on historical fonts (artificial images)Example: Pontanus, Progymnasmata Latinitatis (1589)
p. 12 (16) OCR of historical printings of Latin textsSpringmann et al.
Progress
Training on fonts, ideal lexiconExample: Pontanus, Progymnasmata Latinitatis (1589)character accuracy in %
PageAbbyy FR 11.1
Tesseract 3.03
Ocropus 0.7
Tesseract(font)
Tesseract(font + lex.)
Ocropus(font)
15 87,79 80,88 80,70 91,02 93,90 92,55
16 82,94 77,41 76,94 80,12 85,65 80,47
17 85,25 75,98 86,07 85,41 91,56 93,93
18 85,93 79,51 85,53 88,29 92,68 89,67
19 87,94 80,09 79,09 86,06 90,15 87,83
OCRopus: no language model!red: accuracy better than Abbyy
p. 14 (16) OCR of historical printings of Latin textsSpringmann et al.
Progress
Training on historical fonts (real images)Example: Thanner, Petronij Arbitri Sathyra (1500)character accuracy in %
PageTesseract
3.03Ocropus
0.7Ocropus(trained)
13 41,59 44,59 93,15
14 52,38 57,77 94,61
15 53,09 62,38 95,17
16 59,09 61,45 93,27
page 1-12: training set; page 13-16: test set
p. 15 (16) OCR of historical printings of Latin textsSpringmann et al.
Progress
Summary
● very old printings are hard to OCR out-of-the box● Tesseract and OCRopus can be trained to results above ABBYY● applying lexica as well as font training helps a lot● OCRopus can be trained to accuracies > 90%, but must at present be
combined with good line segmentation in a preprocessing step● postcorrection will do the rest
p. 16 (16) OCR of historical printings of Latin textsSpringmann et al.
Progress
Thank you for your interest!