natural language processing - peopleklein/cs288/fa14/... · and ch’: priftmer anhc bar. jacob...

Natural Language Processing

Historical Document TranscriptionDan Klein — UC Berkeley

Joint work with Taylor Berg-Kirkpatrick and Greg Durrett [ACL 2013]

Historical Document

Historical Document Old Bailey Court Proceedings 1775

Transcription

Document Image

Transcription

Document Image

and Ch’: priftmer anhc bar. Jacob Lazarus and his

IHP1 uh: prifoner. were both together when!

rcccivcd lhczn. I fold eievén pair of than

for xiirce guincas, and dclivcrcd the rcll'l:.in-

d:r hack lo :11: prifuner. 1 fold ftvcn pairof

filk to Mark Simpcr : nncpuir of mixcd. and.

mo pair of Ifircad to lhz: foolnun, and on:

pair of zhrzad to lh: barber. '

Q: What is the foolmarfs name?

Fraum Mgfzr. I dun’: know.

Hairy Hzrvir. l was flandingar the Camp

Icr waizin far the thcrrilfs ufliceruo employ

in: : Mo 3‘: daughter came for me to 0 am!

take the prifoncr. 1 Wm! to |hc Old aailcy

Transcription (Google Tesseract)

Pipelined Approach

Pipelined Approach

m

Pipelined Approach

m o

Pipelined Approach

m o d

Historical Document

Unknown Fonts

Unknown Fonts

po

Unknown Fonts

long s glyph

Wandering Baseline

Unsupervised Transcription of Historical Documents

Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division

University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu

Abstract

We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.

1 Introduction

Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.

One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model

(a)

(b)

(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.

learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.

A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.

A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.

Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).

Wandering Baseline

Wandering Baseline




Abstract


1 Introduction



(a)

(b)






Uneven Inking




Abstract


1 Introduction



(a)

(b)






Uneven Inking

Various Historical Documents

Old Bailey, 1725:

Old Bailey, 1875:

Trove, 1883:

Trove, 1823:

(a)

(b)

(c)

(d)

Figure 6: Portions of several documents from our test set rep-resenting a range of difficulties are displayed. On document(a), which exhibits noisy typesetting, our system achieves aword error rate (WER) of 25.2. Document (b) is cleaner incomparison, and on it we achieve a WER of 15.4. On doc-ument (c), which is also relatively clean, we achieve a WERof 12.5. On document (d), which is severely degraded, weachieve a WER of 70.0.

5 Data

We perform experiments on two historical datasetsconsisting of images of documents printed be-tween 1700 and 1900 in England and Australia.Examples from both datasets are displayed in Fig-ure 6.

5.1 Old Bailey

The first dataset comes from a large set of im-ages of the proceedings of the Old Bailey, a crimi-nal court in London, England (Shoemaker, 2005).The Old Bailey curatorial effort, after decidingthat current OCR systems do not adequately han-dle 18th century fonts, manually transcribed the

documents into text. We will use these manualtranscriptions to evaluate the output of our system.From the Old Bailey proceedings, we extracted aset of 20 images, each consisting of 30 lines oftext to use as our first test set. We picked 20 doc-uments, printed in consecutive decades. The firstdocument is from 1715 and the last is from 1905.We choose the first document in each of the corre-sponding years, choose a random page in the doc-ument, and extracted an image of the first 30 con-secutive lines of text consisting of full sentences.5

The ten documents in the Old Bailey dataset thatwere printed before 1810 use the long s glyph,while the remaining ten do not.

5.2 TroveOur second dataset is taken from a collection ofdigitized Australian newspapers that were printedbetween the years of 1803 and 1954. This col-lection is called Trove, and is maintained by thethe National Library of Australia (Holley, 2010).We extracted ten images from this collection in thesame way that we extracted images from Old Bai-ley, but starting from the year 1803. We manuallyproduced our own gold annotations for these tenimages. Only the first document of Trove uses thelong s glyph.

5.3 Pre-processingMany of the images in historical collections arebitonal (binary) as a result of how they were cap-tured on microfilm for storage in the 1980s (Arl-itsch and Herbert, 2004). This is part of the reasonour model is designed to work directly with bi-narized images. For consistency, we binarized theimages in our test sets that were not already binaryby thresholding pixel values.

Our model requires that the image be pre-segmented into lines of text. We automaticallysegment lines by training an HSMM over rows ofpixels. After the lines are segmented, each lineis resampled so that its vertical resolution is 30pixels. The line extraction process also identifiespixels that are not located in central text regions,and are part of large connected components of ink,spanning multiple lines. The values of such pixelsare treated as unobserved in the model since, moreoften than not, they are part of ink blotches.

5This ruled out portions of the document with extremestructural abnormalities, like title pages and lists. Thesemight be interesting to model, but are not within the scopeof this paper.

Old Bailey, 1725:

Old Bailey, 1875:

Trove, 1883:

Trove, 1823:

(a)

(b)

(c)

(d)


5 Data


5.1 Old Bailey








Old Bailey, 1725:

Old Bailey, 1875:

Trove, 1883:

Trove, 1823:

(a)

(b)

(c)

(d)


5 Data


5.1 Old Bailey








Old Bailey, 1725:

Old Bailey, 1875:

Trove, 1883:

Trove, 1823:

(a)

(b)

(c)

(d)


5 Data


5.1 Old Bailey








1725 1875

1823 1883:

Our Approach

Our Approach

po

Our Approach

po




Abstract


1 Introduction



(a)

(b)






Our Approach

po




Abstract


1 Introduction



(a)

(b)









Abstract


1 Introduction



(a)

(b)






Generative Model

p r i s o n e r

Language Model

Generative Model

p r i s o n e r

Typesetting Model

Generative Model

p r i s o n e r

Generative Model

p r i s o n e rTypesetting

Model

Generative Model

p r i s o n e r

Rendering Model

Generative Model

p r i s o n e r

Generative Model

Language Model

Over-inked

It appeared that the Prisoner was veryE :

X :

Wandering baseline Historical font

Figure 2: An example image from a historical document (X)and its transcription (E).

2 Related Work

Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.

Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.

A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.

3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.

We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:

P (E, T, R, X) =

P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]

We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.

3.1 Language Model P (E)

Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:

P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)

1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.

E p r i s o n e r

Generative Model

Language Model

Over-inked


X :



2 Related Work






P (E, T, R, X) =





P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)


E

Typesetting Model

Over-inked


X :



2 Related Work






P (E, T, R, X) =





P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)


T

p r i s o n e r

Generative Model

Language Model

Over-inked


X :



2 Related Work






P (E, T, R, X) =





P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)


E

Typesetting Model

Over-inked


X :



2 Related Work






P (E, T, R, X) =





P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)


T

XRendering Model

P (X|E, T )

p r i s o n e r

Generative Model

Language Model

Over-inked


X :



2 Related Work






P (E, T, R, X) =





P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)


E

Typesetting Model

Over-inked


X :



2 Related Work






P (E, T, R, X) =





P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)


T

XRendering Model

P (X|E, T )

Language Model

E

Language Model

aei

r tei�1 ei+1

E

Language Model

aei

r tei�1 ei+1

Kneser-Ney smoothed character 6-gram

E

Typesetting Model

aei

T

Typesetting Model

aei

T

Typesetting Model

15Left pad width

li

aei

T

Typesetting Model

15Left pad width

1 30Glyph box width

gili

aei

T

Typesetting Model

15Left pad width

1 5Right pad

width

1 30Glyph box width

gili ri

aei

T

a

Typesetting Model

15Left pad width

1 5Right pad

width

1 30Glyph box width

gili ri

aei

T

a

Typesetting Model

vi

15Left pad width

1 5Right pad

width

1 30Glyph box width

a a aVertical offset

gili ri

aei

T

a

Typesetting Model

di

vi

15Left pad width

1 5Right pad

width

1 30Glyph box width

a aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaInking level

a a aVertical offset

gili ri

aei

Rendering Model

Rendering Model

Glyph box

Rendering Model

gi

Glyph boxwidth

Glyph box

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

Glyph box

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Glyph shape parameters

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Bernoullipixel probs


X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Samplepixels



X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box


Samplepixels


X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Samplepixels



X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box


Samplepixels


Log-linear Interpolation


Glyph shape parameters �

Bernoulli pixel probs ✓




Interpolationweights ↵





Dot product ↵>�





Dot product ↵>�

Apply logistic




\





j

jInterpolation

weights ↵




j

j

✓j /Interpolation

weights ↵




j

j

✓j / exp[↵>j �]


Learning and Inference

E

X

T


• Learn font parameters using EM

E

X

T



E

X

T

15 1 5

1 30



• Initialize font parameters with mixtures of modern fonts

E

X

T

15 1 5

1 30




• Semi-Markov DP to compute expectations

E

X

T

15 1 5

1 30




• Semi-Markov DP to compute expectations

• Efficient inference using a coarse-to-fine approach

E

X

T

15 1 5

1 30

System Output Example


15 1 5

1 30


how the murderers came to

15 1 5

1 30


15 1 5

1 30


taken ill and taken away -- I remember

15 1 5

1 30

Experiments

Experiments

Test data

• Old Bailey (1715-1905)

Experiments

Test data

• Old Bailey (1715-1905)

20 images, 30 lines each

Experiments

Test data

• Old Bailey (1715-1905)


• Trove (1803-1893)

Experiments

Test data

• Old Bailey (1715-1905)


• Trove (1803-1893)


Experiments

Test data

• Old Bailey (1715-1905)


• Trove (1803-1893)


Experiments

Baselines

Test data

• Old Bailey (1715-1905)


• Trove (1803-1893)


Experiments

• Google Tesseract

Baselines

Test data

• Old Bailey (1715-1905)


• Trove (1803-1893)


Experiments


• ABBYY FineReader 11

Baselines

Test data

• Old Bailey (1715-1905)


• Trove (1803-1893)


Experiments



Baselines

Language modelsTest data

• Old Bailey (1715-1905)


• Trove (1803-1893)


Experiments



Baselines

• New York Times


• Old Bailey (1715-1905)


• Trove (1803-1893)


Experiments



Baselines

• New York Times

34M words NYT Gigaword


• Old Bailey (1715-1905)


• Trove (1803-1893)


Experiments



Baselines

• New York Times


• Old Bailey


• Old Bailey (1715-1905)


• Trove (1803-1893)


Experiments



Baselines

• New York Times


• Old Bailey

32M words manually transcribed


Results

0102030405060

Wor

d Er

ror

Rat

eOld Bailey Court Proceedings

(1715-1905)

Results

0102030405060

54.8

Wor

d Er

ror

Rat


(1715-1905)

GoogleTesseract

Results

0102030405060

40.0

54.8

Wor

d Er

ror

Rat


(1715-1905)

GoogleTesseract

ABBYYFineReader

Results

0102030405060

28.1

40.0

54.8

Wor

d Er

ror

Rat


(1715-1905)

GoogleTesseract

ABBYYFineReader

Ocularw/ NYT

[Berg-Kirkpatrick et al. 2013]

Results

0102030405060

24.128.1

40.0

54.8

Wor

d Er

ror

Rat


(1715-1905)

GoogleTesseract

ABBYYFineReader

Ocularw/ NYT

Ocularw/ OB


Trove Historical Newspapers(1803-1893)

Results

0102030405060

Wor

d Er

ror

Rat

e


Results

0102030405060

59.3

Wor

d Er

ror

Rat

e

GoogleTesseract


Results

0102030405060

49.259.3

Wor

d Er

ror

Rat

e

GoogleTesseract

ABBYYFineReader


Results

0102030405060

33.0

49.259.3

Wor

d Er

ror

Rat

e

GoogleTesseract

ABBYYFineReader

Ocularw/ NYT


Results

0102030405060

25.6

49.259.3

Wor

d Er

ror

Rat

e

GoogleTesseract

ABBYYFineReader

Ocularw/ NYT


Transcription















Google Tesseract

Transcription















Google Tesseract

the prisoner at the bar. Jacob Lazarus and his

wife, the prisoners were both together when I

received them. I sold eleven pair of them

for three guineas, and delivered the remain-

der back to the prisoner. I sold, seven pair of

silk to Mark Simpert one pair of mixed, and

two pair of thread to the footman, and one

pair of thread to the barber,

Ms. What in the footman's name?

Franco Asyut, I don't know-

Nearly Norris. I was standing at the Comp-

ter waiting for the sherrill's officers to employ

me a Moses's daughter came for me to go and

take the prisoner. I went to the Old Bailey

Ocular

Learned Fonts

Initializer

Learned Fonts

Initializer

g

1700

1740

1780 1820

1860

1900

Learned Fonts

Initializer

Unobserved Pixels

Conclusion

Conclusion

• Unsupervised font learning yields state-of-the-art results on documents where font is unknown

Conclusion


• Generatively modeling sources of noise specific to printing-press era documents is effective

Conclusion


• Generatively modeling sources of noise specific to printing-press era documents is effective

• Ocular available as a downloadable tool: nlp.cs.berkeley.edu/ocular.shtml

Conclusion

Thanks!