natural language processing - peopleklein/cs288/fa14/... · and ch’: priftmer anhc bar. jacob...

171
Natural Language Processing Historical Document Transcription Dan Klein — UC Berkeley Joint work with Taylor Berg-Kirkpatrick and Greg Durrett [ACL 2013]

Upload: others

Post on 26-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Natural Language Processing

Historical Document TranscriptionDan Klein — UC Berkeley

Joint work with Taylor Berg-Kirkpatrick and Greg Durrett [ACL 2013]

Page 2: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Historical Document

Page 3: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Historical Document Old Bailey Court Proceedings 1775

Page 4: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Transcription

Document Image

Page 5: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Transcription

Document Image

and Ch’: priftmer anhc bar. Jacob Lazarus and his

IHP1 uh: prifoner. were both together when!

rcccivcd lhczn. I fold eievén pair of than

for xiirce guincas, and dclivcrcd the rcll'l:.in-

d:r hack lo :11: prifuner. 1 fold ftvcn pairof

filk to Mark Simpcr : nncpuir of mixcd. and.

mo pair of Ifircad to lhz: foolnun, and on:

pair of zhrzad to lh: barber. '

Q: What is the foolmarfs name?

Fraum Mgfzr. I dun’: know.

Hairy Hzrvir. l was flandingar the Camp

Icr waizin far the thcrrilfs ufliceruo employ

in: : Mo 3‘: daughter came for me to 0 am!

take the prifoncr. 1 Wm! to |hc Old aailcy

Transcription (Google Tesseract)

Page 6: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Pipelined Approach

Page 7: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Pipelined Approach

Page 8: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Pipelined Approach

Page 9: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Pipelined Approach

Page 10: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Pipelined Approach

Page 11: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Pipelined Approach

m

Page 12: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Pipelined Approach

m o

Page 13: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Pipelined Approach

m o d

Page 14: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Historical Document

Page 15: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Unknown Fonts

Page 16: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Unknown Fonts

po

Page 17: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Unknown Fonts

po

Page 18: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Unknown Fonts

po

Page 19: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Unknown Fonts

long s glyph

Page 20: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Wandering Baseline

Unsupervised Transcription of Historical Documents

Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division

University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu

Abstract

We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.

1 Introduction

Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.

One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model

(a)

(b)

(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.

learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.

A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.

A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.

Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).

Page 21: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Wandering Baseline

Unsupervised Transcription of Historical Documents

Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division

University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu

Abstract

We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.

1 Introduction

Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.

One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model

(a)

(b)

(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.

learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.

A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.

A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.

Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).

Page 22: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Wandering Baseline

Page 23: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Wandering Baseline

Unsupervised Transcription of Historical Documents

Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division

University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu

Abstract

We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.

1 Introduction

Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.

One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model

(a)

(b)

(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.

learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.

A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.

A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.

Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).

Page 24: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Uneven Inking

Unsupervised Transcription of Historical Documents

Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division

University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu

Abstract

We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.

1 Introduction

Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.

One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model

(a)

(b)

(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.

learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.

A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.

A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.

Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).

Page 25: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Uneven Inking

Unsupervised Transcription of Historical Documents

Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division

University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu

Abstract

We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.

1 Introduction

Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.

One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model

(a)

(b)

(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.

learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.

A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.

A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.

Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).

Page 26: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Uneven Inking

Unsupervised Transcription of Historical Documents

Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division

University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu

Abstract

We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.

1 Introduction

Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.

One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model

(a)

(b)

(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.

learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.

A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.

A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.

Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).

Page 27: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Uneven Inking

Page 28: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Various Historical Documents

Old Bailey, 1725:

Old Bailey, 1875:

Trove, 1883:

Trove, 1823:

(a)

(b)

(c)

(d)

Figure 6: Portions of several documents from our test set rep-resenting a range of difficulties are displayed. On document(a), which exhibits noisy typesetting, our system achieves aword error rate (WER) of 25.2. Document (b) is cleaner incomparison, and on it we achieve a WER of 15.4. On doc-ument (c), which is also relatively clean, we achieve a WERof 12.5. On document (d), which is severely degraded, weachieve a WER of 70.0.

5 Data

We perform experiments on two historical datasetsconsisting of images of documents printed be-tween 1700 and 1900 in England and Australia.Examples from both datasets are displayed in Fig-ure 6.

5.1 Old Bailey

The first dataset comes from a large set of im-ages of the proceedings of the Old Bailey, a crimi-nal court in London, England (Shoemaker, 2005).The Old Bailey curatorial effort, after decidingthat current OCR systems do not adequately han-dle 18th century fonts, manually transcribed the

documents into text. We will use these manualtranscriptions to evaluate the output of our system.From the Old Bailey proceedings, we extracted aset of 20 images, each consisting of 30 lines oftext to use as our first test set. We picked 20 doc-uments, printed in consecutive decades. The firstdocument is from 1715 and the last is from 1905.We choose the first document in each of the corre-sponding years, choose a random page in the doc-ument, and extracted an image of the first 30 con-secutive lines of text consisting of full sentences.5

The ten documents in the Old Bailey dataset thatwere printed before 1810 use the long s glyph,while the remaining ten do not.

5.2 TroveOur second dataset is taken from a collection ofdigitized Australian newspapers that were printedbetween the years of 1803 and 1954. This col-lection is called Trove, and is maintained by thethe National Library of Australia (Holley, 2010).We extracted ten images from this collection in thesame way that we extracted images from Old Bai-ley, but starting from the year 1803. We manuallyproduced our own gold annotations for these tenimages. Only the first document of Trove uses thelong s glyph.

5.3 Pre-processingMany of the images in historical collections arebitonal (binary) as a result of how they were cap-tured on microfilm for storage in the 1980s (Arl-itsch and Herbert, 2004). This is part of the reasonour model is designed to work directly with bi-narized images. For consistency, we binarized theimages in our test sets that were not already binaryby thresholding pixel values.

Our model requires that the image be pre-segmented into lines of text. We automaticallysegment lines by training an HSMM over rows ofpixels. After the lines are segmented, each lineis resampled so that its vertical resolution is 30pixels. The line extraction process also identifiespixels that are not located in central text regions,and are part of large connected components of ink,spanning multiple lines. The values of such pixelsare treated as unobserved in the model since, moreoften than not, they are part of ink blotches.

5This ruled out portions of the document with extremestructural abnormalities, like title pages and lists. Thesemight be interesting to model, but are not within the scopeof this paper.

Old Bailey, 1725:

Old Bailey, 1875:

Trove, 1883:

Trove, 1823:

(a)

(b)

(c)

(d)

Figure 6: Portions of several documents from our test set rep-resenting a range of difficulties are displayed. On document(a), which exhibits noisy typesetting, our system achieves aword error rate (WER) of 25.2. Document (b) is cleaner incomparison, and on it we achieve a WER of 15.4. On doc-ument (c), which is also relatively clean, we achieve a WERof 12.5. On document (d), which is severely degraded, weachieve a WER of 70.0.

5 Data

We perform experiments on two historical datasetsconsisting of images of documents printed be-tween 1700 and 1900 in England and Australia.Examples from both datasets are displayed in Fig-ure 6.

5.1 Old Bailey

The first dataset comes from a large set of im-ages of the proceedings of the Old Bailey, a crimi-nal court in London, England (Shoemaker, 2005).The Old Bailey curatorial effort, after decidingthat current OCR systems do not adequately han-dle 18th century fonts, manually transcribed the

documents into text. We will use these manualtranscriptions to evaluate the output of our system.From the Old Bailey proceedings, we extracted aset of 20 images, each consisting of 30 lines oftext to use as our first test set. We picked 20 doc-uments, printed in consecutive decades. The firstdocument is from 1715 and the last is from 1905.We choose the first document in each of the corre-sponding years, choose a random page in the doc-ument, and extracted an image of the first 30 con-secutive lines of text consisting of full sentences.5

The ten documents in the Old Bailey dataset thatwere printed before 1810 use the long s glyph,while the remaining ten do not.

5.2 TroveOur second dataset is taken from a collection ofdigitized Australian newspapers that were printedbetween the years of 1803 and 1954. This col-lection is called Trove, and is maintained by thethe National Library of Australia (Holley, 2010).We extracted ten images from this collection in thesame way that we extracted images from Old Bai-ley, but starting from the year 1803. We manuallyproduced our own gold annotations for these tenimages. Only the first document of Trove uses thelong s glyph.

5.3 Pre-processingMany of the images in historical collections arebitonal (binary) as a result of how they were cap-tured on microfilm for storage in the 1980s (Arl-itsch and Herbert, 2004). This is part of the reasonour model is designed to work directly with bi-narized images. For consistency, we binarized theimages in our test sets that were not already binaryby thresholding pixel values.

Our model requires that the image be pre-segmented into lines of text. We automaticallysegment lines by training an HSMM over rows ofpixels. After the lines are segmented, each lineis resampled so that its vertical resolution is 30pixels. The line extraction process also identifiespixels that are not located in central text regions,and are part of large connected components of ink,spanning multiple lines. The values of such pixelsare treated as unobserved in the model since, moreoften than not, they are part of ink blotches.

5This ruled out portions of the document with extremestructural abnormalities, like title pages and lists. Thesemight be interesting to model, but are not within the scopeof this paper.

Old Bailey, 1725:

Old Bailey, 1875:

Trove, 1883:

Trove, 1823:

(a)

(b)

(c)

(d)

Figure 6: Portions of several documents from our test set rep-resenting a range of difficulties are displayed. On document(a), which exhibits noisy typesetting, our system achieves aword error rate (WER) of 25.2. Document (b) is cleaner incomparison, and on it we achieve a WER of 15.4. On doc-ument (c), which is also relatively clean, we achieve a WERof 12.5. On document (d), which is severely degraded, weachieve a WER of 70.0.

5 Data

We perform experiments on two historical datasetsconsisting of images of documents printed be-tween 1700 and 1900 in England and Australia.Examples from both datasets are displayed in Fig-ure 6.

5.1 Old Bailey

The first dataset comes from a large set of im-ages of the proceedings of the Old Bailey, a crimi-nal court in London, England (Shoemaker, 2005).The Old Bailey curatorial effort, after decidingthat current OCR systems do not adequately han-dle 18th century fonts, manually transcribed the

documents into text. We will use these manualtranscriptions to evaluate the output of our system.From the Old Bailey proceedings, we extracted aset of 20 images, each consisting of 30 lines oftext to use as our first test set. We picked 20 doc-uments, printed in consecutive decades. The firstdocument is from 1715 and the last is from 1905.We choose the first document in each of the corre-sponding years, choose a random page in the doc-ument, and extracted an image of the first 30 con-secutive lines of text consisting of full sentences.5

The ten documents in the Old Bailey dataset thatwere printed before 1810 use the long s glyph,while the remaining ten do not.

5.2 TroveOur second dataset is taken from a collection ofdigitized Australian newspapers that were printedbetween the years of 1803 and 1954. This col-lection is called Trove, and is maintained by thethe National Library of Australia (Holley, 2010).We extracted ten images from this collection in thesame way that we extracted images from Old Bai-ley, but starting from the year 1803. We manuallyproduced our own gold annotations for these tenimages. Only the first document of Trove uses thelong s glyph.

5.3 Pre-processingMany of the images in historical collections arebitonal (binary) as a result of how they were cap-tured on microfilm for storage in the 1980s (Arl-itsch and Herbert, 2004). This is part of the reasonour model is designed to work directly with bi-narized images. For consistency, we binarized theimages in our test sets that were not already binaryby thresholding pixel values.

Our model requires that the image be pre-segmented into lines of text. We automaticallysegment lines by training an HSMM over rows ofpixels. After the lines are segmented, each lineis resampled so that its vertical resolution is 30pixels. The line extraction process also identifiespixels that are not located in central text regions,and are part of large connected components of ink,spanning multiple lines. The values of such pixelsare treated as unobserved in the model since, moreoften than not, they are part of ink blotches.

5This ruled out portions of the document with extremestructural abnormalities, like title pages and lists. Thesemight be interesting to model, but are not within the scopeof this paper.

Old Bailey, 1725:

Old Bailey, 1875:

Trove, 1883:

Trove, 1823:

(a)

(b)

(c)

(d)

Figure 6: Portions of several documents from our test set rep-resenting a range of difficulties are displayed. On document(a), which exhibits noisy typesetting, our system achieves aword error rate (WER) of 25.2. Document (b) is cleaner incomparison, and on it we achieve a WER of 15.4. On doc-ument (c), which is also relatively clean, we achieve a WERof 12.5. On document (d), which is severely degraded, weachieve a WER of 70.0.

5 Data

We perform experiments on two historical datasetsconsisting of images of documents printed be-tween 1700 and 1900 in England and Australia.Examples from both datasets are displayed in Fig-ure 6.

5.1 Old Bailey

The first dataset comes from a large set of im-ages of the proceedings of the Old Bailey, a crimi-nal court in London, England (Shoemaker, 2005).The Old Bailey curatorial effort, after decidingthat current OCR systems do not adequately han-dle 18th century fonts, manually transcribed the

documents into text. We will use these manualtranscriptions to evaluate the output of our system.From the Old Bailey proceedings, we extracted aset of 20 images, each consisting of 30 lines oftext to use as our first test set. We picked 20 doc-uments, printed in consecutive decades. The firstdocument is from 1715 and the last is from 1905.We choose the first document in each of the corre-sponding years, choose a random page in the doc-ument, and extracted an image of the first 30 con-secutive lines of text consisting of full sentences.5

The ten documents in the Old Bailey dataset thatwere printed before 1810 use the long s glyph,while the remaining ten do not.

5.2 TroveOur second dataset is taken from a collection ofdigitized Australian newspapers that were printedbetween the years of 1803 and 1954. This col-lection is called Trove, and is maintained by thethe National Library of Australia (Holley, 2010).We extracted ten images from this collection in thesame way that we extracted images from Old Bai-ley, but starting from the year 1803. We manuallyproduced our own gold annotations for these tenimages. Only the first document of Trove uses thelong s glyph.

5.3 Pre-processingMany of the images in historical collections arebitonal (binary) as a result of how they were cap-tured on microfilm for storage in the 1980s (Arl-itsch and Herbert, 2004). This is part of the reasonour model is designed to work directly with bi-narized images. For consistency, we binarized theimages in our test sets that were not already binaryby thresholding pixel values.

Our model requires that the image be pre-segmented into lines of text. We automaticallysegment lines by training an HSMM over rows ofpixels. After the lines are segmented, each lineis resampled so that its vertical resolution is 30pixels. The line extraction process also identifiespixels that are not located in central text regions,and are part of large connected components of ink,spanning multiple lines. The values of such pixelsare treated as unobserved in the model since, moreoften than not, they are part of ink blotches.

5This ruled out portions of the document with extremestructural abnormalities, like title pages and lists. Thesemight be interesting to model, but are not within the scopeof this paper.

1725 1875

1823 1883:

Page 29: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Our Approach

Page 30: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Our Approach

po

Page 31: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Our Approach

po

Unsupervised Transcription of Historical Documents

Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division

University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu

Abstract

We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.

1 Introduction

Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.

One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model

(a)

(b)

(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.

learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.

A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.

A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.

Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).

Page 32: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Our Approach

po

Unsupervised Transcription of Historical Documents

Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division

University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu

Abstract

We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.

1 Introduction

Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.

One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model

(a)

(b)

(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.

learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.

A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.

A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.

Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).

Unsupervised Transcription of Historical Documents

Taylor Berg-Kirkpatrick Greg Durrett Dan KleinComputer Science Division

University of California at Berkeley{tberg,gdurrett,klein}@cs.berkeley.edu

Abstract

We present a generative probabilisticmodel, inspired by historical printing pro-cesses, for transcribing images of docu-ments from the printing press era. Byjointly modeling the text of the docu-ment and the noisy (but regular) processof rendering glyphs, our unsupervised sys-tem is able to decipher font structure andmore accurately transcribe images intotext. Overall, our system substantially out-performs state-of-the-art solutions for thistask, achieving a 31% relative reductionin word error rate over the leading com-mercial system for historical transcription,and a 47% relative reduction over Tesser-act, Google’s open source OCR system.

1 Introduction

Standard techniques for transcribing modern doc-uments do not work well on historical ones. Forexample, even state-of-the-art OCR systems pro-duce word error rates of over 50% on the docu-ments shown in Figure 1. Unsurprisingly, such er-ror rates are too high for many research projects(Arlitsch and Herbert, 2004; Shoemaker, 2005;Holley, 2010). We present a new, generativemodel specialized to transcribing printing-pressera documents. Our model is inspired by the un-derlying printing processes and is designed to cap-ture the primary sources of variation and noise.

One key challenge is that the fonts used in his-torical documents are not standard (Shoemaker,2005). For example, consider Figure 1a. The fontsare not irregular like handwriting – each occur-rence of a given character type, e.g. a, will use thesame underlying glyph. However, the exact glyphsare unknown. Some differences between fonts areminor, reflecting small variations in font design.Others are more severe, like the presence of thearchaic long s character before 1804. To addressthe general problem of unknown fonts, our model

(a)

(b)

(c)Figure 1: Portions of historical documents with (a) unknownfont, (b) uneven baseline, and (c) over-inking.

learns the font in an unsupervised fashion. Fontshape and character segmentation are tightly cou-pled, and so they are modeled jointly.

A second challenge with historical data is thatthe early typesetting process was noisy. Hand-carved blocks were somewhat uneven and oftenfailed to sit evenly on the mechanical baseline.Figure 1b shows an example of the text’s baselinemoving up and down, with varying gaps betweencharacters. To deal with these phenomena, ourmodel incorporates random variables that specifi-cally describe variations in vertical offset and hor-izontal spacing.

A third challenge is that the actual inking wasalso noisy. For example, in Figure 1c some charac-ters are thick from over-inking while others are ob-scured by ink bleeds. To be robust to such render-ing irregularities, our model captures both inkinglevels and pixel-level noise. Because the modelis generative, we can also treat areas that are ob-scured by larger ink blotches as unobserved, andlet the model predict the obscured text based onvisual and linguistic context.

Our system, which we call Ocular, operates byfitting the model to each document in an unsuper-vised fashion. The system outperforms state-of-the-art baselines, giving a 47% relative error re-duction over Google’s open source Tesseract sys-tem, and giving a 31% relative error reduction overABBYY’s commercial FineReader system, whichhas been used in large-scale historical transcrip-tion projects (Holley, 2010).

Page 33: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e r

Page 34: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e r

Page 35: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e r

Page 36: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e r

Page 37: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Language Model

Generative Model

p r i s o n e r

Page 38: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Typesetting Model

Generative Model

p r i s o n e r

Page 39: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Typesetting Model

Generative Model

p r i s o n e r

Page 40: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Typesetting Model

Generative Model

p r i s o n e r

Page 41: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Typesetting Model

Generative Model

p r i s o n e r

Page 42: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Typesetting Model

Generative Model

p r i s o n e r

Page 43: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Typesetting Model

Generative Model

p r i s o n e r

Page 44: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Typesetting Model

Generative Model

p r i s o n e r

Page 45: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Typesetting Model

Generative Model

p r i s o n e r

Page 46: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e rTypesetting

Model

Page 47: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e rTypesetting

Model

Page 48: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e rTypesetting

Model

Page 49: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e rTypesetting

Model

Page 50: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e rTypesetting

Model

Page 51: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e rTypesetting

Model

Page 52: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e r

Rendering Model

Page 53: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e r

Rendering Model

Page 54: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e r

Rendering Model

Page 55: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

p r i s o n e r

Page 56: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

Language Model

Over-inked

It appeared that the Prisoner was veryE :

X :

Wandering baseline Historical font

Figure 2: An example image from a historical document (X)and its transcription (E).

2 Related Work

Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.

Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.

A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.

3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.

We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:

P (E, T, R, X) =

P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]

We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.

3.1 Language Model P (E)

Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:

P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)

1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.

E p r i s o n e r

Page 57: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

Language Model

Over-inked

It appeared that the Prisoner was veryE :

X :

Wandering baseline Historical font

Figure 2: An example image from a historical document (X)and its transcription (E).

2 Related Work

Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.

Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.

A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.

3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.

We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:

P (E, T, R, X) =

P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]

We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.

3.1 Language Model P (E)

Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:

P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)

1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.

E

Typesetting Model

Over-inked

It appeared that the Prisoner was veryE :

X :

Wandering baseline Historical font

Figure 2: An example image from a historical document (X)and its transcription (E).

2 Related Work

Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.

Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.

A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.

3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.

We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:

P (E, T, R, X) =

P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]

We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.

3.1 Language Model P (E)

Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:

P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)

1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.

T

p r i s o n e r

Page 58: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

Language Model

Over-inked

It appeared that the Prisoner was veryE :

X :

Wandering baseline Historical font

Figure 2: An example image from a historical document (X)and its transcription (E).

2 Related Work

Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.

Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.

A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.

3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.

We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:

P (E, T, R, X) =

P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]

We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.

3.1 Language Model P (E)

Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:

P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)

1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.

E

Typesetting Model

Over-inked

It appeared that the Prisoner was veryE :

X :

Wandering baseline Historical font

Figure 2: An example image from a historical document (X)and its transcription (E).

2 Related Work

Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.

Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.

A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.

3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.

We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:

P (E, T, R, X) =

P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]

We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.

3.1 Language Model P (E)

Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:

P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)

1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.

T

XRendering Model

P (X|E, T )

p r i s o n e r

Page 59: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Generative Model

Language Model

Over-inked

It appeared that the Prisoner was veryE :

X :

Wandering baseline Historical font

Figure 2: An example image from a historical document (X)and its transcription (E).

2 Related Work

Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.

Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.

A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.

3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.

We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:

P (E, T, R, X) =

P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]

We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.

3.1 Language Model P (E)

Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:

P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)

1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.

E

Typesetting Model

Over-inked

It appeared that the Prisoner was veryE :

X :

Wandering baseline Historical font

Figure 2: An example image from a historical document (X)and its transcription (E).

2 Related Work

Relatively little prior work has built models specif-ically for transcribing historical documents. Someof the challenges involved have been addressed(Ho and Nagy, 2000; Huang et al., 2006; Kae andLearned-Miller, 2009), but not in a way targetedto documents from the printing press era. For ex-ample, some approaches have learned fonts in anunsupervised fashion but require pre-segmentationof the image into character or word regions (Hoand Nagy, 2000; Huang et al., 2006), which is notfeasible for noisy historical documents. Kae andLearned-Miller (2009) jointly learn the font andimage segmentation but do not outperform mod-ern baselines.

Work that has directly addressed historical doc-uments has done so using a pipelined approach,and without fully integrating a strong languagemodel (Vamvakas et al., 2008; Kluzner et al.,2009; Kae et al., 2010; Kluzner et al., 2011).The most comparable work is that of Kopec andLomelin (1996) and Kopec et al. (2001). Theyintegrated typesetting models with language mod-els, but did not model noise. In the NLP com-munity, generative models have been developedspecifically for correcting outputs of OCR systems(Kolak et al., 2003), but these do not deal directlywith images.

A closely related area of work is automatic de-cipherment (Ravi and Knight, 2008; Snyder et al.,2010; Ravi and Knight, 2011; Berg-Kirkpatrickand Klein, 2011). The fundamental problem issimilar to our own: we are presented with a se-quence of symbols, and we need to learn a corre-spondence between symbols and letters. Our ap-proach is also similar in that we use a strong lan-guage model (in conjunction with the constraintthat the correspondence be regular) to learn thecorrect mapping. However, the symbols are notnoisy in decipherment problems and in our prob-lem we face a grid of pixels for which the segmen-tation into symbols is unknown. In contrast, deci-pherment typically deals only with discrete sym-bols.

3 ModelMost historical documents have unknown fonts,noisy typesetting layouts, and inconsistent ink lev-els, usually simultaneously. For example, the por-tion of the document shown in Figure 2 has allthree of these problems. Our model must handlethem jointly.

We take a generative modeling approach in-spired by the overall structure of the historicalprinting process. Our model generates images ofdocuments line by line; we present the generativeprocess for the image of a single line. Our pri-mary random variables are E (the text) and X (thepixels in an image of the line). Additionally, wehave a random variable T that specifies the layoutof the bounding boxes of the glyphs in the image,and a random variable R that specifies aspects ofthe inking and rendering process. The joint distri-bution is:

P (E, T, R, X) =

P (E) [Language model]· P (T |E) [Typesetting model]· P (R) [Inking model]· P (X|E, T, R) [Noise model]

We let capital letters denote vectors of concate-nated random variables, and we denote the indi-vidual random variables with lower-case letters.For example, E represents the entire sequence oftext, while ei represents ith character in the se-quence.

3.1 Language Model P (E)

Our language model, P (E), is a Kneser-Neysmoothed character n-gram model (Kneser andNey, 1995). We generate printed lines of text(rather than sentences) independently, withoutgenerating an explicit stop character. This meansthat, formally, the model must separately generatethe character length of each line. We choose not tobias the model towards longer or shorter charactersequences and let the line length m be drawn uni-formly at random from the positive integers lessthan some large constant M.1 When i < 1, let eidenote a line-initial null character. We can nowwrite:

P (E) = P (m) ·mY

i=1

P (ei|ei�1, . . . , ei�n)

1In particular, we do not use the kind of “word bonus”common to statistical machine translation models.

T

XRendering Model

P (X|E, T )

Page 60: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Language Model

E

Page 61: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Language Model

E

Page 62: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Language Model

aei

r tei�1 ei+1

E

Page 63: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Language Model

aei

r tei�1 ei+1

Kneser-Ney smoothed character 6-gram

E

Page 64: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Typesetting Model

aei

Page 65: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

T

Typesetting Model

aei

Page 66: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

T

Typesetting Model

15Left pad width

li

aei

Page 67: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

T

Typesetting Model

15Left pad width

1 30Glyph box width

gili

aei

Page 68: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

T

Typesetting Model

15Left pad width

1 5Right pad

width

1 30Glyph box width

gili ri

aei

Page 69: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

T

a

Typesetting Model

15Left pad width

1 5Right pad

width

1 30Glyph box width

gili ri

aei

Page 70: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

T

a

Typesetting Model

15Left pad width

1 5Right pad

width

1 30Glyph box width

gili ri

aei

Page 71: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

T

a

Typesetting Model

vi

15Left pad width

1 5Right pad

width

1 30Glyph box width

a a aVertical offset

gili ri

aei

Page 72: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

T

a

Typesetting Model

di

vi

15Left pad width

1 5Right pad

width

1 30Glyph box width

a aaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaInking level

a a aVertical offset

gili ri

aei

Page 73: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Rendering Model

Page 74: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Rendering Model

Page 75: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Rendering Model

Glyph box

Page 76: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Rendering Model

gi

Glyph boxwidth

Glyph box

Page 77: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

Glyph box

Page 78: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Page 79: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Page 80: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Glyph shape parameters

Page 81: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Glyph shape parameters

Page 82: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Bernoullipixel probs

Glyph shape parameters

Page 83: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Samplepixels

Bernoullipixel probs

Glyph shape parameters

Page 84: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Glyph shape parameters

Samplepixels

Bernoullipixel probs

Page 85: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Glyph shape parameters

Samplepixels

Bernoullipixel probs

Page 86: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Samplepixels

Bernoullipixel probs

Glyph shape parameters

Page 87: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Samplepixels

Bernoullipixel probs

Glyph shape parameters

Page 88: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Samplepixels

Bernoullipixel probs

Glyph shape parameters

Page 89: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Samplepixels

Bernoullipixel probs

Glyph shape parameters

Page 90: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Samplepixels

Bernoullipixel probs

Glyph shape parameters

Page 91: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Samplepixels

Bernoullipixel probs

Glyph shape parameters

Page 92: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Samplepixels

Bernoullipixel probs

Glyph shape parameters

Page 93: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Samplepixels

Bernoullipixel probs

Glyph shape parameters

Page 94: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Samplepixels

Bernoullipixel probs

Glyph shape parameters

Page 95: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Glyph shape parameters

Samplepixels

Bernoullipixel probs

Page 96: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Glyph shape parameters

Samplepixels

Bernoullipixel probs

Page 97: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

X

Rendering Model

gi

Glyph boxwidth

vi

Verticaloffset

di

Inking level

Glyph box

Glyph shape parameters

Samplepixels

Bernoullipixel probs

Page 98: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Page 99: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Glyph shape parameters �

Bernoulli pixel probs ✓

Page 100: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Bernoulli pixel probs ✓

Glyph shape parameters �

Page 101: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Bernoulli pixel probs ✓

Glyph shape parameters �

Page 102: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Bernoulli pixel probs ✓

Glyph shape parameters �

Page 103: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Bernoulli pixel probs ✓

Glyph shape parameters �

Page 104: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Bernoulli pixel probs ✓

Glyph shape parameters �

Interpolationweights ↵

Page 105: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Bernoulli pixel probs ✓

Glyph shape parameters �

Interpolationweights ↵

Dot product ↵>�

Page 106: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Bernoulli pixel probs ✓

Glyph shape parameters �

Interpolationweights ↵

Dot product ↵>�

Apply logistic

Page 107: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Bernoulli pixel probs ✓

Glyph shape parameters �

\

Interpolationweights ↵

Page 108: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Bernoulli pixel probs ✓

Glyph shape parameters �

\

Interpolationweights ↵

Page 109: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Bernoulli pixel probs ✓

Glyph shape parameters �

\

Interpolationweights ↵

Page 110: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Bernoulli pixel probs ✓

Glyph shape parameters �

j

jInterpolation

weights ↵

Page 111: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Bernoulli pixel probs ✓

Glyph shape parameters �

j

j

✓j /Interpolation

weights ↵

Page 112: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Log-linear Interpolation

Bernoulli pixel probs ✓

Glyph shape parameters �

j

j

✓j / exp[↵>j �]

Interpolationweights ↵

Page 113: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Learning and Inference

E

X

T

Page 114: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Learning and Inference

• Learn font parameters using EM

E

X

T

Page 115: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Learning and Inference

• Learn font parameters using EM

E

X

T

15 1 5

1 30

Page 116: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Learning and Inference

• Learn font parameters using EM

E

X

T

15 1 5

1 30

Page 117: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Learning and Inference

• Learn font parameters using EM

• Initialize font parameters with mixtures of modern fonts

E

X

T

15 1 5

1 30

Page 118: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Learning and Inference

• Learn font parameters using EM

• Initialize font parameters with mixtures of modern fonts

• Semi-Markov DP to compute expectations

E

X

T

15 1 5

1 30

Page 119: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Learning and Inference

• Learn font parameters using EM

• Initialize font parameters with mixtures of modern fonts

• Semi-Markov DP to compute expectations

• Efficient inference using a coarse-to-fine approach

E

X

T

15 1 5

1 30

Page 120: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

System Output Example

Page 121: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

System Output Example

15 1 5

1 30

Page 122: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

System Output Example

how the murderers came to

15 1 5

1 30

Page 123: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

System Output Example

how the murderers came to

15 1 5

1 30

Page 124: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

System Output Example

how the murderers came to

15 1 5

1 30

Page 125: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

System Output Example

how the murderers came to

15 1 5

1 30

Page 126: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

System Output Example

how the murderers came to

15 1 5

1 30

Page 127: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

System Output Example

Page 128: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

System Output Example

15 1 5

1 30

Page 129: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

System Output Example

taken ill and taken away -- I remember

15 1 5

1 30

Page 130: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

System Output Example

taken ill and taken away -- I remember

15 1 5

1 30

Page 131: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

System Output Example

taken ill and taken away -- I remember

15 1 5

1 30

Page 132: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

System Output Example

taken ill and taken away -- I remember

15 1 5

1 30

Page 133: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Experiments

Page 134: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Experiments

Test data

Page 135: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

• Old Bailey (1715-1905)

Experiments

Test data

Page 136: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

• Old Bailey (1715-1905)

20 images, 30 lines each

Experiments

Test data

Page 137: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

• Old Bailey (1715-1905)

20 images, 30 lines each

• Trove (1803-1893)

Experiments

Test data

Page 138: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

• Old Bailey (1715-1905)

20 images, 30 lines each

• Trove (1803-1893)

10 images, 30 lines each

Experiments

Test data

Page 139: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

• Old Bailey (1715-1905)

20 images, 30 lines each

• Trove (1803-1893)

10 images, 30 lines each

Experiments

Baselines

Test data

Page 140: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

• Old Bailey (1715-1905)

20 images, 30 lines each

• Trove (1803-1893)

10 images, 30 lines each

Experiments

• Google Tesseract

Baselines

Test data

Page 141: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

• Old Bailey (1715-1905)

20 images, 30 lines each

• Trove (1803-1893)

10 images, 30 lines each

Experiments

• Google Tesseract

• ABBYY FineReader 11

Baselines

Test data

Page 142: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

• Old Bailey (1715-1905)

20 images, 30 lines each

• Trove (1803-1893)

10 images, 30 lines each

Experiments

• Google Tesseract

• ABBYY FineReader 11

Baselines

Language modelsTest data

Page 143: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

• Old Bailey (1715-1905)

20 images, 30 lines each

• Trove (1803-1893)

10 images, 30 lines each

Experiments

• Google Tesseract

• ABBYY FineReader 11

Baselines

• New York Times

Language modelsTest data

Page 144: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

• Old Bailey (1715-1905)

20 images, 30 lines each

• Trove (1803-1893)

10 images, 30 lines each

Experiments

• Google Tesseract

• ABBYY FineReader 11

Baselines

• New York Times

34M words NYT Gigaword

Language modelsTest data

Page 145: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

• Old Bailey (1715-1905)

20 images, 30 lines each

• Trove (1803-1893)

10 images, 30 lines each

Experiments

• Google Tesseract

• ABBYY FineReader 11

Baselines

• New York Times

34M words NYT Gigaword

• Old Bailey

Language modelsTest data

Page 146: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

• Old Bailey (1715-1905)

20 images, 30 lines each

• Trove (1803-1893)

10 images, 30 lines each

Experiments

• Google Tesseract

• ABBYY FineReader 11

Baselines

• New York Times

34M words NYT Gigaword

• Old Bailey

32M words manually transcribed

Language modelsTest data

Page 147: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Results

0102030405060

Wor

d Er

ror

Rat

eOld Bailey Court Proceedings

(1715-1905)

Page 148: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Results

0102030405060

54.8

Wor

d Er

ror

Rat

eOld Bailey Court Proceedings

(1715-1905)

GoogleTesseract

Page 149: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Results

0102030405060

40.0

54.8

Wor

d Er

ror

Rat

eOld Bailey Court Proceedings

(1715-1905)

GoogleTesseract

ABBYYFineReader

Page 150: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Results

0102030405060

28.1

40.0

54.8

Wor

d Er

ror

Rat

eOld Bailey Court Proceedings

(1715-1905)

GoogleTesseract

ABBYYFineReader

Ocularw/ NYT

[Berg-Kirkpatrick et al. 2013]

Page 151: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Results

0102030405060

24.128.1

40.0

54.8

Wor

d Er

ror

Rat

eOld Bailey Court Proceedings

(1715-1905)

GoogleTesseract

ABBYYFineReader

Ocularw/ NYT

Ocularw/ OB

[Berg-Kirkpatrick et al. 2013]

Page 152: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Trove Historical Newspapers(1803-1893)

Results

0102030405060

Wor

d Er

ror

Rat

e

Page 153: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Trove Historical Newspapers(1803-1893)

Results

0102030405060

59.3

Wor

d Er

ror

Rat

e

GoogleTesseract

Page 154: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Trove Historical Newspapers(1803-1893)

Results

0102030405060

49.259.3

Wor

d Er

ror

Rat

e

GoogleTesseract

ABBYYFineReader

Page 155: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Trove Historical Newspapers(1803-1893)

Results

0102030405060

33.0

49.259.3

Wor

d Er

ror

Rat

e

GoogleTesseract

ABBYYFineReader

Ocularw/ NYT

Page 156: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Trove Historical Newspapers(1803-1893)

Results

0102030405060

25.6

49.259.3

Wor

d Er

ror

Rat

e

GoogleTesseract

ABBYYFineReader

Ocularw/ NYT

[Berg-Kirkpatrick et al. 2014]

Page 157: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Transcription

and Ch’: priftmer anhc bar. Jacob Lazarus and his

IHP1 uh: prifoner. were both together when!

rcccivcd lhczn. I fold eievén pair of than

for xiirce guincas, and dclivcrcd the rcll'l:.in-

d:r hack lo :11: prifuner. 1 fold ftvcn pairof

filk to Mark Simpcr : nncpuir of mixcd. and.

mo pair of Ifircad to lhz: foolnun, and on:

pair of zhrzad to lh: barber. '

Q: What is the foolmarfs name?

Fraum Mgfzr. I dun’: know.

Hairy Hzrvir. l was flandingar the Camp

Icr waizin far the thcrrilfs ufliceruo employ

in: : Mo 3‘: daughter came for me to 0 am!

take the prifoncr. 1 Wm! to |hc Old aailcy

Google Tesseract

Page 158: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Transcription

and Ch’: priftmer anhc bar. Jacob Lazarus and his

IHP1 uh: prifoner. were both together when!

rcccivcd lhczn. I fold eievén pair of than

for xiirce guincas, and dclivcrcd the rcll'l:.in-

d:r hack lo :11: prifuner. 1 fold ftvcn pairof

filk to Mark Simpcr : nncpuir of mixcd. and.

mo pair of Ifircad to lhz: foolnun, and on:

pair of zhrzad to lh: barber. '

Q: What is the foolmarfs name?

Fraum Mgfzr. I dun’: know.

Hairy Hzrvir. l was flandingar the Camp

Icr waizin far the thcrrilfs ufliceruo employ

in: : Mo 3‘: daughter came for me to 0 am!

take the prifoncr. 1 Wm! to |hc Old aailcy

Google Tesseract

the prisoner at the bar. Jacob Lazarus and his

wife, the prisoners were both together when I

received them. I sold eleven pair of them

for three guineas, and delivered the remain-

der back to the prisoner. I sold, seven pair of

silk to Mark Simpert one pair of mixed, and

two pair of thread to the footman, and one

pair of thread to the barber,

Ms. What in the footman's name?

Franco Asyut, I don't know-

Nearly Norris. I was standing at the Comp-

ter waiting for the sherrill's officers to employ

me a Moses's daughter came for me to go and

take the prisoner. I went to the Old Bailey

Ocular

Page 159: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Learned Fonts

Initializer

Page 160: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Learned Fonts

Initializer

g

Page 161: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Learned Fonts

Initializer

g

Page 162: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

1700

1740

1780 1820

1860

1900

Learned Fonts

Initializer

Page 163: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Unobserved Pixels

Page 164: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Unobserved Pixels

Page 165: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Unobserved Pixels

Page 166: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Unobserved Pixels

Page 167: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Conclusion

Page 168: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Conclusion

• Unsupervised font learning yields state-of-the-art results on documents where font is unknown

Page 169: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Conclusion

• Unsupervised font learning yields state-of-the-art results on documents where font is unknown

• Generatively modeling sources of noise specific to printing-press era documents is effective

Page 170: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Conclusion

• Unsupervised font learning yields state-of-the-art results on documents where font is unknown

• Generatively modeling sources of noise specific to printing-press era documents is effective

• Ocular available as a downloadable tool: nlp.cs.berkeley.edu/ocular.shtml

Page 171: Natural Language Processing - Peopleklein/cs288/fa14/... · and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold

Conclusion

Thanks!