datech2014 - session 3 - correcting noisy ocr: context beats confsusion
DESCRIPTION
Presentation of the paper Correcting Noisy OCR: Context Beats Confsusion by John Evershed and Kent Fitch in DATeCH 2014. #digidaysTRANSCRIPT
![Page 1: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/1.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
Correcting noisy OCR
- Context beats Confusion
[ presentation viewable at http://goo.gl/n85gR6 ]
![Page 2: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/2.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
who are we?
● Australian software company
● developers John and Kent
● we put theory into practice
![Page 3: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/3.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
● the first draft of history
● popular if made available
● usually poorly digitized
● too extensive for full human
correction
main target - newspapers
![Page 4: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/4.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
goals
● run on commodity cloud server
● optimal for noisy text
● at least 1000 words/sec
● correct at least 50% of errors
![Page 5: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/5.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
division of labour
bad
good
models
models
MANAGER,
TRIAGE
CORE
![Page 6: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/6.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
snippets for the core
● prefer triaged good words at start/end
● column aware
● some easy corrections applied
● some suggestions supplied
● bag of topic words available
● surrounding noise level indicated
![Page 7: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/7.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
error contexts
● spell: vowals or consonnants
● type: you jit teh wrng key
● OCR: roprcroiitativcs cf thc Coveriuient
● random: anygh<eg 0at7happen
![Page 8: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/8.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
confusion cost matrix
93: w ← w 155: e ← e 3750: c ← e 4451: m ← rn 6652: rn ← m 11065: E ← m
![Page 9: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/9.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
word cost (eg rnorniny|morning)
language cost ● lexicon frequency
● entity list
● rare word list
● character 4-gram
error cost ● edit sum
● visual correlation
● generator hint
![Page 10: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/10.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
word character confusion
m o r n i n g
r n o r n i n y
![Page 11: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/11.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
visual correlation
![Page 12: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/12.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
suggestion methods
● gift
● common, cached
● language
● entities
● split/join
● generated (magic)
![Page 13: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/13.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
searching for gold (A*)
l i
i
n e
r
h
hcii h li b n ... c e r o … i i 1 l n u … i i 1 l ...
purple nodes: working priority queue
red nodes: output priority queue
![Page 14: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/14.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
amazing generated suggestions
Parhumuitar} ← Parliamentary I.iulwuvB ← Railways Itegtniont ← Regiment niltfltory ← adultery uj.rccu.eut ← agreement couniutfc.o ← committee cnuipuii ← company dctoimiuatJOu ← determination uiidcrtkikcr’a ← undertaker’s
![Page 15: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/15.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
selecting best combination
unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently
bohavlour behaviour behavour behavior Behaviour behaviours behaving
abonf about above along been
am am an a in as
unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently
disgrie disgrace disagree disguise desire degree disease
[NOTE: word joins and splits are also supported]
![Page 16: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/16.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
training
● 5-grams - subset selection
● corpus 1,2,3-grams - statistical build
● extra word lists - easy
● error model - bootstrap or new pairs
![Page 17: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/17.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
testing
● 65000 words ground truth including
foreign (US) newspapers
● all measures exceeded goal:
○ search errors (article word types)
○ read errors (article word tokens)
○ entropy weighted term errors
![Page 18: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/18.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 83.8% 94.1% recall misses reduced 63.3%
Raw Error Rate 18.5% 5.5% errors reduced 70.1%
Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4%
SMH sample
![Page 19: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/19.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
¿preguntas?
Presentation viewable at http://goo.gl/n85gR6
![Page 20: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/20.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
![Page 21: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/21.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
National Library of Australia’s
TROVE
● 1.4m distinct visitors/month
● 16m pageviews/month
● 80% of usage is old newspapers o 13m pages, over 600 titles
o 85k lines corrected/day
![Page 22: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/22.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
Even this massive volunteer effort
cannot keep up
● < 2% of errors have been corrected
● % corrected is declining
● Hence searching is unreliable, OCR’ed text
is hard to read and reuse
● Trove’s accuracy is “typical”
![Page 23: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/23.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
![Page 24: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/24.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
159 randomly selected news
articles from The Sydney
Morning Herald
47.4K words hand-corrected to ground truth
![Page 25: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/25.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 83.8% 94.1% recall misses reduced 63.3%
False positive recall 26.7% 9.1% false positives reduced 65.8%
Raw Error Rate 18.5% 5.5% errors reduced 70.1%
Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4%
SMH sample
![Page 26: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/26.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
![Page 27: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/27.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
![Page 28: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/28.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
![Page 29: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/29.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
![Page 30: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/30.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
49 randomly selected news
articles from LoC
Chronicling America
18.1K words hand-corrected to ground truth
![Page 31: Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion](https://reader034.vdocument.in/reader034/viewer/2022051816/546fd43fb4af9f3a0b8b46f0/html5/thumbnails/31.jpg)
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 84.0% 93.1% recall misses reduced 56.6%
False positive recall 23.6% 8.8% false positives reduced 62.8%
Raw Error Rate 19.1% 6.4% errors reduced 66.7%
Weighted Error Rate 16.0% 7.7% weighted errors reduced 51.8%
LOC sample