OCR
Optical Character Recognition
Simon Tanner
Blog: simon-
tanner.blogspot.co.uk
Twitter: @SimonTanner
www.slideshare.net/KDCS/
King’s Digital Consultancy Services
www.digitalconsultancy.net
Some OCR resources
By Simon Tanner:
Deciding whether Optical Character Recognition is feasible (PDF document) created for the Oxford University Digital Librarywww.odl.ox.ac.uk/papers/OCRFeasibility_final.pdf
Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archivewww.dlib.org/dlib/july09/munoz/07munoz.html
The IMPACT project: Improving Access to Textwww.impact-project.eu
OCR – How it works
Image optimisation
Document Image Analysis
Character recognition
Word identification/recognition
Correction
Formatting output
Assessing a resource for OCR
Scanning methods possible
Nature of original paper
Nature of printingUniformityLanguageText alignmentComplexity of alignmentLines, graphics and picturesHandwriting
Nature of document
Nature of output requirements
OCR Accuracy
Evaluating OCR accuracy is about more than just character to character accuracy rates
Character accuracy rates are misleading (more later…)
It is also about assessing the functionality enabled through the OCR’s output
Search accuracy
Volume of hits returned
Ability to structure searches and results
Accuracy of result ranking
Amount of correction required to achieve the required performance
Character accuracy rates may mislead
Consider this scenario:1,000 words with 5,000 characters (an average of 5 per word) excluding spaces
90% character accuracy means:
4,500 characters correct
Possibly a maximum 900 words correct (90%)
Possibly a minimum 500 words correct (50%)
Reality is somewhere in between
Depending on the number of “significant words” the search results could still be almost 100% or near zero
OCR Accuracy: Balancing factors
Character accuracy Vs Word accuracySignificant word accuracySignificant words with capital letter start accuracy
Bit-depth is the number one factor that can improve OCR accuracy once a base level of 300+dpi resolution is achieved.
Bitonal emphasises foxing and obscure characters in words: consequently, clergy, matrimonial and thethat would be captured accurately from the greyscale image.
BL Newspaper Results: arranged by date
50
60
70
80
90
100
1801
1810
1820
1830
1840
1850
1860
1870
1880
1890
1900
characters wordswords with capital letter start significant wordsPoly. (characters) Poly. (words)Poly. (significant words) Poly. (words with capital letter start)
OCR Quiz
Look at the examples on screen
Make a note of any features you think might affect OCR accuracy
Have a guess of what you think the accuracy in % terms might be
I am petfood, God toil! uttedy-toverthroW, at feaft; $gy abafe Men's affections tp; and seal for all Party-making Notions amdngft CfiriftiansybefGieirie will raife his,Church to that prof-perous, flourilhing State prophefied of, and prOmifed in the Scrip* tures. There mult be more Love, and Charity, and Unanimity amongft Chriftians,.
OCR Engine% characters
correct% words correct No. of corrections
FineReader 91.1 70.9 110
PrimeOCR 93.95 79.1 79
OCR Results
Total number of characters = 2109Total number of words = 379
OCR Engine% characters
correct% words correct No. of corrections
FineReader 73.7 57.5 31
PrimeOCR 75.9 62.37 28
OCR Results
A THEATRE erein be reprc-fented as wel the miferies & calamities tijat foiioto tht too*e^jr alfo the greate toyts andplefures tobtcf) tbe fatrfc faltooenio^An Argument both profitable anddele&able, to all that finccrclyloue the word of Codt'.*Deuifedby S. hhnv&n~ derlS^oodt.s 3^ Scene and allowed according to the order appointed., ^ Imprinted at London by Henry Bynncman*Anno Domini.CVM PHIT
Total number of characters = 411Total number of words = 73
OCR
Optical Character Recognition
Simon Tanner
Blog: simon-
tanner.blogspot.co.uk
Twitter: @SimonTanner