dh101 2013/2014 course 7 - ocr, printed text recognition, handwriting recognition, ornaments...
TRANSCRIPT
![Page 1: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/1.jpg)
Digital Humanities 101 - 2013/2014 - Course 7
Digital Humanities Laboratory
Andrea Mazzei and Frederic Kaplan
andrea.mazzei,[email protected]
![Page 2: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/2.jpg)
A Job offer
•Running an OCR transcription of 320 pages
•about 60 hours of work
•25 CHF / hour.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 2o
![Page 3: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/3.jpg)
Results of the peer grading process
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 3o
![Page 4: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/4.jpg)
Results of the peer grading process
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 4o
![Page 5: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/5.jpg)
Results of the peer grading process
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 5o
![Page 6: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/6.jpg)
Results of the peer grading process
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 6o
![Page 7: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/7.jpg)
Results of the peer grading process
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 7o
![Page 8: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/8.jpg)
New projects
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 8o
![Page 9: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/9.jpg)
Venitian opera staging andmachinery
•A project that find way for better understanding and visualizing opera staging
based on evidences found in historical sources (treatise, music prints, etc.)
•Rosand, E. 1990. Opera in Seventeenth-Century Venice : The Creation of a Genre.
Berkeley : University of California Press.
•Bjurstrom, P. 1962. Giacomo Torelli and Baroque Stage Design. Stockholm :
Almqvist and Wiksell.
•Leclerc, H. 1987. Venise et l’avenement de l’opora public A l’age baroque. Paris :
A. Colin.
•Larson, O. K. 1980. Giacomo Torelli, Sir Philip Skippon, and Stage Machinery for
the Venetian Opera, Theatre Journal, Vol. 32, No. 4, pp. 448-457.
www.jstor.org/stable/3207407
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 9o
![Page 10: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/10.jpg)
Venetian storytelling in theMiddle-Age
•Marin Sanudo was an historical writer. In contrast to others writer of the
epoch, he wrote a diary noting all the events happend in Venice. Of
course it is not the only one diary wrote in Venice. Imagine how to use
this personal information.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 10o
![Page 11: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/11.jpg)
Looking at music printing typefaces
•A project that looks at the different music typefaces used in Venetian
prints. Typical questions are : the size of the typeface, when they were
used, for what repertoire, what printers used them, etc.
•Agee, R. 1998. The Gardano Music Printing Firms, 1569-1611.
Rochester, University of Rochester Press.
•Bernstein, J. 1998. Music Printing in Renaissance Venice. The Scotto
Press (1539-1572). Oxford, Oxford University Press.
•Bernstein, J. 2001. Print Culture and Music in Sixteenth-Century Venice.
Oxford, Oxford University Press.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 11o
![Page 12: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/12.jpg)
Music at SanMarco
•A project that can look at how the capella di San Marco evolved over
time : how many musicians, where they played in the Basilica, what they
played, etc.
•Selfridge-Field, E. 1994. Venetian instrumental music from Gabrieli to
Vivaldi. New York : Dover.
•Moretti, L. 2004. Jacopo Sansovino and Adrian Willaert at St Mark’s,
Early Music History, Vol. 23, pp. 153-184.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 12o
![Page 13: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/13.jpg)
Venetianmusic prints in libraries today
•A project that looks at the production of music prints in Venice and
where they are hold today in libraries and archives around the world
•The Repertoire International des Source Musicales, Series A/I on music
prints. http ://www.rism.info [will be made available digitally for the
project]
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 13o
![Page 14: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/14.jpg)
Semester 1 : Content of each course
• (1) 19.09 Introduction to the course / Live Tweeting and Collective note
taking
• (2) 25.09 Introduction to Digital Humanities / Wordpress / First assignment
• (3) 2.10 Introduction to the Venice Time Machine project / Zotero
•9.10 No course
• (4) 16.10 Digitization techniques / Deadline first assignment
• (5) 23.10 Datafication / Presentation of projects
• (6) 30.10 Semantic modelling / RDF / Deadline peer-reviewing of first
assignment
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 14o
![Page 15: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/15.jpg)
Semester 1 : Content of each course
• (7) 6.11 Pattern recognition / OCR / Semantic disambiguation
• (8) 13.11 Historical Geographical Information Systems, Procedural modelling
/ City Engine / Deadline Project selection
• (9) 20.11 Crowdsourcing / Wikipedia / OpenStreetMap
• (10) 27.11 Cultural heritage interfaces and visualisation / Museographic
experiences
•4.12 Group work on the projects
•11.12 Oral exam / Presentation of projects / Deadline Project blog
•18.12 Oral exam / Presentation of projects
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 15o
![Page 16: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/16.jpg)
Today’s course
•Printed Text Recognition
•Hand Writing Recognition
•Ornament Recognition
•Text Mining and semantic disambiguation : Extracting named entities
(people, places, etc.) in a text using Wikipedia
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 16o
![Page 17: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/17.jpg)
Part I : Printed Text Recognition
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 17o
![Page 18: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/18.jpg)
OCR : Optical Character Recognition
A system that provides a full recognition of all the printed characters by
simply scanning the support.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 18o
![Page 19: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/19.jpg)
Mori et al. (1992). Historical review of OCR R&D
•1940 : The first version of OCR
•1950 : The first OCR machines appear
•1960 - 1965 : First generation OCR : NOF, Farrington 360, IBM 1418.
They all used a special font
•1965 - 1975 : Second generation OCR : IBM 1287, NEC, Toshiba. They
could also recognize constrained hand-printed alpha-numerals.
•1975 - 1985 : Third generation OCR : IBM 1975, Poor print quality or
handwritten characters. 275 fonts. Handwriting recognition.
•1986 - Today : OCR to the people
Eikvil, L. (1993). Optical Character Recognition
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 19o
![Page 20: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/20.jpg)
OCR capabilities
The recognition performance depends on the type and number of fonts
recognized.
•Fixed font : the sytem can recognize only one font
•Multi font : the system can recognize multiple fonts
•Omni font : the system can recognize most nonstylized fonts without
having to maintain huge databases of specific font information
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 20o
![Page 21: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/21.jpg)
Omni-font OCR Overview Of Processing
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 21o
![Page 22: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/22.jpg)
Preprocessing : Text Lines Straightening
Zhang, Z., & Tan, C. L. (2002, June). Straightening warped text lines using polynomial regression. In Image Processing. 2002.Proceedings. 2002 International Conference on (Vol. 3, pp. 977-980). IEEE.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 22o
![Page 23: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/23.jpg)
Preprocessing : Character Detection
• Image binarization using local adaptive thresholding
•Character detection using region growing-based methods. PROBLEM !
Eikvil, L. (1993). Optical Character Recognition
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 23o
![Page 24: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/24.jpg)
Segmentation Problems : Touching and fragmented characters
•Joints will occur if the document is a dark photocopy or if it is scanned
at a low threshold.
•Joints are common if the fonts are serifed.
•The characters may be split if the document stems from a light
photocopy or is scanned at a high threshold
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 24o
![Page 25: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/25.jpg)
Segmentation Problems : Distinguishing noise from text
Dots and accents may be mistaken for noise, and vice versa.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 25o
![Page 26: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/26.jpg)
Segmentation Problems : Mistaking graphics for text
This leads to non-text being sent or text not being sent to recognition
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 26o
![Page 27: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/27.jpg)
Feature Extraction
From each character several features can be extracted :
•Rasterized pixels
•Geometric moment invariant
•Morphological features
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 27o
![Page 28: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/28.jpg)
Feature Extraction : Zoning
MxN zones are computed as average gray level from the image of the
character.
Due Trier, O., Jain, A. K., & Taxt, T. (1996). Feature extraction methods
for character recognition-a survey. Pattern recognition, 29(4), 641-662
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 28o
![Page 29: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/29.jpg)
Feature Extraction : Projection Profile
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 29o
![Page 30: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/30.jpg)
Feature Extraction : Structural Analysis
Strokes, bays, end-points, intersections between lines and loops.
High tolerance to noise and style variations.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 30o
![Page 31: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/31.jpg)
Classification
The principal approaches to decision-theoretic recognition are minimum
distance classifiers, statistical classifiers and neural networks.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 31o
![Page 32: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/32.jpg)
Matching
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 32o
![Page 33: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/33.jpg)
Optimum statistical classifiers.
•Bayesian classifier. Given an unknown symbol described by its feature
vector, the probability that the symbol belongs to the class c is computed
for all classes c = 1...N . The symbol is then assigned the class which
gives the maximum probability.
• ...
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 33o
![Page 34: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/34.jpg)
Post Processing : Grouping
From symbols to strings using symbols proximity
Eikvil, L. (1993). Optical Character Recognition
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 34o
![Page 35: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/35.jpg)
Post Processing : Error Detection and Correction
•Use of rules defining the syntax of the word. Ex. In English the k never
appears after the h.
•Use of dictionaries. If the word is not in the dictionary, an error has been
detected, and may be corrected by changing the word into the most
similar word.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 35o
![Page 36: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/36.jpg)
Self-learning
Modern OCR systems enlarge the database of characters when new fonts
are encountered. Character recognition is based on the database previously
built in, which contains the important features related to the characters
which are known already. It is necessary that this database is able to self
expand as more and more new characters are met in order to increase the
recognition ability.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 36o
![Page 37: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/37.jpg)
Handwriting Recognition (HWR)
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 37o
![Page 38: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/38.jpg)
Offline HWR : Many difficult problems
•Stroke ordering
•Broken lines
•Merged blobs
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 38o
![Page 39: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/39.jpg)
From Offline to Simulated Online
It is not reliable
•What order were the strokes written in ?
•Doubled-up line segments ?
• Ink blobs ?
•Spurious joins between letters ?
•Missing joins ?
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 39o
![Page 40: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/40.jpg)
Segmentation : Strokes Extraction
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 40o
![Page 41: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/41.jpg)
Segmentation : Segments Fitting
Robustly cut letters into segments
Match multiple segments to detect letters
Easier than matching whole letter
Hutchison L. Handwriting Recognition for Genealogical Records
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 41o
![Page 42: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/42.jpg)
Analytical Approach
It treats a word as a collection of simpler sub-units such as characters
•Segmentation of the word into these units
• Identification of the units
•Word-level interpretation using a predefined lexicon
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 42o
![Page 43: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/43.jpg)
Problemswith the Analytical Approach
• segmentation ambiguity : deciding where to segment the word image
•variability of segment shape : determining the identity of each segment
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 43o
![Page 44: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/44.jpg)
Holistic Matching
Treats the word as a single, indivisible entity and attempts to recognize it
using features of the word as whole.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 44o
![Page 45: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/45.jpg)
Advantages of the Holystic Matching
Coarticulation effect, i.e., the changes in the appearance of a character
as a function of the shapes of neighboring characters
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 45o
![Page 46: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/46.jpg)
Advantages of the Holystic Matching
Orthogonality of holistic features : information about the word that
is clearly orthogonal to the knowledge of characters in it and it stands to
reason that the introduction of this knowledge should improve recognition
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 46o
![Page 47: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/47.jpg)
Advantages of the Holystic Matching
Evidence from psychological studies : psychological studies of
reading points towards the fact that humans do not, in general, read words
letter by letter.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 47o
![Page 48: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/48.jpg)
Dynamic Global Search
Assemble word spelling from possible letter readings
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 48o
![Page 49: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/49.jpg)
Result 1
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 49o
![Page 50: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/50.jpg)
Result 2
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 50o
![Page 51: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/51.jpg)
Result 3
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 51o
![Page 52: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/52.jpg)
ABBYY Fine Reader : A Case Study
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 52o
![Page 53: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/53.jpg)
Scanned Document
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 53o
![Page 54: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/54.jpg)
Image Rotation Adjustment
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 54o
![Page 55: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/55.jpg)
Image Rotation Adjustment
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 55o
![Page 56: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/56.jpg)
First Extraction
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 56o
![Page 57: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/57.jpg)
Synthetizing the Table
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 57o
![Page 58: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/58.jpg)
Second Extraction
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 58o
![Page 59: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/59.jpg)
Retrieval of the ornaments from the Hand-Press Period
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 59o
![Page 60: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/60.jpg)
Problem Statement
For millions of intact books and tens of millions of loose pages, the
provenance of the manuscripts may be in doubt or completely unknown
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 60o
![Page 61: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/61.jpg)
Manual Solution
Human experts are capable to regain the provenance by examining
linguistic, cultural and/or stylistic clues.
However, such experts are rare and this investigation is a time-consuming
process.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 61o
![Page 62: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/62.jpg)
Automatic Solution
By comparing the initial letters in the manuscript to annotated initial
letters whose origin is known, the provenance can be determined.
This process can be automatized
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 62o
![Page 63: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/63.jpg)
What are the Challenges ?
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 63o
![Page 64: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/64.jpg)
Ornament Segmentation
Ornament(s) detection and localization with respect to the page reference system.
Baudrier, E., Busson, S., Corsini, S., Delalandre, M., LandrA c©, J., &
Morain-Nicolier, F. (2009, July). Retrieval of the ornaments from the hand-press
period : an overview. In Document Analysis and Recognition, 2009. ICDAR’09. 10th
International Conference on (pp. 496-500). IEEE.my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 64o
![Page 65: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/65.jpg)
A Compression Based DistanceMeasure for Texture
The distance between a window and an annotated initial letter is
denoted as :
distCK 1(W , IL) =mpegSize(W , IL) + mpegSize(IL,W )
mpegSize(W ,W ) + mpegSize(IL, IL)− 1
The first image supplied to mpegSize is assigned as an I frame
and the second becomes a P frame.
Campana, B. J., & Keogh, E. J. (2010). A compression-based
distance measure for texture. Statistical Analysis and Data
Mining, 3(6), 381-398
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 65o
![Page 66: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/66.jpg)
Properties of CK1 DistanceMeasure
Efficient, robust and parameter-free texture similarity measure.
Rotation, Colour and Illumination Invariant.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 66o
![Page 67: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/67.jpg)
Gabor Filters
Images are convolved with each filter.
The standard deviation and mean of each response => 48 length vector
Vector Euclidean distance
Wang, X., Ding, X., & Liu, C. (2005). Gabor filters-based feature extraction for
character recognition. Pattern recognition, 38(3), 369-379
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 67o
![Page 68: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/68.jpg)
Data Sets
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 68o
![Page 69: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/69.jpg)
Experimental Results
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 69o
![Page 70: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/70.jpg)
Part II : Text mining and semantic disambiguation
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 70o
![Page 71: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/71.jpg)
Case study : Extracting named entities (people, places,etc.) in a text using Wikipedia
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 71o
![Page 72: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/72.jpg)
UsingWikipedia
•A Unique ID : A Wikipedia article is identified by a unique name, which is
the article title itself. The respective URL of a Wikipedia article can be
created by concatenating the words in the article title and appending it
to the URL root of the Wikipedia
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 72o
![Page 73: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/73.jpg)
UsingWikipedia
•Redirections : Some entities can have multiple names. In order to address
this issue, Wikipedia has some article titles that do not have a
substantive article and are only redirected to a different Wikipedia article
with another title. This mechanism is called redirection. Redirections are
used for other purposes such as spelling resolution (e.g. the article title
Oranges is redirected to Orange) and abbreviation resolution (e.g. the
article title UCLA is redirected to University of California, Los Angeles).
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 73o
![Page 74: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/74.jpg)
UsingWikipedia
•Disambiguation pages : A disambiguation page is created for ambiguous
entity names and it enumerates all the possible articles for that name. For
example, the disambiguation page for Paris enumerates 25 places called
Paris (in America, Canada and Europe), 33 people having Paris as name
or surname, 10 television series and films, whose title contains the word
Paris, etc.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 74o
![Page 75: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/75.jpg)
UsingWikipedia
•Outgoing links : In the body text of the Wikipedia article there are
references (links) to other articles. The references are within pairs of
double square brackets.
• Infobox : An infobox is a fixed-format table designed to be added to the
top right-hand corner of articles to consistently present a summary of
some unifying aspect that the articles share and sometimes to improve
navigation to other interrelated articles.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 75o
![Page 76: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/76.jpg)
3 steps
•Data extraction : A (sequence of) word(s) is extracted from a ”Le
Temps” article (e.g. Le Paris). Set the right boundaries in the extracted
data (e.g. from ”Le Paris” is retrieved the ”Paris” ).
•Disambiguation : Retrieve all the Wikipedia articles whose title contains
the word ”Paris” (e.g. Paris (France), Paris (Texas), Paris Hilton, Paris
(mythology), etc). Find the Wikipedia article that maximizes the
agreement between the content extracted from Wikipedia and the
context of the ”Le Temps” article.
•Entity classification : Classify the entity as place, person, company, etc,
based on the chosen Wikipedia article
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 76o
![Page 77: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/77.jpg)
Disambiguation strategy
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 77o
![Page 78: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/78.jpg)
(1) Data extraction
•The first step is the extraction of possible named entities. This step isbased on the fact that the named entities consist of capitalized words.The rules that we apply for the extraction of possible named mentions inthe text are the following :•Retrieve all the capitalized words (e.g. England)
•Retrieve recursively terms T0 of the form T1 Particle T2, where Particle is one of a possessive
pronoun, and the terms T1 and T2 are capitalized words or sequences of capitalized words
(e.g. University of Edinburgh, European Society of Athletic Therapy and Training)
• In French, some entities can contain non-capitalized words, after some specific words.
Therefore, we retrieve non-capitalized words if they are followed by a word that is contained
in a predefined set of words (e.g. Union, Bibliotheque, etc). For example the Union
sovietique is considered as entity.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 78o
![Page 79: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/79.jpg)
(2) Disambiguation
•The disambiguation process employs a vector space model, in which a
vectorial representation of the processed article is compared with the
vectorial representations of the Wikipedia entities.
•The vectorial representation of the processed article (article vector) is a
vector having all the possible entities of the specific article obtained
during the previous step, while the vectorial representation of a Wikipedia
article (Wikipedia vector) is a vector having all the outgoing links in the
body text of the article.
•Once a Wikipedia article is identified as the most similar to the processed
article, the article vector is updated by adopting the features of the
chosen Wikipedia vector.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 79o
![Page 80: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/80.jpg)
(3) Entity classification
•The last step is to classify the entities into persons, places, companies,
etc.
•Ex : It the entity a place ? If the Wikipedia article contains an infobox,
then we retrieve it and we search for specific tags in it that can classify
the entity as a place.
• If the Wikipedia article does not have an infobox, then we use the first
sentence of the body text.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 80o
![Page 81: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/81.jpg)
Partial results
•We have implemented the algorithm and tested it on a subset of the
database
•Our current estimation of the number of entity retrieved is 85 %
•Main issue : Some entites are not in Wikipedia.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 81o
![Page 82: DH101 2013/2014 course 7 - OCR, Printed text recognition, Handwriting recognition, Ornaments classification, Named entities disambiguation](https://reader033.vdocument.in/reader033/viewer/2022060107/5549f968b4c905e56c8b489e/html5/thumbnails/82.jpg)
FromWikipedia toWikipast
•The First principle of Wikipedia is that it is an encyclopedia. Not all
entites are allowed. Sourcing is important but secondary
•On going discussion with Wikimedia to create an alternative to
Wikipedia, allowing page on any person, place, etc. from the past as long
at it is clearly sourced.
my header
Digital Humanities 101 - 2013/2014 - Course 7 | 2013 82o