multimodal computer-assisted transcription of text images...

18
June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai International Journal of Pattern Recognition and Artificial Intelligence c World Scientific Publishing Company Multimodal Computer-Assisted Transcription of Text Images at Character-level Interaction Daniel Mart´ ın-Albo, Ver´onica Romero, Alejandro H. Toselli and Enrique Vidal Instituto Tecnol´ ogico de Inform´atica, Universidad Polit´ ecnica de Valencia Camino de Vera s/n, 46071 Valencia, Spain {dmartinalbo, vromero, ahector, evidal}@iti.upv.es http://prhlt.iti.upv.es Currently, automatic handwriting recognition systems are ineffectual in unconstrained handwriting documents. Therefore heavy human intervention is required to validate and correct the results. Because this process is inefficient and uncomfortable, a multimodal interactive approach was proposed, in which the user interacts with the system by means of an e-pen and/or more traditional keyboard and mouse operations. On the one hand, this user’s feedback allows to improve system accuracy, while on the other multimodality increases system ergonomy and user acceptability. In this work, multimodal interaction at character-level is studied. Until now, multimodal operation had been studied only at whole-word level. However, it is argued that character level pen-stroke interactions may lead to more effective interactivity, since it is faster and easier to write only one character rather than a whole word. Here we study this kind of fine-grained multimodal interaction and present developments that allow taking advantage of interaction-derived context to significantly improve feedback decoding accuracy. Empirical tests on three cursive handwritten tasks suggest that, despite losing the deterministic accuracy of traditional peripherals, this approach can save significant amounts of user effort with respect to fully manual transcription as well as to non-interactive post-editing correction. Keywords : Multimodal interactive pattern recognition; Computer assisted transcription; Handwritten text recognition; Character level interaction; 1. Introduction In today’s digital age, much of the human documentary corpus is still in handwritten “analogic” form. However, there is a growing number of on-line digital libraries whose mission is to scan and publish these documents in digital format. Most of these scanned documents remain waiting to be transcribed into an electronic format. The advantages of having an electronic transcription are countless, for example: new ways of indexing, consulting or querying these documents. The small number of transcribed documents is mainly due to the laborious process involved. These transcriptions are carried out by experts in paleography, who are specialized in reading ancient scripts, characterized, among other things, by different calligraphy/print styles from diverse places and time periods. How long it takes an expert to transcribe one of these documents depends on many different 1

Upload: others

Post on 10-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

International Journal of Pattern Recognition and Artificial Intelligencec© World Scientific Publishing Company

Multimodal Computer-Assisted Transcription of Text Images atCharacter-level Interaction

Daniel Martın-Albo, Veronica Romero, Alejandro H. Toselli and Enrique Vidal

Instituto Tecnologico de Informatica,

Universidad Politecnica de Valencia

Camino de Vera s/n, 46071 Valencia, Spain{dmartinalbo, vromero, ahector, evidal}@iti.upv.es

http://prhlt.iti.upv.es

Currently, automatic handwriting recognition systems are ineffectual in unconstrainedhandwriting documents. Therefore heavy human intervention is required to validate and

correct the results. Because this process is inefficient and uncomfortable, a multimodal

interactive approach was proposed, in which the user interacts with the system by meansof an e-pen and/or more traditional keyboard and mouse operations. On the one hand,

this user’s feedback allows to improve system accuracy, while on the other multimodality

increases system ergonomy and user acceptability. In this work, multimodal interactionat character-level is studied. Until now, multimodal operation had been studied only at

whole-word level. However, it is argued that character level pen-stroke interactions may

lead to more effective interactivity, since it is faster and easier to write only one characterrather than a whole word. Here we study this kind of fine-grained multimodal interaction

and present developments that allow taking advantage of interaction-derived contextto significantly improve feedback decoding accuracy. Empirical tests on three cursive

handwritten tasks suggest that, despite losing the deterministic accuracy of traditional

peripherals, this approach can save significant amounts of user effort with respect tofully manual transcription as well as to non-interactive post-editing correction.

Keywords: Multimodal interactive pattern recognition; Computer assisted transcription;

Handwritten text recognition; Character level interaction;

1. Introduction

In today’s digital age, much of the human documentary corpus is still in handwritten“analogic” form. However, there is a growing number of on-line digital librarieswhose mission is to scan and publish these documents in digital format. Most ofthese scanned documents remain waiting to be transcribed into an electronic format.The advantages of having an electronic transcription are countless, for example: newways of indexing, consulting or querying these documents.

The small number of transcribed documents is mainly due to the laboriousprocess involved. These transcriptions are carried out by experts in paleography,who are specialized in reading ancient scripts, characterized, among other things,by different calligraphy/print styles from diverse places and time periods. How longit takes an expert to transcribe one of these documents depends on many different

1

Page 2: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

2 D. Martın-Albo, V. Romero, A.H Toselli, E. Vidal

factors: skills and experience of the expert, size of the document, etc. To give anidea of the hardness of this work, they would typically take more than half an hourto transcribe just one page of one of these documents.

Moreover, the paper is still used in everyday life because of its convenience andimmediacy. A clerk, for example, employs an average of 10 000 pages per year.However, the digital document has many advantages in the use, management andpreservation versus its analog version.

In both cases we can use computers to make the process faster. Cursive hand-written text recognition systems (HTR) are very useful and accurate in applicationsinvolving form-constrained handwriting and/or fairly limited vocabulary (postal ad-dresses or bank check legal amounts), achieving in this kind of tasks relatively highrecognition accuracy 3,15. However in unconstrained handwritten documents, suchas the documents to which we refer, e.g. old manuscripts or spontaneous text, state-of-the-art HTR systems can by no means substitute the experts in this task, sincethe results achieved are far from being directly acceptable in practice. Therefore,is necessary heavy human intervention in order to verify and correct the mistakesmade by the system. Given the high error rates involved, such a post-editing solutionis quite inefficient and uncomfortable for the human corrector.

In previous works 18,19, a more effective, interactive approach was presented.This approach, called “computer assisted transcription of text images” (CATTI)combines the accuracy proportioned by the human transcriber with the efficiency ofthe automatic HTR system to achieve the final perfect transcription. This approachfollows similar ideas as those previously applied to computer assisted translation 1,2

and speech transcription 21, shifting from the idea of full automatism towards sys-tems where the decision process is affected by human feedback. Experiments haveproven that this kind of systems can save significant amounts of overall humaneffort.

As will be discussed later, human feedback signals in interactive systems rarelybelong to the same domain as the one the main data stream come from, therebyentailing some sort of multimodality. Of course, this is actually the case in CATTI,where the main data are text images whereas feedback consists of keystrokes and/orpointer positioning actions. Nevertheless, at the expense of loosing the deterministicaccuracy of the traditional keyboard and mouse, more ergonomic multimodal inter-faces are possible. It is worth noting, that the potential increase in user-friendlinesscomes at the cost of acknowledging new possible errors coming from the decoding ofthe feedback signals. Therefore, solving the multimodal interaction problem amountsto achieving a modality synergy where both main and feedback data streams helpeach other to optimize overall accuracy. These ideas have recently been explored inthe context of computer assisted translation, using speech signals for feedback 21.

Among the many possible feedback modalities for CATTI, we use here e-penoperations, which is perhaps the most natural and fastest way to provide feedbackto the transcription system. We will use the short hand “MM-CATTI” for de-nominating this kind of multimodal CATTI processing. These interactive systems

Page 3: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

Multimodal CATTI at Character-level Interaction 3

require constant human intervention, therefore, it is necessary to make this processas comfortable and ergonomic as possible. Until now, CATTI has been studied atwhole-word level and character-level, but MM-CATTI has only been studied whole-word level. In this work, we focus on MM-CATTI with character-level pen-strokeinteractions.

The paper is organized as follows: Sect. 2 describes briefly the CATTI frame-work. In Sect. 3 character-level MM-CATTI is presented. A general description ofthe off- and on-line text processing subsystems is given in Sect. 4. We also showa comprehensive series of experiments to assess the capabilities of the differentproposed scenarios in Sect. 5. Finally, Sect. 6 summarizes the work presented anddraws directions for future research.

2. CATTI Overview

In the original CATTI framework, the human transcriber (named user from nowon) is directly involved in the transcription process since he is responsible of val-idating and/or correcting the HTR outputs. The process starts when the HTRsystem proposes a full transcription of a feature vector sequence x, extracted froma handwritten text line image. The user validates an initial part of this transcrip-tion, p′, which is error-free and introduces a correct word, w, thereby producing acorrect transcription prefix, p = p′w. Then, the HTR system takes into account thisavailable information to suggest a new suitable continuation suffix, s. This processis repeated until a full correct transcription of x is accepted by the user 18.

At each step of this process, both the image representation, x, and a correcttranscription prefix p are available and the HTR system should try to complete thisprefix by searching for the most likely suffix s:

s = arg maxs

P (s | x, p) = arg maxs

P (x | p, s) · P (s | p) (1)

Since the concatenation of p and s constitutes a full transcription hypothesis,P (x | p, s) can be approximated by concatenated character Hidden Markov Models(HMMs) 6,11 as in conventional HTR. On the other hand, P (s | p) is usually approx-imated by dynamically modifying an n-gram in order to cope with the increasinglyconsolidated prefixes 18. Let p = pk

1 be a consolidated prefix and s = sl1 a possible

suffix:

P (s | p) 'n−1∏j=1

P (sj | pkk−n+1+j , s

j−11 ) ·

l∏j=n

P (sj | sj−1j−n+1) (2)

We can explicity rely on Eqs. 1 and 2 to implement a decoding process in onestep as in conventional HTR systems. The decoder should be forced to match thepreviously validated prefix p, and then continue searching for a suffix s accordingto the constraints Eq. 2. This can be achieved by building a special language modelwhich can be seen as the “concatenation” of a linear model which strictly accounts

Page 4: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

4 D. Martın-Albo, V. Romero, A.H Toselli, E. Vidal

Fig. 1. Example of CATTI at Character Level transcription process. The user accepts the prefix

p=“opposed the governm” by typing the character m with the keyboard. In this validated prefix,p′′(the fragment of the prefix formed by complete words) is “opposed the” and u (the last incom-

plete word of the prefix) is “governm”. Once the prefix is available the system creates a new suffix

s, where v (the first part of the suffix that corresponds with the final part of the incomplete wordof the prefix) is “ent” and s′′ (the rest of the suffix) is formed by “Bill which brought”.

opposed the governm ent Bill which brought

p′′ u v s′′

p s

for the successive words in p and the “suffix language model” of Eq. 2 18. A di-rect adaptation of the Viterbi algorithm to implement these techniques leads to acomputational cost that grows quadratically with the number of words of each sen-tence. Nevertheless, using word-graph techniques similar to those described in 13,very efficient, linear cost search can be achieved.

In order to make the system more ergonomic and friendlier to the user, in-teraction based on characters, instead of full words, has been studied in 14 withencouraging results. In this approach, as soon as the user types a new keystroke(character), the system proposes a suitable continuation following the same processdescribed above. As the user operates now at the character level, the last word ofthe prefix may be incomplete. In order to autocomplete this last word, it is assumedthat the prefix p is divided into two parts: the fragment of the prefix formed bycomplete words (p′′) and the last incomplete word of the prefix (u). In this casethe HTR decoder has to take into account x, p′′ and u, in order to search for atranscription suffix s, whose first part is the continuation of u:

s = arg maxs

P (s | x, p′′, u) = arg maxs

P (x | p′′, u, s) · P (s | p′′, u) (3)

The concatenation of p′′, u and s constitutes, again, a full transcription hypothesisand P (x | p′′, u, s), can be modeled with HMMs. On the other hand, to modelP (s | p′′, u) we assume that the suffix s is divided into two fragments: v and s′′. vis the first part of the suffix that corresponds with the final part of the incompleteword of the prefix and s′′ is the rest of the suffix (Fig. 1). So, the search mustbe performed over all possible suffixes s of p, and the language model probabilityP (v, s′′ | p′′, u) must ensure that uv = w, where w is an existing word in the taskvocabulary (V) a. This probability can be decomposed into two terms:

P (v, s′′ | p′′, u) = P (s′′ | p′′, u, v) · P (v | p′′, u) (4)

The first term accounts for the probability of all the whole-words in the suffix, andcan be modeled directly by (2). The second term should ensure that the first part of

aThe use of a “closed vocabulary” is inherited from automatic speech recognition to ease results

reproducibility 6.

Page 5: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

Multimodal CATTI at Character-level Interaction 5

Fig. 2. Example of Multimodal CATTI at Character Level transcription process. The user has

accepted the prefix p′=“opposed the govern” and has corrected the first erroneous character writinga letter m using some pen strokes. In this validated prefix, p′′(the fragment of the prefix formed

by complete words) is “opposed the”, u′ (the last incomplete word of the prefix) is “govern” and

d (the optimal decoding) is “m”. Once the prefix is available, the system generates a new suffix,being v (the first part of the suffix that corresponds with the final part of the incomplete word of

the prefix) “ent” and s′′ (the rest of the suffix) “Bill which brought”.

opposed the govern m ent Bill which brought

p′′ u′ d v s′′

p′ s

p

the suffix (usually a word-ending-part) v, will be a possible suffix of the incompleteword u, and can be stated as:

P (v | p′′, u) =

P (u,v|p′′)∑

v′:uv′∈V

P (u, v′ | p′′)if uv ∈ V

0 otherwise

(5)

3. Multimodal CATTI at Character Level

In CATTI applications the user is constantly interacting with the system. Hence,the quality and ergonomics of the interaction process is crucial for the success of thesystem. Traditional peripherals like keyboard and mouse are used in CATTI to un-ambiguously provide the feedback associated with the validation and correction ofthe successive system predictions. Nevertheless, using more ergonomic multimodalinterfaces should result in an easier and more comfortable human-computer inter-action, at the expense of the feedback being less deterministic to the system. Thisis the idea underlying MM-CATTI, which focuses on touchscreen communication,perhaps the most natural modality to provide the required feedback in CATTI sys-tems. This implies the introduction of an on-line HTR decoding system in order todeal with this non-deterministic interaction.

More formally speaking, let x be the representation of the input image and p′ anuser-validated error-free prefix of the transcription. Let t be the on-line pen-strokesprovided by the user. These data are related to the suffix suggested by the system inthe previous interaction step, s′, and are typically aimed at accepting or correctingparts of this suffix. Using this information, the system has to find a new suffix, s,as a continuation of the previous prefix p′, considering all possible decodings, d, ofthe on-line data t and some information from the previous suffix s′. That is, theproblem is to find s given x, a feedback information t, the prefix p′ and the previoussuffix s′, considering all possible decodings of the on-line data.

Page 6: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

6 D. Martın-Albo, V. Romero, A.H Toselli, E. Vidal

s = arg maxs

P (s | x, s′, p′, t) = arg maxs

∑d

P (s, d | x, p′, s′, t)

≈ arg maxs

maxd

P (t | d) · P (d | p′, s′) · P (x | s, p′, d) · P (s | p′, d) (6)

Eq. (6) can be solved with a two-step approach. In the first one, the user producessome (may be null) on-line e-pen data t and the system has to decode t into acharacter d using:

d = arg maxd

P (t | d) · P (d | p′, s′) (7)

Once the user feedback is available a new consolidated prefix p is produced (basedon the validated prefix p′, d and some amendment keystrokes κ). This prefix leadsto a formulation identical to (1):

s ≈ arg maxs

maxd

P (x | s, p′, d) · P (s | p′, d)

= arg maxs

P (x | p, s) · P (s | p) (8)

After that, the system proposes a new suffix s given this new prefix p and theimage x. These two steps are repeated until p is accepted by the user as a fullcorrect transcription of x.

This multimodal interaction method has been previously studied at whole-wordlevel 18. In this work, the multimodal character-level pen-stroke interactions is stud-ied. An example of this process is shown in Fig. 3. We assume that the user alwaysprefers handwriting interaction modality.

Since (8) has been explained previously, we focus now on (7). As in Sect. 2, theprefix p′ is divided into two parts: p′′ (fragment of p′ consisting of whole words)and u′ (the last incomplete word of p′). Therefore the first step of the optimization(6) can be rewritten as:

d = arg maxd

P (t | d) · P (d | p′′, u′, s′) (9)

where, P (t | d) is provided by a morphological (HMM) model of the character d andP (d | p′′, u′, s′) can be approached by a language model dynamically constrained byinformation derived from the the validated prefix (both p′′ and u′) and the previoussuffix s′ (Fig. 2).

Equation (9) may lead to several scenarios depending on the assumptions andconstraints adopted for P (d | p′′, u′, s′). The first and simplest scenario correspondsto a naive approach where any kind of interaction-derived information is ignored;that is, P (d | p′′, u′, s′) ≡ P (d). This scenario will be used as baseline.

In a slightly more restricted scenario, we take into account just the informationfrom the previous off-line HTR prediction s′. The user interacts providing t inorder to correct the first wrong character of s′, e, that follows the validated prefixp′. Clearly, the erroneous character e should be prevented to be decoded again. Thiserror-conditioned model can be written as P (d | p′′, u′, s′) ≡ P (d | e).

Page 7: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

Multimodal CATTI at Character-level Interaction 7

Fig. 3. Example of Multimodal CATTI at Character Level interaction to transcribe an image of

the text sentence “opposed the Government Bill which brought”. The process starts when theHTR system proposes a full transcription of the handwritten text image x. Then, each interaction

consists in two steps. In the first step the user handwrites pen-strokes to amend the suffix proposed

by the system in the previous step. This defines a correct prefix p′, which can be used by the on-lineHTR subsystem to obtain a more accurate decoding of t. After observing this decoding, d, the user

may type additional keystrokes, κ, to correct possible errors in d. In the second step, a new prefix

is built from the previous correct prefix p′ concatenated with d (the decoded on-line handwrittentext d or the typed text κ). Using this information, the system proposes a new potential suffix.

The process ends when the user enters the special character “#”.

INTER-0Step 1

td

κ

Step 2pd

s opposite this Comment Bill in that thought

INTER-1Step 1

td aκ e

Step 2pd opposes d the Government Bill in that thought

INTER-2Step 1

td wκ

Step 2pd opposed the Government Bill ws hich brought

Finalκ #

p≡T opposed the Government Bill which brought

Another, more restrictive scenario, using the information derived from the val-idated prefix p′, arises when we regard the portion of word already validated (u′),i.e. P (d | p′′, u′, s′) ≡ P (d | u′, e). In this case the decoding can be more accurate,since we know beforehand that the character should be a valid continuation of thepart of word accepted so far.

If, in addition to the previous scenario, the set of complete words p′′ are alsotaken into account, the last scenario arises. In this case the possible decodings areconstrained to be a suitable continuation of the prefix accepted so far. This scenariocan be written as P (d | p′′, u′, s′) ≡ P (d | p′′, u′, e).

Page 8: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

8 D. Martın-Albo, V. Romero, A.H Toselli, E. Vidal

3.1. Dynamic Language Modeling

Language model restrictions are implemented on the base of n-grams. As we men-tioned earlier, the baseline scenario does not take into account any informationderived from the interaction. In this case, only character uni-grams can be used forP (d), since only one character is to be recognized.

The second scenario, P (d | e), only considers the first wrong character. Thelanguage model probability (Eq. (10)) uses an uni-gram model like the previousone. But in this case, it avoids to recognize the wrong character by modifying theprobability for that character.

P (d | e) =

{0 if d = e

P (d)1−P (e) if d 6= e

(10)

In the next scenario, given by P (d | u′, e), the on-line HTR subsystem countsnot only on the first wrong character but also on the last incomplete word of thevalidated prefix u′. This scenario can be approached in two different ways: using acharacter language model or a word language model.

The first one uses a modified character n-gram model conditioned by the chunkof word (u′) and the erroneous character (e).

P (d | u′, e) =

{0 if d = e

P (d|u′kk−n+2)

1−P (e|u′kk−n+2)if d 6= e

(11)

In the second approach (12), we use a word language model to generate a morerefined character language model. This can be written as:

P (d | u′, e) =

{0 if d = eP (d|u′)

1−P (e|u′) if d 6= e

where:

P (d | u′) =P (u′, d)∑

d′P (u′, d′)

=

∑v′:uv′∈V

P (u′, d, v)∑v′:uv′∈V

∑d′

P (u′, d′, v)(12)

being u′dv an existing word of V.Finally, the last scenario uses all available information; i.e., the erroneous char-

acter e, the last incomplete word of the prefix u′ and the rest of the prefix p′′. Thiscan be written as:

P (d | p′′, u′, e) =

{0 if d = e

P (d|u′,p′′kk−n+2)

1−P (e|u′,pkk−n+2)

if d 6= e(13)

Using a word n-gram model we can generate a more refined character language

Page 9: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

Multimodal CATTI at Character-level Interaction 9

model as in the previous case, so:

P (d | u′, p′′kk−n+2) =P (d, u′, p′′kk−n+2)∑

d′P (d′, u′, p′′kk−n+2)

=

∑v′:u′v′∈V

P (u′, d, v | p′′kk−n+2)∑v′:u′v′∈V

∑d′

P (u′, d′, v | p′′kk−n+2)(14)

4. Off and On-line HTR baseline systems

In MM-CATTI, besides the main off-line recognition system, an additional subsys-tem is needed in order to cope with the feedback provided by the user in the formof on-line data (pen-strokes interactions, etc). In this section a general overview ofthe off- and on-line HTR systems is presented.

Both systems follow the classic architecture of pattern recognition of three mainblocks: preprocessing, feature extraction and recognition (see Fig. 4). The first twoentail different techniques depending on the data type, but the last one is the samefor both subsystems.

Off-line HTR preprocessing is aimed at correcting image degradations and ge-ometry distortions: skew and slant corrections and size normalization 16. On theother hand, on-line handwriting preprocessing involves only two steps: duplicatedpoint removal and noise reduction 5.

Feature extraction in the off-line case transforms a preprocessed text line imageinto a sequence of feature vectors representing gray levels and gradients 16. Onthe other hand, the feature extraction module on the on-line case transforms thepreprocessed trajectories into a new temporal sequence of seven-dimensional real-valued feature vectors representing pen positions and writing-speeds 17.

The recognition process is based on HMMs. Characters are modeled by a contin-uous density left-to-right HMMs. A Gaussian mixture is used to model the emissionof feature vectors of each HMM state. In the on-line case, a varying number be-tween 8 and 64 Gaussians were employed, this value was empirically tuned for eachcorpus and scenario.

A more detailed information about the off- and on-line HTR systems is givenin 16,17, respectively.

5. Experimental Framework

The experimental framework developed to measure the effectiveness of the MM-CATTI at Character Level is described in the following subsections. First of all, wewill describe the corpora used in the experiments, then the performance measuresused.

5.1. Off-line corpora

Three corpora has been used in the experiments: ODEC-M3 20, IAMBD 9,8 andCS 12. The first two contain handwritten text in modern Spanish and English re-spectively. In addition, IAMDB is publicly available, so it can be used as a reference

Page 10: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

10 D. Martın-Albo, V. Romero, A.H Toselli, E. Vidal

Fig. 4. Overview of the MM-CATTI at Character Level system.

Preprocessing

Feature Extraction

Off-line RecognitionOn-line Recognition

Feature Extraction

Preprocessing User

ASCIIText

KeystrokeInteractions

Pen-strokeInteractions

NormalizedPen-strokes

FeatureVector

FeatureVector

NormalizedText

Off-lineHandwritten Text

Off-line HMMOn-line HMM

Off-lineHTR Context

TranscribedFeedback

TextCorpora

On-lineHandwritten Textand Transcription

Off-lineHandwritten Textand Transcription

Training

Language Model

to compare the obtained results. The third corpus, consists of cursive handwrittenpage images in old Spanish.

Sentence-segmented images are used both in ODEC-M3 and IAMDB, while onlyline-segmented image are available in CS. The IAMDB dataset is provided at differ-ent levels: sentences, lines and isolated words. Here, in order to obtain comparableresults with our previous works 18,13, the sentence partition have been used 10,22.Each sentence or line image is accompanied by its ground through transcription asthe corresponding sequence of words. To train the HMMs of the different corpora,we consider all the elements appearing in each handwritten text image of the train-ing set, such as lowercase and uppercase letters, symbols, abbreviations, spacingbetween words and characters crossed-words, etc. However, to better focus on the

Page 11: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

Multimodal CATTI at Character-level Interaction 11

Table 1. Basic statistics of the ODEC-M3, IAMDB and CS corpora and their standard partitions.

Corpus Number of Training Test Total Lexicon

ODEC-M3Sentences 676 237 913 –

Words 10,994 3,465 14,459 2118Characters 64,666 21,533 86,199 80

IAMDBSentences 2124 200 2,324 –

Words 42,832 3957 46,789 8938Characters 216,774 20,726 237,500 78

CSLines 681 491 1,172 –Words 6,432 4,479 10,911 3408

Characters 36,729 25,487 62,216 78

essential issues of the considered problems, some restrictions were imposed to trainthe n-gram models. This restrictions consists in converting uppercase character tolowercase and eliminating the punctuation signs and diacritics.

These transcriptions are used to train the bi-gram language models for ODEC-M3 and CS. IAMDB, on the other hand, consists of hand-copied sentences fromthe much larger electronic text LOB corpus 7 which contains about one millionrunning words. Therefore, in this case, the whole LOB corpus (after removing thetest sentences) was used for bi-gram training. Finally, the lexicon of each task isdefined as the set of words found in training and in test transcriptions.

Table 1 summarizes the relevant information of the three off-line HTR corpora.More details on these corpora are given in 10,12,20,22.

For n-gram training, the average ratio of running word instances per vocabu-lary word are 5.2, 112 and 1.9, respectively, for ODEC-M3, IAMBD and CS. It isimportant to remark that, if this ratio is too small, such as in CS, the resultinglanguage model will certainly result undertrained, which will clearly increase thedifficulty of the recognition task and prevent MM-CATTI to take much advantageof prefix-derived constraints.

5.2. On-line HTR data

To train the on-line HTR feedback subsystem and test the MM-CATTI at CharacterLevel approach, the on-line handwriting UNIPEN corpus 4 was chosen, which is alsopublicly available. It comes organized in several categories: lowercase and uppercaseletters, digits, symbols, isolated words and full sentences. For our experiments, wereused three categories: 1a (digits), 1c (lowercase letters) and 1d (symbols). Somecharacter examples from these categories are shown in Fig. 5.

To train the on-line HTR feedback system 17 different writers were used. Onthe other hand, to test the system, we consider the required character instances

Page 12: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

12 D. Martın-Albo, V. Romero, A.H Toselli, E. Vidal

Table 2. Statistics of the UNIPEN training and test dataset used in the experiments.

Train Test LexiconWriters 17 3 –

Lowercase Letters 12,298 2,771 26Symbols 2,431 541 21Digits 1,301 234 10

Total 16,030 3,546 57

that would have to be handwritten by the user in the MM-CATTI interactionprocess with the ODEC-M3, IAMDB or CS text images. Three arbitrary writerswere chosen, taking care that all the characters required to simulate the MM-CATTIprocess were available for each writer. The selected writers are identified by theirname initials BS, BH and BR. The distribution of data between train and testpartitions can be seen in Table 2.

Fig. 5. Three samples of each category from the UNIPEN corpus.

4.1 Corpora 25

corpus [?][?], which consists of about 500 printed electronic English texts of about2,000 words each. The text in the LOB corpus consists of many categories (e.g.editorial, religion, nature, science fiction).

The IAMDB database, currently at version 3.0, is composed of 1,539 scannedtext pages, handwritten by 657 di!erent writers. No restriction was imposed re-lated to the writing style or with respect to the pen used. The database is providedat di!erent segmentation levels: characters, words, lines, sentences and page im-ages. For this test, sentence-segmented images were used. Each sentence or lineimage is accompanied by its ground through transcription as the corresponding se-quence of words. To better focus on the essential issues of the considered problems,no punctuation marks, diacritics, or di!erent word capitalizations are included inthe transcriptions. From 2,324 sentences that forms the corpus, 200 were used astest, leaving the rest as training partition.

Fig. 4.1 — Examples from the categories 1a, 1c and 1d in the UNIPEN corpus.

4.1.2 UNIPEN corpus

The UNIPEN corpus [?] is an English publicly available on-line HTR database. Itcomes organized in several categories: lower and upper-case letters, digits, symbols,isolated words and full sentences. For our experiment, were used three UNIPENcategories: 1a (digits), 1c (lowercase letters) and 1d (symbols). Some characterexamples from these categories are shown in Fig. ??. To train the system 17 di!er-ent writers were used. Moreover, test data were produced using three writers [?].The distribution of data between train and test partitions ca be seen in Table ??.

Train Test Lexiconwriters 17 3 -

Lowercase Letters 12,298 2,771 26Symbols 2,431 541 21Digits 1,301 234 10Total 16,030 3,546 57

Table 4.1 — Statistics of the UNIPEN training and test dataset used in the experiments.

5.3. Assessment measures

Different measures have been adopted to assess the performance of character-leveltranscription. On the one hand, the human effort needed to obtain the correcttranscription using a system without any user interactivity could be estimatedby the well known Character Error Rate (CER). It is defined as the minimumnumber of characters that need to be substituted, deleted or inserted to convertthe sentences recognized by the system into the reference transcriptions, divided bythe total number of characters in these transcriptions. However, in order to makethe post-editing process more accurately comparable to CATTI at character level,we introduce a post-editing autocompleting approach. Here, when the user enters acharacter to correct some incorrect word, the system automatically completes theword with the most probable word on the task vocabulary. Hence we define thePost-editing Key Stroke Ratio (PKSR), as the number of keystrokes that the usermust enter to achieve the reference transcription, divided by the total number ofreference characters.

On the other hand, the effort needed by a human transcriber to produce correcttranscriptions using CATTI at character level is estimated by the Key Stroke Ratio

Page 13: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

Multimodal CATTI at Character-level Interaction 13

(KSR), which can be also computed using the reference transcriptions. The KSRcan be defined as the number of (character level) user interactions that are necessaryto achieve the reference transcription of the text image considered, divided by thetotal number of reference characters.

These definitions make PKSR and KSR comparable in a fair way. In addition,the relative difference between them gives us a good estimate of the reduction inhuman effort that can be achieved by using CATTI at character level with respect tousing a conventional HTR system followed by human autocompleting post-editing.This estimated effort reduction will be denoted as “EFR”.

Finally, we will also consider the conventional classification error rate (ER) toassess the accuracy of the on-line HTR feedback subsystem under the differentconstraints entailed by the MM-CATTI interaction process.

5.4. Results

The objective of these experiments is to measure the effectiveness of MM-CATTIat character level. Therefore, one of the concerns here is how much the accuracyof the on-line HTR system can be boosted by taking into account informationsderived from the context and the interaction process. Also, we will aim at assess-ing which degree of synergy can actually be expected by taking into account bothmultimodality and interactivity. As we mentioned earlier, the introduction of mul-timodal interactivity leads, on the one hand, to an ergonomic and easier way ofworking, but on the other hand, to a situation where the system has to deal withnon-deterministic feedback signals. Thus, our main goal is to minimize the loss ofprecision with respect to CATTI.

Table 3 shows the basic statistics of the data used in the experiment. These datacorrespond to the characters that the user must introduce during the MM-CATTIprocess of each database (ODEC-M3, IAMDB, CS). For every mistranscribed char-acter, we chose three random samples (one sample for each test writer) of themistaken character from the UNIPEN corpus.

Table 3. Statistics of the test set for each corpus.

Number of Characters

Corpus Lowercase Letters Numbers Symbols Total

ODEC-M3 1,053 64 0 1,117IAMDB 1,581 41 5 1,627

CS 1,868 88 95 2,051

First of all, Table 4 shows the writer average feedback decoding error ratesfor the three corpora and five types of language models which embody the differ-ent restrictions previously discussed. The first one corresponds to a plain unigram

Page 14: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

14 D. Martın-Albo, V. Romero, A.H Toselli, E. Vidal

estimation of P (d), which is used here as baseline. The second is a character error-conditioned unigram estimate of P (d | e) (Eq.(10)). The third corresponds to aprefix-and-error-conditioned character bigram P (d | u′, e) (Eq.(11)). The fourth isa prefix-and-error-conditioned word unigram (Eq.(12)). The last one is a whole-prefix-and-error conditioned word bigram P (d | p′′, u′, e) (Eq.(13)).

Table 4. Writer average MM-CATTI feedback decoding error rates for the different corpora and fivelanguage models: character unigram (CU), character error-conditioned unigram (CUe), prefix-and-

error-conditioned character bigram (CBe), prefix-and-error-conditioned word unigram (WUe),

whole-prefix-and-error conditioned word bigram (WBe). The relative accuracy improvements forWUe and WBe are shown in the last two columns. All results are in percentages.

Error Rate Rel. Improv.

Corpus CU CUe CBe WUe WBe WUe WBe

ODEC-M3 7.1 7.0 6.6 6.3 5.5 11.3 22.5IAMDB 7.0 6.9 6.4 5.0 4.3 28.6 38.6

CS 10.6 10.1 9.6 4.9 6.2 53.8 41.5

As expected, the more information available, the highest feedback decodingaccuracy. The unforeseen performance in the WBe scenario of corpus CS may bedue, as we said before, to the lack of bi-gram estimation robustness. This corpussuffers from a segmentation into relatively short, syntactically meaningless lines,which further hinders the ability of the bi-gram language model to capture relevantcontextual information.

When the system fails to classify the e-pen strokes made by the user two optionsarise. On the one hand, the user can generate a new e-pen interaction to providethe amendments. On the other hand, the character can be introduced using akeyboard. In order to simulate the first action in our experiments we choose anothersample of the same character of the writer from the UNIPEN corpus. Thus, we cansimulate the new interaction made by the user to correct the wrong recognizedcharacter. Figure 6 shows the feedback decoding error for the best scenario ofeach corpus (scenario WBe for ODEC-M3 and IAMDB and scenario WUe for CS)by reintroducing the character when the on-line subsystem fails to classify thestrokes made by the user as a function of the number of times that the user triesto correct the erroneous word using pen-stroke attempts. This way, whenever amisclassification occurs, the language model is conditioned to avoid recognizingthat character again.

As we can see, after just one more attempt the improvement is quite significant.In addition to this, it is clear that the cognitive cost of making this kind of correctionis fairly small, since the main difficulty lies in finding the wrong character insidethe phrase and not in the correction process itself.

As a final overview, Table 5 compares the results of CATTI and MM-CATTI at

Page 15: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

Multimodal CATTI at Character-level Interaction 15

Fig. 6. Error Rate and KSR (all in %) for varying number of attempts for WBe scenario of IAMDB

(top-left), WBe scenario of ODEC-M3(top-right) and WUe scenario of CS (down) corpora.

0

1

2

3

4

5

1 2 3 4 5 10

10.2

10.4

10.6

ER

KS

R

Number of attempts

IAMDB

ERKSR

1

2

3

4

5

6

1 2 3 4 5 7.8

8

8.2

8.4

ER

KS

R

Number of attempts

ODEC

ERKSR

0

1

2

3

4

5

1 2 3 4 5 11

11.2

11.4

11.6

11.8

ER

KS

R

Number of attempts

CS

ERKSR

Character level. The second column shows the PKSR achieved with the post-editingautocompleting approach. The third one, shows the KSR of CATTI at Characterlevel. The fourth column shows the KSR for the best feedback decoding approach(WBe in Table 4). This value is calculated under the assumption that if the systemfails to recognize a character the user proceeds to enter it again with the keyboard(thereby combining two corrections).

Finally, the last column shows the overall estimated effort reductions (EFR) forCATTI and the MM-CATTI with respect to post-edition with autocompleting.

Table 5. From left to right: PKSR obtained with the post-editing autocompleting approach, KSRachieved with CATTI and MM-CATTI at character level. EFR of CATTI with respect to PKSR

and for KSR of MM-CATTI with respect to PKSR. All results are in percentages.

KSR Overall EFR

Corpus PKSR CATTI MM-CATTI CATTI MM-CATTI

ODEC-M3 9.0 7.5 7.9 16.7 12.2IAMDB 13.5 9.6 10.0 28.9 25.9

CS 12.9 10.5 11.2 18.6 13.2

Page 16: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

16 D. Martın-Albo, V. Romero, A.H Toselli, E. Vidal

According to these results, the expected user effort for the more ergonomicand user preferred touch-screen based MM-CATTI at character level, is only slighthigher than that of CATTI at the character level. Clearly, this is thanks to theimproved on-line feedback decoding accuracy achieved by means of interaction-derived constraints.

6. Summary and conclusions

In conclusion, in this work, we have studied the character level interaction in theMM-CATTI system using pen-strokes as a complementary means to introduce therequired CATTI correction feedback. Here, this feedback is used as a part of a prefixto improve the transcription given by the computer. Thus, the system proposes anew suffix that the user can accept as a final transcription or modify in an iterativeway until a full and correct transcription is finally produced.

Empirical tests presented in this work supports the benefits of using this ap-proach rather than traditional HTR followed by human post-editing. From theresults, we observe that the use of the more ergonomic feedback modality comes atthe cost of only a reasonably small number of additional interaction steps neededto correct the few feedback decoding errors. The number of this extra steps is keptvery small thanks to the system ability to use interaction-derived constraints toconsiderably improve the on-line HTR feedback decoding accuracy. Clearly, thiswould not have been possible if a conventional off-the-shelf on-line HTR decoderwere trivially used for the correction step.

Acknowledgements

Work supported by the Spanish Government (MICINN and “Plan E”) under theMITTRAL (TIN2009-14633-C03-01) research project and under the research pro-gramme Consolider Ingenio 2010: MIPRCV (CSD2007-00018), by the Spanish MI-TyC under the erudito.com (TSI-020110-2009-439) project and by the GeneralitatValenciana under grant Prometeo/2009/014.

References

1. S. Barrachina, O. Bender, F. Casacuberta, J. Civera, E. Cubel, S. Khadivi, A. La-garda H. Ney, J. Tomas, and E. Vidal. Statistical approaches to computer-assistedtranslation. Computational Linguistics, page In press, 2008.

2. J. Civera, J.M. Vilar, E. Cubel, A.L. Lagarda, S. Barrachina, F. Casacuberta, E. Vi-dal, D. Pico, and J. Gonzalez. A syntactic pattern recognition approach to computerassisted translation. In A. Fred, T. Caelli, A. Campilho, R. P.W. Duin, and D. de Rid-der, editors, Advances in Statistical, Structural and Syntactical Pattern Recognition,Lecture Notes in Computer Science. Springer-Verlag, 2004.

3. G. Dimauro, S. Impedovo, R. Modugno, and G. Pirlo. A new database for research onbank-check processing. In 8th Int. Workshop on Frontiers in Handwriting Recognition,pages 524–528, 2002.

Page 17: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

Multimodal CATTI at Character-level Interaction 17

4. I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, and S. Janet. UNIPEN Projectof On-Line Data Exchange and Recognizer Benchmarks. In Proc. of the 14th Inter-national Conference on Pattern Recognition, pages 29–33, Jerusalem (Israel), 1994.

5. B. Q. Huang, Y. B. Zhang, and M. T. Kechadi. Preprocessing techniques for onlinehandwriting recognition. In ISDA ’07: Proceedings of the Seventh International Con-ference on Intelligent Systems Design and Applications, pages 793–800, Washington,DC, USA, 2007. IEEE Computer Society.

6. F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, 1998.7. S. Johansson, E. Atwell, R. Garside, and G. Leech. The tagged lob corpus. user’s

manual. Norwegian Computing Center for the Humanities, 1996.8. U. Marti and H. Bunke. The iam-database:an english sentence database for off-line

andwriting recognition. pages 39–46, 2002.9. U.-V. Marti and H. Bunke. A full English sentence database for off-line handwriting

recognition. In Proc. of the ICDAR’99, pages 705–708, Bangalore (India), 1999.10. U.-V. Marti and H. Bunke. Using a Statistical Language Model to improve the per-

formance of an HMM-Based Cursive Handwriting Recognition System. Int. Journalof Pattern Recognition and Artificial Intelligence, 15(1):65–90, 2001.

11. L. Rabiner. A Tutorial of Hidden Markov Models and Selected Application in SpeechRecognition. Proc. IEEE, 77:257–286, 1989.

12. V. Romero, A. H. Toselli, L. Rodrıguez, and E. Vidal. Computer Assisted Transcrip-tion for Ancient Text Images. In ICIAR 2007, volume 4633 of LNCS, pages 1182–1193.Springer-Verlag, Montreal (Canada).

13. V. Romero, A. H. Toselli, and E. Vidal. Using mouse feedback in computer assistedtranscription of handwritten text images. 10th ICDAR 2009.

14. V. Romero, A. H. Toselli, and E. Vidal. Character-level interaction in computer-assisted transcription of text images. In Proc. ICFHR 2010, pages 539–544. Kolkata,India, November 2010.

15. S. N. Srihari and E. J. Keubert. Integration of Handwritten Address InterpretationTechnology into the United States Postal Service Remote Computer Reader System.In Fourth International Conf. Document Analysis and Recognition, volume 2, pages892–896, Ulm, Germany, August 1997.

16. A. H. Toselli et al. Integrated Handwriting Recognition and Interpretation usingFinite-State Models. Int. Journal of Pattern Recognition and Artificial Intelligence,18(4):519–539, June 2004.

17. A. H. Toselli, M. Pastor, and E. Vidal. On-Line Handwriting Recognition System forTamil Handwritten Characters. In Proc. IbPRIA’07, volume 4477 of LNCS, pages370–377. Springer-Verlag.

18. A. H. Toselli, V. Romero, M. Pastor, and E. Vidal. Multimodal interactive transcrip-tion of text images. Pattern Recognition, 43(5):1814–1825, 2010.

19. A. H. Toselli, V. Romero, L. Rodrıguez, and E. Vidal. Computer Assisted Transcrip-tion of Handwritten Text. In Proc. 9th ICDAR 2007, pages 944–948. Curitiba, Parana(Brazil), 2007.

20. Alejandro H. Toselli, Alfons Juan, and Enrique Vidal. Spontaneous HandwritingRecognition and Classification. In Proceedings of the 17th International Conference onPattern Recognition, volume 1, pages 433–436, Cambridge, United Kingdom, August2004.

21. E. Vidal, L. Rodrıguez, F. Casacuberta, and I. Garcıa-Varea. Interactive patternrecognition. In Proceedings of the 4th Joint Workshop on Multimodal Interaction andRelated Machine Learning Algorithms, volume 4892 of LNCS, pages 60–71. Brno,Czech Republic, June 2007.

Page 18: Multimodal Computer-Assisted Transcription of Text Images ...pdfs.semanticscholar.org/acf9/03166901ad1926ad9ace47516262dfadf0ef.pdfJune 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

June 6, 2011 13:21 WSPC/INSTRUCTION FILE ws-ijprai

18 D. Martın-Albo, V. Romero, A.H Toselli, E. Vidal

22. M. Zimmermann, J.-C. Chappelier, and H. Bunke. Offline grammar-based recognitionof handwritten sentences,. IEEE TPAMI, 28(5):818–821, May 2006.