ieee transactions on pattern analysis and machine intelligence...

14
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1 Toward Integrated Scene Text Reading Jerod J. Weinman, Member, IEEE , Zachary Butler, Dugan Knoll, and Jacqueline Feild Abstract—The growth in digital camera usage combined with a worldly abundance of text has translated to a rich new era for a classic problem of pattern recognition, reading. While traditional document processing often faces challenges such as unusual fonts, noise, and unconstrained lexicons, scene text reading amplifies these challenges and introduces new ones such as motion blur, curved layouts, perspective projection, and occlusion among others. Reading scene text is a complex problem involving many details that must be handled effectively for robust, accurate results. In this work, we describe and evaluate a reading system that combines several pieces, using probabilistic methods for coarsely binarizing a given text region, identifying baselines, and jointly performing word and character segmentation during the recognition process. By using scene context to recognize several words together in a line of text, our system gives state of the art performance on three difficult benchmark data sets. Index Terms—Scene text recognition, cropped word recognition, character recognition, discriminative semi-Markov model, image binarization, skew detection, baseline estimation, text guidelines, word normalization, word segmentation 1 I NTRODUCTION T EXT is everywhere. Anthropologist Sir John Goody characterizes it as “the most important technological development in the history of humanity” [1]. While humans have been writing for nearly six millennia, machines have made great strides in reading over the last century. In traditional optical character recognition (OCR), a printed document is scanned to an image and translated into some machine readable text format. Al- though researchers have made significant progress, ma- chines have yet to match human reading performance. Now the widespread availability of consumer cameras and the worldly abundance of text have created a new era for machine reading. Scene text recognition (STR) involves finding and reading ambient text in the environ- ment captured by a camera. While traditional document processing often faces challenges such as imaging de- fects, novel or rare fonts, noise, and unconstrained lexi- cons, STR amplifies these challenges and introduces new ones such as motion blur, curved layouts, perspective projection, and occlusion among others. Making a robust, accurate scene text reader is a complex problem involving many details that must be handled effectively (see Fig. 1). In this work, we describe and evaluate a reading system that integrates many of these pieces: a simple region-grouping algorithm (Sec- tion 3) and probabilistic models for coarsely binarizing a given text region (Section 4), identifying baselines (Section 5), and jointly performing word and character segmentation during the recognition process (Section 6). By recognizing several words together in a line of text, Jerod Weinman is with the Department of Computer Science at Grinnell College. Email: [email protected]. Jacqueline Feild is with the Department of Computer Science at the University of Massachusetts Amherst. Email: [email protected]. Messrs. Weinman, Butler, and Knoll gratefully acknowledge financial support from Grinnell College. Ms. Feild was supported by an NSF Graduate Research Fellowship. Binarize, Fit guides f i s h Parse Group regions Normalize Fig. 1. Detected text regions are collected into lines, coarsely binarized and fit with quadratic text guides. The lines are normalized and passed on for recognition. our system gives state of the art performance on three difficult benchmark data sets (Section 7). Our work makes several contributions to scene text reading. Unlike documents, scene text lines may have just a few words. Still, utilizing collinear text words facil- itates improved appearance normalization, an important contribution that significantly improves accuracy. An- other contribution is the use of the discriminative semi- Markov model, which integrates learning from several information sources such as character appearance, ge- ometry, and language. Finally, we explicitly incorporate word segmentation for STR, a task made challenging by highly irregular and unconstrained character spacing. Because nearly all modules of the system are probabilis- tic, we pass forward multiple hypotheses to subsequent modules, delaying final decisions until top-down infor- mation has been incorporated. 2 RELATED WORK Many authors have studied the primary task of finding text in images [2], [3], [4], [5], [6]. The 2003 ICDAR Robust Reading competition did much to spur interest in this area [7]. Although other STR work appeared [8], [2], [3], [9], [10] and a follow-up 2005 contest was orga- nized [11], six years passed before anyone would bench-

Upload: dangthu

Post on 11-Mar-2018

222 views

Category:

Documents


2 download

TRANSCRIPT

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Toward Integrated Scene Text ReadingJerod J. Weinman, Member, IEEE , Zachary Butler, Dugan Knoll, and Jacqueline Feild

Abstract—The growth in digital camera usage combined with a worldly abundance of text has translated to a rich new era for a classic

problem of pattern recognition, reading. While traditional document processing often faces challenges such as unusual fonts, noise, and

unconstrained lexicons, scene text reading amplifies these challenges and introduces new ones such as motion blur, curved layouts,

perspective projection, and occlusion among others. Reading scene text is a complex problem involving many details that must be

handled effectively for robust, accurate results. In this work, we describe and evaluate a reading system that combines several pieces,

using probabilistic methods for coarsely binarizing a given text region, identifying baselines, and jointly performing word and character

segmentation during the recognition process. By using scene context to recognize several words together in a line of text, our system

gives state of the art performance on three difficult benchmark data sets.

Index Terms—Scene text recognition, cropped word recognition, character recognition, discriminative semi-Markov model, image

binarization, skew detection, baseline estimation, text guidelines, word normalization, word segmentation

1 INTRODUCTION

T EXT is everywhere. Anthropologist Sir John Goodycharacterizes it as “the most important technological

development in the history of humanity” [1]. Whilehumans have been writing for nearly six millennia,machines have made great strides in reading over thelast century. In traditional optical character recognition(OCR), a printed document is scanned to an image andtranslated into some machine readable text format. Al-though researchers have made significant progress, ma-chines have yet to match human reading performance.

Now the widespread availability of consumer camerasand the worldly abundance of text have created a newera for machine reading. Scene text recognition (STR)involves finding and reading ambient text in the environ-ment captured by a camera. While traditional documentprocessing often faces challenges such as imaging de-fects, novel or rare fonts, noise, and unconstrained lexi-cons, STR amplifies these challenges and introduces newones such as motion blur, curved layouts, perspectiveprojection, and occlusion among others.

Making a robust, accurate scene text reader is acomplex problem involving many details that must behandled effectively (see Fig. 1). In this work, we describeand evaluate a reading system that integrates many ofthese pieces: a simple region-grouping algorithm (Sec-tion 3) and probabilistic models for coarsely binarizinga given text region (Section 4), identifying baselines(Section 5), and jointly performing word and charactersegmentation during the recognition process (Section 6).By recognizing several words together in a line of text,

• Jerod Weinman is with the Department of Computer Science at GrinnellCollege. Email: [email protected].

• Jacqueline Feild is with the Department of Computer Science at theUniversity of Massachusetts Amherst. Email: [email protected].

• Messrs. Weinman, Butler, and Knoll gratefully acknowledge financialsupport from Grinnell College. Ms. Feild was supported by an NSFGraduate Research Fellowship.

Binarize, Fit guides

fis h

ParseGroup regions

Normalize

Fig. 1. Detected text regions are collected into lines,

coarsely binarized and fit with quadratic text guides. The

lines are normalized and passed on for recognition.

our system gives state of the art performance on threedifficult benchmark data sets (Section 7).

Our work makes several contributions to scene textreading. Unlike documents, scene text lines may havejust a few words. Still, utilizing collinear text words facil-itates improved appearance normalization, an importantcontribution that significantly improves accuracy. An-other contribution is the use of the discriminative semi-Markov model, which integrates learning from severalinformation sources such as character appearance, ge-ometry, and language. Finally, we explicitly incorporateword segmentation for STR, a task made challengingby highly irregular and unconstrained character spacing.Because nearly all modules of the system are probabilis-tic, we pass forward multiple hypotheses to subsequentmodules, delaying final decisions until top-down infor-mation has been incorporated.

2 RELATED WORK

Many authors have studied the primary task of findingtext in images [2], [3], [4], [5], [6]. The 2003 ICDARRobust Reading competition did much to spur interestin this area [7]. Although other STR work appeared [8],[2], [3], [9], [10] and a follow-up 2005 contest was orga-nized [11], six years passed before anyone would bench-

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

mark the open-vocabulary word recognition task of thisdifficult data set [12]. Others have since followed [13],[14], [15], [16], [17]. In 2011, the data set was revisedto reduce (though not eliminate) annotation errors, givetighter bounding boxes, and increase certain forms ofvariability [18]; the word recognition contest receivedjust three entries, the best performing with a 59% worderror rate, which this work reduces to 42%.

Wang and Belongie introduced a new task for theirStreet View Text (SVT) data set [13], [19]. Words in eachSVT image are to be recognized from a small lexicon ofabout fifty words. Though the images are more intrin-sically challenging due to resolution, mosaicing errors,and perspective, the word spotting task is much simplerthan the open-vocabulary ICDAR benchmark.

Our prior work assumed character and word bound-aries for recognition [20]. To find word boundaries, Neu-mann and Matas [14], use heuristics of gaps on binarizedcharacter regions, while Shivakumara et al. [21] useweak image gradients. Before distances between char-acters can be measured, characters must be binarizedcorrectly, a significant challenge when noise and lowresolution cause both broken and touching characters.Even if characters could be binarized and isolated, wordboundaries are not easily predicted because scene text isoften less constrained by character kerning and trackingconventions. For example, the intra-word gaps inare larger than the inter-word space in .

To resolve these ambiguities, we integrate word andcharacter segmentation with character recognition [22],[23], giving bottom-up and top-down information flowsinfluence so that low-level segmentation commitmentsare not made too early and high-level recognition pro-cesses need not examine unsupported hypotheses. Sev-eral others have since applied similar weighted finitestate transducer (FST) variants to the character seg-mentation and recognition problem. Saidane et al. [12]use a convolutional neural network to predict characterboundaries, building a recognition graph with segmen-tation and letter scores from a convolutional neuralnetwork as edge weights. With no language model in-cluded, only one letter hypothesis per edge is neededto find the maximal score. Elagouni et al. [16] followby incorporating a trigram language model. Yamazoeet al. [15] similarly use a weighted FST, with lexicon-constrained paths. Such lexicon constraints are suited tothe word spotting task, where Wang and Belongie [13]use a pictorial structures model, a type of weighted FSTwith quadratic edge scores. Mishra et al. [17] take anintermediate approach by using positional bigram statis-tics, an approach whose power devolves as the lexicongrows and is only applicable when word segmentationsare assumed.

Many of these informal approaches have their rootsin the Markov models of speech and handwritingrecognition, where segmentation must be integrated.In this work, we use the discriminative semi-Markovmodel [24], a globally normalized, non-generative cousin

of the variable duration hidden Markov model proposedby Ferguson for speech recognition [25]. Because ourprobabilistic model has a global normalization factor, allthe features, from appearance to language to geometry,and their relative importance can be learned in an in-tegrated fashion, as in the non-probabilistic setting ofLeCun et al. [26].

Other machine perception problems, especially hand-writing recognition and speech understanding, areclosely related to STR and share many techniques. In-tegrating lexical recognition or word segmentation re-quires tracking many hypotheses, which excessively con-sumes memory. Speech recognition systems commonlyreduce this large search space heuristically with Viterbibeam search [27]. In handwriting recognition, Liu etal. [28] combine the three most common beam pruningmethods: absolute threshold, relative cost difference, andmaximum number of active hypotheses.

Many approaches require characters to be isolatedin the image before recognition. Some detect charactercandidates with sliding windows [13], [17], while othersclassify connected components [14]. We take a moreholistic view by fitting guidelines to coarsely binarizedgroups of colinear text regions, which may then benormalized for an integrated segmentation and recog-nition process. We first describe these three importantpreprocessing steps (Sections 3-5) before detailing ourprobabilistic recognition model (Section 6).

3 REGION GROUPING

Both texture and connected component-based text de-tectors can break up a single line of text into multipledetections. While separated regions often occur acrossword boundaries, weak bottom-up signals due to shad-ows or blurring within words can also cause separationand undetected characters. Grouping these independentdetections together can help later stages in at least twoways. First, fitting guide curves to the text becomesmore robust with increased data. Second, the problem ofundetected characters can be resolved by postponing thecharacter/no character/space decision to a later stage,when more context is available.

Several authors propose techniques for grouping con-nected components into text lines for recognition. Mod-els that incorporate local and pairwise component fea-tures in an energy-minimization framework are commonfor handwriting [29] as well as scene text [30], [31].Others use simpler heuristics and incremental pairwisefeature comparisons [32], [14].

Our simple but effective approach is motivated byour benchmarks’ input data format: axis-aligned wordbounding boxes. While most any of the sophisticatedmethods above might be used, we simply iteratively linkthe closest pair of available boxes, subject to a few simpleconstraints:

1) The horizontal overlap may not exceed 15% of thesmaller box height.

WEINMAN et al.: TOWARD INTEGRATED SCENE TEXT READING 3

2) The smaller box height may be no less than 45% ofthe larger box height.

3) The average distance between correspondingpoints may not exceed 120% of the average boxheight.

The first constraint prevents unrelated but coincidentboxes from being linked. The second constraint allowsa box with both ascenders and descenders to be linkedwith a neighbor that may have neither, but otherwiseprevents linking collinear boxes of different font sizes.While some minor false positives do accrue, this con-

straint helps to filter superscript marks such as R© andTM

.The final constraint allows some flexibility for groupingcurved or rotated text and word boxes separated by areasonable amount of space, but it otherwise preventslinking boxes too distant to belong to the same line. Theparameters and rules were identified and tuned with theICDAR scenes training data.

Subject to these constraints, we iteratively link wordboxes in ascending order of the average distance be-tween their adjacent, corresponding points (in a left-to-right orthography). Thus, each box is followed by itsclosest unlinked successor.

Fig. 2. Word box line grouping examples.

TABLE 1

Grouping in ICDAR 2011 test scenes (1189 words).

Collinear Word Rectangle PairsPredicted

Co

rrec

t Positive NegativePositive 429 31

Negative 11 7535

As the examples above show, overlapping regions,rotated text, and long lines can be handled reasonablywell. Despite the algorithmic simplicity, quantitativeevaluation demonstrates relatively few false positives(0.0145%), which are more problematic than the falsenegatives (6.74%). To hedge against false positives, weapply the subsequent steps (image normalization andrecognition) both with and without grouping, keepingthe best scoring final interpretation for each word.

4 SEGMENTATION AND BINARIZATION

The 3-D orientation of scene text implies some form ofnormalization is critical for recognition. Normalizationtypically requires inferring character geometry from bi-narized text. Unfortunately, environmental factors such

as lighting, low resolution, and noise make accuratebottom-up text binarization difficult. However, a crudebinarization often suffices for a geometric analysis, evenif it does not segment letters perfectly or has otherartifacts ill-suited for recognition.

Given a candidate image region from a text detec-tor, we perform a variety of segmentations and choosethe most text-like binarization. After discussing relatedwork, we first describe how to generate candidate regionsegmentations and then how we select a binarizationfrom among them.

4.1 Related work

Many approaches to scene text character binarizationbegin with a clustering segmentation, followed by a clas-sification step to identify text segments. While Kita andWakahara [33] score all binarizations of a k-means model(i.e., 25 = 32 for k = 5) to find the best for a single char-acter, Zeng et al. [34] use a simple heuristic for choosingwhich of the k = 3 modes for an entire word correspondsto text. Cho et al. [35] create a watershed segmentation ofa word image, where all segments are jointly classifiedas part of a character by a CRF. Neumann and Matasindependently classify extremal region binarizations ascharacters with an SVM [6]. Rather than using top-downfeatures of characters to classify segments, Mishra etal. [36] use a bottom-up seeding approach to guess attext/non-text regions before estimating an MRF withlatent Gaussian mixture models.

Rather than classify individual connected componentsor k-means mode combinations as characters, we takea global bottom-up approach to image segmentationfollowed by a more top-down text binarization.

4.2 Segmentation with regression mixtures

A common model for unsupervised color-based imagesegmentation is the simple Gaussian mixture,

p (c | µ, σ,π) =

K∑

k=1

πkp (c | µ, σ, z = k) , (1)

where c ∈ R3 is a color column vector at some pixel

location, which is modeled by a set of K Gaussiandensities p (c | µ, σ, z = k) = G

(c− µk;σ

2)

with mixtureparameters πk = p (z = k) and scalar variances σ2. Theposterior probability γk = p (z = k | c,µ, σ,π) induces asegmentation from the most likely component for eachpixel, z = argmaxk γk.

This simple model can fail in scene images wherelighting changes across the image region or designersemploy color gradients; an accurate segmentation mayrequire many components, large variances in the Gaus-sian components, or both. A more compact and reliableapproach is to model change directly. We follow theapproach of Tu and Zhu [37], who use a cubic Beziersurface to model smooth color changes over a region.Rather than a mixture of simple Gaussian distributions

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

(1), we instead have a mixture of regressors [38], with themean of each component dependent on the pixel location

p (c | x,B, σ, z = k) = G(c−B⊤

k ϕ (x) ;σ2). (2)

where Bk is a 3 × d parameter matrix representinga regression over a higher-dimensional representation

ϕ : R2 → Rd of the pixel coordinates x = (x, y)

⊤. WhileTu and Zhu [37] use a cubic expansion, making Bk a3× 16 parameter matrix, we find a quadratic expansion

kernel ϕ (x) =(x2, y2, xy, x, y, 1

)⊤superior to a linear

kernel ϕ (x) = (x, y, 1)⊤, and both an improvement

over a simple mixture model. Because our Gaussianmodel is diagonal, we use the YCbCr color space, whosecomponents are designed to be more independent thanthose of RGB.

An expectation-maximization (EM) framework finds alocal maximum likelihood solution of B and π with fixedσ = 3/255 . Given B and π, the traditional E-step com-putes updated values of the posteriors γ. Given thoseposteriors in the M-step, the mixing parameters π areupdated in the usual way, while each Bk may be foundas a weighted least-squares solution to C = B⊤

k Φ(X),where C is a stacked version of the colors c for allpixels, and Φ(X) is the stacked version of correspond-ing “lifted” pixel locations. The posterior probability γkbecomes the weight for that pixel location’s contributionto the least-squares problem.

To insure against bad local maxima of EM, we learnthe constant offset parameters with a standard mixturemodel (1) before fitting the full regression model pa-rameters. As the figure below demonstrates, a simpleGaussian mixture model is not well-suited to lightingchanges, while coupling the color values through alatent linear model dependent on pixel location is oftensufficient to separate the image components.

Fig. 3. Component posterior probabilities with K = 3 us-

ing a simple Gaussian mixture model (top) and a mixture

of linear regression models (bottom).

4.3 Binarization

Inspired by Kita and Wakahara [33], we choose amongseveral candidate segmentations given by the regressionmixtures to yield a final binarization. We fit one mixtureof K = 3 components and another with K = 4. In manycases, the components in the K = 3 model correspond totext, background, and mixed pixels. The primary ques-tion is how to identify which component(s) correspondto the text. We pose this as a probabilistic classificationproblem, where a logistic regression scores several fea-tures of all binary images induced by the mixture model.

The candidate binarization with the highest probabilityof being text is forwarded to the subsequent stage (guideline fitting). Next we describe how candidate binaryimages are generated, the features used to make theclassification, and how the classifier is trained.

4.3.1 Connected component and binary image features

For each mixture component, we create a binary imagewhere a pixel is “on” if the given component is the mostprobable under the model (2). We also consider the unionof pairs of binary images, which allows us to handlestrong shadows, dual-color characters, etc. Because thebackground is sometimes segmented more uniformlythan the text, we include the binary complement of eachcandidate image for consideration. In total, there are sixunique images for K = 3, plus twenty more candidatebinarizations for K = 4.

We use two classes of features to represent each binaryimage: statistics of connected component features andglobal statistics of the binary image. We measure four-teen features for each connected component: Normalizedarea [34], hole/area ratio, compactness, solidity (threefeatures used by Neumann and Matas [6]), eccentricity,normalized major and minor axis lengths, aspect ratio,and Hu’s seven invariant moments [39]. Because thenumber of components varies, we represent the distri-bution of each component feature with a fixed set ofeleven feature statistics: mean, variance, skew, kurtosis,quantiles at 2.5, 25, 50, 75, and 97.5 percents, and sumsof the absolute deviation from the mean and median.

We compute five global features of the binary image:normalized total area, fraction of “on” border pixels,Euler number, and normalized horizontal and verticalrange. Finally, we compute the same eleven statisticsdescribed above on five functions of the pixels: strokewidth (distance from every skeleton point to the nearest“off” pixel), normalized row and column coordinates,and normalized marginals (e.g., column and row sumsdivided by height and width). All features together forma 225-dimensional representation of a binary image.

4.3.2 Text image classifier

For training data, we labeled several word image seg-mentations from the K = 3 regression mixture model (2).Additional positive instances came from automaticallybinarized images of words from the ICDAR 2003 scenestraining data provided by Mishra et al. [36].

We use a logistic regression classifier to score each ofthe 26 candidate images given by the K = 3 and K = 4regression mixture models, identifying the image withthe highest probability of being text. If none of theseimages yields a positive classification, we run anothermixture with K = 5. Considering all components andpairs of components along with the complements yields30 candidate binarizations for the K = 5 model. Thebinarization with the highest classification score (nowamong all 56 candidates) is finally chosen as the bi-narized text image. Trying the more restricted models

WEINMAN et al.: TOWARD INTEGRATED SCENE TEXT READING 5

first prevents a spurious but high-scoring K = 5 over-segmentation from being chosen when a simpler modelsuffices. On the SVT test data, the percentage of binariza-tions drawn from the K = 3, 4, 5 models are 74%, 20%,and 6%, respectively. Of those using the K = 5 model,70% are still classified as non-text.

Instead of standard logistic regression, which inducesa linear decision boundary in the feature space, we“raise” the features into a quadratic decision space byincluding all product pairs of the feature vector elementsfor a total of 25, 425 features. To prevent overfitting inthis high-dimensional space, we use cross-validation tochoose the weight of a sparsity-inducing Laplace prior(ℓ1 regularization) on the weights [40], which pruned30% of the features.

Fig. 4. Binarizations of Street View Text words. Inferred

text is black on a white background.

The figure above illustrates examples from the SVTdata. Common failure modes (cf. bottom row) includethin, under-segmented characters and text-like blobs, inaddition to polarity inversion. However, unlike themethod of Mishra et al. [36], ours is not sensitive toan initial, bottom-up seeding process to guess whichregions are text. Instead, it waits until more geometricfeatures are available to classify the final segments astext, the advantages of which can be seen in Fig. 5 below.

Original Mishra et al. [36] This work

Fig. 5. ICDAR word binarization comparison.

Table 2 qualitatively evaluates our technique on theKAIST scene text database English subset with rectangu-lar bounding boxes [41]. Though each method uses thesame binarization classifier, segmentations are providedby a simple Gaussian mixture (1) or regression mix-tures (2). A paired, two-sided Wilcoxon signed rank testindicates the recall and F1 of the regression mixtures areboth significantly better than the simple mixture model;precision differences are not significant.

Finally, we re-emphasize that the purpose of this stageis not to achieve a binarization worthy of accurate recog-nition. Rather, we simply want to enable a geometricanalysis for word normalization prior to a recognition

TABLE 2

Average binarization results for KAIST/English data [41]

(395 images).

Method Prec. Recall F1Simple Gaussian Mixture 70.50 86.29 74.30

Linear Regression Mixture 68.53 89.71 75.25Quadratic Regression Mixture 69.95 90.59 76.76

based on the full color image. Thus any sufficientlyadvanced binarization algorithm should suffice. Nextwe describe how these coarse binarizations can improverecognition by facilitating word image normalization.

5 TEXT LINE NORMALIZATION

Text in the wild is captured from arbitrary views andexhibits rotation, perspective projection from non-planarsurfaces, and even intrinsically non-linear baselines (theconventional term for the curve on which the letters rest,whether it is truly linear or not). While it is possible totrain a character classifier with examples of these vari-ations, the result is often inherently less discriminativeunless invariant features are used or the variations areexplicitly modeled rather than being treated as noise. Wehope to sharpen performance of our character recogni-tion system by using the binarized image to normalizethe text to a rectilinear pose, minimizing the effect ofviewpoint variations, as outlined in Fig. 6 below.

Original Initial fit Pruned fit, Guides,

image extrema control points

Normalized image

Fig. 6. Word normalization process.

5.1 Related work

Whether in documents, handwriting, or scenes, nor-malizing text typically improves recognition accuracy.In document processing, this task usually consists of“skew detection,” which simply means inferring theglobal page rotation (see Hull [42] for a survey). Inhandwriting recognition, Bengio and LeCun [43] fit fourtied quadratic curves to individual words. They use localextrema of online pen trajectories as control points for arobust least squares regression built on a formal proba-bilistic model. For offline handwriting, Caesar et al. [44]fit four tied straight lines to groups of words, usingsimple vertical extrema of the binary image as controlpoints. In scene text, Shivakumara et al. [21] performa simple rotation of a binarized detection. Neumannand Matas [14] assume a planar text surface with twolinear guides to estimate foreshortening parameters fornormalization. In later work, they follow Caesar et al.,

6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

fitting four tied lines to binary character vertical extremawith a least median squares regression [45].

Unlike handwritten text, typefaces exhibit substantialregularity in the locations of the capline (tops of theupper case or capital letters), meanline (tops of the lowercase letters), and descender line for a given font size andbaseline. Thus for many recognition tasks, identifyingonly a pair of the guides is sufficient for normalization;a character appearance model can easily handle thesmall variation remaining. Our algorithm infers two textguides from the binarized image, one for the baselineand one for character tops, which may be either themeanline or the capline.

Several important requirements must be satisfied inorder for any algorithm to perform robustly. First, be-cause binarization is imperfect, the guide curve inferencemust be robust to noise, outliers, false positive andmissing characters. Second, because text can appear inpractically any orientation, the algorithm must be flexi-ble enough to handle rotated, curved, and non-parallelguides while preventing overparameterized fits to noiseor small amounts of text data. To satisfy the first require-ment, we use a probabilistic mixture explicitly modelingoutliers. For the second requirement, we model theguides as two loosely coupled curves. A second-degreepolynomial is sufficient to handle the most commonvariations found in scene text, while purely linear base-lines are surprisingly rare due to lens distortion. Theloose coupling means that while the guides tend to varytogether (i.e., for rotated text), our model can still handleperspective convergence (e.g., ) and independentupper and lower guides (e.g., ).

5.2 Probabilistic regression formulation

In our formulation, we have only the binary image andmust infer which points correspond to the extrema thatthe guides intersect. Unfortunately, these extrema aredefined in a coordinate system relative to the guidecurves themselves. Consider the character “o.” Underany rotation, the image coordinate system gives an arbi-trary point on the boundary as a lower extrema. Forcinga guide through this arbitrary point makes the characterrest below the baseline. The converse happens for upperextrema. Prior work is prone to these biases for rotatedtext. Because we need the guides to infer the extrema andthe extrema to infer the guides, we take an approximate,iterative refinement approach to discovering both.

Following Caesar et al. [44] we first find the least-squares quadratic fit to all the points (pixels) in the bina-rized image. This fit tends to give a good approximationof the text guide shape because most of the pixels aretext. However, a simple least-squares regression is notrobust, and non-text outliers can significantly bias theresults. We therefore discard any connected componentsthis initial curve does not touch and perform anotherregression on what remains. This secondary fit is usuallymuch closer to the correct guide shape (assuming top

and bottom are similar to one another), tends to traversethe characters, and gives us a reasonable, approximatelycorrect coordinate system for finding extrema to whichlower and upper guides may be fit.

To find candidate extrema, we consider a series of localcoordinate systems aligned with the curve’s normal andtangent. At each column, we search along the normal ofthe secondary fit to find the farthest point from the curveon each connected component intersecting that normal.We then retain only those that are extremal points of thebinary image in this local coordinate system. To assesswhich points are extrema, we consider a neighborhoodof 5 pixels to either side (in the tangent orientation)for comparison. While this neighborhood is not scaleinvariant, results are stable over a wide range of scales.

Following Bengio and LeCun [43] we define a proba-bilistic model for the coefficients β ∈ R

d+1 of each degreed polynomial that describes a text guide. Let X = {xi} bethe set of observed lower (upper, resp.) points x = (x, y).Because a given point may be an outlier due to a badbinarization or a descender (ascender, resp.), we modelthe likelihood of each point as a mixture between aGaussian and a uniform density,

p (xi | β, ǫ, υ, ρ) = ρG(yi − β⊤ϕ (xi) ; ǫ

2)+ (1− ρ) υ, (3)

where ϕ (x) =(x2, x, 1

)⊤for a d = 2 quadratic guide,

ǫ describes the error (standard deviation) expected forthe fit, 0 < ρ ≤ 1 is the mixing prior for inliers, andυ is the (finite) inverse area of the uniform backgroundmodel. To promote similarity to some preliminary curve,we incorporate a prior on the guide coefficients,

p (β | α,σ) = G(β −α;σ⊤Iσ

), (4)

where α ∈ Rd+1 describes the preferred coefficients,

with a tolerance vector σ ∈ Rd+1. We find the most

likely coefficients for the fit according to the posteriorprobability

p (β, ρ | X,α,σ, ǫ, υ) ∝

N∏

i=1

p (xi | β, ǫ, υ, ρ) p (β | α,σ) ,

(5)fixing the uniform background likelihood at υ = 10−21.While a value on the order of the image size would seemmore appropriate, in the absence of prior information onρ, an extreme number is necessary to prevent most allpoints from being assigned to the background model.The small υ therefore functions as a proxy for a strongbias toward inliers. Although not scale invariant, we fixthe error term to just ǫ = 0.5 pixels to promote tight fitsrather than guides that run between baseline and descen-der points. We discuss the deviation tolerances σ in thenext section, where we describe a multi-step process forfinding a good local maximum of this posterior (5) forthe upper and lower guides.

5.3 Expectation-Maximization fitting algorithm

Because the log posterior (5) is not convex, we use aniterative EM approximation to estimate coefficients. As

WEINMAN et al.: TOWARD INTEGRATED SCENE TEXT READING 7

with the mixture of regressors in Section 4.2, calculatingguide coefficients β in the M-Step amounts to solvinga weighted least squares problem, due to the Gaussianlikelihoods and prior.

Bengio and LeCun’s model explicitly couples theirfour fitted guides. However, to avoid bad local maxima,we implicitly couple the inference of our two guidesthrough a sequential process that first infers the baselinecurve, using the result to bias the upper curve. To fitthe baseline, we must set the hyperparameters α and σ.The pruned, secondary regression gives us the higher-order coefficients (e.g., α2 and α1). With these two termsfixed, α0 is the least squares estimate on all the lowerextrema points, even though some may be outliers. Thecorresponding deviation tolerance is σ =

(10−5, 10−3, h

)

where h is the average text region rectangle height.For stability, we then condition the upper curve’s

coefficients on the baseline estimate, but complicationsremain: characters may reach either the meanline orcapline; a noisy binarization has outliers. Because in-liers are typically offset from the baseline by the sameamount, we increase our chances of finding a good-fitting upper curve by first clustering the distances fromthe upper points to the baseline curve with a K = 4Gaussian mixture model. We initialize the means touniformly spaced percentiles of the distances, discardingthe largest and smallest values for stability to outliers.The posterior (5) for the upper curve’s coefficients ismaximized with EM. We initialize the M-step with thecluster probabilities of the upper extrema points undereach mixture component in turn. From these varyinginitial conditions, we keep whichever result has thehighest posterior probability.

The prior coefficient means α2 and α1 use the valuesfrom the bottom curve, with α0 an analogous leastsquares estimate on the upper extrema. In many cir-cumstances, we expect the baseline curve to be moresimilar to the upper curve so we can use much smallertolerances, σ =

(10−10, 10−7, 1

)for the ICDAR scenes

and a slightly more forgiving σ =(10−9, 10−6, 1

)for the

SVT data, which exhibits more variation.To control model complexity, we use MacKay’s “Oc-

cam factor” [46] to approximate the Bayesian evidencefor linear d = 1 and quadratic d = 2 models. We selectthe model degree with the highest approximate posteriorprobability when the coefficients β are marginalized.

The figure below shows common failure modes: toofew data points with descenders (or ascenders), ascen-der/descender drift, and broken characters, in additionto poor binarizations with too many distractors.

Fig. 7. Guide fitting failure modes.

Once we infer the pair of guide curves, we normalizethe word images to horizontal, straight-line guides usinga thin-plate spline and bicubic interpolation. For each of

Fig. 8. Guide fitting and normalization examples.

several evenly spaced points along the baseline curve,we find the baseline normal’s intersection with the topcurve. To cover ascenders and descenders, we extendtwo more points along the normal above and belowthe curves. These four points form a vertical line in thenormalized image, with corresponding points aligninghorizontally. Note that the control points are establishedusing only the curve geometry, with no underlyingimage information. Fig. 8 gives further examples of fitsand their normalized results.

Sometimes the upper curve settles on the meanline,e.g., , and sometimes on the capline, e.g., . Ratherthan attempt further bottom-up inferences at this stage,we assume either interpretation could be correct andrecognize the words under both interpretations, keepingwhichever scores higher. We could also enhance thismore Bayesian approach by sampling multiple modesof the guide probabilities to create several normalizedimage candidates for recognition.

6 TEXT LINE RECOGNITION

Section 4 demonstrated the difficulties of binarizing low-resolution characters, which often results in a single con-nected mass or characters broken into multiple segments.Without more information about the characters, a singlealgorithm or parameter setting is unlikely to correctlybinarize all inputs. This section explains our approachto integrating character segmentation with recognition.With the normalization described in Section 5, a two-dimensional curved or rotated text layout is transformedinto a one-dimensional representation of the text string,allowing us to use standard linear sequence models forcharacter segmentation and recognition. We extend theapproach to find word boundaries, placing the entireprocess into a probabilistic framework.

If word boundaries are known, we can force the modelto interpret a given word image as a lexicon string andreturn the highest scoring word [47], [13], [17]. However,this method does not apply when word boundaries areunknown. Therefore, in a manner analogous to examin-ing the hypothesis of starting a character at every loca-tion, we consider starting a word at every location. Theconcomitant complexity requires us to borrow sparsebeam search tools from speech recognition to eliminateunlikely word hypotheses.

Our model requires no binarization or prior charactersegmentation. By integrating recognition with segmen-tation at both word and character levels, we reducethe strict reliance on uninformed bottom-up techniques,

8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

avoiding unrecoverable errors. Furthermore, because themodel seeks to explain an entire normalized text region,it is not overly sensitive to missed detections of isolatedcharacters in earlier pipeline stages [13], [14], [17].

Like many contextual recognition models, ours em-ploys compatibility functions using various informationsources to relate label hypotheses to one another and theimage features. Although the conventional conditionalrandom field (CRF) is a powerful tool for sequencemodeling, the discriminative semi-Markov model [24] ismore natural because it not only captures the dependen-cies between states but also their duration, automaticallyyielding segmentations. Unlike the generative durationmodels used in speech [25] and handwriting recogni-tion [48], the discriminatively trained model need notmake independence assumptions of the input, allowingus to use global image features.

In the following, we first establish the basic model for-malism, then describe how several information sourcesscore a segmentation and labeling, the dynamic pro-gramming algorithm for finding the optimal segmenta-tion and labeling, and finally how the model is trained.

6.1 Semi-Markov model

A segmentation s induces a sequence of labels y and acorresponding set of discriminant functions {UC}C∈C(s).Each segment si = (ri, ti) with ri < ti indicates the startand ending location of a character yi ∈ Y , which takes ona label from some alphabet, i.e., Y ≡ [A− Za− z0− 9⊔]where ⊔ is an inter-word space. Modeling spaces explic-itly as a character allows seamless integration of wordsegmentation with character recognition and segmenta-tion. The conditional probability over segmentations andthe labels is an exponential model,

p (s,y | x,θ) ∝ exp

C∈C(s)

UC (sC ,yC ,x;θ)

, (6)

where x represents the observed image, and C is a setof learned discriminant functions that depends on thesegmentation. For instance, each segment’s unknownyi will require a function corresponding to a characterrecognition discriminant. The notation yC indicates thesubset of the labels that are arguments to UC .

6.2 Model features

In this section we describe how a segmentation s andlabeling y are scored, given the image x and parametersθ. Compatibility functions representing appearance andlanguage information aid recognition. Additional func-tions capturing layout information aid in simultaneouslyparsing the string into segments.

6.2.1 Character appearance

Each segment is scored for a particular character andsegment width |s| by a learned linear discriminant

UA(s, y,x;θA

)= |s|θA (|s| , y)

⊤F (s,x) . (7)

The linear parameters θA of the inner product are de-pendent not only on the character identity y, but alsothe size of the segment, |s| = t − r. Therefore UA

learns both the appearance and width of characters. Thediscriminative nature of the model allows us to useimage features outside the segment without violatingany independence assumptions. Greater image contextcould help disambiguate some characters.

The probability (6) could be biased by the number ofsegments. This potential bias is handled automatically byan appropriate learning process that can give differentsegment widths different score magnitudes. However,our simplified piecewise training strategy (describedbelow) prevents this beneficial side-effect. Therefore weinclude the simple width multiplier |s| in the appearanceenergy (7), which effectively assigns the inner product toevery index along the segment.

We resize the normalized image to a 25 px fontheight; the word box height is therefore either 12.5 px(assuming the upper guide is the mean line) or 18 px(if the cap line). For efficiency we quantize the validsegment widths to be W = {4, 8, 12, 16, 20, 24, 32} pixels.Using a 32× 32 window centered in the segment s, thefeatures F (s,x) are complex magnitudes of steerablepyramid filters with 8 orientations and 1 scale (omittingthe low and high pass bands) [49]. For RGB images wetake the largest filter response among the three colorchannels at each pixel. Contrast normalization is criticalto performance. We divide each filter response by theℓ2 norm of all responses in a 8× 16 window (efficientlycomputed with box filters), clip at 0.2 and renormalize.We also include the binary image as a feature becauseit increases accuracy by anchoring the phase invariantsteerable pyramid feature magnitudes. Thus, F (s,x) isa 32× 32× 9 = 9, 216 element vector, plus an added biasdimension.

6.2.2 Language: Character bigrams and a lexicon

Linguistic properties provide helpful top-down cues forcharacter recognition in challenging images. The char-acter N -gram is a ubiquitous feature for OCR, STR,and handwriting recognition. In this model, each pairof neighboring character segments s′, s with labels y′, yhas a bigram score,

UB(s′, s, y′, y;θB

)= (t− r′) θB (y′, y) , (8)

with the first character segment s′ = (r′, t′) and thesubsequent segment s = (r, t). When bigram scores arenot a completely integrated part of the model, usingthem can introduce problematic biases (i.e., for longer orshorter strings) that must somehow be rectified. Wanget al. [50] examine several possibilities. Just as in thecharacter appearances, we could make the bigram scoresconditional not only on the labels, but on the segmentlengths, i.e., θB (s′s, y′, y) [51]. Instead, we compensatefor training approximations by factoring segment widthinto a multiplier, rather than a conditioning term, ineffect tying the bigram parameters across all segment

WEINMAN et al.: TOWARD INTEGRATED SCENE TEXT READING 9

widths. Note that two bigram scores get counted on ev-ery segment except the first and last; our implementationcorrects for this bias.

To facilitate a soft preference for character sequencesthat compose a lexicon word, we use a different energy

UL(s′, s; θL

)= (t− r′) θL (9)

in lieu of the bigram score for segments belonging to alexicon word. See Section 6.3.2 below for computationalramifications.

6.2.3 Geometry: Character gap or overlap

A pair of neighboring segments may either overlap orhave a gap between them. We allow character boundingboxes to overlap (as in italics or an fi ligature), with aquadratic penalty:

UO(s′, s;θO

)=

0 r − t′ < ε

∞ r − t′ > τ

(r − t′)2θO2 + (r − t′) θO1 otherwise.

(10)Because character widths are quantized, we allow over-lap that could arise from quantization error ε = 2without penalty and a hard limit of up to half thesegment width, τ = min {1/2min {|s′| , |s|} , 8}. Thoughwe manually fixed

(θO2 , θ

O1

)= (0.0234, 0.7416) in our

experiments, these are easily learned.When a pair of neighboring character segments have

a gap between them, the gap is scored by a learnedcompatibility function,

UG(s′, s,x;θG

)=

r−1∑

i=t′+1

(θG

)⊤F (i,x) , (11)

which is a sum of the scores for each index (column)i in the gap between segments s′ and s. The sameimage features of the character appearance model aremultiplied by a different set of learned linear parameters.In practice, we limit the gap to 12 pixels.

6.3 Model inference

The recognition task is to find the most probable seg-mentation and labeling (s, y) = argmaxs,y p (s,y | x,θ).Equivalently, we minimize the total of the energies de-scribed in the previous section. This inference processmay be done in a model with or without a lexicon. Forsimplicity, we begin by describing a lexicon-free model.

6.3.1 Lexicon-free parsing

We build up a two-dimensional dynamic programmingtable to find the optimal parse. Let S (t, y) be the optimalscore for a segment ending at index t with character y.Letting t = 1 be the first column index of the sequence,we iteratively build the table via the recurrence relation

S (t, y) =

maxt′,r,y′ S (t′, y′)− P (t′, r, t, y′, y, 0) , t > 00 t = 0−∞ t < 0,

(12)

where P (t′, r, t, y′, y, 0) represents the additional parsescore for adding a segment s = (r, t) with character labely, while the previous character y′ ended at t′. The lastargument 0 indicates that the additional character is notforming part of a lexicon word.

Energy terms detailed in the previous section composethe additional parse score,

P (t′, r, t, y′, y, w) = UA (s, y) + wUL (s′, s)

+ (1− w)UB (s′, s, y′, y)

+UO (s′, s) + UG (s′, s) , (13)

where s = (r, t) and s′ = (r′, t′). The total score for agiven segmentation and labeling is the negated sum ofall the P terms, which of course is the value inside theexponential of the corresponding probability model (6).

We trace back the optimal parse by storing the argmaxvalues of t′, r, and y′ for each step t. Here we especiallynote that no separate word segmentation must be done;it is completely automatic because a space is among thecharacters to be recognized.

6.3.2 Lexicon-based parsing

When incorporating a lexicon (and thereby allowing abias for strings to be recognized as lexicon words), theoptimization becomes significantly more complex. If thetext region is assumed to be a single word, we simplyfind the highest-scoring word among parses constrainedto the lexicon. The dynamic programming can be in-terleaved with a trie traversal, reusing optimal parseinformation from the shared prefix substrings to speedprocessing [28], [52], [47].

In general, we cannot assume detected text lines con-tain just one word; words may start or end at any pointin the line. To handle this generalization, we maintaintwo parallel dynamic programming tables, one for non-lexicon parses and another with lexicon restrictions.

As before, the table S (t, y) corresponds to the bestparse for an arbitrary character y ending at point t.It is built by adding optimal segments and charactersusing the parse score (13). A new table W

(t)

correspondsto the best parse for any lexicon word ending at t.The connection between these two tables occurs wheny = ⊔ is a space. The optimal character given by S

(t, y

)

may have been preceded by a lexicon word; conversely,the optimal word calculated by W

(t)

may have beenpreceded by a non-lexical character. Mutually dependentrecurrence relations link these two tables.

With a lexicon, the primary recurrence for S(t, y

)now

has two cases. When y 6= ⊔, the parse score is the sumof the best previous score plus the score for adding thenew character segment, as given by Equation (12) earlier.When the new y = ⊔ is a space, the end of a characterstring is signaled, and that string may either be a lexiconword or not. In this case, the dynamic programmingmust determine whether the optimal parse is to acceptthe previous lexicon word Sℓ or to take the non-lexicon

10 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

parse ending before the space Sℓ:

S (t,⊔) = max{Sℓ (t) , Sℓ (t)

}(14)

Sℓ (t) = maxt′,r

W (t′)− P (t′, r, t, yt′ ,⊔, 0) (15)

Sℓ (t) = maxt′,r,y′

S (t′, y′)− P (t′, r, t, y′,⊔, 0) , (16)

where yt′ is the last character of the word parse endingat t′ and all other terms are as before.

Building the table W (t) for the best score of a lexiconword ending at t requires a new recurrence C (t′′, t′,y),which scores the best partial lexicon word y over thespan from t′′ to t′. The score for extending the subwordby the character y, ending at index t is

C (t′′, t, [y | y]) = maxt′,r

C (t′′, t′,y)− P (t′, r, t, y′, y, 1) ,

(17)where y′ is the last character of y and [y | y] forms a newpartial lexicon word. As before, the term P is a score for aparse of a particular character including the appearanceand geometry scores. However, the language model isaltered because the character is now part of a (hypothe-sized) lexicon word. The score UL for constituents of alexicon word thus replaces the bigram score UB entirely,as indicated by the last argument, w = 1. The lexicalrecurrence relation W (t) is similar to the original tableS (t, y), but with a larger spatial scale:

W (t) = maxt′′

S (t′′,⊔) + maxy∈L

C (t′′, t,y) , (18)

where L is the set of complete lexicon words. The scoreW (t) builds upon the previous scores S, but adds anadditional term C that represents the optimal score forany lexicon word following a space.

6.3.3 Computational complexity and beam search

This approach begets a problem of scale. Although thecomplexity is linear in the number of lexicon wordsand the length of the image to be parsed, our proposedmethod maintains hypotheses for every possible wordbeginning at every possible location. This cross productof possibilities is impossibly slow in practice, so approx-imations must be introduced.

For the span from t′′ to t, the function C (t′′, t,y)ranks subwords y, most of which are extremely unlikely.At the risk of error, we eliminate initially low-scoringcandidates for each span to speed processing. Everysubword eliminated further eliminates words buildingupon that subword parse from consideration. With fewercalculations of the subword score (17) to be made, thereare fewer complete words to score. Such pruning is aform of beam search common to dynamic programmingalgorithms for speech recognition. While several pruningstrategies exist, our earlier work indicated that a simpleN -best list was sufficient for this task [23]. Under thisstandard strategy for Viterbi beam search, we simplysort the scores C for a given region (t′′ to t), keeping theN best subword parses. We use N = 50 for the small-vocabulary SVT data and N = 25 for the others.

With our unoptimized Java implementation, the av-erage parse runtime for the ICDAR data on a standarddesktop machine is 40 sec/word with a lexicon and 20sec/word without. SVT word spotting is 25 sec/word.We note that significant opportunities for parallelismabound, e.g., by computing maxima over nearly 3000values for each non-lexical parse score S (t, y) alone.

6.4 Model training

The discriminative semi-Markov model (6) has a learn-ing process analogous to conventional probabilistic mod-els: maximizing a regularized version of the convexconditional log likelihood [24]. Unlike the informationextraction tasks for which the model was first proposed,such a learning scheme poses two major drawbacks onthis task. First and foremost, such a scheme requiresthe training data be like the testing data in that bothlabeling and segmentation (i.e., context) are required.This restriction is problematic because tens or even hun-dreds of thousands of examples are required for effectivecharacter recognition [53], yet the largest STR data setsonly have a few thousand characters in context. Thesecond major drawback is time, because the model per-forms segmentation along with recognition over severaliterations of gradient descent.

The many works that use weighted FSTs sidestep theseproblems by simply learning modules (e.g. characterrecognition, N -grams) separately, perhaps adjusting therelative module weights in an ad hoc fashion [13], [17],[16] We formalize this approach under the piecewisepseudolikelihood interpretation of Sutton and McCal-lum [54]. Our two-stage training process first temporar-ily decouples the appearance and bigram modules toreduce train time and increase available training data.First, piecewise training maximizes a lower bound onthe likelihood of the training data by learning the energyfunctions separately. Because all the energies are linearin the parameters θ, the original and piecewise log likeli-hoods are convex and optimized with a limited-memorysecond-order Newton method for gradient descent. Inthe second training stage we recouple all the energiesfor an important lower-dimensional scale tuning via thegeneralized perceptron loss [55].

First, the appearance and gap parameters, θA and θG,are trained on a set of rendered examples (see Wein-man [23, Chapters 5,6] for details) from 1866 fonts givenrandom contrast, bias, noise, borders, and neighboringcharacters. To mimic the wide variety of scene text, weinclude versions of these fonts horizontally scaled at2−1/2, 2−1/4, and 21/2 for a total of 1866×62×4 = 462, 768characters plus 16, 794 spaces and gaps. The quantizedhorizontal range of each raw character image is taken asthe segment width for learning θA (|s| , y). For efficiency,only segmentations of the quantized widths centered inthe training window are considered during learning.

Bigram parameters θB are simultaneously trained ina piecewise fashion on a text corpus [20]. We collapse

WEINMAN et al.: TOWARD INTEGRATED SCENE TEXT READING 11

C o L o n i a L R o c k PICKLES H O S PIT A L

Fig. 9. Street View Text word spotting. Evaluation is case

insensitive, but recognized lettercase is shown.

TABLE 3

Word spotting results on the Street View Text data (647

words, 249 images) [13].

Method Word Error (%)ABBYY [19] 65

Wang et al. [19] 43Mishra et al. [17] 26.74Proposed Method 21.95

digits into a single class because they occur infre-quently and exhibit no meaningful sequence patterns.Conversely, all-uppercase letter signs occur frequently, sointra-case transition parameters are tied across case, e.g.,θB (T,H) = θB (t, h); inter-case transitions are preservedso that in general θB (T, h) 6= θB (t, h) 6= θB (t,H). OnceθB is learned, we set θL to twice the median bigramweight for all bigrams over words in our lexicon L, avalue which is tuned in the secondary stage below.

Both θA and θB are trained with a Laplace prior (ℓ1regularization). We choose the weight of the ℓ1 penaltyon θB with a ten-fold cross-validation, optimizing thetotal likelihood of all held out folds. The ℓ1 penalty onθA is instead chosen by minimizing recognition erroron a training validation set after the entire model hasbeen learned. Delayed validation helps compensate forpiecewise training with fixed segments.

Although piecewise training maximizes a lower boundon the true likelihood, there is significant risk that thescales of the energy functions learned in this fashionare not commensurate. Therefore, we undertake a sim-ple secondary training step for positive weights λ =(λA, λB , λL, λG, λO

)to adjust the scale of all five energy

components—appearance λA · UA, bigrams λB · UB ,lexicon λL · UL, gap λG · UG and overlap λO · UO—byusing gradient descent to minimize the perceptron loss

E (λ) =∑

i

mins

U (s,yi,xi;λ,θ)−mins′,y′

U (s′,y′,xi;λ,θ) ,

(19)where U is the total energy of the probability model(6) whose components are scaled by λ, yi is a groundtruth character labeling, and θ is fixed at the valueslearned by piecewise training. E (λ) has a lower boundat zero where the best interpretation y′ matches theground truth [55]. In practice, pre-training θ preventsthe perceptron functional from collapsing U to be zeroeverywhere. Unlike the first piecewise training stage, weminimize the perceptron loss (19) on contextual trainingdata that is like the testing data, where full segmentationoptimization is required.

a v a ila b l e P e r m a n e n tl y E n g i n e e r i n g m o n o x i d e

Fig. 10. ICDAR 2011 word recognition.

TABLE 4

Word recognition results on the ICDAR 2011 scenes

data (6393 characters, 1189 words, 255 images) [18].

Error (%)Method Character/Word Character Word

Neumann et al. [18] 36.14 - 66.89Lee et al. [18] 26.78 - 64.4

Yang et al. [18] 14.82 - 58.8Appearance + Geometry 44.00 38.59 60.72

+ Bigram 40.00 34.21 54.50+ Lexicon 36.24 29.36 42.30

7 EXPERIMENTS

We evaluate our system on three data sets that eachtest slightly different and increasingly difficult prob-lems: a small closed-vocabulary word spotting task, anopen-vocabulary word recognition task, and two open-vocabulary word segmentation and recognition tasks,one with noisy detections, giving complete end-to-endperformance.

7.1 Closed-vocabulary word spotting

Wang and Belongie’s Street View Text data set features647 test words in 249 images, each with an associatedlexicon of about 50 words [19]. Annotated boundingboxes are drawn from this small, closed vocabulary. Theevaluation is case insensitive and the lexicon is givenin all uppercase letters. To perform this word spottingtask, our recognition model simply ignores letter case,prohibiting spaces and non-lexical parses. Table 3 showsour method to be the state of the art.

7.2 Open-vocabulary word recognition

The 2011 ICDAR robust reading competition [18] fea-tures 1189 word bounding boxes over 255 scene images,including words from several languages. Though nolexicon is given, it is such a strong source of informationthat we create one (using SCOWL1, with British andAmerican English) including proper names, abbrevia-tions, and contractions (with all punctuation removed).Our case-sensitive model ignores word frequency, so weonly include words up to the 50th frequency percentileand add uppercase and leading caps versions of allwords to the lexicon (along with the training words)so that |L| = 244, 033 words, including 884 (74%) ofthe test words. Table 4 compares results from the recentcompetition. Character/word error is the average ofword edit distance divided by word length. The standardcharacter error divides the total edit distance (over allwords) by the total number of characters in the data.

1. http://wordlist.sourceforge.net, revision 6

12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

s a l o n P O STAL SER VIC E V i a V i a

Fig. 11. VIDI word segmentation and recognition.

TABLE 5

Word segmentation and recognition results on the VIDI

data (1164 characters, 192 words, 86 images) [20].

Error (%)Model Features Character Word

Appearance + Geometry 16.84 58.33+ Bigrams 12.89 43.23+ Lexicon 10.65 24.48

Case-sensitive evaluations (Table 4) show our method tobe the state of the art for this task as well.

7.3 Word segmentation and recognition

To test our model’s integrated word segmentation capa-bilities, we first use a punctuation-free subset of the VIDIdata by Weinman et al. [20], which provides approximatetext baselines and a pre-normalized font size of 100pxfor 1164 characters in 192 words from 86 images. Wordsmust be segmented automatically because the probleminput is a line of text. We measure word error (includingall stop words) with the ISRI OCR performance toolkitprogram wordacc2, modified to include letter case sen-sitivity and number strings, which the original ignores.Table 5 shows quantitative results.

For a complete end-to-end test including detection,normalization, word segmentation, and recognition, weuse text detection results of Yi and Tian [56] as submittedto the 2011 ICDAR detection competition [18], where itplaced second. Fig. 12 shows some examples, while Table6 reports detection results with the 2011 competitionmethodology [18] and recognition results using the case-sensitive methodology of Lucas et al. [7]. Many spuriousdetections give only one-character words, so we reportresults where these are dropped from the system output;ground truth is unaltered.

7.4 Discussion

We eliminate nearly 18% of errors on the SVT wordspotting task. The task is greatly simplified by the smalllexicon, and these results illustrate the relative difficultyof the remaining incorrect words. In the open-vocabularyword recognition setting, we significantly improve stateof the art ICDAR 2011 results. We note word recognitionerror rises 8% when regions are not grouped into linesfor guide fitting, indicating this pre-processing step givesa significant performance boost.

The comparatively high character/word error on theICDAR data indicates many failures happen earlier in

2. http://code.google.com/p/isri-ocr-evaluation-tools

Hag Except for access

and buses

Please wait here until

acashieris available

STAR WART

EPISODE Is ATTACK OF THE CLONES

plTTh Essex County Concedes

Colchester Library

jim CASUAL PLUS

First Eastern Naiora Bus

Fig. 12. End-to-end examples: input detections (by

Yi and Tian [56]) shown in magenta, correctly seg-

mented/recognized words in green, and incorrect regions

in dashed red. Incorrect words printed in bold, with incor-

rect characters in italicized red.

TABLE 6

Average end-to-end results for ICDAR 2011 scenes.

Method and Task Prec. Recall F1

Neumann & Matas [6] Detection 73.1 64.7 68.7Neumann & Matas [6] Recognition 37.1 37.2 36.5

Yi & Tian [56] Detection 67.22 58.09 62.32Proposed Method Recognition 41.10 36.45 33.71Omitting One-Character Words 46.60 36.27 36.40

the recognition pipeline, leaving little hope of recogniz-ing even a portion of the word. Our model must alsoinfer whether the top guide is the mean or capline. Ifthe objectively better interpretation were chosen (thatwith lesser error), the overall character error would bereduced by nearly a third and the word error by 14%.Resolving more of these common error cases correctlywould bring modest improvement.

In the open vocabulary VIDI data, we demonstrate theutility of joint word segmentation and recognition. Notethe large intra-word gaps and small inter-word spacescorrectly identified in Fig. 11. Overall error rates arelower than for ICDAR and SVT because the guidelinesare given; yet some fonts and complex backgroundsmake the remaining errors challenging.

On the end-to-end task, top-down word segmentationresolves cases where the bottom-up detector cannot re-liably segment words. Despite using lower quality textdetections as input, we achieve results at par with thestate of the art. A better initial text detection score wouldlikely yield even better overall results (cf. 7.2).

The lexicon notably reduces word recognition errors,by 22% on the ICDAR word recognition task and 43%on the VIDI task, which includes word segmentation.

WEINMAN et al.: TOWARD INTEGRATED SCENE TEXT READING 13

However, it can be a double-edged sword. A mosaicingerror (Fig. 9, right) requires hallucinating the “I” char-acter for word spotting, a strategy that can backfire inan open vocabulary setting (Fig. 10, right).

8 CONCLUSION

We have presented a system that handles many stagesof scene text reading in a probabilistic fashion, frombinarization to appearance normalization to charactersegmentation and recognition. We do not rely on in-dividual character detections, which are prone to falsenegatives. Only a coarse binarization is necessary fordetecting guidelines that allow us to handle curvedand rotated text. Our model uses a hybrid open/closedvocabulary approach to balance bottom-up signals withtop-down priors. Most importantly, it fully integratessegmentation and recognition.

While these latter two stages are integrated, partiallyin training and and fully in testing, we should likefor the entire system to be even more integrated, inlearning first [26], but more so that later pipeline stagescan provide feedback to earlier modules that may refinegrouping, binarization, or guideline hypotheses in an it-erative fashion, as in human information processing [57].Given the difficulty of bottom-up processing, furtheropportunities for passing multiple hypotheses from onestage to the next are worthy of exploration, thoughcare must be taken to avoid a combinatorial rise incomplexity.

The binarization, guideline fitting, and character clas-sification modules remain somewhat rudimentary. It isthe integration of these stages and evaluating multiplehypotheses that makes the system perform well. How-ever, further refinements to these key modules couldbring even better performance.

Scene text recognition is difficult—the world’s vividcolors, uncontrolled lighting, and unpredictable perspec-tives conspire to make general machine reading a “grandchallenge”-worthy task. Like other recent challenges,with additional concentrated engineering and scientificefforts, we believe STR can be solved.

Acknowledgments

The authors thank N. Howe, J. Jonkman, E. Learned-Miller, andC. Sutton for helpful conversations, anonymous reviewers forfeedback that improved the manuscript, A. Mishra for provid-ing binarized word images, and C. Yi for the text detections.

REFERENCES

[1] R. Darnton, “The library in the new age,” The New York Review ofBooks, vol. 55, June 2008.

[2] V. Wu, R. Manmatha, and E. M. Riseman, “Textfinder: An auto-matic system to detect and recognize text in images,” IEEE Trans.on Pattern Anal. Mach. Intell., vol. 21, no. 11, pp. 1224–1229, 1999.

[3] X. Chen, J. Yang, J. Zhang, and A. Waibel, “Automatic detectionand recognition of signs from natural scenes,” IEEE Trans. onImage Process., vol. 13, no. 1, pp. 87–99, 2004.

[4] X. Chen and A. L. Yuille, “Detecting and reading text in naturalscenes,” in Proc. Conf. on Computer Vision and Pattern Recognition,pp. 366–373, 2004.

[5] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in naturalscenes with stroke width transform,” in Proc. Conf. on ComputerVision and Pattern Recognition, pp. 2963–2970, 2010.

[6] L. Neumann and J. Matas, “Real-time scene text localizationand recognition,” in Proc. Conf. on Computer Vision and PatternRecognition, pp. 3538–3545, 2012.

[7] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, andR. Young, “ICDAR 2003 robust reading competitions,” in Proc.Intl. Conf. on Document Analysis and Recognition, vol. 2, pp. 682–687, 2003.

[8] J. Ohya, A. Shio, and S. Akamatsu, “Recognizing characters inscene images,” IEEE Trans. on Pattern Anal. Mach. Intell., vol. 16,no. 2, pp. 214–220, 1994.

[9] C. Thillou, S. Ferreira, and B. Gosselin, “An embedded applicationfor degraded text recognition,” Eurasip Journal on Applied SignalProcessing, vol. 13, pp. 2127–2135, 2005.

[10] J. J. Weinman and E. Learned-Miller, “Improving recognition ofnovel input with similarity,” in Proc. Conf. on Computer Vision andPattern Recognition, pp. 308–315, June 2006.

[11] S. M. Lucas, “Text locating competition results,” in Proc. Intl. Conf.on Document Analysis and Recognition, pp. 80–85, 2005.

[12] Z. Saidane, C. Garcia, and J. L. Dugelay, “The image text recog-nition graph (iTRG),” in Proc. Intl. Conf. on Multimedia and Expo,pp. 266–269, 2009.

[13] K. Wang and S. Belongie, “Word spotting in the wild,” in Proc.European Conf. on Computer Vision, September 2010.

[14] L. Neumann and J. Matas, “A method for text localization andrecognition in real-world images,” in Proc. Asian Conf. on ComputerVision (R. Kimmel, R. Klette, and A. Sugimoto, eds.), vol. 6494 ofLecture Notes in Computer Science, pp. 770–783, 2010.

[15] T. Yamazoe, M. Etoh, T. Yoshimura, and K. Tsujino, “Hypothesispreservation approach to scene text recognition with weightedfinite-state transducer,” in Proc. Intl. Conf. on Document Analysisand Recognition, pp. 359–363, 2011.

[16] K. Elagouni, C. Garcia, F. Mamalet, and P. Sébillot, “Combiningmulti-scale character recognition and linguistic knowledge fornatural scene text OCR,” in Intl. Workshop on Document AnalysisSystems, pp. 120–124, March 2012.

[17] A. Mishra, K. Alahari, and C. V. Jawahar, “Top-down and bottom-up cues for scene text recognition,” in Proc. Conf. on ComputerVision and Pattern Recognition, pp. 2687–2694, 2012.

[18] A. Shahab, F. Shafait, and A. Dengel, “ICDAR 2011 robust readingcompetition challenge 2: Reading text in scene images,” in Proc.Intl. Conf. on Document Analysis and Recognition, pp. 1491–1496,2011.

[19] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene textrecognition,” in Proc. Intl. Conf. on Computer Vision, pp. 1457–1464,2011.

[20] J. J. Weinman, E. Learned-Miller, and A. Hanson, “Scene textrecognition using similarity and a lexicon with sparse beliefpropagation,” IEEE Trans. on Pattern Anal. Mach. Intell., vol. 31,pp. 1733–1746, Oct 2009.

[21] P. Shivakumara, S. Bhowmick, B. Su, C. L. Tan, and U. Pal, “Anew gradient based character segmentation method for videotext recognition,” in Proc. Intl. Conf. on Document Analysis andRecognition, pp. 126–130, 2011.

[22] J. J. Weinman, E. Learned-Miller, and A. Hanson, “A discrimi-native semi-Markov model for robust scene text recognition,” inProc. Intl. Conf. on Pattern Recognition, Dec 2008.

[23] J. J. Weinman, Unified Detection and Recognition for Reading Text inScene Images. PhD thesis, University of Massachusetts Amherst,2008.

[24] S. Sarawagi and W. W. Cohen, “Semi-Markov conditional randomfields for information extraction,” in Advances in Neural Informa-tion Processing Systems (NIPS), pp. 1185–1192, 2005.

[25] J. D. Ferguson, “Variable duration models for speech,” in Proc. ofthe Symp. on Application of Hidden Markov Models to Text and Speech(J. D. Ferguson, ed.), pp. 143–179, 1980.

[26] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proc. IEEE, vol. 86,pp. 2278–2324, November 1998.

[27] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing:A Guide to Theory, Algorithm, and System Development. PrenticeHall PTR, 2001.

[28] C.-L. Liu, M. Koga, and H. Fujisawa, “Lexicon-driven segmenta-tion and recognition of handwritten character strings for Japanese

14 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

address reading,” IEEE Trans. on Pattern Anal. Mach. Intell., vol. 24,pp. 1425–1437, November 2002.

[29] H. I. Koo and N. I. Cho, “Text-line extraction in handwritten chi-nese documents based on an energy minimization framework,”IEEE Trans. on Image Process., vol. 21, pp. 1169–1175, March 2012.

[30] Y.-F. Pan, X. Hou, and C.-L. Liu, “Text localization in natural sceneimages based on conditional random field,” in Proc. Intl. Conf. onDocument Analysis and Recognition, pp. 6–10, 2009.

[31] H. Zhang, C. Liu, C. Yang, X. Ding, and K. Wang, “An improvedscene text extraction method using conditional random field andoptical character recognition,” in Proc. Intl. Conf. on DocumentAnalysis and Recognition, pp. 708–712, 2011.

[32] J. Zhang and R. Kasturi, “Character energy and link energy-basedtext extraction in scene images,” in Proc. Asian Conf. on ComputerVision, vol. 6493 of Lecture Notes in Computer Science, pp. 308–320,2010.

[33] K. Kita and T. Wakahara, “Binarization of color characters in sceneimages using k-means clustering and support vector machines,”in Proc. Intl. Conf. on Pattern Recognition, pp. 3183–3186, 2010.

[34] C. Zeng, W. Jia, and X. He, “An algorithm for colour-based naturalscene text segmentation,” in Camera-Based Document Analysis andRecognition, vol. 7139 of Lecture Notes in Computer Science, pp. 58–68, 2012.

[35] M. S. Cho, J.-H. Seok, S. Lee, and J. H. Kim, “Scene text extractionby superpixel CRFs combining multiple character features,” inProc. Intl. Conf. on Document Analysis and Recognition, pp. 1034–1038, 2011.

[36] A. Mishra, K. Alahari, and C. Jawahar, “An MRF model forbinarization of natural scene text,” in Proc. Intl. Conf. on DocumentAnalysis and Recognition, pp. 11–16, 2011.

[37] Z. Tu and S.-C. Zhu, “Image segmentation by data-driven Markovchain Monte Carlo,” IEEE Trans. on Pattern Anal. Mach. Intell.,vol. 24, pp. 657 –673, may 2002.

[38] J. Feild and E. Learned-Miller, “Scene text recognition with bi-lateral regression,” Tech. Rep. UM-CS-2012-021, University ofMassachusetts-Amherst, Computer Science Research Center, Uni-versity of Massachusetts, Amherst, MA 01003-4601, 2012.

[39] M.-K. Hu, “Visual pattern recognition by moment invariants,” IRETrans. on Inf. Theory, vol. 8, pp. 179–187, February 1962.

[40] P. M. Williams, “Bayesian regularization and pruning using aLaplace prior,” Neural Computation, vol. 7, pp. 117–143, 1995.

[41] J. Jung, S. Lee, M. S. Cho, and J. H. Kim, “Touch TT: Scene textextractor using touchscreen interface,” ETRI Journal, vol. 33, no. 1,pp. 78–88, 2011.

[42] J. J. Hull, Document Image Skew Detection: Survey and Annotated Bib-liography, vol. Document Analysis Systems II of Series in MachinePerception and Artificial Intelligence, pp. 40–64. 1998.

[43] Y. Bengio and Y. L. Cun, “Word normalization for on-line hand-written word recognition,” in Proc. Intl. Conf. on Pattern Recogni-tion, pp. 409–413, 1994.

[44] T. Caesar, J. M. Gloger, and E. Mandler, “Estimating the baselinefor written material,” in Proc. Intl. Conf. on Document Analysis andRecognition, vol. 1, pp. 382–385, August 1995.

[45] L. Neumann and J. Matas, “Text localization in real-world imagesusing efficiently pruned exhaustive search,” in Proc. Intl. Conf. onDocument Analysis and Recognition, pp. 687–691, 2011.

[46] D. J. MacKay, “Bayesian interpolation,” Neural Computation, vol. 4,pp. 415–448, 1992.

[47] C. Jacobs, P. Y. Simard, P. Viola, and J. Rinker, “Text recognition oflow-resolution document images,” in Proc. Intl. Conf. on DocumentAnalysis and Recognition, pp. 695–699, 2005.

[48] M.-Y. Chen, A. Kundu, and S. N. Srihari, “Variable durationhidden Markov model and morphological segmentation for hand-written word recognition,” IEEE Trans. on Image Process., vol. 4,pp. 1675 –1688, December 1995.

[49] E. P. Simoncelli and W. T. Freeman, “The steerable pyramid: Aflexible architecture for multi-scale derivative computation,” inProc. Intl. Conf. on Image Processing, vol. 3, pp. 444–447, 1995.

[50] Q.-F. Wang, F. Yin, and C.-L. Liu, “Handwritten chinese textrecognition by integrating multiple contexts,” IEEE Trans. onPattern Anal. Mach. Intell., vol. 34, no. 8, pp. 1469–1481, 2012.

[51] J. J. Weinman, “Typographical features for scene text recognition,”in Proc. Intl. Conf. on Pattern Recognition, pp. 3987–3990, 2010.

[52] S. M. Lucas, G. Patoulas, and A. C. Downton, “Fast lexicon-basedword recognition in noisy index card images,” in Proc. Intl. Conf.on Document Analysis and Recognition, vol. 1, pp. 462–466, 2003.

[53] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for con-volutional neural networks applied to visual document analysis,”in Proc. Intl. Conf. on Document Analysis and Recognition, vol. 2,pp. 958–963, 2003.

[54] C. Sutton and A. McCallum, “Piecewise training for structuredprediction,” Machine Learning, vol. 77, no. 2-3, pp. 165–194, 2009.

[55] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “Atutorial on energy-based learning,” in Predicting Structured Data(G. Bakir, T. Hofman, B. Schölkopf, A. Smola, and B. Taskar, eds.),MIT Press, 2006.

[56] C. Yi and Y. Tian, “Text string detection from natural scenes bystructure-based partition and grouping,” IEEE Trans. on ImageProcess., vol. 20, no. 9, pp. 2594–2605, 2011.

[57] J. L. McClelland and D. E. Rumelhart, “An interactive activationmodel of context effects in letter perception: Part 1. An accountof basic findings,” Psychological Review, vol. 88, pp. 375–407,September 1981.

Jerod J. Weinman (M’08) received the B.S. de-gree in computer science and mathematics fromRose-Hulman Institute of Technology in 2001,and the M.Sc. and Ph.D. degrees in computerscience from the University of MassachusettsAmherst in 2005 and 2008, respectively. He is anAssistant Professor at Grinnell College, with re-search interests in computer vision and machinelearning. He is a member of the IEEE, ACM,ACM SIGCAS, and ACM SIGCSE.

Zachary Butler received the B.A. degree incomputer science and mathematics from Grin-nell College in 2013. He is pursuing the Ph.D.degree at University of California, Irvine, withresearch interests in the intersection of machinelearning and computer vision.

Dugan Knoll received the B.A. degree in com-puter science from Grinnell College in 2012. Heis currently a Web Developer at Salem Health inSalem, Oregon and plans to attend a graduateprogram in computer science in the future.

Jacqueline Feild is a Ph.D. candidate in theSchool of Computer Science at the University ofMassachusetts Amherst. She works in the Com-puter Vision lab with Dr. Erik Learned-Miller. Shereceived her M.S. degree from the University ofMassachusetts Amherst in 2010 and her B.S.degree from Loyola University Maryland in 2007.