machine learning final project: handwritten sanskrit recognition

Machine Learning Final Project:Handwritten Sanskrit Recognition using a Multi-class SVM with

K-NN Guidance

Yichang [email protected]

Donglai [email protected]

Abstract

We develop an optical character recognition(OCR) engine for handwritten Sanskrit using atwo-stage classifier. Inside the standard OCRpipeline, we focus on the classification problemassuming characters have been preprocessed de-cently. One challenge we face is that the languageof Sanskrit has about a hundred core characterswhere model driven methods, like Support Vec-tor Machine (SVM), have to search in the expo-nentially growth of the combinatoric model spaceduring training, while data driven methods, like knearest neighbor (kNN), becomes costly in com-putation during testing. To address this chal-lenge, we propose a two-stage multiclassifier, us-ing non-parametric to reduce the model space tosearch, and parametric models to relieve compu-tation burden with better generalization. In thefirst stage, we apply kNN to coarsely assign thetest data into the possible group of k classes, anda multiclassifier of k classes to label the sample inthe second stage. Our method is fully automatic,highly accurate, and computational efficiently.

1. IntroductionSanskrit scholars have been endeavoring to

integrate the gathered Tegabytes of Sanskritmanuscript images into a digital library [1] forworldwide reach.

Besides merely displaying the pages, it is desir-able to translate the image into unicode for furtherediting, where current commercial OCR softwarefor Sanskrit are far from satisfactory, as shown inFig.2.

We find Sanskrit OCR an interesting problemin three folds:

Firstly, OCR problem itself lies in between onedimensional natural language process (NLP) andthree dimensional real world projected down toimage in computer vision (CV).

Also Sanskrit characters lie in the middle of analphabetical language like English and a stroke-based language like Chinese.

Lastly, these Sanskrit manuscript are writtenby well-trained ancient calligrapher and there is ahigh consistence in character shape between sam-ples, as shown in Fig.1.

With the above properties in mind, we designan optical character recognition system (OCR)that can automatically map Sanskrit to Unicode.

Our database contains about one hundred dif-ferent Sanskrit characters, as shown in Fig.3.

In machine learning community, there are 3typical approaches to solve multi-class problems:generative-model-based, SVM-based, and non-parametric-model-base. In OCR, the physic re-lationship between feature and label is unclear,and so the generative model is less practical. Amulticlassifier of one hundred classes has only ac-curacy of 71%, which cannot satisfy our require-

1

ment, and also computational consuming in thetraining phase. Non-parametric model, such knearest neighbor (kNN), is good at separate visu-ally very different character, but cannot separatesamples of minor difference, such as the imagesin Fig.4.

We take advantage of SVM-based and kNNmethod, and propose a two stage multiclassiferfor about one hundred classes. In the first stage,we use kNN to assign an input query sample toa group containing k possible labels. Charac-ter with label in this group are visually similar,and cannot be distinguished by non-parametricmethod, as illustrated in Fig.4. In the secondstage, we use a SVM multiclassifier to classify be-tween labels in the group. Because our databaseis written by well-trained calligrapher, so we caneasily design a set of robust features for SVM. Inour work, we use 26 features, ranging from im-age statistics to local features, such as corners oncharacters, to train 65 multiclassifier of k classes.Our recognition is 85%, which is better than usingSVMs only.

Our paper makes the following contributions,

1. The first system that works for handwrittenSanskrit.

2. We applies a two-stage classifier that com-bines non-parametric method and SVM.This outperforms only using SVMs

2. Related Work2.1. OCR Pipeline

Optical character recognition is a combina-tion of natural language processing (NLP) andcomputer vision. In our application, languagemodeling is less important since we have moreclues from the image likelihood, and the visionclues are more robust since it is an intrinsic 2Dimage.

Below is the pipeline of a typical OCR sys-tem [4]:

Figure 3: In our database, nearly 70 differentcharacters are used.

1. In low-level vision, the system finds char-acters by layout analysis and converts them

Figure 1: A page of Sanskrit Manucript in our database used for our OCR. The documents are written bywell trained calligrapher, and so there is high consistence between the same characters, nearly as goodas in printing version.

Figure 4: A group of 6 characters that kNN can-not classify. They have very similar visual shapes.

into grayscale (preprocessing).

2. In mid-level vision, the article is separated tounconnected characters.

3. In high-level vision, test characters are clas-sified based on the model from training data.This is referred to as recognition.

4. In NLP, a n-gram HMM model is fitted tocorrect certain ambiguity.

Here, we focus on the classification problem,assuming all training and testing characters havealready been correctly cropped out, and we leaveout the NLP part to further improve the perfor-mance.

2.2. Sanskrit Recognition

Some commercial software, such as has imple-mented Sanskrit OCR. [2], but their performance,as shown in Fig.2 is not satisfactory in our dataset.

2.3. Classification

We review the state-of-art classification algo-rithm in machine learning, and state the reason wepropose a two-stage jointly non-parametric anddiscriminative classifier.

Figure 2: Recognition result of Indsez, a commercial Sanskrit OCR software. The upper black-and-white image is the input article, and the recognition result are marked in yellow. The question markmeans the character cannot be recognized. We cannot test the recognition rate because the source codeis not open.

2.3.1 Definition

We first review the general goal of MachineLearning Algorithms: A computer program issaid to learn from experience E with respect tosome class of tasks T and performance measureP, if its performance at tasks in T, as measured byP, improves with experience E.

For classification problem, the task to is totell things different apart and the experienceE is to learn the suitable representation of thetrue data distribution based on the performancemeasured by P on training data. It can be viewedas an optimization over hypthesis spaceH. In the

parametric setting, the objective function is givenby f̂(w; θ) where w is the model parameter, θ agiven sample. An empirical objective: F̂ (w) canbe written as:F̂ (w) = E[f(w; θ)] = 1

m

m∑i=1

f(w; θi)

here θ1, θ2, ..., θm are a sequence of observedsamples. It is normally assumed that thesesamples are i.i.d. drawn from some unknown dis-tribution D, and therefore the stochastic objectiveF(w) is often more desirable:F̂ (w) = Eθ≈D[f(w; θ)]For example, if we take f(w; θ) =max{0, 1y(w, x) + λ

2‖w‖2

2, we will arrive at

the famous SVM, with a weighted L2 normregularizer.

2.3.2 H: Parametric vs Nonparametric

In terms of modeling the hypothesis space, clas-sification algorithms can be divided into paramet-ric ones and nonparametric ones. In the paramet-ric setting where the hypothesis space is a familyof parametrized function, no matter Frequentist,Bayesianist or Learning Theorist, no matter Dis-crimitive, Generative or Algorimitic, we end upwith an object function composed of the bonus ofthe likelihood of the data under the model and thepenalty of the complexity of the model to opti-mize. For example, SVM is trying to

In the nonparametric setting however, we aretrying to make use of the empirical probabilitydistribution of the data itself.

2.3.3 H: Single vs Structured

Instead of searching one best function f inone subspace of H with model selection, re-searchers have been seeking efficient ways tocombine different functions instead of . Wecan consider the single function approach as the0th order approximation in H, where the vot-ing/bagging/boosting scheme that linearly com-bining each weighted function as the 1st or-der approach. Second order methods, like De-cision Tree/Random Forest/Feed-forward NeuralNets(Mulitple Layer Perceptron), are trying tocome up with a 2D decomposition ofH.

2.3.4 H: Binary-class vs Multi-class

In the practice of binary classification problem,boosting and SVM generally works well enoughwith carefully chosen kernel/feature. However, sofar, there is no satisfactory methodology on Multi-class problems. One natrual approach is to reducethe multi-class problem to binary ones, but neitherone-vs-one or one-vs-all scheme gives strong dis-criminative power. Even if we try more tasks to

fill out a coding matrix, introducing redundancyto reduce errors, the problem becomes how to ef-ficiently choose powerful set of tasks, which isproved to be NP hard. Another approach is tojointly model all classes Instead, we of Hierachi-cal SVM [5], Boosting SVM [6]

2.3.5 BesidesH: Feature Space

So far, we are focusing on different constructionin the Hypothesis Space H, which takes in thedata generically.

In different application however feature de-sign: Meta Knowledge of the task selec-tion/combination:

3. AlgorithmIn our work, most characters are well-

structured and well-separated, which is differentfrom general vision recognition tasks where ob-jects have much variation. On MNIST hand-written digit recognition task, people find sur-prisingly that some easy kNN matching algo-rithm perform much better than most craftedSVN/Boosting/Neural Nets. Thus quick explicitnonparametric methods work better than paramet-ric ones doing optimization costly while implic-itly trading-off among joint object function. Butin order to further improve the performance byresolving the deformation problem, we need toadd some ingredients of SVM to learn from am-biguous cases. Thus, We here plan to use kNNas an initial guess and build ensembles of SVMsaround set of classes that are close to each other.We not only try to combine the robustness of non-parametric modeling and the expressiveness ofparametric modeling, but also the computation ef-ficiency during training and testing respectively.

3.1. kNN

Though conceptually easy, designing an effec-tive distance metric for kNN can be hard. Onenatural approach is to use explicit distance func-tions (like L1, L2, correlation) in the feature

space. The baseline kNN tested here uses thepixel value as the feature and transforms the twodimensional image into one dimensitional vector.Apparently such metric does not capture the twodimensional structure of the image and is con-sequently not robust to the deformation of thecharacters, which may cause huge distance withsimple affine transformation. One way to workaround, as tangent distance [8] does, is to incor-porate invariance into the metric on the featurespace. But the increase of the complexity of themetric function still does not help to capture non-affine deformation of the image, which is com-mon in practice.

Recently, instead of searching an explicit an-alytic form in the hypothesis space, people arelooking into metric based on matching score.[3] samples points on the image boundary andcombine the bipatrite matching score with otherpenalties into the final distance. However, theweight parameters in the distance has to be iden-tified with cross-validation with huge computa-tional cost. Surprisingly, [7] reports that sim-ply matching local windows of two image con-texts works extremely well on digit recognitiontask. Below is a simple description of the algo-rithm. For each point x in the test image context,we search over a window with fixed size s1, say 5by 5,around the corresponding point y (matchedup in position) in the reference image. Then thelocal matching cost for this point is defined asthe minimum distance computed above with anexplicit function f(x,y). To make it more robustto outliers, we replace the distance of two pointswith the sum of the distance of matched up pointswithin a small window around them. Now, weget a local matching cost for each point in the testimage context and the distance between two im-age contexts is defined as the sum over all localmatching cost. On MNIST dataset, [7] uses thegradient field of the image, which captures localtwo dimensional structure of the image as the im-age context. And the simple choice of f(x,y) asL2 norm produces the third best testing result so

far.In the experiment, we implement the algorithm

in [7] in a slightly more efficient way by avoidingrepetitive computation in the for loops. Instead ofcostly cross-validation, we set the parameters tobe the same as that mentioned above.

Though kNN can achieve amazing perfor-mance with little knowledge representation of thetask, it has the bottleneck of generalibility, whichrequires huge amount of data to gain further im-provement. This is where we need SVM to in-fer the hidden discrimitive rules to generalize be-yond limited training data after narrowing downthe search range.

3.2. Multiclass SVM

Feature design We design 26 features for clas-sification in this phase. The 26 features can beroughly divided into 2 categories, global statis-tics features and local shape features. We have9 global features, as described in the below,

1. Number of horizontal lines.

2. Number of vertical lines.

3. Euler number of the image.

4. Center of mass along x direction.

5. Center of mass along y direction.

6. Standard deviation along x direction.

7. Standard deviation along y direction.

8. The height to width ratio of the bounding boxof the image.

9. Area of the written part in the image

To compute the number of horizontal lines, weproject the input character image to y-axis to gen-erate a one-dimensional array P . Then we binarythreshold P . For each element i in P , if P (i) issmaller than 0.6*(image width), then P (i) = 0,else P (i) = 1. 0.6*(image width) here represents

Figure 5: Euler number of this 3 images, from leftto right, are 0,1,-1.

the minimal length of a horizontal line. Then wecount the number of nonzero element in P , anddivide a parameterw that represents the width of astroke, to calculate the number of horizontal line.The same of vertical lines. In our work, w is cho-sen to be 15.

The definition of Euler number, e, is as below,

e = # of objects− # of holes in the objects (1)

Fig.5 shows some examples of euler number us-ing our data set. Euler number is a measure oftopology, which is invariant to handwriting.

Our system uses 16 Local, shape-aware fea-tures. For each input character image, we seg-ment the image into 4 non-overlap sub-images:top-right, top-left, bottom-left, and bottom right.For each sub-images, we compute the normalizedcross correlation (NCC) between the sub-imageand the 4 templates in Fig.6, which detects cor-ners toward the 4 directions, respectively. NCC isdefined as,

NCC(u, v)s,t =Σ[s(x,y)−s̄][t(x−u,y−v)−t̄]

{Σ[s(x,y)−s̄]2Σ[t(x−u,y−v)−t̄]2}0.5

where s is the sub-image and t is the template. Ifthe maximal NCC is greater than the threshold,then the corresponding feature is 1, otherwise is0. For example, if the maximal NCC between thesth sub-image and the tth template is greater thanthe threshold, the (4s + t)th feature is 1, whichphysically means that sth sub-image contains thetth pattern. In our work, we set the threshold to be0.9 and works very well.

Figure 6: Four templates used for corner detectionin our system.

SVM classifier We use a multiclassifer of kclasses to finely classify the input character im-age. We tried both one-against-all and one-against-one methods for classification, and ob-tain the same performance, about 85% recogni-tion rate. The first stage generates 65 groups, andso we train 65 multiclassifier in the second stage.

4. Experiment4.1. Training Set

Starting from the raw Sanskrit manuscript im-age, we first do standard page layout analysis tofind the position and statistics of potential charac-ters. Then we binarized the image making char-acters stand out from background by contrivedthresholding method based on the informationfrom the analysis above. Taking advantage of theproperty of the manuscript that characters tendto disjoint, we crop out the characters by find-ing connected components. We also threshold thesize the image patch for fewer false alarms andsegment the image patch when several charactersaccidentally connected to each other. So far, wehave the image bank of the Sanskrit characters.

Unlike OCR for printing or for handwritten En-glish which are much studied, there is no pub-lic datasets for Sanskrit characters with groundtruth labels. As can be imagined, there is nosingle trick to build up the whole dataset aroundhuman knowledge about discriminative patternsof shapes and structure from scratch. We heremanaged to do so by using the combination of aTop-Down approach to decompose problems intosmaller ones and a Bottom-Up method to groupcharacters conservatively. On the first stage, wemanually pick discriminative features, like struc-

tural pattern to roughly seperate different groupsof characters apart. Since none of the featuresare robust enough to help us recursively build thenoiseless training data, we develop an interactiveGUI for quick relabeling/merging same groupsof characters with binary choices on the secondstage. After the traininig data first built, we rununsuperised algorithms like K-means and hierar-chical clustering for anormaly detection. Within afew steps of refinement, the training dataset gen-erally has the acceptable quality for conductingexperiment.

In the end, we have around 6,000 image pathesof characters which can be grouped into around200 groups. We then manually choose 64 of thecharacter classes with around 2,500 instances asthe training data and create the test data witharound 600 instances.

4.2. kNN training

Firstly, for each character class, we tried to fig-ure out its k confusion neighbours (closest charac-ter classes) in order to train SVM in advance. Oneway is to do the leave one out cross-validation onthe training data and see which wrong labels maybe returned by the rest of the training data. Inpractice however, we find that character classesare well grouped in nature and we may simplyuse the mean of a character class to figure out itsk confusion neighbours.

Also, during testing, in order to have a coarseinitial guess, we may not need to search over allthe training data for each test image. Thus weneed to select representative images inside eachcharacter class which cover most of the variationof the class. One way to do is to greedily stage-forward fitting, adding the image that minimizethe mutal information among the selected set ofimages.

4.3. SVM training

We tried both one-against-all and one-against-one methods for classification, and obtain thesame performance. In the training phase, we use

both polynomial and gaussian as kernel function,and also obtain the same performance. We set thesoft margin parameter to be 1000. The trainingtakes about 1 minutes for each classifier, usingMATLAB in a 8-cores machine, and so about 1hour for the whole 70 classifiers.

5. ResultOur training data contains 65 characters, and

totally about 2500 training samples. Testing datacontains 300 different characters. The correctrecognition rate is 85.6% using one-against-alland 84.57% using one-against-one multiclassifier,respectively. Compared with only using SVM,which results accuracy of 72%, our method hassignificantly improvement. But for practical use,this is still not enough.

6. DiscussionThe proposed method outperform simply us-

ing SVM about 10%. However, our method isstill not good enough for practical use. Constraintto the difficulty of obtaining ground truth data,we here only conduct experiment on a relativelysmall dataset. In future work, we wish we cantrain this methods with more data and have fur-ther testing.

References[1] http://sanskritlibrary.org/. 1

[2] http://www.indsenz.com/int/. 3

[3] S. Belongie, J. Malik, and J. Puzicha. Shapecontext: A new descriptor for shape matchingand object recognition. Advances in neuralinformation processing systems, pages 831–837, 2001. 6

[4] R. Casey and E. Lecolinet. A survey of meth-ods and strategies in character segmentation.Pattern Analysis and Machine Intelligence,IEEE Transactions on, 18(7):690–706, 1996.2

[5] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni.Hierarchical classification: combining bayeswith svm. In Proceedings of the 23rd interna-tional conference on Machine learning, pages177–184. ACM, 2006. 5

[6] T. Gao and D. Koller. Multiclass boostingwith hinge loss based on output coding. 2011.5

[7] D. Keysers, T. Deselaers, C. Gollan, andH. Ney. Deformation models for imagerecognition. Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on,29(8):1422–1435, 2007. 6

[8] P. Simard, Y. LeCun, J. Denker, and B. Vic-torri. Transformation invariance in pat-tern recognitionxtangent distance and tangentpropagation. Neural networks: tricks of thetrade, pages 549–550, 1998. 6

machine learning final project: handwritten sanskrit recognition

Documents