multilingual scene character recognition system using

1

Multilingual Scene Character Recognition System using Sparse Auto-Encoder for Efficient

Local Features Representation in Bag of Features

Maroua Tounsi*§, Ikram Moalla*t, Frank Lebourgeois*, Adel M. Alimi*

*REsearch Groups in Intelligent Machines Laboratory, ENIS-Sfax, Tunisia

§Higher Institute of Computer Science and Communication Techniques, University of Sousse, Tunisia

+Al Baha University, Saudi Arabia

* Laboratoire d’InfoRmatique en Images et Systmes d’information (LIRIS), INSA of Lyon, France

Emails: [email protected], [email protected], [email protected],

[email protected]

Abstract—The recognition of texts existing in camera-captured images has become an important issue for a great deal of research

during the past few decades. This give birth to Scene Character Recognition (SCR) which is an important step in scene text

recognition pipeline.

In this paper, we extended the Bag of Features (BoF)-based model using sparse stacked auto-encoder technique for representing

local features for accurate SCR of different languages. In the features coding step, a Sparse Auto-encoder (SAE)-based strategy was

applied to enhance the representative and discriminative abilities of image features. This technique provides more efficient features

representation and therefore a better recognition accuracy, compared to other feature coding techniques. This coding step was

followed by a Spatial Pyramid Matching (SPM) and max-pooling to keep the spatial information and form the global image signature.

Our system was evaluated extensively on six scene character datasets of five different languages. The experimental results proved

the efficiency of our system for a multilingual SCR.

Index Terms—Scene Character Recognition, Multilingual Scene Text, Bag of Features, Features Learning, Sparse Auto-encoder.

---------------------------- ♦ ------------------------------

could be improved when using the language prior or some lexicon based

post processing, the OCR systems are not suitable for the same scale of

problematic classes. The task of SCR has been found to be more difficult

and more complex than that of recognizing existing in scanned

documents. Indeed, in these documents, characters are almost the same

size when dealing with the same paragraph or title and do not show any

stretch, rotation or contrast.

While many works was interested in recognizing Arabic Script [7-10],

Arabic script recognition is not yet studied in the field of natural scene

images, except our early work presented in [11-13].

On the other hand, scene characters are very specific and have

variable sizes, forms styles or colors. They could be illuminated,

stretched, distorted, with a non-planar surface and changeable

background (See figure 1). Scene characters may therefore consist on a

very variable set of shapes, and thus induces a wide intra-class variability

(in the case of the same character) in addition to inter-class variability

(between characters).

Furthermore, in the absence of a specific context, the words to be

recognized may belong to different languages and/or scripts. All these

particularities are added as extra challenges in the SCR field of research

(See figure 3).

This is our main underlying motivation for looking for a

mailto:[email protected]




2

tn W1 ot iD Craint n ï a«e«bit omuM lD as mmnire&.

[il) tait»* » fcdito A «M, fUsordireasrecomiienHitd ÈtomijUio. bj tt. Stontois rat » SSÎ • «Marisa n, „ o :mptoo„ .cmnendstm oüy is pan. j.,: ::i- „,...

°™, ««•)! contained 1ai m»dittn, SMJJ b, accampaniod bj a 'STOUT «t» tortb fo£ sack dîtcmlnaCou.

I "*"“?■ ? ïear& *feT November 16, 1SG0 a report to ifce Artrainii-

1 INTRODUC

TION

T

HE displayed

amounts of texts

in our

background, such

as text on road

signs, shop

names and

product adver-

tisements

presents

information for

people and

presents an

essential tool to

interact with their

environment.

Hence, the need

to an automatic

OCR in the wild

has become

mandatory,

specifically for a

growing number

of vision ap-

plications, such

as automatic text

translator used by

tourists in foreign

countries,

automatic sign

recognition and

visual- based

navigation,

among others.

One may find

himself in a

foreign language

context where he

would like to

quickly find out

Whenever the Administrator trite Crp-

d

rt,. “! “T*1”01 > Wiir m ail «taction, tnislortcs aid tbtlr

555?™ “ Uv' «» am> n&otstr " H “

OT.HU li «lmæqCjna HwJth Act [2S 1

Esyj’ "» ?“r) w aw inform, ¥S"5 of Uu Admlnist

uiMe. this, chapter, Including the subi

what is written

exactly for a

specific or urgent

situation. Such a

need is even

greater in foreign

countries or for

the blind people

special use.

Scene Character

Recognition

(SCR) is an

important step in

text recognition

pipeline.

To meet this

need, a first

reflection

consisting of

using a

conventional

recognition

system (OCR)

which typically

achieves

character

recognition rates

over 99% on

clean documents.

This attempt

failed

catastrophically

in solving SCR

problems.

Therefore, a

special focus was

put on SCR

issues and a

great deal of

research in this

field has been

proposed during

the past few

years [1-6].

However, unlike

r/IT jater than 1*0 day3 after receipt tc *• www-

tîïw rK a AÛ”lElarit»7 stall indl- •^tberths

Secrctasy m_ £ *** 4 nilermldtif « !a>10 auch oris u are Kcessarj to imUemat a.

----- * uwuny, sen a jury 01 substantial property damage ta eurod and do all things therein me* lor A proper iavatigation panant to

„?d inipoot “* rM50j times records, files, papets, processas trois, and facilities and take such sanjl are relevant to suet Investigation.

and we’ve seen many times how just a small

change on the ground here

3

power provided in seotloa TCMOvii J

Fig. 1: Differences in challenges in recognizing text in (a) scanned documents, (b) artificial text images and (c) scene text images (huge variance in

size, font, color, non-planar surface, perspective distortion).

robust system invariant to scale, rotation, stretch, brightness, and so on,

which is effective for the SCR domain in a multilingual and multi-script

context.

Most recently published methods consider the scene characters as

a special category of objects and therefore shared many of the

techniques used to recognize objects. Furthermore, BoF framework

seems to be an efficient model for object recognition. So, the idea was to

take advantage of the feature extraction and representation methods that

perform well in object recognition tasks to achieve the SCR. However, a

major problem hinders the performance of such a pipeline. In fact, the

information loss may result in a loss of discrimination between classes

and therefore a lack of spatial information.

To deal with these problems, in this paper we proposed a novel deep

feature learning architecture based on Sparse Auto-Encoder (SAE) [14].

We tried to benefit from the strengths of the BoF framework and the

power of the SAE in features representation. This technique provides a

more concise visual dictionary, more efficient features learning and a

better recognition accuracy, compared to the other features

representation techniques like the sparse coding. Specifically, we

constructed a two-hidden-layer using the SAE in order to fine-tune the

visual dictionary and learn the discriminative feature codes for each

class.

The contribution of this paper is threefold. First, we handled five language

scene character recognition problems, namely, Arabic, English, Bengali,

Devanagari and Kannada. Indeed, we used a patch-based method which

allowed us to be script independent and enabled us to deal with different

language scene characters. Second, we applied stacked SAE for sparse

and discriminative features representation in a BoF framework. Finally,

we provided a suitable benchmark to compare the performances of

different methods for Arabic scene text reading.

The rest of the paper is organized as follows. Section 2 discusses the

related work and section 3 introduces our methodology. Section 4

presents a description of our system. Our proposed dataset and the other

datasets used for the system evaluation are described in section 5. The

text recognition in

scanned

documents, the

SCR did not yield

acceptable

results when

experimental results and discussion are then presented in section 6. Our

concluding remarks and future work are finally presented in section 7.

2 RELATED WORK

The text recognition in the wild relies on two available set of approaches:

The first category is based on segmentation and uses the traditional

Optical Character Recognition (OCR) methods

[15] whereas the second type is a segmentation-free approach also

called features based approaches and used object recognition based

methods [2], [5], [16-18]. Different methods have been proposed for

scene text recognition based on segmentation in recent years. In [19], a

variant of Niblacks adaptive binarization algorithm is proposed for this

purpose. Recent works like [20] and [21] adopt more advanced

techniques to get a better text segmentation performance. Extremal

Regions are used in [20] to extract candidate text regions and non-text

regions are removed by a classifier trained with text specific features.

The segmentation-based approach is computationally efficient and has

the advantage of using an open source traditional OCR for the text

recognition step. The performance of this approach is still limited since

there is a loss of information during the binarization and the segmentation

process.

Using segmentation-free approach, we can avoid the challenging

segmentation step. It is robust to problems of text variability of size and

color and the camera-based image problems namely the uneven lighting,

background complexity, non-planar surface, lack of resolution, distorted

perception, etc. However, the segmentation-based approaches are

prone to error accumulation and loss of information during the

binarization and segmentation processes.

Thus, a great deal of research focused on the field of object recognition

using local features and reached very competitive results [2], [16-18],

since local descriptors represent an efficient tool for dealing with

problems of variability of size and color of text and problems of camera-

based images

using a traditional

OCR system.

Although the final

text recognition

result

(a) (b) (c)

4

(e)

Fig. 2: Samples of scene characters of different languages: (a) English scene characters from Chars74K (b) English scene characters from

ICDAR2003, (c) Arabic scene characters from ARASTI, (d) Devanagari scene characters from DSIW-3K, (e) Kannada scene characters from

Chars74K, and (f) Bengali scene characters from images captured in West Bengal, India.

Similar patches

Discriminative patches

Fig. 3: Some examples of characters with discriminative patches (in green) which have the same context and can be the origin of confusion without

keeping spatial information. Examples of confusion in (a) English characters, (b) Arabic characters, (c) Kannada characters, (d) Devanagari

characters and (e) Bengali characters.

namely the uneven lighting, complexity of the background and small text

size. For example, Zhang et al. [22] extract the histograms of sparse

codes (HSC) features for scene character representation. [23] suggests

the stroke-based tree structure to seize different characters structure

information of through building for each class of character where parts of

characters are represented with nodes. The bag-of-keypoints approach

proposed by [24] relies on densely extracted SIFT descriptors to

recognize perspective scene texts of arbitrary orientations.

3 METHODOLOGY

In this paper, we use a hybrid architecture that merges the

complementary strength of the BoF framework and deep architectures

was used. Furthermore, a patch-based method was applied allowing us

to be script independent when dealing with different language scene

characters. In this section, we introduced our work and we reviewed the

related techniques.

mm Lair

(a)

(b)

mm (c)

(d)

(f)

(a)

(c)

(e)

5

3.1 Scene character recognition based on BoF

It is worth noting that the BoF framework seems to be an efficient model

for object recognition. The general idea in the BoF is to represent 'the

image' as a collection of important key points while totally disregarding

the order the words appear in. The BoF is based on a representation of

the image using the distribution of the visual words present in the image

to replace the functional or the vector representation of the image.

In fact, BoF seems to be an efficient model for object recognition. After

learning the local descriptors, such as SIFT descriptor, a scene character

image could be represented by a histogram of these visual codes, called

also patches, using the corresponding coding and pooling methods.

Given the histogram features of characters, various methods could be

used for recognition. Following these steps construct the BoF framework.

Meanwhile, in the BoF framework, the major problem is the information

loss brought by feature coding leading to the loss of discrimination

between classes and the lack of spatial information. This is because

many patches of a character may appear in another location of another

character. Ignoring spatial information can produce character

classification confusion. For example, for English characters, 'B' can be

overlapped with 'R' (figure 3) if one ignores the context above this shared

segment.

Therefore, character models of the same character in different positions

need to be discriminant from each other.

So, it is desirable to give priority to the patches containing more

discriminative features and ignore those with low discriminative power in

the global classification.

How to alleviate the above problems in the BoF-based framework to

enhance the representative and discriminative abilities of image features

while incorporating the spatial information for SCR problem is a

challenging and also becomes our focus in this paper.

3.2 Deep learning based character feature learning

So far, learning features directly from data through deep learning

methods has become increasingly popular and effective for visual

recognition.

• Deep Learning

A deep neural network architecture is obtained by stacking multiple

layers of simple learning blocks. In a deep learning architecture, the

output of each intermediate layer can be considered as a representation

of the original input data. Each level benefits from the representation

produced by a previous level as input, and achieves new representations

as output, which is then fed to higher levels [25].

When applied to character recognition, the deep learning exploits the

unknown structure of a character in the input distribution in order to

discover good representations, often at multiple levels, with higher-level

learned features defined in terms of lower-level features.

To deal with the information loss problem, deep architectures have

recently proven recently their effectiveness in reducing the information

loss by integrating unsupervised pre-training [26], a process that might

be generalized over different situations.

3.3 Stroke based models for scene character recogni-

tion

Certain characters or parts of characters have low information value for

recognition used in the text as they appear in multiple characters (see

figure 3), whereas other characters or their parts are more discriminative

by being unique to only one or a small number of scripts. Therefore, it is

desirable to give priority to the patches containing more discriminative

features and ignore those with a low discriminative power in the global

classification. Furthermore, to deal with to the multilingual SCR we used

a patch-based method which allows it to be script independent dealing

with different language scene characters and also to be able to treat the

loss of discrimination between classes and the lack of spatial information.

To deal with this problem, many works have exploited the stroke based

models for scene characters in a diversity of languages and has

demonstrated competitive results. They tried to improve previous results

by responding to the need of keeping the spatial information in

recognizing characters and discovering the important spatial layout of

each character class from the training data in a discriminative way.

In this context, several related works which deal with the SCR problem

using the BoF or deep learning or stoke based models can be presented.

DeCampos et al. [2] proposed to apply a BoF-based model on scene

characters and compare the effectiveness of different local features.

Zheng et al. [22] proposed a local description-based method to measure

the similarity between two character images and use the Nearest

Neighbor-based method to build a system for Chinese character

recognition. In [27], the authors recognized Devanagari characters by

comparing 6 types of local feature descriptors namely, pixel density,

directional features, HOG, GIST, Shape Context and Dense SIFT.

While many works exploited the BoF in recognizing scene characters in

a diversity of languages and demonstrated competitive results, applying

this framework can admit an extension to improve the results and

respond to the need of keeping spatial information in recognizing

characters. Different network structures have been studied in [15, 28,

29] to obtain a better recognition performance by reducing the

information loss through varying the number of layers, the activation

function and the sampling techniques. Although the above-mentioned

works demonstrate competitive results, the used methods require

millions of annotated training samples to perform well. However, these

annotated training images are not publicly available.

Our work is also related to the stroke based models where regions which

contain the basic information are learned. We quote as an example the

work of Gao et al. [30] which proposes the stroke bank to consider the

spatial context at the scene character representation stage. Shi et al. [31]

extend [30] to a discriminative multiscale stroke detector- based

representation (DMSDR). Tian et al. [1] learn the spatial information by

using the cooccurrence of a histogram of oriented gradients (Co-HOG)

which captures the spatial distribution of neighboring orientation pairs.

Gao et al. [32] embed the spatial context in the stage of dictionary

learning for scene character representation. Chen-Yu Lee et al. [5]

propose a discriminative feature pooling method that automatically learns

the most informative strokes of each scene character within a multi-class

classification framework, whereas each sub-region seamlessly

integrates a set of low-level image features through integral images. In

[31], the authors introduce two existing stroke based models to recognize

characters by detecting the elastic stroke like parts. In order to use

strokes of various scales, they propose to learn the discriminative multi-

scale stroke detector-based representation (DMSDR) for characters. In

[33], the authors propose a method of selecting co-occurrence activation

descriptors enhances the discriminative power of parts.

While many works used the BoF or the deep learning based method or

stroke-based model for SCR, there is not any single attempt that exploits

the deep learning for feature representation in a BOF for SCR using a

patch-based model.

4 LEARNING FEATURE REPRESENTATION IN A BOF

In this section, a detailed description of our architecture based on the use

of an adopted BoF model using SAE to encode local descriptors was

introduced.

In this paper, we used a hybrid architecture that merges the

complementary strength of the BoF framework and deep architectures.

6

Figure 4 shows our BoF framework using a SAE to learn visual codes

and perform the feature coding. During training, the SAE was first used

to perform an unsupervised learning of the local features. We

concurrently learnt a local classifier, while back-propagating the residual

error to fine-tune the codebook. After training, the SAE was used as a

feed-forward encoder to directly infer each local feature coding.

4.1 SAE for unsupervised features representation in

BoF framework

After extracting local patches and learning local features via local

descriptors recognized by their invariance and their compact

representation of images, we perform the unsupervised features

representation. In fact, the number of nodes for both the input and output

layers is d, which is the same as the descriptor dimension. In the BoF,

we have K codewords in the codebook W, therefore, we also set the

number of hidden nodes in the AE network as K, which means that the

coded vectors of AE network and other BoF models have the same

discrete dimensionality (K can be set as 1024, 2048, etc.). Let V G

Rd*K,6(1) G RK be the weight and bias of the input-hidden layer. Figure

5 shows the architecture of the AE network.

4.1.1 Constructing a single layer

An auto-encoder network [34] is a variant of a three-layer MLP, which

models a set of inputs through unsupervised learning. The auto-encoder

learns a latent representation that performs reconstruction of the input

vectors (see figure 5). This means that the input and output layers

assume the same conceptual meaning. Given an input x, a set of weights

W maps the inputs to the latent layer z. From this latent layer, the inputs

(x) are reconstructed with another set of weights V. The feature layer has

a dimension d corresponding to the size of the input vector (e.g. 128 for

SIFT).

The coding layer has K visual codewords used to encode the feature

layer. The layers are connected via undirected weights W, our visual

dictionary. Given a visual dictionary W, feature x can be encoded to its

visual code z. The learning is concurrently regularized to guide the

representation to be sparse.

Specifically, suppose we have W G Rd * N , 6(1) G R N is the weight and

bias of the input-hidden layer, V G Rn*d, 6(2) G Rd is the weight and bias

of the hidden- output layer. Suppose we have extracted N descriptors: X

= [xi, X2, • • • ,xN ] G Rd * N from the training images around N interesting

points by dense sampling. N = i=1 X i denotes

the N input data where K is the number of neurons in the hidden layer

which corresponds to the size of visual dictionary W, and Xj G Rw * K

denotes the reconstruction of Xj, our objective is to learn a shared mid-

level visual dictionary. W = [w1, w2, • • •, wK ] G Rw * K and a reconstruction

weight V = [vi, V2, • • •, V K ] G Rv * K to make Xj to be as close to Xj as

possible, by minimising the following objective function Ei(W,V; X )

1 N K

Ei(W, V; X) = — £ ||Xj - Xj||2 + £ £ KL(p ||?fc) (1) j=1 k =1

X j = V (1 + eXp(WT Xj)) 1 (2)

KL(P I[pk )= P log PP + (1 - p) log 1 P (3)

Pk 1 - Pk

1N

pk = NvX] (1 + eXP(wT Xj)) (4) N j=1

Where || .|2 indicates the L2-norm, p is a sparsity parameter, W is the

mid-level visual dictionary to be learned which is subject to ||wk ||2 = 1,Vk

= 1, 2, • • • , K , to prevent V from being arbitrarily large, V is a

reconstruction weight which reconstructs the input layer from the hidden

layer, £ is a parameter controlling the weight of the sparsity penalty term,

(pk ) is the average activation of the k t h hidden node averaged over the

N training data and p is the target average activation of the hidden

nodes. The KullbackLeibler divergence KL(.) [35] is a standard function

for measuring how different two distributions are. It has the property that

KL(p ||pk ) = 0, if (p k ) = P , otherwise it increases monoton- ically as (pk )

diverges from p . Hence, the KullbackLeibler divergence KL(.) is used for

the sparsity constraint.

4.1.2 Constructing a deep architecture We extend the single-

layer architecture to be hierarchical. In the case of deep fully-connected

networks, we stack two SAE. We greedily stack one additional auto-

encoder with a new set of parameters V that aggregates within a

neighborhood of outputs from the first auto-encoder W. This allows the

representation to be transformed from a low-dimensional to a single high-

dimensional vector that describes the entire image. This is illustrated in

figure 6.

7

4.2 Spatial pyramid matching and max pooling

Using the SPM technique, each image was segmented into fine sub-

regions. Then, we calculated the histogram of the local features of each

sub-region. We segmented the images using different scales and

computed the BoF histogram in each segment. After that, we

concatenated all the histograms. Using this strategy, the performance

was considerably improved over the basic bag-of-features

representation. We applied then a max pooling technique in order to

summarize all the local features representing the image. We

concatenated the pooling features in different levels of the pyramid and

the final image level features or representations were built.

4.3 Scene character recognition

Scene character images segmented from a text in the wild can be

recognized by training a suitable classifier. The linear Support Vector

Machine (SVM) [36] seems to be the most efficient machine learning

technique used for high dimen

sional data since it is fast and capable of providing the state of-the-art

recognition rates. Therefore, in this work, we have chosen to use the

linear SVM with L2-regularization to classify signatures into their

categories.

5 MULTILINGUAL SCENE CHARACTER DATASETS

This section described the several scene characters used in the

evaluation of this work. These are six scene character datasets of five

different languages including two sets in English, one in Arabic, one in

Devanagari, one in Kannada and one in Bengali.

5.1 ARASTI: An Arabic scene character dataset

In spite of the diversity of the works interested in text recognition in

natural scenes for a diversity of languages

[1] , [4], [27], Arabic texts seem to be not yet well solved to address

this problem due to the complex shape nature of Arabic characters

depending on their positions within a word (isolated, beginning, middle,

and end) and the lack of a database of available Arabic script text images

in the wild. Added to that, there is a scarcity in publicly available OCR

datasets for scene text recognition. Besides, most existing datasets are

interested in English texts. The most widely used in the community is

ICDAR 2003 Robust Reading dataset [37] and Chars74K dataset

proposed by [2] which is interested in the recognition of English and

Kannada characters in natural scene images, yet no Arabic dataset

seems to exist for the purpose. Thus, the creation of scene character

datasets of Arabic would have a significant impact on the advancement

of scene text recognition research. Taking this novelty into account, we

have initiated the development of a real dataset of annotated images of

text in diverse natural scenes captured at varying weather, lighting and

perspective conditions for Arabic scene text recognition, called ARASTI

(Arabic Scene Text Image) dataset [38]. To our knowledge, ARASTI is

the first dataset for Arabic scene text recognition in real-world. Our

dataset of the Arabic text was collected from pictures of shopping areas,

advertisements and hoardings in streets, signboards, roadside signs, etc.

We

Fig. 5: Auto-encoder Network.

Fig. 4: Our SAE-based BoF architecture for SCR. The arrows indicate the operations performed during the various phases.

8

Fig. 6: The BoF from the image pixels level to low-level descriptors, mid-level visual codes and to a high-level image signature via a two-layer

architecture for coding local features.

photographed a set of 364 images. They are taken under real life

situations without caring about the environmental conditions and refined

camera settings. Therefore, in our image dataset, the lighting conditions

may be more or less good, they might include blur or be taken from

different distances, etc.

To validate a character recognition method, one needs to have a

benchmark dataset covering all the varieties of characters. The different

forms of these characters are used to store the character images in 55

classes. Then, we manually segmented individual characters to obtain a

total of 55 classes of 28 Arabic characters in their different positions

within the word (start, middle, end and isolated). The ARASTI dataset

contains an unequal number of samples for each conjunct/character as

these are extracted from the text strings in the images. This is also in line

with the fact that, in daily usage, some characters are more frequently

used than others. Our ARASTI dataset is freely available1.

5.2 English scene characters datasets We evaluated the proposed recognition system on two English scene

character datasets, i.e, Chars74K [2] and ICDAR03-CH [37] datasets.

These consist of 62 classes (0-9, A-Z, a-z). Character samples are

collected from a wide variety of camera captured images, like book

covers, road signs, brand logo and other texts randomly selected from

various objects.

• Chars74K dataset2 3 contains 12,505 images of characters

extracted from scene images. Various training and test splits of

dataset are provided for the research community to serve as a

benchmark and compare results.

• ICDAR03-CH dataset [37] contains 11,482 images of characters.

This dataset comprises of 6,113 training and 5,369 test images.

In order to be able to compare our results to previously published

findings we concentrated on testing the Chars74K benchmark 15

training images per class, and

2 http://www.regim.org/publications/databases/ arabic-

scenetext- image- of- regim- lab2015/

the ICDAR03-CH benchmark with 15 training images per class.

5.3 Hindi scene characters datasets

India is a multi-script country. In this paper, we evaluated our work on

three different Hindi scripts: Bengali, Devana- gari and Kannada.

5.3.1 Bengali scene characters dataset

The Bengali alphabet set has 50 basic characters which include 11

vowels and 39 consonants. Additionally, it has several diacritics and a

large number of conjunct characters. In this paper, we used the dataset

proposed in [1]. This dataset has 15250 Bengali character samples

extracted from these 260 images and another set of 4280 manually

created samples. Some of these Bengali character samples are shown

in figure 2f. This dataset has 19530 Bengali character samples which are

divided into two sets including 13410 for training and 6120 for testing,

respectively.

5.3.2 Devanagari scene characters dataset In this paper, we used the DSIW-3K [27] to evaluate the scene

Devanagari recognition rate. Pictures were taken under real life situations

without caring about the refined camera settings and environmental

conditions, which may cause poor resolution and distortion (see figure

2d). The dataset contains an unequal number of samples for each

conjunct/character as these are extracted from the text strings in the

images.

5.3.3 Kannada scene characters dataset

Kannada is an Indic script language. We used Kannada Chars74K

dataset 3 [39] for our experiments. (See figure 2e).

6 EXPERIMENTAL RESULTS

6.1 Experimental setting

The images were first resized to 90*90 pixels. Local descriptors were

densely sampled from each image. We randomly

3. http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/

3 http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/

Local Image Signature Character

Image Pixels Loca features Features Codes Category Label

Local Feature 2. Coding using 3. Coding using sparse auto- 4. Spatial

Extraction sparse auto-encoder (with encoder (with the second Pyramids

Classification using

SVM

the first hidden layer hidden layer (K=512))

K=1024 Max-Pooing

http://www.regim.org/publications/databases/arabic-scenetext-image-of-regim-lab2015/



http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/

http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/

9

sampled 200,000 patches for ICDAR03-CH and Chars74K English

datasets; 80,000 patches for ARASTI dataset; and

150.0 patches for Chars74K Kannada, Bengali and De- vanagari

datasets. The trained dictionary was used to encode the local features.

We performed feature extraction when varying three local descriptors,

namely, SIFT (Scale Invariant Feature Transform) [40], HOG (Histogram

of oriented gradients) [41] and SURF (Speeded Up Robust Features)

[42].

6.2 Evaluations of each part in the system and the

parameters 6.2.1 Evaluation of the local features extraction

We reported the recognition accuracy on the six datasets compared with

SIFT [40], HOG [41], SURF [42]. As shown in table 2,3 and 4, the best

performance is achieved with SIFT descriptor.

Also, we reported the recognition accuracy on ICDAR03- CH and

Chars74K datasets while combining the response of these three local

features through the majority voting rule which is used to find a particular

member in any given sequence which has more votes than all the others.

This searching rule has been used in association with SVM classifier to

improve the accuracy of our classification problem.

As shown in table 2,3 and 4, The maximum recognition rate has

been improved from 83.2% to 84.5% by implementing the majority voting

of these three local features. Therefore, in the next section which

evaluates each part in the system, we used the combining of the

response of these three local features through the majority voting rule

descriptor for all the experiments.

6.2.2 Evaluation of the patches size

As shown in figure 7, the patches were varied in size. The best

recognition accuracy was obtained when the patches size is 16 pixels.

For too small patches sizes, the extracted features can hardly capture

any high-level statistics. On the other hand, for too large patches,

extracted local features become highly diverse.

100 95

g 90 aT 85 £ 80 § 75 •E 70 o 65 £ 60 55 50

4 8 16 32

Patches size

Icdar03 Chars74K Arasti

^^—Bengali ^^—Devangari^^—Kannada

Fig. 7: Recognition rate when varying the patches size used for local

feature extraction.

6.2.3 Evaluation of the dictionary size Our system remains very

competitive even when the codebook size is reduced by four times from

4096 to 1024. As shown in figure 8, with only 256 codewords, we are still

able to get greater accuracies than 50%. It can be observed that a larger

codebook has more capacity to capture the diversity in the features, but

is also more likely to exhibit codeword redundancies. From our

experiments, 1024 codewords seem to give a good balance between

diversity and conciseness.

- Icdar03 Chars74K Arasti — Bengali Devangari Kannada

Fig. 8: Recognition rate when varying the dictionary size.

6.2.4 Evaluation of the number of training samples

To evaluate whether the representative ability of our system is dependent

on sufficient training samples, the results with different training samples

on ICDAR dataset were reported. From the results on ICDAR dataset, we

can see that using more training samples could lead to an improvement

of all the methods. Moreover, as shown in table 1, with only 75 training

samples per class, we could still achieve an accuracy of 88.5%.

The results prove that our system outperforms the CNN based

methods. The results show that ConveNet [44] achieves an accuracy of

86.5%, whereas our system could reach 88.5% with fewer of training

samples. This could be due to two reasons. Firstly, the training data is

insufficient which causes the CNN to suffer from overfitting. Secondly, the

network parameters need to be further studied when applied to different

text languages. Oppositely, our proposed system showed its advantage

under insufficient training data.

6.2.5 Evaluation of spatial pyramids matching

The results in figure 9 show that the spatial pyramids matching improves

the performance when the learned total number of the dictionary size is

the same. The results further indicate that spatial division is more

effective for Arabic characters which tend to have more complex

structures. For all the scripts, the best performance is obtained with finer

spatial division (with level 2).

6.3 Comparative study with state-of-the-art methods

6.3.1 Results on English scene character datasets Table 2

shows the results obtained with training sets of 15 samples per class and

test sets of 5 samples per class. As can be seen in table 2, the proposed

system gives better results than those of the existing methods, except

those based on

100

90

40 256 512 1024 2048 4096

Dictionary Size

10

the CNN. This can be explained by the fact that we used a much smaller

number of training data. Table 2 reports on our proposed system and

some existing systems. It can be concluded that our proposed system

using Stacked Sparse Auto-encoder (SSAE) for supervised features

learning gives better results than those of the existing methods. For

further details, the confusion matrix is displayed in figure 10 where we

notice that upper case letters and lower case letters like 'X' and 'x', 'W'

and 'w', 'O' and 'o', 'Z' and 'z' were confused. Meanwhile, these characters

are extremely difficult to distinguish, even by human eyes.

6.3.2 Results on ARASTI dataset

This subsection detailed the recognition result of Arabic characters. In

our experiments, we compared the results obtained using a conventional

OCR system; Tesseract OCR4. Arabic scene character recognition is a

very challenging task because some characters are distinguished from

each other by the number and the position of dots around the basic shape

of the character. To alleviate this problem, the similar classes were

merged into one. So, the samples from the 55 initial classes (denoted by

'ARASTI-55' in table 3) with similar structures were combined. Hence, a

total of 30 classes (denoted by 'ARASTI-30' in table 3) to be recognized

were obtained; they are shown in the confusion matrix in

4 https://code.google.com/pAesseract-ocr/

figure 11. Table 3 shows the results obtained with training sets of 15

samples per class and test sets of 5 samples per class.

We can conclude that our system yields competitive results and prove

that SSAE is a good choice for Arabic character recognition. Its good

performance could be explained by the fact that the SSAE is designed to

learn the discriminative patches and keep the spatial information without

information loss. For further details, figure 11 introduces the confusion

matrix. As shown in this confusion matrix the most obvious mistakes are

the mis-classification between characters which are extremely difficult to

distinguish, even for the human eye. Meanwhile, when recognizing a

scene Arabic text, we can distinguish them by exploiting the context

information.

6.3.3 Results on Hindi script scene character datasets. Table

4 lists the obtained results on Hindi scripts datasets. The Tesseract OCR

result is not available since it cannot deal with isolated Hindi characters.

As shown in table 4, the recognition accuracy outperforms the current

state-of- the-art on Bangali, Kannada and Devanagri datasets even

100

95

90

ICDAR03-CH Cb ars74K Arasti Bengali Devangari Kannada

■ Level 0 ■ Level 1 ■ Level 2

Fig. 9: Comparative results with different spatial pyramids (%).

Fig. 10: Confusion matrix resulting from ICDAR03-CH dataset.

Feature coding method Dictionary size 15 TS 30 TS 45 TS 75 TS

512 59.5 61.5 68.2 70.9

K-means [43] 1024 62.9 63.5 70.0 71.4

2048 62.7 63.0 70.4 70.6 4096 61.9 61.1 69.2 69.8

512 73.8 74.5 76.1 78.9

Sparse Coding [12]

1024 2048

75.9 75.3

77.5 77.4

79.7 79.8

81.7 81.0

4096 74.6 75.8 78.7 80.2

512 81.7 82.3 83.0 86.3

Sparse Auto-encoder

1024 2048

83.2 82.9

83.9 83.8

85.1 85.3

88.5 88.1

4096 81.0 82.7 84.7 87.4

TABLE 1: Comparative results of our system and other methods with different training samples (TS) on ICDAR dataset(%)

https://code.google.com/p/tesseract-ocr/

11

with CNN [29]. This is due to the training data which is insufficient which

makes the CNN suffer from over fitting. Our proposed system

demonstrates its efficiency even with insufficient training data and with

different languages. Some successful and failed cases are shown in

figure 12.

7 CONCLUSION

In this paper, a sparse deep auto-encoder based system in BoF was

proposed in order to fine-tune the visual dictionary and to learn local

features codes followed by a multi-scale spatial max pooling and a linear

SVM kernel. This technique provides more efficient features

representation and thus a more recognition accuracy, compared to the

other feature representation techniques such as the sparse coding. We

exploited the modeling power of local descriptors and spatial pooling of

the BoF framework and the adaptability

and representational power of deep learning.

The proposed system enabled us to tackle an important problem resulting

from text recognition in images of natural scenes; Multilingual SCR.

In fact, we handled five language scene character recognition problems,

namely, Arabic, English, Bengali, Devanagari and Kannada towards the

advancement of scene character recognition research for multiple scripts.

In addition, we provided a benchmark results for comparing Arabic scene

TABLE 3: The performance of our system on Arabic scene characters

dataset compared to previously published results using other methods

(%). Stacked Sparse Auto-encoder (SSAE). Arasti-55 Arasti-30

Tesseract OCR 19.6 22.3

Tounsi et al. [12] 57.5 60.4

Tounsi et al. [13] 61.2 65.1 SURF + SSAE + SVM 50.1 62.4

HOG + SSAE + SVM 61.7 71.8

SIFT + SSAE + SVM 62.9 72.5

Majority voting with SIFT+HOG+SURF +

SSAE + SVM 65.4 74.7

Fig. 11: Confusion matrix resulting from ARASTI dataset

TABLE 2: The performance of our system compared to previously published results using other methods

on English script (%). Stacked Sparse Auto-encoder (SSAE).

ICDAR03-CH Chars74K

Convolutional Co-occurrence of HOG [1] 81.7 -

CoHOG+Linear SVM [45] 79.4 - PHOG+Linear SVM [46] 76.5 -

MLFP [5] 79 64 GHOG + SVM [3] 76 62

LHOG + SVM [3] 75 58

HOG + NN [16] 52 58

MKL [2] - 55

NATIVE + FERNS [16] 64 54

GB + SVM [2] - 53

GB + NN [2] - 47

SYNTH + FERNS [16] 52 47

SC + SVM [2] - 35

SC + NN [2] - 34 SIFT+ SVM [2] - 21.40

ABBYY [2] 21 31 Foreground segmentation+HOG+SVM [47] 81.34 72.03

Strokelets [48] 69 62

Co-occurrence of strokelets [49] 82.7 67.5 PCAnet[50] 75 64

HOG + Sparse Coding + SVM[51] 72.27 -

SIFT + Sparse Coding+ SVM [12] 75.3 73.1

SIFT + SAE + SVM [13] 79.3 75.4 SIFT + Sparse RBM + SVM [13] 78.1 73.9

SURF + SSAE + SVM 69.3 50.4

HOG + SSAE + SVM 80.5 74.7

SIFT + SSAE + SVM 83.2 77.9 Majority voting with SIFT+HOG+SURF + SSAE + SVM 84.5 80.1

REFERENCES 12

characters recognition methods.

Our future research will mainly include: first, our framework will be

adapted to word-level text recognition in natural video scenes. Second,

the use of GPU will be explored to speed up the computation of the

trained sparse feature codes. Finally, we will further investigate

recognizing scene text in the wild in other languages.

ACKNOWLEDGMENT

This work is performed in the framework of a thesis MO- BIDOC financed

by the EU under the program PASRI. The authors would like also to

acknowledge the partial financial support of this work by grants from

General Direction of Scientific Research (DGRT), Tunisia, under the

ARUB program. The research leading to these results has received

funding from the Ministry of Higher Education and Scientific Research of

Tunisia under the grant agreement number LR11ES48.

REFERENCES

[1] Shangxuan Tian, Ujjwal Bhattacharya, Shijian Lu, Bolan Su,

Qingqing Wang, Xiaohua Wei, Yue Lu, and Chew Lim Tan.

"Multilingual scene character recognition with co-occurrence of

histogram of oriented gradients". In: Pattern Recognition 51

(2016), pp. 125-134.

[2] Teofilo Emidio de Campos, Bodla Rakesh Babu, and Manik Varma.

"Character Recognition in Natural Images". In: (2009), pp. 273-

280.

[3] Chucai Yi, Xiaodong Yang, and Yingli Tian. "Feature

Representations for Scene Text Character Recognition: A

Comparative Study". In: ICDAR '13 (2013), pp. 907911.

[4] Qixiang Ye and David S. Doermann. "Text Detection and

Recognition in Imagery: A Survey". In: IEEE Trans. Pattern Anal.

Mach. Intell. 37.7 (2015), pp. 14801500.

[5] Chen-Yu Lee, Anurag Bhardwaj, Wei Di, Vignesh Jagadeesh, and

Robinson Piramuthu. "Region-Based Discriminative Feature

Pooling for Scene Text Recognition". In: Proceedings of the 2014

IEEE Conference on Computer Vision and Pattern Recognition.

CVPR '14. 2014, pp. 4050-4057.

[6] Aymen Chaabouni, Houcine Boubaker, Monji Kheral- lah, Adel M.

Alimi, and Haikal El Abed. "Fractal and Multi-fractal for Arabic

Offline Writer Identification". In: 20th International Conference on

Pattern Recognition, ICPR 2010, Istanbul, Turkey, 23-26 August

2010. 2010, pp. 3793-3796.

[7] Najiba Tagougui, Monji Kherallah, and Adel M. Alimi. "Online Arabic

handwriting recognition: a survey". In: IJDAR 16.3 (2013), pp. 209-

226.

[8] Abdelkarim Elbaati, Houcine Boubaker, Monji Kher- allah, Abdel

Ennaji, Haikal El Abed, and Adel M. Alimi. "Arabic Handwriting

Recognition Using Restored Stroke Chronology". In: 10th

International Conference on Document Analysis and Recognition,

ICDAR 2009, Barcelona, Spain, 26-29 July 2009. 2009, pp. 411-

415.

(a) (b)

Fig. 12: Successful and failed cases of our system. In each figure, the top row shows the correctly recognized samples and the bottom row shows

wrongly recognized ones. (a) Bengali language, (b) Devanagari language and (c) Kannada Language.

(c)

Bengali Kannada Devanagari

Jaderberg et al. (CNN) [29] 84.5 76.2 62.7 Sheshadri et al. [39] - 54.13 -

deCampos et al. [2] - 29.88 -

Narang et al. [27] - - 56.0

Tounsi et al. [12] 84.2 72.5 48.4

Tounsi et al. [13] 87.7 76.5 58.7

SURF + SSAE + SVM 78.0 65.2 53.7

HOG + SSAE + SVM 86.9 76.4 64.1

SIFT + SSAE + SVM 89.2 77.1 64.2

Majority voting with SIFT+HOG+SURF +

SSAE + SVM 90.7 79.4 66.6

TABLE 4: The performance of our system in recognition on Bangali, Kannada and Davangari scene characters datasets compared to previously

published results using other methods (%). Stacked Sparse Auto-encoder (SSAE).

REFERENCES 13

[9] Haikal El Abed, Monji Kherallah, Volker Margner, and Adel M.

Alimi. “On-line Arabic handwriting recognition competition - ADAB

database and participating systems". In: IJDAR 14.1 (2011), pp.

15-23.

[10] Fouad Slimane, Slim Kanoun, Haikal El Abed, Adel M. Alimi, Rolf

Ingold, and Jean Hennebert. “ICDAR 2011 - Arabic Recognition

Competition: Multi-font Multi-size Digitally Represented Text". In:

2011 International Conference on Document Analysis and

Recognition, ICDAR 2011, Beijing, China, September 18-21,

2011. 2011, pp. 1449-1453.

[11] Maroua Tounsi, Ikram Moalla, Adel M. Alimi, and Franck

Lebourgeois. “A comparative study of local descriptors for Arabic

character recognition on mobile devices". In: Seventh

International Conference on Machine Vision, ICMV 2014, Milan,

Italy, 19-21 November 2014. 2014, 94450B.

[12] M. Tounsi, I. Moalla, A. M. Alimi, and F. Lebouregois. “Arabic

characters recognition in natural scenes using sparse coding for

feature representations". In: 2015 13th International Conference

on Document Analysis and Recognition (ICDAR). 2015, pp. 1036-

1040.

[13] Maroua Tounsi, Ikram Moalla, and Adel M. Alimi. “Supervised

dictionary learning in BoF framework for Scene Character

recognition". In: 23rd International Conference on Pattern

Recognition, ICPR 2016, CancUn, Mexico, December 4-8, 2016.

2016, pp. 3987-3992.

[14] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio,

and Pierre-Antoine Manzagol. “Stacked Denoising Autoencoders:

Learning Useful Representations in a Deep Network with a Local

Denoising Criterion". In: Journal of Machine Learning Research

11 (2010), pp. 3371-3408.

[15] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. “PhotoOCR:

Reading Text in Uncontrolled Conditions". In: (2013), pp. 785-792.

[16] Kai Wang, Boris Babenko, and Serge J. Belongie. “End- to-end

scene text recognition". In: IEEE International Conference on

Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13,

2011. 2011, pp. 1457-1464.

[17] Vibhor Goel, Anand Mishra, Karteek Alahari, and C. V. Jawahar.

“Whole is Greater than Sum of Parts: Recognizing Scene Text

Words." In: ICDAR. IEEE Computer Society, 2013, pp. 398-402.

[18] Cunzhao Shi, Chunheng Wang, Baihua Xiao, Yang Zhang, Song

Gao, and Zhong Zhang. “Scene Text Recognition Using Part-

Based Tree-Structured Character Detection". In: Proceedings of

the 2013 IEEE Conference on Computer Vision and Pattern

Recognition. CVPR '13. 2013, pp. 2961-2968.

[19] Xiangrong Chen and Alan L. Yuille. “Detecting and Reading Text

in Natural Scenes". In: 2004 IEEE Computer Society Conference

on Computer Vision and Pattern Recognition (CVPR 2004), with

CD-ROM, 27 June - 2 July 2004, Washington, DC, USA. 2004,

pp. 366-373.

[20] Lukas Neumann and Jiri Matas. “Real-Time Lexicon- Free Scene

Text Localization and Recognition". In: IEEE Trans. Pattern Anal.

Mach. Intell. 38.9 (2016), pp. 1872-1885.

[21] Jacqueline L. Feild and Erik G. Learned-Miller. “Improving Open-

Vocabulary Scene Text Recognition".

In: 12th International Conference on Document Analysis and

Recognition, ICDAR 2013, Washington, DC, USA, August 25-28,

2013. 2013, pp. 604-608.

[22] Qi Zheng, Kai Chen, Yi Zhou, Congcong Gu, and Haibing Guan.

“Text Localization and Recognition in Complex Scenes Using

Local Features". In: Computer Vision - ACCV 2010 - 10th Asian

Conference on Computer Vision, Queenstown, New Zealand,

November 8-12, 2010, Revised Selected Papers, Part III. 2010,

pp. 121-132.

[23] Cunzhao Shi, Chunheng Wang, Baihua Xiao, Song Gao, and

Jinlong Hu. “End-to-end scene text recognition using tree-

structured models". In: Pattern Recognition 47.9 (2014), pp. 2853-

2866.

[24] Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian,

and Chew Lim Tan. “Recognizing Text with Perspective Distortion

in Natural Scenes". In: IEEE International Conference on

Computer Vision, ICCV 2013, Sydney, Australia, December 1-8,

2013. 2013, pp. 569-576.

[25] Yoshua Bengio. “Learning Deep Architectures for AI". In:

Foundations and Trends in Machine Learning 2.1 (2009), pp. 1-

127.

[26] Hanlin Goh, Nicolas Thome, Matthieu Cord, and Joo- Hwee Lim.

“Learning Deep Hierarchical Visual Feature Coding". In: IEEE

Trans. Neural Netw. Learning Syst. 25.12 (2014), pp. 2212-2225.

[27] Vipin Narang, Sujoy Roy, O. V. R. Murthy, and M. Hanmandlu.

“Devanagari Character Recognition in Scene Images". In:

Proceedings of the 2013 12th International Conference on

Document Analysis and Recognition. ICDAR '13. 2013, pp. 902-

906.

[28] Ouais Alsharif and Joelle Pineau. “End-to-End Text Recognition

with Hybrid HMM Maxout Models". In: CoRR abs/1310.1811

(2013).

[29] Max Jaderberg, Andrea Vedaldi, and Andrew Zisser- man. “Deep

Features for Text Spotting". In: Computer Vision - ECCV 2014 -

13th European Conference, Zurich, Switzerland, September 6-12,

2014, Proceedings, Part IV. 2014, pp. 512-528.

[30] Song Gao, Chunheng Wang, Baihua Xiao, Cunzhao Shi, and

Zhong Zhang. “Stroke Bank: A High-Level Representation for

Scene Character Recognition". In: 22nd International Conference

on Pattern Recognition, ICPR 2014, Stockholm, Sweden, August

24-28, 2014. 2014, pp. 2909-2913.

[31] Cunzhao Shi, Song Gao, Meng-Tao Liu, Cheng-Zuo Qi, Chun-

Heng Wang, and Baihua Xiao. “Stroke Detector and Structure

Based Models for Character Recognition: A Comparative Study".

In: IEEE Trans. Image Processing 24.12 (2015), pp. 4952-4964.

[32] Song Gao, Chunheng Wang, Baihua Xiao, Cunzhao Shi, Wen

Zhou, and Zhong Zhang. “Scene Text Character Recognition

Using Spatiality Embedded Dictionary". In: IEICE Transactions

97-D.7 (2014), pp. 19421946.

[33] “Multi-order co-occurrence activations encoded with Fisher Vector

for scene character recognition". In: Pattern Recognition Letters

97 (2017), pp. 69 -76.

[34] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-

Antoine Manzagol. “Extracting and Composing Robust Features

with Denoising Autoencoders".

14

In: Proceedings of the 25th International Conference on Machine

Learning. ICML '08. Helsinki, Finland, 2008, pp. 1096-1103.

[35] S. Kullback and R. A. Leibler. “On Information and Sufficiency". In:

Ann. Math. Statist. 1 (), pp. 79-86.

[36] Luc T. Wille. “Review of "Learning with Kernels: Support Vector

Machines, Regularization Optimization". In: SIGACT News 35.3

(2004), pp. 13-17.

[37] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R.

Young. “ICDAR 2003 robust reading competitions". In: Seventh

International Conference on Document Analysis and Recognition,

2003. Proceedings. 2003, pp. 682-687.

[38] Maroua Tounsi, Ikram Moalla, and Adel M. Alimi. “ARASTI: A

database for Arabic scene text recognition". In: 1st International

Workshop on Arabic Script Analysis and Recognition, ASAR

2017, Nancy, France, April 3-5, 2017. 2017, pp. 140-144.

[39] Karthik Sheshadri and Santosh Divvala. “Exemplar Driven

Character Recognition in the Wild". In: Proceedings of the British

Machine Vision Conference. BMVA Press, 2012, pp. 13.1-13.10.

[40] David G. Lowe. “Distinctive Image Features from Scale-Invariant

Keypoints". In: Int. J. Comput. Vision 60.2 (2004), pp. 91-110.

[41] N. Dalal and B. Triggs. “Histograms of oriented gradients for

human detection". In: 2005 IEEE Computer Society Conference

on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1.

2005, 886-893 vol. 1.

[42] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool.

“Speeded-Up Robust Features (SURF)". In: Comput. Vis. Image

Underst. 110.3 (2008).

[43] S. Lazebnik, C. Schmid, and J. Ponce. “Beyond Bags of Features:

Spatial Pyramid Matching for Recognizing Natural Scene

Categories". In: 2006 IEEE Computer Society Conference on

Computer Vision and Pattern Recognition (CVPR'06). Vol. 2.

2006, pp. 2169-2178.

[44] Michael Opitz, Markus Diem, Stefan Fiel, Florian Kle- ber, and

Robert Sablatnig. “End-to-End Text Recognition Using Local

Ternary Patterns, MSER and Deep Convolutional Nets". In: SBES

'13 (2013), pp. 186-190.

[45] Shangxuan Tian, Shijian Lu, Bolan Su, and Chew Lim Tan. “Scene

Text Recognition Using Co-occurrence of Histogram of Oriented

Gradients". In: 2013 12th International Conference on Document

Analysis and Recognition, Washington, DC, USA, August 25-28,

2013. 2013, pp. 912-916.

[46] Zhi Rong Tan, Shangxuan Tian, and Chew Lim Tan. “Using

pyramid of histogram of oriented gradients on natural scene text

recognition". In: 2014 IEEE International Conference on Image

Processing, ICIP 2014, Paris, France, October 27-30, 2014. 2014,

pp. 2629-2633.

[47] Muhammad Fraz, M. Saquib Sarfraz, and Eran A. Edirisinghe.

“Exploiting colour information for better scene text detection and

recognition". In: IJDAR 18.2 (2015), pp. 153-167.

[48] Xiang Bai, Cong Yao, and Wenyu Liu. “Strokelets: A Learned

Multi-Scale Mid-Level Representation for Scene Text

Recognition". In: IEEE Trans. Image Processing 25.6 (2016), pp.

2789-2802.

[49] S. Gao, C. Wang, B. Xiao, C. Shi, W. Zhou, and Z. Zhang.

“Learning co-occurrence strokes for scene character recognition

based on spatiality embedded dictionary". In: 2014 IEEE

International Conference on Image Processing (ICIP). 2014, pp.

5956-5960.

[50] Chongmu Chen, Da-Han Wang, and Hanzi Wang. “Scene

character recognition using PCANet". In: Proceedings of the 7th

International Conference on Internet Multimedia Computing and

Service, ICIMCS 2015, Zhangjiajie, Hunan, China, August 19-21,

2015. 2015, 65:1-65:4.

[51] Dong Zhang, Da-Han Wang, and Hanzi Wang. “Scene text

recognition using sparse coding based features". In: 2014 IEEE

International Conference on Image Processing, ICIP 2014, Paris,

France, October 27-30, 2014. 2014, pp. 1066-1070.

Maroua Tounsi received the engineering degree in computer science from the National Engineering School of Sfax (ENIS), Tunisia, in 2012.

She is now preparing her PhD in Research groups in Intelligent Machines from the National School of Engineers of Sfax, Tunisia. Her Ph.D project was carried out within the framework of a PhD MOBIDOC (PASRI program), funded by the EU and managed by the ANPR. She is an IEEE Student Member. Her main research activities focus on document analysis, pattern recognition, segmentation method for natural scene analysis.

Ikram Moalla is currently an Assistant Professor in the department of computer science and applied mathematics at ENIS, University of Sfax. She is also a member of REGIM-Lab, the Research Group on Intelligent

Machines. Her main research activities focus on document analysis, pattern recognition, feature extraction, free segmentation method for main paleography and natural scene analysis and they lead to several published papers in refereed journals and proceedings of international conferences. She has been involved in different projects.

Frank Lebourgeois is a senior researcher from LIRIS laboratory from INSA Lyon (France) since 1992. He is specialized in the domain of Docu-ment Images Analysis (DIA) and its works cover a wide range of applications in the domain of document imaging. With more than 25 years

of experience in this field, he has published more than 70 international papers in international con-ferences and journals and has served in nu-merous program committees of international and national conferences. He is reviewer of interna-tional journals (PAMI, PR, CVIU, IJDAR...) and international conferences of the domain (ICDAR, ICFHR, DAS,...).

feature extraction, free

15

Adel M. Alimi obtained a PhD and then an HDR both in Electrical Computer Engineering in 1995 and 2000 respectively. He is fullProfes- sor in Electrical Engineering at the University of Sfax, ENIS, since 2006. Prof. Alimi is founder and director of the REGIM-Lab. in intelligent Ma-chines. He is the holder of 15 Tunisian patents. He served also as Expert evaluator for the Eu-ropean Agency for Research since 2009. He is the Founder and Chair of many IEEE Chapters

in Tunisia section, IEEE Sfax Subsection Chair (since 2011), IEEE Systems, Man, and Cybernetics Society Tunisia Chapter Chair (since 2011), IEEE Computer Society Tunisia Chapter Chair (since 2010), IEEE ENIS Student Branch Counselor (since 2010). His research interests include iBrain (Evolutionary computing, Swarm intelligence, Artificial immune systems, Fuzzy Sets, Uncertainty analysis, Fractals, Support vector machines, Artificial neural networks, Case Based, Reasoning, Wavelets, Hybrid intelligent systems, Nature inspired computing techniques, Machine learning, Ambient intelligence, Hardware implementations, Multi-Agent Systems) iPerception (Classification, Classifiers Combination, Features Extraction, High Dimension Classification Problems, Handwriting Recognition, Handwriting Model-ing, Motor Control, Perception, OCR, Arabic Script, Historical Documents Analysis, iDocument, iVision, Remote Sensing, Affective Computing) iData (Big Data, iWeb, Biometry, Data Mining, Web Mining, Cloud Computing) iRobotics (Robotics, Multi-Robot, Autonomous Robots, Complex and Distributed Systems, Parallel and Distributed Architec-tures, Intelligent Control, Embedded Systems). Prof. Alimi served as associate editor and member of the editorial board of many international scientific journals (e.g. ”IEEE Trans. Fuzzy Sys-tems”, ’’Pattern Recognition Letters”, ”NeuroComputing”, "Neural Pro-cessing Letters”, ”International Journal of Image and Graphics”, ”Neural Computing and Applications”, ’’International Journal of Robotics and Automation”, ’’International Journal of Systems Science”, etc.). He was guest editor of several special issues of international journals (e.g. Fuzzy Sets Systems, Soft Computing, Journal of Decision Systems, Integrated Computer Aided Engineering, Systems Analysis Modelling and Sim-ulations). He organized many international conferences: ICDAR2015, SoCPaR2014, ICALT20142013, ICBR2013, HIS2013, IAS2013, ISI12, NGNS11, LOGISTIQUA11, ACIDCA-ICMI05, SCS04ACIDCA2000.