multilingual scene character recognition system using
TRANSCRIPT
1
Multilingual Scene Character Recognition System using Sparse Auto-Encoder for Efficient
Local Features Representation in Bag of Features
Maroua Tounsi*§, Ikram Moalla*t, Frank Lebourgeois*, Adel M. Alimi*
*REsearch Groups in Intelligent Machines Laboratory, ENIS-Sfax, Tunisia
§Higher Institute of Computer Science and Communication Techniques, University of Sousse, Tunisia
+Al Baha University, Saudi Arabia
* Laboratoire d’InfoRmatique en Images et Systmes d’information (LIRIS), INSA of Lyon, France
Emails: [email protected], [email protected], [email protected],
Abstract—The recognition of texts existing in camera-captured images has become an important issue for a great deal of research
during the past few decades. This give birth to Scene Character Recognition (SCR) which is an important step in scene text
recognition pipeline.
In this paper, we extended the Bag of Features (BoF)-based model using sparse stacked auto-encoder technique for representing
local features for accurate SCR of different languages. In the features coding step, a Sparse Auto-encoder (SAE)-based strategy was
applied to enhance the representative and discriminative abilities of image features. This technique provides more efficient features
representation and therefore a better recognition accuracy, compared to other feature coding techniques. This coding step was
followed by a Spatial Pyramid Matching (SPM) and max-pooling to keep the spatial information and form the global image signature.
Our system was evaluated extensively on six scene character datasets of five different languages. The experimental results proved
the efficiency of our system for a multilingual SCR.
Index Terms—Scene Character Recognition, Multilingual Scene Text, Bag of Features, Features Learning, Sparse Auto-encoder.
---------------------------- ♦ ------------------------------
could be improved when using the language prior or some lexicon based
post processing, the OCR systems are not suitable for the same scale of
problematic classes. The task of SCR has been found to be more difficult
and more complex than that of recognizing existing in scanned
documents. Indeed, in these documents, characters are almost the same
size when dealing with the same paragraph or title and do not show any
stretch, rotation or contrast.
While many works was interested in recognizing Arabic Script [7-10],
Arabic script recognition is not yet studied in the field of natural scene
images, except our early work presented in [11-13].
On the other hand, scene characters are very specific and have
variable sizes, forms styles or colors. They could be illuminated,
stretched, distorted, with a non-planar surface and changeable
background (See figure 1). Scene characters may therefore consist on a
very variable set of shapes, and thus induces a wide intra-class variability
(in the case of the same character) in addition to inter-class variability
(between characters).
Furthermore, in the absence of a specific context, the words to be
recognized may belong to different languages and/or scripts. All these
particularities are added as extra challenges in the SCR field of research
(See figure 3).
This is our main underlying motivation for looking for a
2
tn W1 ot iD Craint n ï a«e«bit omuM lD as mmnire&.
[il) tait»* » fcdito A «M, fUsordireasrecomiienHitd ÈtomijUio. bj tt. Stontois rat » SSÎ • «Marisa n, „ o :mptoo„ .cmnendstm oüy is pan. j.,: ::i- „,...
°™, ««•)! contained 1ai m»dittn, SMJJ b, accampaniod bj a 'STOUT «t» tortb fo£ sack dîtcmlnaCou.
I "*"“?■ ? ïear& *feT November 16, 1SG0 a report to ifce Artrainii-
1 INTRODUC
TION
T
HE displayed
amounts of texts
in our
background, such
as text on road
signs, shop
names and
product adver-
tisements
presents
information for
people and
presents an
essential tool to
interact with their
environment.
Hence, the need
to an automatic
OCR in the wild
has become
mandatory,
specifically for a
growing number
of vision ap-
plications, such
as automatic text
translator used by
tourists in foreign
countries,
automatic sign
recognition and
visual- based
navigation,
among others.
One may find
himself in a
foreign language
context where he
would like to
quickly find out
Whenever the Administrator trite Crp-
d
rt,. “! “T*1”01 > Wiir m ail «taction, tnislortcs aid tbtlr
555?™ “ Uv' «» am> n&otstr " H “
OT.HU li «lmæqCjna HwJth Act [2S 1
Esyj’ "» ?“r) w aw inform, ¥S"5 of Uu Admlnist
uiMe. this, chapter, Including the subi
what is written
exactly for a
specific or urgent
situation. Such a
need is even
greater in foreign
countries or for
the blind people
special use.
Scene Character
Recognition
(SCR) is an
important step in
text recognition
pipeline.
To meet this
need, a first
reflection
consisting of
using a
conventional
recognition
system (OCR)
which typically
achieves
character
recognition rates
over 99% on
clean documents.
This attempt
failed
catastrophically
in solving SCR
problems.
Therefore, a
special focus was
put on SCR
issues and a
great deal of
research in this
field has been
proposed during
the past few
years [1-6].
However, unlike
r/IT jater than 1*0 day3 after receipt tc *• www-
tîïw rK a AÛ”lElarit»7 stall indl- •^tberths
Secrctasy m_ £ *** 4 nilermldtif « !a>10 auch oris u are Kcessarj to imUemat a.
----- * uwuny, sen a jury 01 substantial property damage ta eurod and do all things therein me* lor A proper iavatigation panant to
„?d inipoot “* rM50j times records, files, papets, processas trois, and facilities and take such sanjl are relevant to suet Investigation.
and we’ve seen many times how just a small
change on the ground here
3
power provided in seotloa TCMOvii J
Fig. 1: Differences in challenges in recognizing text in (a) scanned documents, (b) artificial text images and (c) scene text images (huge variance in
size, font, color, non-planar surface, perspective distortion).
robust system invariant to scale, rotation, stretch, brightness, and so on,
which is effective for the SCR domain in a multilingual and multi-script
context.
Most recently published methods consider the scene characters as
a special category of objects and therefore shared many of the
techniques used to recognize objects. Furthermore, BoF framework
seems to be an efficient model for object recognition. So, the idea was to
take advantage of the feature extraction and representation methods that
perform well in object recognition tasks to achieve the SCR. However, a
major problem hinders the performance of such a pipeline. In fact, the
information loss may result in a loss of discrimination between classes
and therefore a lack of spatial information.
To deal with these problems, in this paper we proposed a novel deep
feature learning architecture based on Sparse Auto-Encoder (SAE) [14].
We tried to benefit from the strengths of the BoF framework and the
power of the SAE in features representation. This technique provides a
more concise visual dictionary, more efficient features learning and a
better recognition accuracy, compared to the other features
representation techniques like the sparse coding. Specifically, we
constructed a two-hidden-layer using the SAE in order to fine-tune the
visual dictionary and learn the discriminative feature codes for each
class.
The contribution of this paper is threefold. First, we handled five language
scene character recognition problems, namely, Arabic, English, Bengali,
Devanagari and Kannada. Indeed, we used a patch-based method which
allowed us to be script independent and enabled us to deal with different
language scene characters. Second, we applied stacked SAE for sparse
and discriminative features representation in a BoF framework. Finally,
we provided a suitable benchmark to compare the performances of
different methods for Arabic scene text reading.
The rest of the paper is organized as follows. Section 2 discusses the
related work and section 3 introduces our methodology. Section 4
presents a description of our system. Our proposed dataset and the other
datasets used for the system evaluation are described in section 5. The
text recognition in
scanned
documents, the
SCR did not yield
acceptable
results when
experimental results and discussion are then presented in section 6. Our
concluding remarks and future work are finally presented in section 7.
2 RELATED WORK
The text recognition in the wild relies on two available set of approaches:
The first category is based on segmentation and uses the traditional
Optical Character Recognition (OCR) methods
[15] whereas the second type is a segmentation-free approach also
called features based approaches and used object recognition based
methods [2], [5], [16-18]. Different methods have been proposed for
scene text recognition based on segmentation in recent years. In [19], a
variant of Niblacks adaptive binarization algorithm is proposed for this
purpose. Recent works like [20] and [21] adopt more advanced
techniques to get a better text segmentation performance. Extremal
Regions are used in [20] to extract candidate text regions and non-text
regions are removed by a classifier trained with text specific features.
The segmentation-based approach is computationally efficient and has
the advantage of using an open source traditional OCR for the text
recognition step. The performance of this approach is still limited since
there is a loss of information during the binarization and the segmentation
process.
Using segmentation-free approach, we can avoid the challenging
segmentation step. It is robust to problems of text variability of size and
color and the camera-based image problems namely the uneven lighting,
background complexity, non-planar surface, lack of resolution, distorted
perception, etc. However, the segmentation-based approaches are
prone to error accumulation and loss of information during the
binarization and segmentation processes.
Thus, a great deal of research focused on the field of object recognition
using local features and reached very competitive results [2], [16-18],
since local descriptors represent an efficient tool for dealing with
problems of variability of size and color of text and problems of camera-
based images
using a traditional
OCR system.
Although the final
text recognition
result
(a) (b) (c)
4
(e)
Fig. 2: Samples of scene characters of different languages: (a) English scene characters from Chars74K (b) English scene characters from
ICDAR2003, (c) Arabic scene characters from ARASTI, (d) Devanagari scene characters from DSIW-3K, (e) Kannada scene characters from
Chars74K, and (f) Bengali scene characters from images captured in West Bengal, India.
Similar patches
Discriminative patches
Fig. 3: Some examples of characters with discriminative patches (in green) which have the same context and can be the origin of confusion without
keeping spatial information. Examples of confusion in (a) English characters, (b) Arabic characters, (c) Kannada characters, (d) Devanagari
characters and (e) Bengali characters.
namely the uneven lighting, complexity of the background and small text
size. For example, Zhang et al. [22] extract the histograms of sparse
codes (HSC) features for scene character representation. [23] suggests
the stroke-based tree structure to seize different characters structure
information of through building for each class of character where parts of
characters are represented with nodes. The bag-of-keypoints approach
proposed by [24] relies on densely extracted SIFT descriptors to
recognize perspective scene texts of arbitrary orientations.
3 METHODOLOGY
In this paper, we use a hybrid architecture that merges the
complementary strength of the BoF framework and deep architectures
was used. Furthermore, a patch-based method was applied allowing us
to be script independent when dealing with different language scene
characters. In this section, we introduced our work and we reviewed the
related techniques.
mm Lair
(a)
(b)
mm (c)
(d)
(f)
(a)
(c)
(e)
5
3.1 Scene character recognition based on BoF
It is worth noting that the BoF framework seems to be an efficient model
for object recognition. The general idea in the BoF is to represent 'the
image' as a collection of important key points while totally disregarding
the order the words appear in. The BoF is based on a representation of
the image using the distribution of the visual words present in the image
to replace the functional or the vector representation of the image.
In fact, BoF seems to be an efficient model for object recognition. After
learning the local descriptors, such as SIFT descriptor, a scene character
image could be represented by a histogram of these visual codes, called
also patches, using the corresponding coding and pooling methods.
Given the histogram features of characters, various methods could be
used for recognition. Following these steps construct the BoF framework.
Meanwhile, in the BoF framework, the major problem is the information
loss brought by feature coding leading to the loss of discrimination
between classes and the lack of spatial information. This is because
many patches of a character may appear in another location of another
character. Ignoring spatial information can produce character
classification confusion. For example, for English characters, 'B' can be
overlapped with 'R' (figure 3) if one ignores the context above this shared
segment.
Therefore, character models of the same character in different positions
need to be discriminant from each other.
So, it is desirable to give priority to the patches containing more
discriminative features and ignore those with low discriminative power in
the global classification.
How to alleviate the above problems in the BoF-based framework to
enhance the representative and discriminative abilities of image features
while incorporating the spatial information for SCR problem is a
challenging and also becomes our focus in this paper.
3.2 Deep learning based character feature learning
So far, learning features directly from data through deep learning
methods has become increasingly popular and effective for visual
recognition.
• Deep Learning
A deep neural network architecture is obtained by stacking multiple
layers of simple learning blocks. In a deep learning architecture, the
output of each intermediate layer can be considered as a representation
of the original input data. Each level benefits from the representation
produced by a previous level as input, and achieves new representations
as output, which is then fed to higher levels [25].
When applied to character recognition, the deep learning exploits the
unknown structure of a character in the input distribution in order to
discover good representations, often at multiple levels, with higher-level
learned features defined in terms of lower-level features.
To deal with the information loss problem, deep architectures have
recently proven recently their effectiveness in reducing the information
loss by integrating unsupervised pre-training [26], a process that might
be generalized over different situations.
3.3 Stroke based models for scene character recogni-
tion
Certain characters or parts of characters have low information value for
recognition used in the text as they appear in multiple characters (see
figure 3), whereas other characters or their parts are more discriminative
by being unique to only one or a small number of scripts. Therefore, it is
desirable to give priority to the patches containing more discriminative
features and ignore those with a low discriminative power in the global
classification. Furthermore, to deal with to the multilingual SCR we used
a patch-based method which allows it to be script independent dealing
with different language scene characters and also to be able to treat the
loss of discrimination between classes and the lack of spatial information.
To deal with this problem, many works have exploited the stroke based
models for scene characters in a diversity of languages and has
demonstrated competitive results. They tried to improve previous results
by responding to the need of keeping the spatial information in
recognizing characters and discovering the important spatial layout of
each character class from the training data in a discriminative way.
In this context, several related works which deal with the SCR problem
using the BoF or deep learning or stoke based models can be presented.
DeCampos et al. [2] proposed to apply a BoF-based model on scene
characters and compare the effectiveness of different local features.
Zheng et al. [22] proposed a local description-based method to measure
the similarity between two character images and use the Nearest
Neighbor-based method to build a system for Chinese character
recognition. In [27], the authors recognized Devanagari characters by
comparing 6 types of local feature descriptors namely, pixel density,
directional features, HOG, GIST, Shape Context and Dense SIFT.
While many works exploited the BoF in recognizing scene characters in
a diversity of languages and demonstrated competitive results, applying
this framework can admit an extension to improve the results and
respond to the need of keeping spatial information in recognizing
characters. Different network structures have been studied in [15, 28,
29] to obtain a better recognition performance by reducing the
information loss through varying the number of layers, the activation
function and the sampling techniques. Although the above-mentioned
works demonstrate competitive results, the used methods require
millions of annotated training samples to perform well. However, these
annotated training images are not publicly available.
Our work is also related to the stroke based models where regions which
contain the basic information are learned. We quote as an example the
work of Gao et al. [30] which proposes the stroke bank to consider the
spatial context at the scene character representation stage. Shi et al. [31]
extend [30] to a discriminative multiscale stroke detector- based
representation (DMSDR). Tian et al. [1] learn the spatial information by
using the cooccurrence of a histogram of oriented gradients (Co-HOG)
which captures the spatial distribution of neighboring orientation pairs.
Gao et al. [32] embed the spatial context in the stage of dictionary
learning for scene character representation. Chen-Yu Lee et al. [5]
propose a discriminative feature pooling method that automatically learns
the most informative strokes of each scene character within a multi-class
classification framework, whereas each sub-region seamlessly
integrates a set of low-level image features through integral images. In
[31], the authors introduce two existing stroke based models to recognize
characters by detecting the elastic stroke like parts. In order to use
strokes of various scales, they propose to learn the discriminative multi-
scale stroke detector-based representation (DMSDR) for characters. In
[33], the authors propose a method of selecting co-occurrence activation
descriptors enhances the discriminative power of parts.
While many works used the BoF or the deep learning based method or
stroke-based model for SCR, there is not any single attempt that exploits
the deep learning for feature representation in a BOF for SCR using a
patch-based model.
4 LEARNING FEATURE REPRESENTATION IN A BOF
In this section, a detailed description of our architecture based on the use
of an adopted BoF model using SAE to encode local descriptors was
introduced.
In this paper, we used a hybrid architecture that merges the
complementary strength of the BoF framework and deep architectures.
6
Figure 4 shows our BoF framework using a SAE to learn visual codes
and perform the feature coding. During training, the SAE was first used
to perform an unsupervised learning of the local features. We
concurrently learnt a local classifier, while back-propagating the residual
error to fine-tune the codebook. After training, the SAE was used as a
feed-forward encoder to directly infer each local feature coding.
4.1 SAE for unsupervised features representation in
BoF framework
After extracting local patches and learning local features via local
descriptors recognized by their invariance and their compact
representation of images, we perform the unsupervised features
representation. In fact, the number of nodes for both the input and output
layers is d, which is the same as the descriptor dimension. In the BoF,
we have K codewords in the codebook W, therefore, we also set the
number of hidden nodes in the AE network as K, which means that the
coded vectors of AE network and other BoF models have the same
discrete dimensionality (K can be set as 1024, 2048, etc.). Let V G
Rd*K,6(1) G RK be the weight and bias of the input-hidden layer. Figure
5 shows the architecture of the AE network.
4.1.1 Constructing a single layer
An auto-encoder network [34] is a variant of a three-layer MLP, which
models a set of inputs through unsupervised learning. The auto-encoder
learns a latent representation that performs reconstruction of the input
vectors (see figure 5). This means that the input and output layers
assume the same conceptual meaning. Given an input x, a set of weights
W maps the inputs to the latent layer z. From this latent layer, the inputs
(x) are reconstructed with another set of weights V. The feature layer has
a dimension d corresponding to the size of the input vector (e.g. 128 for
SIFT).
The coding layer has K visual codewords used to encode the feature
layer. The layers are connected via undirected weights W, our visual
dictionary. Given a visual dictionary W, feature x can be encoded to its
visual code z. The learning is concurrently regularized to guide the
representation to be sparse.
Specifically, suppose we have W G Rd * N , 6(1) G R N is the weight and
bias of the input-hidden layer, V G Rn*d, 6(2) G Rd is the weight and bias
of the hidden- output layer. Suppose we have extracted N descriptors: X
= [xi, X2, • • • ,xN ] G Rd * N from the training images around N interesting
points by dense sampling. N = i=1 X i denotes
the N input data where K is the number of neurons in the hidden layer
which corresponds to the size of visual dictionary W, and Xj G Rw * K
denotes the reconstruction of Xj, our objective is to learn a shared mid-
level visual dictionary. W = [w1, w2, • • •, wK ] G Rw * K and a reconstruction
weight V = [vi, V2, • • •, V K ] G Rv * K to make Xj to be as close to Xj as
possible, by minimising the following objective function Ei(W,V; X )
1 N K
Ei(W, V; X) = — £ ||Xj - Xj||2 + £ £ KL(p ||?fc) (1) j=1 k =1
X j = V (1 + eXp(WT Xj)) 1 (2)
KL(P I[pk )= P log PP + (1 - p) log 1 P (3)
Pk 1 - Pk
1N
pk = NvX] (1 + eXP(wT Xj)) (4) N j=1
Where || .|2 indicates the L2-norm, p is a sparsity parameter, W is the
mid-level visual dictionary to be learned which is subject to ||wk ||2 = 1,Vk
= 1, 2, • • • , K , to prevent V from being arbitrarily large, V is a
reconstruction weight which reconstructs the input layer from the hidden
layer, £ is a parameter controlling the weight of the sparsity penalty term,
(pk ) is the average activation of the k t h hidden node averaged over the
N training data and p is the target average activation of the hidden
nodes. The KullbackLeibler divergence KL(.) [35] is a standard function
for measuring how different two distributions are. It has the property that
KL(p ||pk ) = 0, if (p k ) = P , otherwise it increases monoton- ically as (pk )
diverges from p . Hence, the KullbackLeibler divergence KL(.) is used for
the sparsity constraint.
4.1.2 Constructing a deep architecture We extend the single-
layer architecture to be hierarchical. In the case of deep fully-connected
networks, we stack two SAE. We greedily stack one additional auto-
encoder with a new set of parameters V that aggregates within a
neighborhood of outputs from the first auto-encoder W. This allows the
representation to be transformed from a low-dimensional to a single high-
dimensional vector that describes the entire image. This is illustrated in
figure 6.
7
4.2 Spatial pyramid matching and max pooling
Using the SPM technique, each image was segmented into fine sub-
regions. Then, we calculated the histogram of the local features of each
sub-region. We segmented the images using different scales and
computed the BoF histogram in each segment. After that, we
concatenated all the histograms. Using this strategy, the performance
was considerably improved over the basic bag-of-features
representation. We applied then a max pooling technique in order to
summarize all the local features representing the image. We
concatenated the pooling features in different levels of the pyramid and
the final image level features or representations were built.
4.3 Scene character recognition
Scene character images segmented from a text in the wild can be
recognized by training a suitable classifier. The linear Support Vector
Machine (SVM) [36] seems to be the most efficient machine learning
technique used for high dimen
sional data since it is fast and capable of providing the state of-the-art
recognition rates. Therefore, in this work, we have chosen to use the
linear SVM with L2-regularization to classify signatures into their
categories.
5 MULTILINGUAL SCENE CHARACTER DATASETS
This section described the several scene characters used in the
evaluation of this work. These are six scene character datasets of five
different languages including two sets in English, one in Arabic, one in
Devanagari, one in Kannada and one in Bengali.
5.1 ARASTI: An Arabic scene character dataset
In spite of the diversity of the works interested in text recognition in
natural scenes for a diversity of languages
[1] , [4], [27], Arabic texts seem to be not yet well solved to address
this problem due to the complex shape nature of Arabic characters
depending on their positions within a word (isolated, beginning, middle,
and end) and the lack of a database of available Arabic script text images
in the wild. Added to that, there is a scarcity in publicly available OCR
datasets for scene text recognition. Besides, most existing datasets are
interested in English texts. The most widely used in the community is
ICDAR 2003 Robust Reading dataset [37] and Chars74K dataset
proposed by [2] which is interested in the recognition of English and
Kannada characters in natural scene images, yet no Arabic dataset
seems to exist for the purpose. Thus, the creation of scene character
datasets of Arabic would have a significant impact on the advancement
of scene text recognition research. Taking this novelty into account, we
have initiated the development of a real dataset of annotated images of
text in diverse natural scenes captured at varying weather, lighting and
perspective conditions for Arabic scene text recognition, called ARASTI
(Arabic Scene Text Image) dataset [38]. To our knowledge, ARASTI is
the first dataset for Arabic scene text recognition in real-world. Our
dataset of the Arabic text was collected from pictures of shopping areas,
advertisements and hoardings in streets, signboards, roadside signs, etc.
We
Fig. 5: Auto-encoder Network.
Fig. 4: Our SAE-based BoF architecture for SCR. The arrows indicate the operations performed during the various phases.
8
Fig. 6: The BoF from the image pixels level to low-level descriptors, mid-level visual codes and to a high-level image signature via a two-layer
architecture for coding local features.
photographed a set of 364 images. They are taken under real life
situations without caring about the environmental conditions and refined
camera settings. Therefore, in our image dataset, the lighting conditions
may be more or less good, they might include blur or be taken from
different distances, etc.
To validate a character recognition method, one needs to have a
benchmark dataset covering all the varieties of characters. The different
forms of these characters are used to store the character images in 55
classes. Then, we manually segmented individual characters to obtain a
total of 55 classes of 28 Arabic characters in their different positions
within the word (start, middle, end and isolated). The ARASTI dataset
contains an unequal number of samples for each conjunct/character as
these are extracted from the text strings in the images. This is also in line
with the fact that, in daily usage, some characters are more frequently
used than others. Our ARASTI dataset is freely available1.
5.2 English scene characters datasets We evaluated the proposed recognition system on two English scene
character datasets, i.e, Chars74K [2] and ICDAR03-CH [37] datasets.
These consist of 62 classes (0-9, A-Z, a-z). Character samples are
collected from a wide variety of camera captured images, like book
covers, road signs, brand logo and other texts randomly selected from
various objects.
• Chars74K dataset2 3 contains 12,505 images of characters
extracted from scene images. Various training and test splits of
dataset are provided for the research community to serve as a
benchmark and compare results.
• ICDAR03-CH dataset [37] contains 11,482 images of characters.
This dataset comprises of 6,113 training and 5,369 test images.
In order to be able to compare our results to previously published
findings we concentrated on testing the Chars74K benchmark 15
training images per class, and
2 http://www.regim.org/publications/databases/ arabic-
scenetext- image- of- regim- lab2015/
the ICDAR03-CH benchmark with 15 training images per class.
5.3 Hindi scene characters datasets
India is a multi-script country. In this paper, we evaluated our work on
three different Hindi scripts: Bengali, Devana- gari and Kannada.
5.3.1 Bengali scene characters dataset
The Bengali alphabet set has 50 basic characters which include 11
vowels and 39 consonants. Additionally, it has several diacritics and a
large number of conjunct characters. In this paper, we used the dataset
proposed in [1]. This dataset has 15250 Bengali character samples
extracted from these 260 images and another set of 4280 manually
created samples. Some of these Bengali character samples are shown
in figure 2f. This dataset has 19530 Bengali character samples which are
divided into two sets including 13410 for training and 6120 for testing,
respectively.
5.3.2 Devanagari scene characters dataset In this paper, we used the DSIW-3K [27] to evaluate the scene
Devanagari recognition rate. Pictures were taken under real life situations
without caring about the refined camera settings and environmental
conditions, which may cause poor resolution and distortion (see figure
2d). The dataset contains an unequal number of samples for each
conjunct/character as these are extracted from the text strings in the
images.
5.3.3 Kannada scene characters dataset
Kannada is an Indic script language. We used Kannada Chars74K
dataset 3 [39] for our experiments. (See figure 2e).
6 EXPERIMENTAL RESULTS
6.1 Experimental setting
The images were first resized to 90*90 pixels. Local descriptors were
densely sampled from each image. We randomly
3. http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/
3 http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/
Local Image Signature Character
Image Pixels Loca features Features Codes Category Label
Local Feature 2. Coding using 3. Coding using sparse auto- 4. Spatial
Extraction sparse auto-encoder (with encoder (with the second Pyramids
Classification using
SVM
the first hidden layer hidden layer (K=512))
K=1024 Max-Pooing
9
sampled 200,000 patches for ICDAR03-CH and Chars74K English
datasets; 80,000 patches for ARASTI dataset; and
150.0 patches for Chars74K Kannada, Bengali and De- vanagari
datasets. The trained dictionary was used to encode the local features.
We performed feature extraction when varying three local descriptors,
namely, SIFT (Scale Invariant Feature Transform) [40], HOG (Histogram
of oriented gradients) [41] and SURF (Speeded Up Robust Features)
[42].
6.2 Evaluations of each part in the system and the
parameters 6.2.1 Evaluation of the local features extraction
We reported the recognition accuracy on the six datasets compared with
SIFT [40], HOG [41], SURF [42]. As shown in table 2,3 and 4, the best
performance is achieved with SIFT descriptor.
Also, we reported the recognition accuracy on ICDAR03- CH and
Chars74K datasets while combining the response of these three local
features through the majority voting rule which is used to find a particular
member in any given sequence which has more votes than all the others.
This searching rule has been used in association with SVM classifier to
improve the accuracy of our classification problem.
As shown in table 2,3 and 4, The maximum recognition rate has
been improved from 83.2% to 84.5% by implementing the majority voting
of these three local features. Therefore, in the next section which
evaluates each part in the system, we used the combining of the
response of these three local features through the majority voting rule
descriptor for all the experiments.
6.2.2 Evaluation of the patches size
As shown in figure 7, the patches were varied in size. The best
recognition accuracy was obtained when the patches size is 16 pixels.
For too small patches sizes, the extracted features can hardly capture
any high-level statistics. On the other hand, for too large patches,
extracted local features become highly diverse.
100 95
g 90 aT 85 £ 80 § 75 •E 70 o 65 £ 60 55 50
4 8 16 32
Patches size
Icdar03 Chars74K Arasti
^^—Bengali ^^—Devangari^^—Kannada
Fig. 7: Recognition rate when varying the patches size used for local
feature extraction.
6.2.3 Evaluation of the dictionary size Our system remains very
competitive even when the codebook size is reduced by four times from
4096 to 1024. As shown in figure 8, with only 256 codewords, we are still
able to get greater accuracies than 50%. It can be observed that a larger
codebook has more capacity to capture the diversity in the features, but
is also more likely to exhibit codeword redundancies. From our
experiments, 1024 codewords seem to give a good balance between
diversity and conciseness.
- Icdar03 Chars74K Arasti — Bengali Devangari Kannada
Fig. 8: Recognition rate when varying the dictionary size.
6.2.4 Evaluation of the number of training samples
To evaluate whether the representative ability of our system is dependent
on sufficient training samples, the results with different training samples
on ICDAR dataset were reported. From the results on ICDAR dataset, we
can see that using more training samples could lead to an improvement
of all the methods. Moreover, as shown in table 1, with only 75 training
samples per class, we could still achieve an accuracy of 88.5%.
The results prove that our system outperforms the CNN based
methods. The results show that ConveNet [44] achieves an accuracy of
86.5%, whereas our system could reach 88.5% with fewer of training
samples. This could be due to two reasons. Firstly, the training data is
insufficient which causes the CNN to suffer from overfitting. Secondly, the
network parameters need to be further studied when applied to different
text languages. Oppositely, our proposed system showed its advantage
under insufficient training data.
6.2.5 Evaluation of spatial pyramids matching
The results in figure 9 show that the spatial pyramids matching improves
the performance when the learned total number of the dictionary size is
the same. The results further indicate that spatial division is more
effective for Arabic characters which tend to have more complex
structures. For all the scripts, the best performance is obtained with finer
spatial division (with level 2).
6.3 Comparative study with state-of-the-art methods
6.3.1 Results on English scene character datasets Table 2
shows the results obtained with training sets of 15 samples per class and
test sets of 5 samples per class. As can be seen in table 2, the proposed
system gives better results than those of the existing methods, except
those based on
100
90
40 256 512 1024 2048 4096
Dictionary Size
10
the CNN. This can be explained by the fact that we used a much smaller
number of training data. Table 2 reports on our proposed system and
some existing systems. It can be concluded that our proposed system
using Stacked Sparse Auto-encoder (SSAE) for supervised features
learning gives better results than those of the existing methods. For
further details, the confusion matrix is displayed in figure 10 where we
notice that upper case letters and lower case letters like 'X' and 'x', 'W'
and 'w', 'O' and 'o', 'Z' and 'z' were confused. Meanwhile, these characters
are extremely difficult to distinguish, even by human eyes.
6.3.2 Results on ARASTI dataset
This subsection detailed the recognition result of Arabic characters. In
our experiments, we compared the results obtained using a conventional
OCR system; Tesseract OCR4. Arabic scene character recognition is a
very challenging task because some characters are distinguished from
each other by the number and the position of dots around the basic shape
of the character. To alleviate this problem, the similar classes were
merged into one. So, the samples from the 55 initial classes (denoted by
'ARASTI-55' in table 3) with similar structures were combined. Hence, a
total of 30 classes (denoted by 'ARASTI-30' in table 3) to be recognized
were obtained; they are shown in the confusion matrix in
4 https://code.google.com/pAesseract-ocr/
figure 11. Table 3 shows the results obtained with training sets of 15
samples per class and test sets of 5 samples per class.
We can conclude that our system yields competitive results and prove
that SSAE is a good choice for Arabic character recognition. Its good
performance could be explained by the fact that the SSAE is designed to
learn the discriminative patches and keep the spatial information without
information loss. For further details, figure 11 introduces the confusion
matrix. As shown in this confusion matrix the most obvious mistakes are
the mis-classification between characters which are extremely difficult to
distinguish, even for the human eye. Meanwhile, when recognizing a
scene Arabic text, we can distinguish them by exploiting the context
information.
6.3.3 Results on Hindi script scene character datasets. Table
4 lists the obtained results on Hindi scripts datasets. The Tesseract OCR
result is not available since it cannot deal with isolated Hindi characters.
As shown in table 4, the recognition accuracy outperforms the current
state-of- the-art on Bangali, Kannada and Devanagri datasets even
100
95
90
ICDAR03-CH Cb ars74K Arasti Bengali Devangari Kannada
■ Level 0 ■ Level 1 ■ Level 2
Fig. 9: Comparative results with different spatial pyramids (%).
Fig. 10: Confusion matrix resulting from ICDAR03-CH dataset.
Feature coding method Dictionary size 15 TS 30 TS 45 TS 75 TS
512 59.5 61.5 68.2 70.9
K-means [43] 1024 62.9 63.5 70.0 71.4
2048 62.7 63.0 70.4 70.6 4096 61.9 61.1 69.2 69.8
512 73.8 74.5 76.1 78.9
Sparse Coding [12]
1024 2048
75.9 75.3
77.5 77.4
79.7 79.8
81.7 81.0
4096 74.6 75.8 78.7 80.2
512 81.7 82.3 83.0 86.3
Sparse Auto-encoder
1024 2048
83.2 82.9
83.9 83.8
85.1 85.3
88.5 88.1
4096 81.0 82.7 84.7 87.4
TABLE 1: Comparative results of our system and other methods with different training samples (TS) on ICDAR dataset(%)
11
with CNN [29]. This is due to the training data which is insufficient which
makes the CNN suffer from over fitting. Our proposed system
demonstrates its efficiency even with insufficient training data and with
different languages. Some successful and failed cases are shown in
figure 12.
7 CONCLUSION
In this paper, a sparse deep auto-encoder based system in BoF was
proposed in order to fine-tune the visual dictionary and to learn local
features codes followed by a multi-scale spatial max pooling and a linear
SVM kernel. This technique provides more efficient features
representation and thus a more recognition accuracy, compared to the
other feature representation techniques such as the sparse coding. We
exploited the modeling power of local descriptors and spatial pooling of
the BoF framework and the adaptability
and representational power of deep learning.
The proposed system enabled us to tackle an important problem resulting
from text recognition in images of natural scenes; Multilingual SCR.
In fact, we handled five language scene character recognition problems,
namely, Arabic, English, Bengali, Devanagari and Kannada towards the
advancement of scene character recognition research for multiple scripts.
In addition, we provided a benchmark results for comparing Arabic scene
TABLE 3: The performance of our system on Arabic scene characters
dataset compared to previously published results using other methods
(%). Stacked Sparse Auto-encoder (SSAE). Arasti-55 Arasti-30
Tesseract OCR 19.6 22.3
Tounsi et al. [12] 57.5 60.4
Tounsi et al. [13] 61.2 65.1 SURF + SSAE + SVM 50.1 62.4
HOG + SSAE + SVM 61.7 71.8
SIFT + SSAE + SVM 62.9 72.5
Majority voting with SIFT+HOG+SURF +
SSAE + SVM 65.4 74.7
Fig. 11: Confusion matrix resulting from ARASTI dataset
TABLE 2: The performance of our system compared to previously published results using other methods
on English script (%). Stacked Sparse Auto-encoder (SSAE).
ICDAR03-CH Chars74K
Convolutional Co-occurrence of HOG [1] 81.7 -
CoHOG+Linear SVM [45] 79.4 - PHOG+Linear SVM [46] 76.5 -
MLFP [5] 79 64 GHOG + SVM [3] 76 62
LHOG + SVM [3] 75 58
HOG + NN [16] 52 58
MKL [2] - 55
NATIVE + FERNS [16] 64 54
GB + SVM [2] - 53
GB + NN [2] - 47
SYNTH + FERNS [16] 52 47
SC + SVM [2] - 35
SC + NN [2] - 34 SIFT+ SVM [2] - 21.40
ABBYY [2] 21 31 Foreground segmentation+HOG+SVM [47] 81.34 72.03
Strokelets [48] 69 62
Co-occurrence of strokelets [49] 82.7 67.5 PCAnet[50] 75 64
HOG + Sparse Coding + SVM[51] 72.27 -
SIFT + Sparse Coding+ SVM [12] 75.3 73.1
SIFT + SAE + SVM [13] 79.3 75.4 SIFT + Sparse RBM + SVM [13] 78.1 73.9
SURF + SSAE + SVM 69.3 50.4
HOG + SSAE + SVM 80.5 74.7
SIFT + SSAE + SVM 83.2 77.9 Majority voting with SIFT+HOG+SURF + SSAE + SVM 84.5 80.1
REFERENCES 12
characters recognition methods.
Our future research will mainly include: first, our framework will be
adapted to word-level text recognition in natural video scenes. Second,
the use of GPU will be explored to speed up the computation of the
trained sparse feature codes. Finally, we will further investigate
recognizing scene text in the wild in other languages.
ACKNOWLEDGMENT
This work is performed in the framework of a thesis MO- BIDOC financed
by the EU under the program PASRI. The authors would like also to
acknowledge the partial financial support of this work by grants from
General Direction of Scientific Research (DGRT), Tunisia, under the
ARUB program. The research leading to these results has received
funding from the Ministry of Higher Education and Scientific Research of
Tunisia under the grant agreement number LR11ES48.
REFERENCES
[1] Shangxuan Tian, Ujjwal Bhattacharya, Shijian Lu, Bolan Su,
Qingqing Wang, Xiaohua Wei, Yue Lu, and Chew Lim Tan.
"Multilingual scene character recognition with co-occurrence of
histogram of oriented gradients". In: Pattern Recognition 51
(2016), pp. 125-134.
[2] Teofilo Emidio de Campos, Bodla Rakesh Babu, and Manik Varma.
"Character Recognition in Natural Images". In: (2009), pp. 273-
280.
[3] Chucai Yi, Xiaodong Yang, and Yingli Tian. "Feature
Representations for Scene Text Character Recognition: A
Comparative Study". In: ICDAR '13 (2013), pp. 907911.
[4] Qixiang Ye and David S. Doermann. "Text Detection and
Recognition in Imagery: A Survey". In: IEEE Trans. Pattern Anal.
Mach. Intell. 37.7 (2015), pp. 14801500.
[5] Chen-Yu Lee, Anurag Bhardwaj, Wei Di, Vignesh Jagadeesh, and
Robinson Piramuthu. "Region-Based Discriminative Feature
Pooling for Scene Text Recognition". In: Proceedings of the 2014
IEEE Conference on Computer Vision and Pattern Recognition.
CVPR '14. 2014, pp. 4050-4057.
[6] Aymen Chaabouni, Houcine Boubaker, Monji Kheral- lah, Adel M.
Alimi, and Haikal El Abed. "Fractal and Multi-fractal for Arabic
Offline Writer Identification". In: 20th International Conference on
Pattern Recognition, ICPR 2010, Istanbul, Turkey, 23-26 August
2010. 2010, pp. 3793-3796.
[7] Najiba Tagougui, Monji Kherallah, and Adel M. Alimi. "Online Arabic
handwriting recognition: a survey". In: IJDAR 16.3 (2013), pp. 209-
226.
[8] Abdelkarim Elbaati, Houcine Boubaker, Monji Kher- allah, Abdel
Ennaji, Haikal El Abed, and Adel M. Alimi. "Arabic Handwriting
Recognition Using Restored Stroke Chronology". In: 10th
International Conference on Document Analysis and Recognition,
ICDAR 2009, Barcelona, Spain, 26-29 July 2009. 2009, pp. 411-
415.
(a) (b)
Fig. 12: Successful and failed cases of our system. In each figure, the top row shows the correctly recognized samples and the bottom row shows
wrongly recognized ones. (a) Bengali language, (b) Devanagari language and (c) Kannada Language.
(c)
Bengali Kannada Devanagari
Jaderberg et al. (CNN) [29] 84.5 76.2 62.7 Sheshadri et al. [39] - 54.13 -
deCampos et al. [2] - 29.88 -
Narang et al. [27] - - 56.0
Tounsi et al. [12] 84.2 72.5 48.4
Tounsi et al. [13] 87.7 76.5 58.7
SURF + SSAE + SVM 78.0 65.2 53.7
HOG + SSAE + SVM 86.9 76.4 64.1
SIFT + SSAE + SVM 89.2 77.1 64.2
Majority voting with SIFT+HOG+SURF +
SSAE + SVM 90.7 79.4 66.6
TABLE 4: The performance of our system in recognition on Bangali, Kannada and Davangari scene characters datasets compared to previously
published results using other methods (%). Stacked Sparse Auto-encoder (SSAE).
REFERENCES 13
[9] Haikal El Abed, Monji Kherallah, Volker Margner, and Adel M.
Alimi. “On-line Arabic handwriting recognition competition - ADAB
database and participating systems". In: IJDAR 14.1 (2011), pp.
15-23.
[10] Fouad Slimane, Slim Kanoun, Haikal El Abed, Adel M. Alimi, Rolf
Ingold, and Jean Hennebert. “ICDAR 2011 - Arabic Recognition
Competition: Multi-font Multi-size Digitally Represented Text". In:
2011 International Conference on Document Analysis and
Recognition, ICDAR 2011, Beijing, China, September 18-21,
2011. 2011, pp. 1449-1453.
[11] Maroua Tounsi, Ikram Moalla, Adel M. Alimi, and Franck
Lebourgeois. “A comparative study of local descriptors for Arabic
character recognition on mobile devices". In: Seventh
International Conference on Machine Vision, ICMV 2014, Milan,
Italy, 19-21 November 2014. 2014, 94450B.
[12] M. Tounsi, I. Moalla, A. M. Alimi, and F. Lebouregois. “Arabic
characters recognition in natural scenes using sparse coding for
feature representations". In: 2015 13th International Conference
on Document Analysis and Recognition (ICDAR). 2015, pp. 1036-
1040.
[13] Maroua Tounsi, Ikram Moalla, and Adel M. Alimi. “Supervised
dictionary learning in BoF framework for Scene Character
recognition". In: 23rd International Conference on Pattern
Recognition, ICPR 2016, CancUn, Mexico, December 4-8, 2016.
2016, pp. 3987-3992.
[14] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio,
and Pierre-Antoine Manzagol. “Stacked Denoising Autoencoders:
Learning Useful Representations in a Deep Network with a Local
Denoising Criterion". In: Journal of Machine Learning Research
11 (2010), pp. 3371-3408.
[15] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. “PhotoOCR:
Reading Text in Uncontrolled Conditions". In: (2013), pp. 785-792.
[16] Kai Wang, Boris Babenko, and Serge J. Belongie. “End- to-end
scene text recognition". In: IEEE International Conference on
Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13,
2011. 2011, pp. 1457-1464.
[17] Vibhor Goel, Anand Mishra, Karteek Alahari, and C. V. Jawahar.
“Whole is Greater than Sum of Parts: Recognizing Scene Text
Words." In: ICDAR. IEEE Computer Society, 2013, pp. 398-402.
[18] Cunzhao Shi, Chunheng Wang, Baihua Xiao, Yang Zhang, Song
Gao, and Zhong Zhang. “Scene Text Recognition Using Part-
Based Tree-Structured Character Detection". In: Proceedings of
the 2013 IEEE Conference on Computer Vision and Pattern
Recognition. CVPR '13. 2013, pp. 2961-2968.
[19] Xiangrong Chen and Alan L. Yuille. “Detecting and Reading Text
in Natural Scenes". In: 2004 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR 2004), with
CD-ROM, 27 June - 2 July 2004, Washington, DC, USA. 2004,
pp. 366-373.
[20] Lukas Neumann and Jiri Matas. “Real-Time Lexicon- Free Scene
Text Localization and Recognition". In: IEEE Trans. Pattern Anal.
Mach. Intell. 38.9 (2016), pp. 1872-1885.
[21] Jacqueline L. Feild and Erik G. Learned-Miller. “Improving Open-
Vocabulary Scene Text Recognition".
In: 12th International Conference on Document Analysis and
Recognition, ICDAR 2013, Washington, DC, USA, August 25-28,
2013. 2013, pp. 604-608.
[22] Qi Zheng, Kai Chen, Yi Zhou, Congcong Gu, and Haibing Guan.
“Text Localization and Recognition in Complex Scenes Using
Local Features". In: Computer Vision - ACCV 2010 - 10th Asian
Conference on Computer Vision, Queenstown, New Zealand,
November 8-12, 2010, Revised Selected Papers, Part III. 2010,
pp. 121-132.
[23] Cunzhao Shi, Chunheng Wang, Baihua Xiao, Song Gao, and
Jinlong Hu. “End-to-end scene text recognition using tree-
structured models". In: Pattern Recognition 47.9 (2014), pp. 2853-
2866.
[24] Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian,
and Chew Lim Tan. “Recognizing Text with Perspective Distortion
in Natural Scenes". In: IEEE International Conference on
Computer Vision, ICCV 2013, Sydney, Australia, December 1-8,
2013. 2013, pp. 569-576.
[25] Yoshua Bengio. “Learning Deep Architectures for AI". In:
Foundations and Trends in Machine Learning 2.1 (2009), pp. 1-
127.
[26] Hanlin Goh, Nicolas Thome, Matthieu Cord, and Joo- Hwee Lim.
“Learning Deep Hierarchical Visual Feature Coding". In: IEEE
Trans. Neural Netw. Learning Syst. 25.12 (2014), pp. 2212-2225.
[27] Vipin Narang, Sujoy Roy, O. V. R. Murthy, and M. Hanmandlu.
“Devanagari Character Recognition in Scene Images". In:
Proceedings of the 2013 12th International Conference on
Document Analysis and Recognition. ICDAR '13. 2013, pp. 902-
906.
[28] Ouais Alsharif and Joelle Pineau. “End-to-End Text Recognition
with Hybrid HMM Maxout Models". In: CoRR abs/1310.1811
(2013).
[29] Max Jaderberg, Andrea Vedaldi, and Andrew Zisser- man. “Deep
Features for Text Spotting". In: Computer Vision - ECCV 2014 -
13th European Conference, Zurich, Switzerland, September 6-12,
2014, Proceedings, Part IV. 2014, pp. 512-528.
[30] Song Gao, Chunheng Wang, Baihua Xiao, Cunzhao Shi, and
Zhong Zhang. “Stroke Bank: A High-Level Representation for
Scene Character Recognition". In: 22nd International Conference
on Pattern Recognition, ICPR 2014, Stockholm, Sweden, August
24-28, 2014. 2014, pp. 2909-2913.
[31] Cunzhao Shi, Song Gao, Meng-Tao Liu, Cheng-Zuo Qi, Chun-
Heng Wang, and Baihua Xiao. “Stroke Detector and Structure
Based Models for Character Recognition: A Comparative Study".
In: IEEE Trans. Image Processing 24.12 (2015), pp. 4952-4964.
[32] Song Gao, Chunheng Wang, Baihua Xiao, Cunzhao Shi, Wen
Zhou, and Zhong Zhang. “Scene Text Character Recognition
Using Spatiality Embedded Dictionary". In: IEICE Transactions
97-D.7 (2014), pp. 19421946.
[33] “Multi-order co-occurrence activations encoded with Fisher Vector
for scene character recognition". In: Pattern Recognition Letters
97 (2017), pp. 69 -76.
[34] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-
Antoine Manzagol. “Extracting and Composing Robust Features
with Denoising Autoencoders".
14
In: Proceedings of the 25th International Conference on Machine
Learning. ICML '08. Helsinki, Finland, 2008, pp. 1096-1103.
[35] S. Kullback and R. A. Leibler. “On Information and Sufficiency". In:
Ann. Math. Statist. 1 (), pp. 79-86.
[36] Luc T. Wille. “Review of "Learning with Kernels: Support Vector
Machines, Regularization Optimization". In: SIGACT News 35.3
(2004), pp. 13-17.
[37] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R.
Young. “ICDAR 2003 robust reading competitions". In: Seventh
International Conference on Document Analysis and Recognition,
2003. Proceedings. 2003, pp. 682-687.
[38] Maroua Tounsi, Ikram Moalla, and Adel M. Alimi. “ARASTI: A
database for Arabic scene text recognition". In: 1st International
Workshop on Arabic Script Analysis and Recognition, ASAR
2017, Nancy, France, April 3-5, 2017. 2017, pp. 140-144.
[39] Karthik Sheshadri and Santosh Divvala. “Exemplar Driven
Character Recognition in the Wild". In: Proceedings of the British
Machine Vision Conference. BMVA Press, 2012, pp. 13.1-13.10.
[40] David G. Lowe. “Distinctive Image Features from Scale-Invariant
Keypoints". In: Int. J. Comput. Vision 60.2 (2004), pp. 91-110.
[41] N. Dalal and B. Triggs. “Histograms of oriented gradients for
human detection". In: 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1.
2005, 886-893 vol. 1.
[42] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool.
“Speeded-Up Robust Features (SURF)". In: Comput. Vis. Image
Underst. 110.3 (2008).
[43] S. Lazebnik, C. Schmid, and J. Ponce. “Beyond Bags of Features:
Spatial Pyramid Matching for Recognizing Natural Scene
Categories". In: 2006 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR'06). Vol. 2.
2006, pp. 2169-2178.
[44] Michael Opitz, Markus Diem, Stefan Fiel, Florian Kle- ber, and
Robert Sablatnig. “End-to-End Text Recognition Using Local
Ternary Patterns, MSER and Deep Convolutional Nets". In: SBES
'13 (2013), pp. 186-190.
[45] Shangxuan Tian, Shijian Lu, Bolan Su, and Chew Lim Tan. “Scene
Text Recognition Using Co-occurrence of Histogram of Oriented
Gradients". In: 2013 12th International Conference on Document
Analysis and Recognition, Washington, DC, USA, August 25-28,
2013. 2013, pp. 912-916.
[46] Zhi Rong Tan, Shangxuan Tian, and Chew Lim Tan. “Using
pyramid of histogram of oriented gradients on natural scene text
recognition". In: 2014 IEEE International Conference on Image
Processing, ICIP 2014, Paris, France, October 27-30, 2014. 2014,
pp. 2629-2633.
[47] Muhammad Fraz, M. Saquib Sarfraz, and Eran A. Edirisinghe.
“Exploiting colour information for better scene text detection and
recognition". In: IJDAR 18.2 (2015), pp. 153-167.
[48] Xiang Bai, Cong Yao, and Wenyu Liu. “Strokelets: A Learned
Multi-Scale Mid-Level Representation for Scene Text
Recognition". In: IEEE Trans. Image Processing 25.6 (2016), pp.
2789-2802.
[49] S. Gao, C. Wang, B. Xiao, C. Shi, W. Zhou, and Z. Zhang.
“Learning co-occurrence strokes for scene character recognition
based on spatiality embedded dictionary". In: 2014 IEEE
International Conference on Image Processing (ICIP). 2014, pp.
5956-5960.
[50] Chongmu Chen, Da-Han Wang, and Hanzi Wang. “Scene
character recognition using PCANet". In: Proceedings of the 7th
International Conference on Internet Multimedia Computing and
Service, ICIMCS 2015, Zhangjiajie, Hunan, China, August 19-21,
2015. 2015, 65:1-65:4.
[51] Dong Zhang, Da-Han Wang, and Hanzi Wang. “Scene text
recognition using sparse coding based features". In: 2014 IEEE
International Conference on Image Processing, ICIP 2014, Paris,
France, October 27-30, 2014. 2014, pp. 1066-1070.
Maroua Tounsi received the engineering degree in computer science from the National Engineering School of Sfax (ENIS), Tunisia, in 2012.
She is now preparing her PhD in Research groups in Intelligent Machines from the National School of Engineers of Sfax, Tunisia. Her Ph.D project was carried out within the framework of a PhD MOBIDOC (PASRI program), funded by the EU and managed by the ANPR. She is an IEEE Student Member. Her main research activities focus on document analysis, pattern recognition, segmentation method for natural scene analysis.
Ikram Moalla is currently an Assistant Professor in the department of computer science and applied mathematics at ENIS, University of Sfax. She is also a member of REGIM-Lab, the Research Group on Intelligent
Machines. Her main research activities focus on document analysis, pattern recognition, feature extraction, free segmentation method for main paleography and natural scene analysis and they lead to several published papers in refereed journals and proceedings of international conferences. She has been involved in different projects.
Frank Lebourgeois is a senior researcher from LIRIS laboratory from INSA Lyon (France) since 1992. He is specialized in the domain of Docu-ment Images Analysis (DIA) and its works cover a wide range of applications in the domain of document imaging. With more than 25 years
of experience in this field, he has published more than 70 international papers in international con-ferences and journals and has served in nu-merous program committees of international and national conferences. He is reviewer of interna-tional journals (PAMI, PR, CVIU, IJDAR...) and international conferences of the domain (ICDAR, ICFHR, DAS,...).
feature extraction, free
15
Adel M. Alimi obtained a PhD and then an HDR both in Electrical Computer Engineering in 1995 and 2000 respectively. He is fullProfes- sor in Electrical Engineering at the University of Sfax, ENIS, since 2006. Prof. Alimi is founder and director of the REGIM-Lab. in intelligent Ma-chines. He is the holder of 15 Tunisian patents. He served also as Expert evaluator for the Eu-ropean Agency for Research since 2009. He is the Founder and Chair of many IEEE Chapters
in Tunisia section, IEEE Sfax Subsection Chair (since 2011), IEEE Systems, Man, and Cybernetics Society Tunisia Chapter Chair (since 2011), IEEE Computer Society Tunisia Chapter Chair (since 2010), IEEE ENIS Student Branch Counselor (since 2010). His research interests include iBrain (Evolutionary computing, Swarm intelligence, Artificial immune systems, Fuzzy Sets, Uncertainty analysis, Fractals, Support vector machines, Artificial neural networks, Case Based, Reasoning, Wavelets, Hybrid intelligent systems, Nature inspired computing techniques, Machine learning, Ambient intelligence, Hardware implementations, Multi-Agent Systems) iPerception (Classification, Classifiers Combination, Features Extraction, High Dimension Classification Problems, Handwriting Recognition, Handwriting Model-ing, Motor Control, Perception, OCR, Arabic Script, Historical Documents Analysis, iDocument, iVision, Remote Sensing, Affective Computing) iData (Big Data, iWeb, Biometry, Data Mining, Web Mining, Cloud Computing) iRobotics (Robotics, Multi-Robot, Autonomous Robots, Complex and Distributed Systems, Parallel and Distributed Architec-tures, Intelligent Control, Embedded Systems). Prof. Alimi served as associate editor and member of the editorial board of many international scientific journals (e.g. ”IEEE Trans. Fuzzy Sys-tems”, ’’Pattern Recognition Letters”, ”NeuroComputing”, "Neural Pro-cessing Letters”, ”International Journal of Image and Graphics”, ”Neural Computing and Applications”, ’’International Journal of Robotics and Automation”, ’’International Journal of Systems Science”, etc.). He was guest editor of several special issues of international journals (e.g. Fuzzy Sets Systems, Soft Computing, Journal of Decision Systems, Integrated Computer Aided Engineering, Systems Analysis Modelling and Sim-ulations). He organized many international conferences: ICDAR2015, SoCPaR2014, ICALT20142013, ICBR2013, HIS2013, IAS2013, ISI12, NGNS11, LOGISTIQUA11, ACIDCA-ICMI05, SCS04ACIDCA2000.