challenging issues in devanagari script recognition
Post on 07-Feb-2017
268 Views
Preview:
TRANSCRIPT
Challenging Issues in Devanagari Script Recognition
Rameshwar S. Mohite 1
, Balaji R. Bombade 2
1 M. Tech. Student, CSE Dept., SGGS IE&T Nanded-431606, India. 2 Assistant Professor, CSE Dept., SGGS IE&T Nanded-431606, India.
1rameshwar.mohite@gmail.com
2b.r.bombade@gmail.com
Abstract
Rigorous research has been done on optical character
recognition (OCR) and a large number of articles have
been published on this topic during the last few
decades. OCR plays a vital role in Digital Image
Processing and Pattern Recognition. Numerous work
has stated for Roman, Chinese, Japanese and Arabic
scripts. There is no convenient work done on Indian script recognition. In India, more than 300 million
people use Devanagari script for documentation.
Although different efficient methodologies of
Devanagari script recognition are proposed, but
recognition accuracy of Devanagari script is not yet
analogous to its overseas counterparts. This is
predominantly due to the large variety of
characters/symbols and their intimacy arrival in the
Devanagari script. In this paper, we discuss some
challenging issues which arise while recognition of
Devanagari script.
Keywords— OCR, image processing, peculiarities
of the Devanagari script, challenging issues in
Devanagari script.
1. Introduction Machine simulation of human activities has been the
most challenging research area since the evolution of
digital computers. The main reason for such an effort
was not only the challenges in simulating human
reading, but also the possibility of methodical
applications in which the information present on paper
documents has to be transferred into machine editable
form. OCR is a process of automatic computer
recognition of characters and symbols in optically
scanned and digitized pages of text [1]. Automatic
recognition of information present on documents like
cheques, envelopes, forms, and other manuscripts has a
numerous practical and commercial applications in banks, post offices, library, publication houses,
language processing, and forensic investigation.
Presently there are many OCR systems available for
handling printed English documents with reasonable
levels of accuracy. These systems are available for
many European languages as well as some of the Asian
languages such as Japanese, Chinese, etc. However,
there are not many efforts stated on developing OCR
systems for Indian languages. India is a multilingual,
multi-script country and there are twenty two
languages. Eleven scripts are used to write these
languages. Devanagari Script is an old one and evolved
from the Brahmi script. Devanagari is used to write
many languages such as Hindi, Konkani, Marathi,
Nepali, Sanskrit, Bodo, Dogri and Maithili. Hindi is
the national language of the India. Devanagari is the
second most popular language in the Indian
subcontinent and third most popular in the world [2].
300 million people use the Devanagari Script for
documentation in central and northern parts of India
[3]. It also serves as an auxiliary script for other
languages such as Punjabi, Sindhi and Kashmiri. The rest of this paper is organized as follows:
Section II illustrates the previous work. Section III
describes the peculiarities of Devanagari Script.
Challenging issues are discussed in section IV. The
conclusion is given in section V.
2. Previous Work OCR work on the printed Devanagari script started
in the early 1970s. The good survey about the work
done for offline recognition of Devanagari Script in [4].
The first complete OCR system development of printed
Devanagari is perhaps due to Palit and Chaudhuri [5] as
well as Pal and Chaudhuri [6]. The work on machine
printed Devanagari has been made by Bansal et al [7].
A syntactic pattern analysis system for Devanagari
script recognition is presented in Sinha’s Ph.D. Thesis
[8]. OCR is classified into two types, Offline
recognition and Online recognition. In offline
recognition the source is either an image or a scanned
form of the document whereas in online recognition the
successive points are represented as a function of time
and the order of hits are also accessible. The study
scrutinises the direction of the CR research, analysing
the limitations of methodologies for the systems, which
can be classified based upon two major criteria:
Rameshwar S Mohite et al, Int.J.Computer Technology & Applications,Vol 5 (3),947-952
IJCTA | May-June 2014 Available online@www.ijcta.com
947
ISSN:2229-6093
The data acquisition process (on-line or off-line).
The text type (machine-printed or handwritten).
No matter in which type the problem belongs, in
general, these are the major phases in OCR problem as
follows:
Pre-processing.
Segmentation.
Feature extraction.
Classification and recognition.
2.1 Pre-processing
Pre-processing consists of a few types of sub
processes to clean the document image and make it
appropriate to carry the recognition process accurately.
The main sub processes of pre-processing are:
Binarization, Noise Reduction, Skew correction and
Thinning. Binarization process is transforming a grayscale image into a black and white image. Image
binarization is categorized into two main classes:
Global and Local. In a global approach, threshold
selection results in a single threshold value for the
entire image. The most commonly used method is an
Otsu’s method [9]. Using the local information that
guides the threshold value pixel wise in an adaptive
manner is well suited for degraded documents [10].
Histogram based thresholding approach can also be
used to convert a grayscale image into a two tone
image. Digital images are susceptible to several types
of noises. Noise in a document image is due to the
poorly optical scanning device. Salt and pepper noise
arises due to scanning process and quality of the paper
being scanned thereby corrupting the pixels. Median
filter is used for removal of salt and pepper noise [11],
[12]. The wiener Filtering method and morphological operations can be performed to remove noise [12].
When a document is scanned using an optical scanner,
a small degree of skew is unescapable. Skew angle is
the angle that the text lines in the digital image, make
with the horizontal direction. Skew estimation and
correction are important pre-processing steps of
document outline analysis. In [13], a rule based
approach is proposed. This method does remove
irrelevant data and fix skew from scanned textual
documents of Devanagari script. Thinning is a
technique which results in single pixel width image to
recognize the character easily. It is applied repeatedly
leaving only pixel wise linear representations of the
image characters. Thinning extracts the shape
information of the characters. The detailed information
about the thinning algorithm is available in [14].
2.2 Segmentation
Segmentation is a process which is used to split the
document images into lines, words and
characters/symbols. Segmentation is a vital phase in the OCR system because it affects the rate of recognition.
Segmentation can be external and internal. External
segmentation is the segregation of various text parts,
such as paragraphs, sentences or words. In internal
segmentation an image of a series of characters is
segregated into sub-images of individual character. The
segmentation process involves three steps, namely Line
segmentation, word segmentation and character
segmentation. Segmentation of lines and words are
done using the horizontal and vertical projection
profiles of the scanned document image [15]. Bansal
and Sinha [16] suggested segmentation of touching and
fused Devanagari characters for printed text. The
strategy recommended by them uses a two-pass
algorithm for the segmentation and separation of
Devanagari composite characters into their constituent
symbols. In the first phase, words are segmented into
smoothly detachable characters or composite characters. Statistical enlightenment about the height
and width of each autonomous box is used to
hypothesize whether a character box is composite. In
the second phase, the hypothesized composite
characters are again segmented. The proposed
algorithm extensively uses structural properties of the
script. Removal of shirorekha does the segmentation of
characters from each Devanagari word in [16]. In [18],
the touching characters are initially identified and then
segmented into basic ones by a new fuzzy decision-
making approach. This idea was motivated after
examining the complex ways by which characters touch
each other in the Devanagari and Bangla scripts.
2.3 Feature Extraction
Feature extraction is very problem dependent. Good
features are those whose values are similar to objects
belonging to the same category and distinct for objects
in different categories. The better approach for
recognition is to segment characters into basic symbol
and recognize each symbol subsequently. The system
described by Sinha and Mahabala [19] for printed
Devanagari characters stores structural descriptions for
each symbol of the script in terms of primitives and
their relationships. Bansal and Sinha [17] considered several statistical classifying features like horizontal
zero crossings, moments, vertex points, and pixel
density in different zones for Devanagari characters.
They also considered word envelop information
containing a number of character boxes, number of
vertical bars, number of upper modifier boxes, a
number of lower modifier boxes, vector giving the
position of vertical bars, vector giving type, and locus
of each character box. Jawahar et al. [20] used PCA for
feature extraction of printed characters. A word-level
identical system for searching in printed document
images is proposed by Meshesha and Jawahar [21]. The
feature-extraction system a takeaway local features by
Rameshwar S Mohite et al, Int.J.Computer Technology & Applications,Vol 5 (3),947-952
IJCTA | May-June 2014 Available online@www.ijcta.com
948
ISSN:2229-6093
scanning vertical strips of the word image and integrates them automatically based on their
discriminatory potential. The features considered are
word profiles, moments, and transform-domain
representations. In [22], a technique to identify
Kannada, Hindi, and English text lines from a printed
document is presented. The system is based on the
upper and lower profiles of isolated text lines of the
input document image. The locations of the connected
components of the upper and lower profiles are
extracted and the coefficients of variation of the upper
and lower profiles are calculated latterly. In [23],
consider 5 different features i.e. the lower profile, the
upper profile, the ink-background transitions, the
number of black pixels, and the span of the foreground
pixels. The upper and lower profiles measure the
distance of the top and bottom foreground pixel from
the respective baselines. Ink-background transitions
measures the number of transitions from Ink to background and reverse. The number of black pixels
provides the information about the density of ink in the
vertical stripe.
2.4 Classification and Recognition
The extracted features are given as the input to the
decision making part of the recognition system. The
performance of a classifier relies on the quality of the features. Two main types of approaches have been
applied for character recognition [24].
The holistic approach
The analytical approach.
In holistic technique recognition is globally worked
on the whole image of words and there is no effort to
organize characters separately. The main advantage of
holistic method is that they avoid word
segmentation [25]. Their main drawback is that
vulnerable to recognition of long word and recognition
accuracy is reduced. Analytical methodology deals with
numerous levels of representation of the image that is
Sub-word or letter recognition. The leading advantage
of analytic method is that unlimited vocabulary and
recognition accuracy is high. Their main drawback is
that vulnerable to segmentation errors. Analytical
method requires external and internal segmentation.
There are some approaches that are used to classify the
characteristic features in the existing systems such as
neural network, support vector machine and
Combination Classifier and so on.
A neural network is an estimating structural design
that involves enormous parallel interconnection of
adaptive neural processors. The most popularly used
neural networks in the OCR systems are multi-layer
perceptrons (MLP). MLP is being used as classifiers
because of their universal approximation property and
better generalization ability [26], [30]. Back
propagation type NN classifier is proposed by K. Y.
Rajput et al. [27]. In [23], propose a Recurrent Neural
Network is known as Bidirectional Long-Short Term
Memory (BLSTM). Support Vector Machine is based
on statistical learning theory. A classification process
generally contains separating the data into two sets,
training and testing sets. Each instance in the training
set contains one target value and several Attributes.
Many researchers used SVM successfully viz. C. V.
Jawahar et al. [20], Sandhya Arora et al. [26],
Umapada Pal et al. [28]. Numerous classification
methods proposed for Devanagari script recognition
and each method has specific strengths and
weaknesses. Hence, many times combination classifiers
are used to resolve a specified classification problem.
In Indian scripts the combination of classifiers can be
used such as SVM and ANN [26], K-Means and SVM
[29], MLP and minimum edit [31].
3. Peculiarities of Devanagari Script Devanagari script has 34 consonants (Vyanjana),
and 13 vowels (Swara). Basic characters can be formed
by using vowels and consonants. Vowels can be an
independent letter or a variety of accent symbols which are written top, bottom, left or right the
consonant they belong to. When vowels are written in
this way they are known as modifiers and the characters
so modelled are called conjuncts and different
conjunction forms as shown in figure 1. Occasionally,
two or more consonants can merge and take new
shapes. This shape is known as composite character.
Devanagari is written from left to right. It has no upper
and lower case characters. Every character has a
horizontal line at the top called as shirorekha or header line. It connects with the header line of two or more
basic or composite characters to form a word.
Horizontal line at the top called as shirorekha or header
Figure 1. Different Conjunction form.
line. It connects with the header line of two or more
basic or composite characters to form a word.
Devanagari words can normally be divided into three
discrete zones: top zone, core zone, and bottom zone.
Rameshwar S Mohite et al, Int.J.Computer Technology & Applications,Vol 5 (3),947-952
IJCTA | May-June 2014 Available online@www.ijcta.com
949
ISSN:2229-6093
The top zone and core zone are always separated by the header line, whereas there is no analogous feature to
distinct the bottom zone and core. The top zone
contains the top modifiers, and bottom zone contains
lower modifiers. The core zone that encompasses the
vowel, consonant, conjunct forms and composite
characters, respectively as shown in figure 2.
Figure 2. Three zones of a Devanagari script.
4. Challenging issues in Devanagari script Recognition of the printed Devanagari script is the
challenging problem since there is a difference in the
same character due to diverse font family, font size,
font orientation etc. Difference in font family and sizes
makes recognition task problematic, in such conditions
pre-processing, feature extraction and recognition are
not robust. Sometime same font and size may also have bold face character as well as normal ones. Thus, the
width of the stroke is also an issue that interrupts
recognition. For example, the four-character-word
image is shown in figure 3. Where first character font
size is larger than remaining characters. Therefore, all
characters within word don’t come under single
shirorekha or header line.
Figure 3. Font variation in Devanagari script.
Header line property is carrying out a vital role in
Devanagari script recognition. It is used for
identification of word limits and skew adjustment. If
the header line is absent from a word, it introduces the
problem of printed word recognition, i.e. skew
correction and character segmentation. The presence of
more than one header lines adds confusion of two text
lines.
Devanagari word contains Information about
Number of characters, Number of vertical bars,
Number of modifiers, and position of vertical bars.
Based on the above information we can recognize a significant word. Devanagari word contains a complex
mixture of few or all of the above elements, when we
recognize upper modifier, lower modifier and exact
place of that modifier and sometimes frail joining of
modifier creates a confusion. Also, when we miss the
position of modifier, word information is impossible to
understand. The gap between the character and the
modifier doesn't touch the core character at all, makes
the situation more tedious. The bottom modifier called ―nukta‖ (signified as a dot (.) at the bottom) usually
does not touch the core character. In figure 4. illustrates
one of the example.
Figure 4. The gap between character and modifier.
Image degradations can arise due to multiple sources like poor quality of ink, low spacing between
characters, document age etc. If the document is
heavily degraded then any meaningful extraction of
information is very tough. OCR for Devanagari script
becomes even more difficult when composite character
and modifiers are collective in 'noisy' state. Word
contains upper modifier like anuswar are small and
difficult to distinguish from noise. Figure 5a. illustrates
an example, the small dot present on top of the word is
actually a valid one, but due to small size, it can be
considered as a noise. There are several isolated marks,
which are vowel modifiers namely ―Anuswar‖,
―Visarga‖ and ―Chandra Bindu‖ which add up to the
misinterpretation. Possibly most errors in conjunct
recognition are due to misperceptions with vowels and
the virama symbol. When word contains a special
symbol like omkar or rupee sign it becomes very tedious situation in recognition. This is illustrated in
figure 5b.
Figure 5. (a) Confused with modifier and noise.
(b) Word with special symbol.
In Devanagari words consisting of modifiers, curvy
shapes, joined/ fused characters and composite characters leads to the usage of segmentation at
different levels. Due to these reasons recognitions of
Devanagari script is difficult job as compared to
English language. The line separation may be abstruse
due to overlapping of text lines. If two top modifier
touches each other, then they are segmented as one
component. For instance, consider ―Chandra‖ and
―Bindu‖ are two components which usually occurs
together in many words, but due to image degradation
they may overlap over each other and appear to not be
as separate one. The maximum problem of lower
modifier separation from consonants occurs in
character, due to the presence of lower modifier like
loop in the lower part of this character. Gap between
words is an important factor for word separation
because closely words may not get segmented into
Rameshwar S Mohite et al, Int.J.Computer Technology & Applications,Vol 5 (3),947-952
IJCTA | May-June 2014 Available online@www.ijcta.com
950
ISSN:2229-6093
individual words. Devanagari word contains characters
like (a), (bha) etc. If we remove the header line
using vertical projection, which result in loss of shape
of such characters. So there may be miss interpretation of character recognition. Figure 6 illustrates this
consequence. The errors which arise in text line
segmentation also create a problem in word
segmentation and character segmentation.
Figure 6. (a) Word contains and .
(b) Loss of shape of and .
In classification phase, some character has similar
features. This observation creates confusion for
classification phase. It means certain character classes
have also been observed due to their similarity as
shown in figure. 7a. and figure 7b. ―na‖ (second
character in the image) can be confused as ―ta‖. This happened because the ―na‖ character in devanagari has
a hole in the beginning, which got filled up, in a
situation like this it can confused with ―ta‖ [23].
Figure 7. (a) Similar features character.
(b) Ambiguity with na and ta.
Where a conjunction with a ra ( ) and dash ( ) in
the Devanagari script has two different meanings
depending on its position in the word. As shown in the
figure 8a. the conjunction symbol is interpreted and
pronounced as a combination of 'ra' + 'ya', whereas in
figure 8b. the dash is often used in place of the word
"to" such. It is difficult for the classifier to interpret a
conjunction symbol and dash due to its two different
meanings in the Devanagari script.
Figure 8. Analogous feature symbol.
The quality of training data also affects the
performance of word recognizer. Poor quality of
hardware resources causes the improper generation of documents, thus removing very important and critical
sections of word in the initial phase itself. This
misleads the recognition of word particularly for
Devanagri scripts. Recognition of the Devanagri script
requires implementation of algorithms differentiation.
Better algorithms and techniques for correct and
efficient recognition is required because of the
existence of the problems discuss throughout this
paper.
5. Conclusion OCR techniques and algorithms vary as the script
changes. Different approaches of Devanagari script
recognition are proposed in various journal articles, but
recognition rate is not pre-eminent and improvements
are very marginal. Generally, recognition rates depend
upon the pre-processing segmentation and feature
extraction. An existing methodologies are not much
capable of segmenting and recognizing document in
complex cases. Devanagari script recognition is a tricky
job. In this paper, the problems are elaborated and
eager to solve in the future to make the OCR systems
more potent.
6. References [1] U. Pal and B. B. Chaudhuri, ―Indian script character
recognition: A survey‖, Pattern Recognit. , vol. 37, pp. 1887-1899, 2004.
[2] Raghuraj Singh, C.S.Yadav, Prabhat Verma, Vibhash
Yadav, ―Optical Character Recognition for Printed Devanagari Script Using Artificial Neural Network‖,
IJCSC, Vol.1, No.1, pp.91-95, Jan- June: 2010
[3] R. M. K. Sinha, ―A journey from Indian scripts
processing to Indian language processing‖, IEEE Ann. Hist. Comput., vol. 31, no. 1, pp. 8–31, Jan./Mar. 2009.
[4] R. Jayadevan, S. R. Kolhe, P.M. Patil and U. Pal,
"Offline Recognition of Devanagari Script: A Survey",
IEEE Transactions on Systems, Man and Cybernetics-Part C: Applications and Reviews, 2011.
[5] S. Palit, B.B. Chaudhuri, ―A feature-based scheme for
the machine recognition of printed Devanagari script‖,
Pattern Recognition, Image Processing and Computer Vision, India, pp. 163-168, 1995.
[6] U. Pal and B. B. Chaudhuri, ―Printed Devanagari script
OCR system‖, Vivek, vol. 10, pp. 12–24, 1997.
[7] V. Bansal, ―Integrating Knowledge Sources in Devanagari Text Recognition System‖, Ph. D
Thesis, 1996.
[8] R. M. K. Sinha, ―A Syntactic pattern analysis system
and its application to Devnagari script recognition‖, Ph.D. Thesis, Dept. Elect. Eng., Indian Institute of
Technology, Kanpur, India, 1973.
[9] N. Otsu, ―A threshold selection method from grey level
histogram,‖ IEEE Trans on SMC, Vol.9, pp.62-66, 1979.
[10] Y. Yang and H. Yan, ―An adaptive logical method for
binarization of degraded document images‖, Pattern
Recognition (33), pp. 787-807, 2000.
Rameshwar S Mohite et al, Int.J.Computer Technology & Applications,Vol 5 (3),947-952
IJCTA | May-June 2014 Available online@www.ijcta.com
951
ISSN:2229-6093
[11] G. G. Rajput, Rajeswari Horakeri, ―Shape Descriptors based Handwritten Character Recognition Engine with
Application to Kannada Characters‖, International
Conference on Computer & Communication
Technology (ICCCT), pp 135-141, 2011. [12] P. Patidar, M. Gupta, S. Shrivastava and A. Nagawat,
―Image De-noising by Various Filter for Different
Noise,‖ vol. 9, no. 4, pp. 45-50, Nov. 2010.
[13] Pramod Kumar Sharma, Kapil Dev Dhingra, Sudip Sanyal, ―A Rule Based Approach for Skew Correction
and Removal of Insignificant Data from Scanned Text
Documents of Devanagari Script‖, SITIS, 899-903,
2007. [14] L. Lam, S. W. Lee, and C. Y. Suen, ―Thinning
Methodologies- A Comprehensive Survey,‖ IEEE
Trans. PAMI, vol. 14, pp. 869–885, Sept. 1992.
[15] B. B Chaudhuri and U. Pal, ―An OCR system to read two Indian language scripts: Bangla and Devanagari‖,
in Proc. 4th Conf. Document Anal. Recognit., pp. 1011–
1015b, 1997.
[16] Bansal, V., Sinha, R. ―Segmentation of touching and fused Devanagari characters‖, Pattern Recogn. 35 (4),
875–893, 2002.
[17] V. Bansal and R. M. K. Sinha, ―Integrating knowledge sources in Devanagari text recognition,‖ IEEE Trans.
Syst. Man Cybern. A: Syst. Hum. , vol. 30, no. 4, pp.
500–505, Jul. 2000.
[18] Garain, U., Chaudhuri, B., ―Segmentation of touching characters in printed Devanagari and Bangla scripts
using fuzzy multifactorial analysis‖, IEEE Trans. Syst.
Man Cybern. Part C 32 (4), 449–459, 2002.
[19] R. M. K. Sinha and H. Mahabala, ―Machine recognition of Devnagari script,‖ IEEE Trans. Syst. Man Cybern. ,
vol. 9, no. 8, pp. 435–441, Aug. 1979.
[20] C. V. Jawahar, P. Kumar, and S. S. R. Kiran, ―Bilingual
OCR for Hindi-Telugu documents and its applications‖, in Proc. 7th Conf. Document Anal. Recognit., pp. 1–5,
2003.
[21] M. Meshesha and C. V. Jawahar, ―Matching word
images for content-based retrieval from printed document images,‖ Int. J. Document Anal. Recognit. ,
vol. 11, pp. 29–38, 2008.
[22] P. A. Vijaya and M. C. Padma, ―Text line identification
from a multilingual document,‖ in Proc. Int. Conf. Digital Image Process., pp. 302–305.,2009.
[23] Naveen Sankaran and C.V Jawahar, ―Recognition of
Printed Devanagari Text Using BLSTM Neural
Network‖, ICPR, page 322-325. IEEE, 2012. [24] J.Hull,T.K.Ho,J.Favata,V.Govindaraju,S.Srihari,―Comb
ination of Segmentation-based and Wholistic
Handwritten Word Recognition Algorithms‖, Elsevier
Publ., pp. 261-272, 1992. [25] J. Rocha and T. Pavlidis. ―New method for word
recognition without segmentation.‖ In Proceedings of
SPIE, volume 1906, page 76, 1993.
[26] Sandhya Arora et al., ―Performance Comparison of SVM and ANN for Handwritten Devnagari Character
Recognition‖, IJCSI International Journal of Computer
Science Issues, Vol. 7, Issue 3, May 2010.
[27] K. Y. Rajput and Sangeeta Mishra, ―Recognition and Editing of Devnagari Handwriting Using Neural
Network‖, Proceedings of SPIT-IEEE Colloquium and
International Conference, Mumbai, India Vol. 1, 66.
[28] Umapada Pal, Sukalpa Chanda Tetsushi, Wakabayashi, Fumitaka Kimura, Accuracy Improvement of
Devnagari Character Recognition Combining SVM and
MQDF‖.
[29] Satish Kumar, ―Evaluation of Orthogonal Directional Gradients on Hand-Printed Datasets‖, Intl. Journal of
Information Technology and Knowledge Management ,
Volume 2, No. 1, pp. 203-207. Jan - Jun 2009.
[30] Anil K. Jain, Robert P.W. Duin, and Jianchang Mao, ―Statistical Pattern Recognition: A Review‖, IEEE
Transactions on Pattern Analysis and Machine
Intelligence, Vol. 22, No. 1, pp- 4-37, January 2000.
[31] Sandhya Arora, Debotosh Bhattacharjee, Mita Nasipuri, D. K. Basu, M. Kundu, ―Recognition of Non-
Compound Handwritten Devnagari Characters using a
Combination of MLP and Minimum Edit Distance‖,
International Journal of Computer Science and Security (IJCSS), Volume (4): Issue-1 pp 107-120.
Rameshwar S Mohite et al, Int.J.Computer Technology & Applications,Vol 5 (3),947-952
IJCTA | May-June 2014 Available online@www.ijcta.com
952
ISSN:2229-6093
top related