pre-processing and segmentation stages of off-line ...shodhganga.inflibnet.ac.in › bitstream ›...
TRANSCRIPT
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
74
CHAPTER 5
Pre-processing and Segmentation Stages of Off-line
Handwritten Arabic Text
This chapter discusses the proposed and implemented pre-processing and
segmentation stages used in this thesis. Pre-processing techniques are employed on
Arabic text to produce standard data for segmentation/recognition stages of the
proposed system. A new segmentation algorithm for off-line handwritten Arabic text
is presented in this chapter. Conclusions and references are illustrated at the end of
this chapter.
5.1: Preprocessing Stage
The pre-processing stage is the first step in any OCR system. It produces data that are
easy for OCR systems to operate accurately. The following techniques are used and
implemented in the pre-processing stage of the system to increase the performance of
the segmentation/recognition stages:
5.1.1: Noise Reduction
Different filtering such as Median filter, morphological operations and noise
modeling techniques such as skew[1] are applied on off-line handwritten Arabic text
to reduce noise.
5.1.1.1: Median Filter:
The median filter is a useful nonlinear digital filtering technique, often used to
remove noise such as reducing salt-and-pepper noise in an image [2]. It is
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
75
widely used in digital image processing. This technique replaces a pixel by the
median of all pixels in the neighbourhood:
Where is an input noisy image, and denotes subimage of g. The
sub-image is centred at coordinates (x,y), and represents the filter at
(x,y).
5.1.1.2: Morphological Operations:
We can define mathematical morphology as a tool for extracting image
components that are useful in the segmentation and description of region
shape, such as boundaries, skeletons, etc. Morphological operations play an
important role in pre-processing, such as morphological filtering, thinning, and
pruning.
5.1.2: Normalization
Normalization is considered the most important step in the pre-processing stage which
removes some variations in text images without affecting the identity of the text. The
basic methods for normalization are as follow:
5.1.2.1: Skew Normalization
Deskewing is the process of first detecting whether the handwritten word or text
line has been written on a slope, and then rotating the word or the text line if the
slope’s angle is too high so that the baseline of the word is horizontal [3]. Brown
at el. [4] illustrates some techniques for correcting slope. Figure 5.1(a) shows
the vertical histogram of a sloped text line (b) and (c) shows the slope correction
result of the sloped text line.
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
76
Figure 5.1(a): The vertical histogram of a sloped text line (b).
(c) The result of slope correction process of (b)
The slope correction process is carried out as follows:
1. Find bounding box of the word image
2. Construct the vertical histogram of the word image.
3. Divide the vertical histogram into two equal parts vertically.
4. Calculate number of pixels of each part of the vertical histogram
5. Find the absolute difference of the pixels of the two parts
6. If the absolute difference is small, it means no slope. If the number of white
pixels of the left part is greater than the number of white pixels of the right
part, it means the slope angle is negative i.e. the image should be rotated in a
clockwise direction around its centre point, otherwise the slope angle is
positive i.e. the image should be rotated in a counter clockwise direction
around its centre point.
7. Find the slope angle
8. Rotate the image word based on the slope angle and direction.
(a)
(c)
(b)
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
77
5.1.2.2: Baseline Detection
The baseline is a medium line in the Arabic word in which all successive characters
are connected to each other. Baseline represents the majority of foreground black
pixels in the area of a text line or a word. Baseline is defined as follows:
Where W(i,j) is a binary image represents either a text line or a word, and (i,j) indexes
the rows and columns respectively.
Figure 5.2: The word image with its baseline
The baseline in Arabic text contains valuable information about the orientation of the
text and the location of connection points between characters [5].
5.1.2.3: Size Normalization
Before finding the features of a handwritten character separated from a word, the
original character image can be normalized and encoded into a standard form so that
different images of the same character can be encoded similarly. Since the sizes of
Arabic characters greatly vary, size normalization is often used to scale characters to a
fixed size and to centre the character before recognition [6][7]. Size normalization is
an important pre-processing technique in any OCR system. In the size normalization
process implemented in this thesis, if a separated character (an output of segmentation
stage) is a single component, then it is normalized into 48×48 pixels. Otherwise, the
primary component and secondary component are cropped from the segmented
character and normalized into 48×48 and 22×22 pixels respectively. The size
normalization process is done like this: For each separated primary component/ single
character component, the actual image is first cropped automatically, for example, the
Baseline
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
78
original image which is of 40×40 size as shown in figure 5.3 (a) of Arabic character
‘Ha’a’ is cropped automatically into a new image by removing the white spaces in the
original image the size becomes 30×30 as shown in figure 5.3 (b). The new image
which created in figure 5.3(b) is resized to a proposed size 40×40, by using bilinear
interpolation algorithm see figure 5.3 (c). Finally, the 40×40 image is put into the
centre of a new white image of size 48×48 as shown in figure 5.3 (d). The size of
48×48 is selected because the images in IHACDB database which used in this
research are of the same size.
The same size normalization technique is also appled to secondary components with
18×18 pixels proposed size and 22×22 pixels normalized size.
5.1.3: Compression in the amount of information to be retained
Thresholding and Thinning techniques are the two popular compression techniques as
follows:
5.1.3.1: Thresholding
From a grayscale image, thresholding can be used to create a binary image [8].
Mehmet at el. [9] makes a categorized survey of image thresholding methods.
Our system uses Otsu’s method for thresholding images [10].
(a) 40×40 Pad (b)30×30 cropped image
(c)40×40 enlarging image (d)48×48 normalized image
Figure 5.3 Size normalization of Arabic Ha’a character
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
79
5.1.3.2: Thinning
The thinning process is a morphological operation that reduces the width of a binary
image to just a single pixel. It gives a binarized image a uniform width of one pixel.
Numerous algorithms of thinning have been devised and applied on a great variety of
patterns for different purposes. In recent years, it appears that "thinning" and
"skeletonization" have become almost synonymous in the literature, and the term
"skeleton" is used to refer to the result, regardless of the shape of the original pattern
or the method employed [11].
Figure 5.4: An Arabic sentence image before and after thinning
5.2: Segmentation Stage
Character segmentation has long been a critical area of the OCR process [12]. The
researcher presents a new algorithm of off-line handwritten Arabic text segmentation.
This algorithm divides segmentation stage into three levels: segment a text page into
lines, find words boundaries of each text line, and then find and segment each word
into characters. For each text line, the algorithm produces a novel matrix consisting of
segmentation points and their position status. The output of the segmentation stage is
segmented characters with their position status. Finally, the output of this stage is fed
character by character to the recognition stage. The segmentation algorithm has been
tested and has given a high correct segmentation rate.
(a) (b)
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
80
5.2.1: Line Segmentation
Arabic script is written from right to left and top to bottom. A text line is separated
from the previous and following text lines by white space. In bad handwriting,
overlapping between text lines may occur.
Figure 5.5: Horizontal Projection Profile of an Arabic paragraph and its Line
Segmentations
In order to segment text lines, horizontal projection profile histogram of the Arabic
document is made by counting the number of black pixels in each row. A zero value
in both horizontal and vertical projection corresponds to white space (gap). Arabic
text lines are separated by these horizontal gaps. The gap between two consecutive
peaks in the horizontal profile denotes the boundary between two text lines. A text
line is found between two consecutive boundary lines. Figure 5.5 shows An Arabic
paragraph with its horizontal projection profile. In this figure, dotted horizontal lines
represent line segmentations.
5.2.2: Segmentation of a text line into words
In Arabic script, there is no connection between separate words, so word boundaries
are always represented by a space. Letters are connected at the same relative height.
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
81
Arabic words are separated from each other by an enough sized space. In order to
segment a text line into its words, vertical projection profile histogram of the Arabic
text line is constructed by counting the number of black pixels in each column. For
word segmentation a text line is scanned vertically [13]. Right and left white spaces of
a text line must be removed.
In the segmentation phase of a text line into words is proposed and implemented in
this thesis, for each vertical histogram, the left most column and the right most
column of every gap of consecutive columns having zero black pixel are considered
as segmentation points. These segmentation points are saved into a first row of a
matrix of two rows. The WordSegMat is the name of the proposed matrix. The values
one and two respectively are saved into the second row; the couple (1:2) denotes that
two segmentation points are between two words, or sub-words. The indexes of the
first and last columns of the text line are also added to WordSegMat with (2) and (1)
values at the second row respectively. The WordSegMat is then arranged in ascending
order. Figure 5.6 shows an Arabic text line with its vertical projection profile. In this
figure, dotted vertical lines represent words boundaries.
Figure 5.6: Word boundary identification in an Arabic text line with the help of
vertical Projection
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
82
5.2.3: Segmentation of Cursive Arabic Words
Arabic language is a cursive language. Characters within a word are normally joined
even in machine-print. Arabic words consist of one or more sub-words. As an
example, the word (جامعة, University) consists of two sub-words: the first sub-word
consists of two characters while the second sub-word consists of three characters. Any
error in segmenting the basic shape of Arabic characters will produce a different
representation of the character component. Two techniques have been applied for
segmenting machine printed and handwritten Arabic words into individual characters:
implicit and explicit segmentations [14]. The segmentation method of this research
belongs to implicit segmentation which segments words directly into letters.
In the cursive word segmentation phase proposed and implemented in this thesis, for
each couple of consecutive columns of WordSegMat, a cursive word is cropped from
the text line. To explain the cursive word segmentation method, suppose that the word
image in figure 5.7 is an input to the cursive word segmentation method. For each
cropped word the following steps must be done in sequence to determine the
segmentation points of the word as follows:
Figure 5.7: An example of a word image
1. Bounding box:
In this step an exact image is cropped from the whole image by removing the
surrounded empty lines. Noise removal is also applied on the cropped image. The
clean image is then thinned to be ready for the next step as shown in figure 5.8.
Figure 5.8: Cleaned and Thinned image
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
83
2. Baseline detection of the word:
Horizontal projection is a simple and common method for detecting the
baseline. Figure 5.9 shows the input word along with its baseline as follows:
Figure 5.9: Thinned word image with its horizontal histogram and baseline
3. Strokes (Diacritical marks and Dots) Detection and Removal
After getting the baseline, all isolated characters, strokes, and dots that situate outside
the baseline are removed as shown in figure 5.10.
Figure 5.10: The word without strokes
4. Determination of Segmentation points
Here, vertical histogram of the Arabic word is constructed. Since Arabic language is
from right to left, the vertical histogram is scanned in the same way in order to find
segmentation points. If a number of successive columns of vertical histogram is
greater than four and a number of pixels in each column are only one pixel, then the
left most column of that successive columns is considered as a segmentation point.
These segmentation points are saved into a first row of a matrix of two rows. The
CharSegMat is the name of the proposed matrix. The value zero only is saved into the
second row; zero denotes that every segmentation point is between two cursive
Baseline
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
84
(connected) characters. Figure 5.11 shows the input word with its vertical histogram.
The dotted lines in the figure represent the segmentation points.
Figure 5.11: Vertical histogram for the thinned input image with segmentation points
5.2.4: The output of the segmentation stage
The output of the segmentation phase of each text line is a matrix called WordSegMat.
The output of the word segmentation phase is another matrix called CharSegMat. The
WordSegMat is combined with CharSegMat to form a new matrix. The name of the
new matrix is AllSegMat as shown in equation 5.3.
AllSegMat = WordSegMat(2,N) ∪ CharSegMat(2,M) 5.3
N and M are the number of columns in wordSegMat and CharSegMat respectively.
The AllSegMat contains all segmentation points of a given text line or a word. The
AllSegMat is arranged in ascending order and then refined to produce accurate
segmentation points. Figure 5.12 shows an Arabic text line with its segmentation
points which are represented as solid vertical lines. The numbers below the solid lines
are the row 2 of the matrix AllSegMat.
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
85
Figure 5.12: The segmentation points of a text line
The boundaries of a character are found in column i-1 and column i of the matrix
AllSegMat(1,i). The position of the character within a word i.e. beginning, middle,
end of a word and isolation is determined by the second row of the matrix,
AllSegMat(2,i-1): AllSegMat(2,i). Five combinations are formed as follows:
(0:0)Middle position
(0:1)Beginning position
(2:0)End position
(2:1)Isolated character
(1:2) space between two words
Finally, each segmented character with its position status is fed to the recognition
stage.
The segmentation points determine whether the segmented character belongs to the
beginning of a word (class 2), middle of a word (class 3), end of a word (class 4) or in
isolated manner (class 1). Therefore, using the segmentation points save time and
effort. See chapter six for more information.
5.2.5: Ligature Processing
Arabic characters, like other cursive languages, can be combined or changed in shape
depending on their context. A character’s appearance is affected by its relation to
other characters, the font used to render the character, and the application or system
environment [15]. An Arabic ligature is a compound of two or sometimes three
characters such as (Lam Alef ال). Ligatures unfortunately complicate the segmentation
task of any Arabic Optical Character Recognition (OCR) system. In our system,
ligatures are treated as isolated characters.
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
86
5.2.6: Algorithm of the Segmentation Stage
Algorithm: Segmentation Stage
Input: Handwritten Arabic page image.
Output: Segmented Arabic characters with their position status for Recognition
Stage.
Method:
Find text lines boundaries of the input text image
For each text line
begin
Finding words boundaries of the text line, (WordSegMat(2,N))
Finding characters boundaries of a word (CharSegMat(2,M))
Combining words boundaries with characters boundaries in one matrix:
AllSegMat = WordSegMat(2,N) ∪ CharSegMat(2,M)
For i= length (AllSegMat) to 1
begin
Cropping characters,
Ch=crop(text line, AllSegMat(1,i-1) : AllSegMat(1,i) )
Determine position of the character,
Position=AllSegMat(2,i-1) : AllSegMat(2,i)
Recognition stage (Ch , Position)
End i
End for text line
End Segmentation Algorithm
5.2.7: Segmentation Drawback
A common segmentation drawback of our segmentation algorithm is associated
among the primary character (joining group) of characters Seen and Sheen and the
primary character (joining group) of characters Ya, Ba, Ta, and Tha’a in classes 2,3,
and 4 because of the second one has one peak whereas the first one has three peaks.
Figure 5.13 shows an example of the common segmentation error i.e. the first
character of the handwritten word (Smer-سمير) character (Seen - ) was
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
87
segmented wrongly into three primary character (joining group) of characters Ya, Ba,
Ta, and Tha’a.
Original
word
Segments
Figure 5.13: Wrongly segmentation of character “Seen”
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
88
5.2.8: General Framework of the Segmentation Stage:
Figure 5.14: General Framework of the Segmentation Stage
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
89
5.3: Conclusion
A serious effort has been made in segmenting Arabic cursive words. But the
segmentation problem still exists. Segmenting Arabic ligatures is still a challenging
task in machine printed as well as handwritten Arabic text. Chapter six presents
preview of Pre-processing and Segmentation stages of off-line handwritten Arabic
text. It discusses the pre-processing stage methods such as noise reduction,
normalization, thresholding as well as thinning. Moreover, it discusses the
segmentation stage such as page, line and word segmentation. Profile projections,
Bounding box and Base line also used to design the segmentation algorithm for
Arabic text. The drawback of the proposed segmentation algorithm is also mentioned.
5.4: References
[1] T. Kunango, R. Haralick and I.Phillips, “Nonlinear Local and Global
Document Degradation Models”, Int. Journal of Imaging Systems and
Technologies, vol.5, no. 4, pp.274-282, 1994.
[2] R. C. Gonzalez, R.E. Woods, S. L. Eddins, Digital image processing using
Matlab, India: Pearson Education, 2004.
[3] Myer Blumenstein, ”Intelligent Techniques for Handwriting Recognition”,
Ph.D Thesis, School of Information Technology- Griffith University-Gold
Coast Campus, December 2000.
[4] M. K. Brown and S. Ganapathy, “Pre-processing Techniques for Cursive
Script Word Recognition”, Pattern Recognition, vol. 16, no.5, pp. 447-458,
1983.
[5] B. Al-Badr and S. A. Mahmoud, “Survey and bibliography of Arabic optical
text recognition “, Signal Processing, vol. 41, pp. 49-77, 1995.
[6] H. Mahdi, “Thinning and transforming the segmented Arabic characters into
meshes of unified small size”, Proc. 14th Internal. Conf on Statistics,
Computer Science and Demographic Research, Cairo, Egypt, pp. 263-216,
March 1989.
[7] A. Nurul-Ula and A. Nouh, “Automatic recognition of Arabic characters using
logic statements. Part I. System description and pre-processing”, Journal of
Engrg. Sci., King Saud Univ., vol. 14, no. 2, pp. 343-352, 1988.
[8] Linda G. Shapiro and George C. Stockman, Computer Vision, Prentice Hall,
ISBN 0-13-030796-3, 2001.
Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text
90
[9] Mehmet Sezgin and Bulent Sankur, ”Survey over image thresholding
techniques and quantitative performance evaluation”, Journal of Electronic
Imaging, vol. 13, no. 1, pp. 146-165, January 2004.
[10] Nobuyuki Otsu, "A threshold selection method from gray-level histograms".
IEEE Trans. Sys., Man., Cyber. vol. 9, no. 1, pp. 62-66, 1979.
[11] Louisa Lam, Seong-Whan Lee and Ching Y. Suen, “Thinning
Methodologies: A Comprehensive Survey”, IEEE Transaction on Pattern
Analysis and Machine Intelligence, vol. 14, no. 9, Sept.1992.
[12] R. G. Casey and E. Lecolinet, “A survey of methods and strategies in
character segmentation”, IEEE Trans. Patt. Anal. Mac. Intel., vol. 18, no. 7,
pp. 690-706, 1996.
[13] S. Chanda and U. Pal, “English, Devnagari and Urdu Text Identification”,
Proceedings of the International Conference on Cognition and Recognition,
pp. 538-545, 2005.
[14] A. Amin, “Off-line Arabic character recognition: the state of the art”,
Pattern Recognition, vol. 31, pp. 517-530, 1998.
[15] R. J. Ramteke, “Recognition of Handwritten and Typed Based Document in
Marathi”, Ph.D Thesis, BAMU University: India, 2006.