pre-processing and segmentation stages of off-line ...shodhganga.inflibnet.ac.in › bitstream ›...

Chapter 5: Pre-processing and Segmentation Stages of Handwritten Arabic Text

74

CHAPTER 5

Pre-processing and Segmentation Stages of Off-line

Handwritten Arabic Text

This chapter discusses the proposed and implemented pre-processing and

segmentation stages used in this thesis. Pre-processing techniques are employed on

Arabic text to produce standard data for segmentation/recognition stages of the

proposed system. A new segmentation algorithm for off-line handwritten Arabic text

is presented in this chapter. Conclusions and references are illustrated at the end of

this chapter.

5.1: Preprocessing Stage

The pre-processing stage is the first step in any OCR system. It produces data that are

easy for OCR systems to operate accurately. The following techniques are used and

implemented in the pre-processing stage of the system to increase the performance of

the segmentation/recognition stages:

5.1.1: Noise Reduction

Different filtering such as Median filter, morphological operations and noise

modeling techniques such as skew[1] are applied on off-line handwritten Arabic text

to reduce noise.

5.1.1.1: Median Filter:

The median filter is a useful nonlinear digital filtering technique, often used to

remove noise such as reducing salt-and-pepper noise in an image [2]. It is

http://en.wikipedia.org/wiki/Digital_filter


75

widely used in digital image processing. This technique replaces a pixel by the

median of all pixels in the neighbourhood:

Where is an input noisy image, and denotes subimage of g. The

sub-image is centred at coordinates (x,y), and represents the filter at

(x,y).

5.1.1.2: Morphological Operations:

We can define mathematical morphology as a tool for extracting image

components that are useful in the segmentation and description of region

shape, such as boundaries, skeletons, etc. Morphological operations play an

important role in pre-processing, such as morphological filtering, thinning, and

pruning.

5.1.2: Normalization

Normalization is considered the most important step in the pre-processing stage which

removes some variations in text images without affecting the identity of the text. The

basic methods for normalization are as follow:

5.1.2.1: Skew Normalization

Deskewing is the process of first detecting whether the handwritten word or text

line has been written on a slope, and then rotating the word or the text line if the

slope’s angle is too high so that the baseline of the word is horizontal [3]. Brown

at el. [4] illustrates some techniques for correcting slope. Figure 5.1(a) shows

the vertical histogram of a sloped text line (b) and (c) shows the slope correction

result of the sloped text line.


76

Figure 5.1(a): The vertical histogram of a sloped text line (b).

(c) The result of slope correction process of (b)

The slope correction process is carried out as follows:

1. Find bounding box of the word image

2. Construct the vertical histogram of the word image.

3. Divide the vertical histogram into two equal parts vertically.

4. Calculate number of pixels of each part of the vertical histogram

5. Find the absolute difference of the pixels of the two parts

6. If the absolute difference is small, it means no slope. If the number of white

pixels of the left part is greater than the number of white pixels of the right

part, it means the slope angle is negative i.e. the image should be rotated in a

clockwise direction around its centre point, otherwise the slope angle is

positive i.e. the image should be rotated in a counter clockwise direction

around its centre point.

7. Find the slope angle

8. Rotate the image word based on the slope angle and direction.

(a)

(c)

(b)


77

5.1.2.2: Baseline Detection

The baseline is a medium line in the Arabic word in which all successive characters

are connected to each other. Baseline represents the majority of foreground black

pixels in the area of a text line or a word. Baseline is defined as follows:

Where W(i,j) is a binary image represents either a text line or a word, and (i,j) indexes

the rows and columns respectively.

Figure 5.2: The word image with its baseline

The baseline in Arabic text contains valuable information about the orientation of the

text and the location of connection points between characters [5].

5.1.2.3: Size Normalization

Before finding the features of a handwritten character separated from a word, the

original character image can be normalized and encoded into a standard form so that

different images of the same character can be encoded similarly. Since the sizes of

Arabic characters greatly vary, size normalization is often used to scale characters to a

fixed size and to centre the character before recognition [6][7]. Size normalization is

an important pre-processing technique in any OCR system. In the size normalization

process implemented in this thesis, if a separated character (an output of segmentation

stage) is a single component, then it is normalized into 48×48 pixels. Otherwise, the

primary component and secondary component are cropped from the segmented

character and normalized into 48×48 and 22×22 pixels respectively. The size

normalization process is done like this: For each separated primary component/ single

character component, the actual image is first cropped automatically, for example, the

Baseline


78

original image which is of 40×40 size as shown in figure 5.3 (a) of Arabic character

‘Ha’a’ is cropped automatically into a new image by removing the white spaces in the

original image the size becomes 30×30 as shown in figure 5.3 (b). The new image

which created in figure 5.3(b) is resized to a proposed size 40×40, by using bilinear

interpolation algorithm see figure 5.3 (c). Finally, the 40×40 image is put into the

centre of a new white image of size 48×48 as shown in figure 5.3 (d). The size of

48×48 is selected because the images in IHACDB database which used in this

research are of the same size.

The same size normalization technique is also appled to secondary components with

18×18 pixels proposed size and 22×22 pixels normalized size.

5.1.3: Compression in the amount of information to be retained

Thresholding and Thinning techniques are the two popular compression techniques as

follows:

5.1.3.1: Thresholding

From a grayscale image, thresholding can be used to create a binary image [8].

Mehmet at el. [9] makes a categorized survey of image thresholding methods.

Our system uses Otsu’s method for thresholding images [10].

(a) 40×40 Pad (b)30×30 cropped image

(c)40×40 enlarging image (d)48×48 normalized image

Figure 5.3 Size normalization of Arabic Ha’a character


79

5.1.3.2: Thinning

The thinning process is a morphological operation that reduces the width of a binary

image to just a single pixel. It gives a binarized image a uniform width of one pixel.

Numerous algorithms of thinning have been devised and applied on a great variety of

patterns for different purposes. In recent years, it appears that "thinning" and

"skeletonization" have become almost synonymous in the literature, and the term

"skeleton" is used to refer to the result, regardless of the shape of the original pattern

or the method employed [11].

Figure 5.4: An Arabic sentence image before and after thinning

5.2: Segmentation Stage

Character segmentation has long been a critical area of the OCR process [12]. The

researcher presents a new algorithm of off-line handwritten Arabic text segmentation.

This algorithm divides segmentation stage into three levels: segment a text page into

lines, find words boundaries of each text line, and then find and segment each word

into characters. For each text line, the algorithm produces a novel matrix consisting of

segmentation points and their position status. The output of the segmentation stage is

segmented characters with their position status. Finally, the output of this stage is fed

character by character to the recognition stage. The segmentation algorithm has been

tested and has given a high correct segmentation rate.

(a) (b)


80

5.2.1: Line Segmentation

Arabic script is written from right to left and top to bottom. A text line is separated

from the previous and following text lines by white space. In bad handwriting,

overlapping between text lines may occur.

Figure 5.5: Horizontal Projection Profile of an Arabic paragraph and its Line

Segmentations

In order to segment text lines, horizontal projection profile histogram of the Arabic

document is made by counting the number of black pixels in each row. A zero value

in both horizontal and vertical projection corresponds to white space (gap). Arabic

text lines are separated by these horizontal gaps. The gap between two consecutive

peaks in the horizontal profile denotes the boundary between two text lines. A text

line is found between two consecutive boundary lines. Figure 5.5 shows An Arabic

paragraph with its horizontal projection profile. In this figure, dotted horizontal lines

represent line segmentations.

5.2.2: Segmentation of a text line into words

In Arabic script, there is no connection between separate words, so word boundaries

are always represented by a space. Letters are connected at the same relative height.


81

Arabic words are separated from each other by an enough sized space. In order to

segment a text line into its words, vertical projection profile histogram of the Arabic

text line is constructed by counting the number of black pixels in each column. For

word segmentation a text line is scanned vertically [13]. Right and left white spaces of

a text line must be removed.

In the segmentation phase of a text line into words is proposed and implemented in

this thesis, for each vertical histogram, the left most column and the right most

column of every gap of consecutive columns having zero black pixel are considered

as segmentation points. These segmentation points are saved into a first row of a

matrix of two rows. The WordSegMat is the name of the proposed matrix. The values

one and two respectively are saved into the second row; the couple (1:2) denotes that

two segmentation points are between two words, or sub-words. The indexes of the

first and last columns of the text line are also added to WordSegMat with (2) and (1)

values at the second row respectively. The WordSegMat is then arranged in ascending

order. Figure 5.6 shows an Arabic text line with its vertical projection profile. In this

figure, dotted vertical lines represent words boundaries.

Figure 5.6: Word boundary identification in an Arabic text line with the help of

vertical Projection


82

5.2.3: Segmentation of Cursive Arabic Words

Arabic language is a cursive language. Characters within a word are normally joined

even in machine-print. Arabic words consist of one or more sub-words. As an

example, the word (جامعة, University) consists of two sub-words: the first sub-word

consists of two characters while the second sub-word consists of three characters. Any

error in segmenting the basic shape of Arabic characters will produce a different

representation of the character component. Two techniques have been applied for

segmenting machine printed and handwritten Arabic words into individual characters:

implicit and explicit segmentations [14]. The segmentation method of this research

belongs to implicit segmentation which segments words directly into letters.

In the cursive word segmentation phase proposed and implemented in this thesis, for

each couple of consecutive columns of WordSegMat, a cursive word is cropped from

the text line. To explain the cursive word segmentation method, suppose that the word

image in figure 5.7 is an input to the cursive word segmentation method. For each

cropped word the following steps must be done in sequence to determine the

segmentation points of the word as follows:

Figure 5.7: An example of a word image

1. Bounding box:

In this step an exact image is cropped from the whole image by removing the

surrounded empty lines. Noise removal is also applied on the cropped image. The

clean image is then thinned to be ready for the next step as shown in figure 5.8.

Figure 5.8: Cleaned and Thinned image


83

2. Baseline detection of the word:

Horizontal projection is a simple and common method for detecting the

baseline. Figure 5.9 shows the input word along with its baseline as follows:

Figure 5.9: Thinned word image with its horizontal histogram and baseline

3. Strokes (Diacritical marks and Dots) Detection and Removal

After getting the baseline, all isolated characters, strokes, and dots that situate outside

the baseline are removed as shown in figure 5.10.

Figure 5.10: The word without strokes

4. Determination of Segmentation points

Here, vertical histogram of the Arabic word is constructed. Since Arabic language is

from right to left, the vertical histogram is scanned in the same way in order to find

segmentation points. If a number of successive columns of vertical histogram is

greater than four and a number of pixels in each column are only one pixel, then the

left most column of that successive columns is considered as a segmentation point.

These segmentation points are saved into a first row of a matrix of two rows. The

CharSegMat is the name of the proposed matrix. The value zero only is saved into the

second row; zero denotes that every segmentation point is between two cursive

Baseline


84

(connected) characters. Figure 5.11 shows the input word with its vertical histogram.

The dotted lines in the figure represent the segmentation points.

Figure 5.11: Vertical histogram for the thinned input image with segmentation points

5.2.4: The output of the segmentation stage

The output of the segmentation phase of each text line is a matrix called WordSegMat.

The output of the word segmentation phase is another matrix called CharSegMat. The

WordSegMat is combined with CharSegMat to form a new matrix. The name of the

new matrix is AllSegMat as shown in equation 5.3.

AllSegMat = WordSegMat(2,N) ∪ CharSegMat(2,M) 5.3

N and M are the number of columns in wordSegMat and CharSegMat respectively.

The AllSegMat contains all segmentation points of a given text line or a word. The

AllSegMat is arranged in ascending order and then refined to produce accurate

segmentation points. Figure 5.12 shows an Arabic text line with its segmentation

points which are represented as solid vertical lines. The numbers below the solid lines

are the row 2 of the matrix AllSegMat.


85

Figure 5.12: The segmentation points of a text line

The boundaries of a character are found in column i-1 and column i of the matrix

AllSegMat(1,i). The position of the character within a word i.e. beginning, middle,

end of a word and isolation is determined by the second row of the matrix,

AllSegMat(2,i-1): AllSegMat(2,i). Five combinations are formed as follows:

(0:0)Middle position

(0:1)Beginning position

(2:0)End position

(2:1)Isolated character

(1:2) space between two words

Finally, each segmented character with its position status is fed to the recognition

stage.

The segmentation points determine whether the segmented character belongs to the

beginning of a word (class 2), middle of a word (class 3), end of a word (class 4) or in

isolated manner (class 1). Therefore, using the segmentation points save time and

effort. See chapter six for more information.

5.2.5: Ligature Processing

Arabic characters, like other cursive languages, can be combined or changed in shape

depending on their context. A character’s appearance is affected by its relation to

other characters, the font used to render the character, and the application or system

environment [15]. An Arabic ligature is a compound of two or sometimes three

characters such as (Lam Alef ال). Ligatures unfortunately complicate the segmentation

task of any Arabic Optical Character Recognition (OCR) system. In our system,

ligatures are treated as isolated characters.


86

5.2.6: Algorithm of the Segmentation Stage

Algorithm: Segmentation Stage

Input: Handwritten Arabic page image.

Output: Segmented Arabic characters with their position status for Recognition

Stage.

Method:

Find text lines boundaries of the input text image

For each text line

begin

Finding words boundaries of the text line, (WordSegMat(2,N))

Finding characters boundaries of a word (CharSegMat(2,M))

Combining words boundaries with characters boundaries in one matrix:

AllSegMat = WordSegMat(2,N) ∪ CharSegMat(2,M)

For i= length (AllSegMat) to 1

begin

Cropping characters,

Ch=crop(text line, AllSegMat(1,i-1) : AllSegMat(1,i) )

Determine position of the character,

Position=AllSegMat(2,i-1) : AllSegMat(2,i)

Recognition stage (Ch , Position)

End i

End for text line

End Segmentation Algorithm

5.2.7: Segmentation Drawback

A common segmentation drawback of our segmentation algorithm is associated

among the primary character (joining group) of characters Seen and Sheen and the

primary character (joining group) of characters Ya, Ba, Ta, and Tha’a in classes 2,3,

and 4 because of the second one has one peak whereas the first one has three peaks.

Figure 5.13 shows an example of the common segmentation error i.e. the first

character of the handwritten word (Smer-سمير) character (Seen - ) was


87

segmented wrongly into three primary character (joining group) of characters Ya, Ba,

Ta, and Tha’a.

Original

word

Segments

Figure 5.13: Wrongly segmentation of character “Seen”


88

5.2.8: General Framework of the Segmentation Stage:

Figure 5.14: General Framework of the Segmentation Stage


89

5.3: Conclusion

A serious effort has been made in segmenting Arabic cursive words. But the

segmentation problem still exists. Segmenting Arabic ligatures is still a challenging

task in machine printed as well as handwritten Arabic text. Chapter six presents

preview of Pre-processing and Segmentation stages of off-line handwritten Arabic

text. It discusses the pre-processing stage methods such as noise reduction,

normalization, thresholding as well as thinning. Moreover, it discusses the

segmentation stage such as page, line and word segmentation. Profile projections,

Bounding box and Base line also used to design the segmentation algorithm for

Arabic text. The drawback of the proposed segmentation algorithm is also mentioned.

5.4: References

[1] T. Kunango, R. Haralick and I.Phillips, “Nonlinear Local and Global

Document Degradation Models”, Int. Journal of Imaging Systems and

Technologies, vol.5, no. 4, pp.274-282, 1994.

[2] R. C. Gonzalez, R.E. Woods, S. L. Eddins, Digital image processing using

Matlab, India: Pearson Education, 2004.

[3] Myer Blumenstein, ”Intelligent Techniques for Handwriting Recognition”,

Ph.D Thesis, School of Information Technology- Griffith University-Gold

Coast Campus, December 2000.

[4] M. K. Brown and S. Ganapathy, “Pre-processing Techniques for Cursive

Script Word Recognition”, Pattern Recognition, vol. 16, no.5, pp. 447-458,

1983.

[5] B. Al-Badr and S. A. Mahmoud, “Survey and bibliography of Arabic optical

text recognition “, Signal Processing, vol. 41, pp. 49-77, 1995.

[6] H. Mahdi, “Thinning and transforming the segmented Arabic characters into

meshes of unified small size”, Proc. 14th Internal. Conf on Statistics,

Computer Science and Demographic Research, Cairo, Egypt, pp. 263-216,

March 1989.

[7] A. Nurul-Ula and A. Nouh, “Automatic recognition of Arabic characters using

logic statements. Part I. System description and pre-processing”, Journal of

Engrg. Sci., King Saud Univ., vol. 14, no. 2, pp. 343-352, 1988.

[8] Linda G. Shapiro and George C. Stockman, Computer Vision, Prentice Hall,

ISBN 0-13-030796-3, 2001.


90

[9] Mehmet Sezgin and Bulent Sankur, ”Survey over image thresholding

techniques and quantitative performance evaluation”, Journal of Electronic

Imaging, vol. 13, no. 1, pp. 146-165, January 2004.

[10] Nobuyuki Otsu, "A threshold selection method from gray-level histograms".

IEEE Trans. Sys., Man., Cyber. vol. 9, no. 1, pp. 62-66, 1979.

[11] Louisa Lam, Seong-Whan Lee and Ching Y. Suen, “Thinning

Methodologies: A Comprehensive Survey”, IEEE Transaction on Pattern

Analysis and Machine Intelligence, vol. 14, no. 9, Sept.1992.

[12] R. G. Casey and E. Lecolinet, “A survey of methods and strategies in

character segmentation”, IEEE Trans. Patt. Anal. Mac. Intel., vol. 18, no. 7,

pp. 690-706, 1996.

[13] S. Chanda and U. Pal, “English, Devnagari and Urdu Text Identification”,

Proceedings of the International Conference on Cognition and Recognition,

pp. 538-545, 2005.

[14] A. Amin, “Off-line Arabic character recognition: the state of the art”,

Pattern Recognition, vol. 31, pp. 517-530, 1998.

[15] R. J. Ramteke, “Recognition of Handwritten and Typed Based Document in

Marathi”, Ph.D Thesis, BAMU University: India, 2006.

pre-processing and segmentation stages of off-line ...shodhganga.inflibnet.ac.in › bitstream ›...

Documents