[ieee 2012 6th ieee international conference intelligent systems (is) - sofia, bulgaria...
TRANSCRIPT
Segmentation of Historical Lanna Handwritten
Manuscripts
Sakkayaphop Pravesjit1 and Arit Thammano
2
Computational Intelligence Laboratory
Faculty of Information Technology
King Mongkut’s Institute of Technology Ladkrabang
Bangkok, Thailand [email protected] and
Abstract— Lanna script is an archaic script not
commonly used in today’s world. People trying to read
these archaic Lanna manuscripts have to find some form
of translation help to understand what they said.
Unfortunately, few people nowadays know how to read or
write this language. Therefore, character recognition
system must be put to use in order to translate the Lanna
script to the commonly used script. The poor condition of
the manuscripts and the writing style of the script make
this problem very difficult to solve. The most difficult
cases of the writing style problem are the touching and
overlapping characters. Therefore, the first two stages of
the character recognition process, which are image
preprocessing and segmentation, need to be closely
watched over so that the recognition accuracy is high. In
this paper, two new techniques are proposed. The first
proposed technique emphasizes on converting a grayscale
image to a binary image. In this proposed technique, the
concepts of the multithresholding method and Otsu’s
method are combined together. The second proposed
technique emphasizes on the process of touching
character segmentation. In doing this, the bounding box
analysis is initially employed to segment the document
image into images of isolated characters and images of
touching characters. The thinning algorithm is applied to
extract the skeleton of the touching characters. Next, by
using the junction points as the separation points, the
skeleton of the touching characters is separated into
several pieces. Finally, the separated pieces of the
touching characters are put back to reconstruct two
isolated characters. The proposed algorithm achieves an
accuracy of 86.67%.
Keywords- character segmentation; touching character;
dissection method
I. INTRODUCTION
Lanna language was used in the 13th to 18
th centuries
in the Kingdom of Lanna. However, after the kingdom had been annexed by Siam (as Thailand was called until 1939) in 1774, Lanna script became obsolete and was replaced with Thai script. Few people nowadays know how to read or write this language. Lanna manuscripts were typically written about the people’s ways of life, believes, laws, folklore, herbal medicine ingredients,
history, astrological knowledge, and other general knowledge. Lanna manuscripts were generally inscribed on palm leaves (Fig. 1), or on the surface of stones. As time goes by, these ancient documents have been decayed, damaged, destroyed, or lost. Termites, insects, and general decay have left these old palm leaf manuscripts in poor condition (Fig. 2). In order to preserve valuable historical information inscribed on these documents, computerized systems must be put to use in order to translate the inscribed script into the current Thai script.
Figure 1. Palm leaf manuscripts.
Figure 2. Example of damaged palm leaf manuscripts.
Touching and overlapping of characters is the first problem encountered when attempting to recognize the handwritten Lanna characters. Touching of characters can emerge when two or more adjacent characters are written too close; therefore, some parts of characters are connected [1]. There are many kinds of touching characters commonly found in the written documents (Fig. 3). This is because characters of each language have different styles and characteristics. Therefore, the types of touching characters vary from language to language, which in turn require different methods for segmenting the touching characters in each language. For example, the handwritten cursive characters shown in Fig. 3(a) are a type of touching characters typically found in English handwritten manuscripts but not in Lanna handwritten manuscripts. Only the types shown in Fig. 3(b), Fig. 3(c), and Fig. 3(d) can be found in Lanna handwritten manuscripts. The purpose of this research is to separate the touching or overlapping Lanna characters, which have not been effectively solved using any other character segmentation methods.
Character segmentation is a process that seeks to decompose a sequence of characters into individual symbols. There have been substantial researches undertaken to solve character segmentation problem, mostly for numerals, English script, Chinese script, Arabic script, and Bangla script. Segmentation strategies can be divided into three main categories [2]: dissection methods, recognition-based methods, and holistic methods. Dissection methods decompose the image into a sequence of sub-images using general features, e.g., character height and width [3]. Recognition-based methods search the image for components that match classes in its alphabet. Holistic methods seek to recognize entire words as a whole, thus avoiding the need to segment the image into characters. Among the methods proposed for character segmentation, Tseng and Chen [4] proposed a three-stage Chinese character segmentation algorithm. Firstly, a bounding box is created around each stroke of a Chinese character. Secondly, the knowledge-based merging operations are used to merge the stroke bounding boxes together. Finally, a dynamic programming is used to find the optimal segmentation boundaries. The experimental results show that the proposed algorithm is a very effective segmentation algorithm. It works well with touching and/or overlapping characters. Xiao and Leedham [5] proposed a novel approach to English cursive script segmentation. In the proposed approach, connected components are split into sub-components based on their face-up or face-down background regions. Then the over-segmented sub-components are merged into characters according to the knowledge of character structures are their joining characteristics. Bhowmik, Roy, and Roy [6] proposed a segmentation scheme for handwritten Bangla words. The authors use the analysis of directional chaincode and the positional information to extract the features from the image, then employ multilayer perceptron neural network to determine the
segmentation points. The authors also point out that their segmentation result can be significantly improved if their proposed technique is combined with the recognition process in a holistic system.
This study focuses mainly on the dissection methods. Projection analysis, connected component processing, and bounding box analysis are three widely used dissection methods [7]. While projection analysis is very effective for segmenting good quality machine printed manuscripts [2], it has limited success when segmenting handwritten manuscripts. Connected component processing and bounding box analysis usually offer an efficient way to segment handwritten manuscripts. However, they might lead to incorrect segmentation when dealing with touching characters as shown in Fig. 4.
In this paper, the new dissection algorithm is proposed to segment touching Lanna characters. The performance of the proposed algorithm is measured by the accuracy, that is, the percentage of correct segmentations among all the segmentations made by the proposed algorithm.
Following this introduction, section 2 explains the proposed character segmentation algorithm. In section 3, the experimental results are presented and discussed. Finally, section 4 is the conclusion.
Figure 3. Four types of touching characters.
Figure 4. Segmentation results by bounding box analysis.
(a) (b)
(c) (d)
Bounding Box Analysis
Touching Characters
II. THE PROPOSED ALGORITHM
A description of the proposed segmentation algorithm is given below.
a) In the first step, the scanned RGB image (Fig.
5) is converted into its corresponding grayscale image
(Fig. 6). First, select the best quality manuscript image
and use it as a reference for the conversion of the other
manuscript images into the grayscale images. Then the
average intensities of the three fundamental colors of
the reference image (Rref, Gref , and Bref) are computed.
Since the human eye is not equally sensitive to different
colors [8], the average intensity of each color is
weighted according to (1), (2), and (3) to compensate
for the difference in sensitivity.
refg R2989.0R
(1)
refg G5870.0G
refg B1140.0B
Finally, all other manuscript images are transformed into grayscale images by using the following equations:
iii newnewnew BzGyRxGray
1
g
n
newii
R nx
R
n
inew
g
iG
nGy
1
n
inew
g
iB
nBz
1
where n is the total number of pixels in the image.
b) This step converts a grayscale image into a
binary image (Fig. 7). There are many different
approaches to convert a grayscale image to a binary
image. Among these approaches, two of the most
popular approaches are Otsu’s method [9] and
Papamarkos, Strouthopoulos and Andreads’s method
[10]. Otsu’s method uses a global thresholding
approach while Papamarkos, Strouthopoulos and
Andreads’s method uses a multi-thresholding approach.
The new approach in converting a grayscale image into
a binary image proposed in this paper uses both the
concepts of the global thresholding and the multi-
thresholding. Let Tk is the kth
local threshold value,
where k = 1, 2, ..., J-1 and T1 < T2 < T3 < ...< TJ-1. Tt is a
threshold value obtained from Otsu’s method. Tt is used
to separate a set of local threshold values into two
groups: the group which is less than Tt and the group
which is greater than Tt. Determine Tmin, Tmax, and Tnew
by using (8), (9) and (10). Then classify each pixel in
the image into two classes by using (11).
min:
arg min[ ]
k k t
t kT T T
T T T
max:
arg max[ ]
k k t
t kT T T
T T T
0.052
0.052
maxmax
min maxmax
T Tif T T T
new T Tif T T T
tt t
T
t t
1 ,
,0 ,
if f x y Tnewg x yif f x y Tnew
Figure 5. RGB image of the Lanna palm leaf manuscript.
Figure 6. Grayscale image of the Lanna palm leaf manuscript.
Figure 7. Binary image of the Lanna palm leaf manuscript.
Figure 8. Illustration of step b.
c) Use the bounding box analysis to segment the
entire manuscript image into subimages. Since the
bounding box analysis is only effective in segmenting
nontouching characters, the obtained subimages
therefore consist of both images of isolated characters
and images of touching characters. Next, detect
touching characters by looking at the aspect ratio
(width/height) of each subimage. From the facts that 1)
touching characters commonly have an aspect ratio
larger than single isolated characters and 2) the aspect
ratio of Lanna isolated characters is typically smaller
than 3/2, therefore, if the aspect ratio of the subimage is
larger than 3/2, it will be identified as the touching
characters. Then use the thinning algorithm to reduce
the thickness of the character image to its skeleton.
d) Search the touching character image for
junction points (Fig. 9(a)). Then, record all the junction
points found in set S. Next, the following process is
repeated six times: 1) the binary morphology is applied
to the character image, then 2) find the junction points
and record them in set S. Rank the junction points in set
S by the frequency of occurring in descending order.
Finally, all ranked junction points are placed into set O.
e) Use the first-ranked junction point from set O as
the partition point to break up the touching character
into several contours (Fig. 9(b)). For each extracted
contour, translate the contour so that the junction point
coincides with the origin (where x and y axes intersect).
C C Jix ix ixC C Jiy iy iy
T
where T is the translation operator. Cix and Ciy are x and y coordinate of the i
th contour. Jix and Jiy are the x and y
coordinate of the junction point of the ith contour.
f) At the origin, create the reference unit vectors
along the x axis ((1, 0) and (-1, 0)) and along the y axis
((0, 1) and (0, -1)).
Figure 9. Process of finding the junction point for segmentation.
g) For each translated contour, create a vector V1
that starts at the origin and ends one pixel away from
the starting point. Then, determine an angle between the
vector V1 and the x axis in a clockwise direction by
using (13).
n
nn
VU
VU 1cos
where U is a unit vector along the x axis. Vn is the nth
vector along the translated contour i.
Next, sequentially create the vectors V2, V3, …, Vn, …, VN. However this time instead of starting the vector at the origin, start the vector Vn at the point where the vector Vn-1 ends. For example, the starting point of the vector V2 is the ending point of the vector V1. After each vector Vn is created, determine the clockwise angle of the vector Vn relative to the x axis. If the angle of three consecutive vectors is the same, stop the creation of further vectors.
h) Use the linear regression method ((14), (15),
and (16)) to calculate the equation of the best-fit line for
a series of points, starting at the origin and ending at the
ending point of the third consecutive vector as shown in
Fig. 11. Then calculate the angle of the line from the
closest reference unit vector and assign it to represent
the angle of the whole contour.
bxay
2xx
yyxxb
bxya
Tmax
Tmin
Tt
Junction points
(a) All junction points found in
the touching character
(b) Selected junction point
for segmentation
After obtaining the angles of all contours (Fig. 12), compare the angle of each contour to that of other contours. According to the characteristic of Lanna characters whose trajectories typically do not abruptly change the direction, group the closest match, anglewise, together. Finally, the contours within the same group are combined to form each isolated character. Then check whether the following criteria are satisfied: 1) the touching character image is decomposed into two sub-images, and 2) the size of each sub-image is larger than one third of the original image. If both criteria are met, continue to the next touching character image. If not, go back to step E; however this time, the next-ranked junction point in set O is used as the partition point.
Figure 10. Illustration of step g.
Figure 11. The best-fit lines for the contours.
Figure 12. Measure the angles of all contours.
III. EXPERIMENTAL RESULTS
In this study, six Lanna handwritten manuscripts used in testing the performance of the proposed algorithm were obtained from temples in the northern provinces of Thailand. Samples of the tested palm leaf manuscripts are shown in Fig. 13. Each manuscript is written by a different handwriting script. Approximately ten pages from each manuscript were scanned and processed using the proposed segmentation algorithm. A total of 150 touching characters were found; all of them only consist of two characters. Similar to other languages, touching characters of three or more components are very uncommon in Lanna palm leaf manuscripts.
Of the 150 touching character images, 86.67% were correctly segmented with the proposed algorithm. Figs. 14 and 15 show some of correctly and incorrectly segmented Lanna characters.
Palm leaf
manuscript 1
Palm leaf
manuscript 2
Palm leaf
manuscript 3
Palm leaf
manuscript 4
Palm leaf
manuscript 5
Palm leaf
manuscript 6
Figure 13. Samples of the tested Lanna palm leaf manuscripts.
Θ1 Θ2
Θ3
Touching
Characters
Correctly Segmented by the
Proposed Algorithm
Figure 14. Samples of the correctly segmented characters.
Touching
Characters
Incorrectly
Segmented by
the Proposed
Algorithm
Expected
Segmentation
Results
Figure 15. Samples of the incorrectly segmented characters.
IV. CONCLUSIONS
A new segmentation algorithm for the historical Lanna handwritten manuscripts is proposed in this paper. To segment a handwritten document, touching characters is a major problem we have to deal with.
First, the partition point is identified. At the partition point, the touching character is separated into several contours. Then, by using the characteristic of Lanna characters, the separated contours are put back to reconstruct two isolated characters. From the experimental results, it is clear that the proposed algorithm is quite capable of segmenting touching Lanna characters.
REFERENCES
[1] T, Soba, G. Sulong, A. Rehman, “A Survey on Methods and Strategies on Touched Characters Segmentation,” International Journal of Research and Reviews in Computer Science, vol. 1, no. 2, 2010, pp. 103-114.
[2] R. G. Casey and E. Lecolinet, “ A Survey of Methods and Strategies in Character Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 7,1996, pp. 690-706.
[3] T. V. Hoang, S. Tabbone and N. Pham, “Recognition-based Segmentation of Nom Characters from Body Text Regions of Stele Images Using Area Voronoi Diagram,” in Proceedings of the 13th International Conference on Computer Analysis of Images and Patterns, 2009.
[4] L. Y. Tseng and R. C. Chen, “Segmenting Handwritten Chinese Characters Based on Heuristic Merging of Stroke Bounding Boxes and Dynamic Programming,” Pattern Recognition Letter, vol. 19, 1998, pp. 963-973.
[5] X. Xiao and G. Leedham, “Knowledge-based English Cursive Script Segmentation,” Pattern Recognition Letters, vol. 21, 2000, pp. 945-954.
[6] T. K. Bhowmik , A. Roy and U. Roy, “Character Segmentation for Handwritten Bangla Words Using Artificial Neural Network,” In Proceedings of the International Workshop on Neural Networks and Learning in Document Analysis and Recognition, 2005.
[7] J. L. Chen, C. H. Wu and H. J. Lee, “Chinese Handwritten Character Segmentation in Form Documents,” Document Analysis Systems: Theory and Practice, LNCS 1655, 1998, pp. 348-362.
[8] G. Lenz and T. Moeller, .Net: A Complete Development Cycle, Addison-Wesley, 2004.
[9] N. Otsu, “A Threshold Selection Method from Gray-Level Histogram,” IEEE Trans. Systems Man, and Cybernetics, vol. 9, 1979, pp. 62-66.
[10] N. Papamarkos, C. Strouthopoulos and I. Andreads, “Multithresholding of color and gray-level images through a neural network technique,” Image and Vision Computing 18(2000) 213-222, 2000.