[acm press proceeding of the workshop - mumbai, india (2012.12.16-2012.12.16)] proceeding of the...

Detection and Removal of Hand-drawn Underlines in aDocument Image Using Approximate Digital Straightness

∗

Sanjoy Pratihar†

Department of ComputerScience and Engineering

Indian Institute of TechnologyKharagpur, India

[email protected]

Partha BhowmickDepartment of ComputerScience and Engineering


[email protected]

Shamik SuralSchool of Information

TechnologyIndian Institute of Technology

Kharagpur, India

[email protected]

Jayanta MukhopadhyayDepartment of ComputerScience and Engineering


[email protected]

ABSTRACT

A novel algorithm for detection and removal of underlinespresent in a scanned document page is proposed. The under-lines treated here are hand-drawn and of various patterns.One of the important features of these underlines is thatthey are drawn by hand in almost a horizontal fashion. Tolocate these underlines, we detect the edges of their coversas a sequence of approximately straight segments, which aregrown horizontally. The novelty of the algorithm lies in thedetection of almost straight segments from the boundaryedge map of the underline parts. After getting the exactcover of the underlines, an effective strategy is taken for un-derline removal. Experimental results are given to show theefficiency and robustness of the method.

1. INTRODUCTIONUnderlines are frequently seen in many documents. The

work reported in this paper deals with the problem of de-tection and removal of hand-drawn underlines in a printeddocument image for improving OCR performance. The im-ages are captured by a flat-bed scanner. In the simplestcase, the underline in a document page does not touch anytext part, and is called an untouched underline. Otherwise,the underlines will be touched by some characters of textlines, and they are named as touched underlines. Underlines

∗The work is carried out under DRD project, IIT Kharagpur, spon-

sored by MCIT(Govt. of India), Ref. 11(19)/2010 - HCC(TDIL) Dt.28.12.2010.†Corresponding author

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.DAR ’12, December 16, 2012, Mumbai, IndiaCopyright 2012 ACM 978-1-4503-1797-9/12/12 ...$15.00.

sometimes may also possess some curvature, which is usuallysmall.

Not many works are reported in the literature related withour work. In a work reported in [2], a technique for removingunderlines is presented, which is based on the height of thebounding boxes of the connected components. It works onlyfor single text lines. Hence to feed the system by a singletext line at every step, a text-line segmentation algorithm isneeded if we want the system to work for a text paragraphas input. But proper segmentation of lines becomes moredifficult when underlines are dense in the paragraph andthe script is without headlines, e.g., as in English. Further,when the hand-drawn underline touches characters of twotext lines—one text line above the underline and the othertext line below it. So after the segmentation of the text lines,a part of the underline may be wrongly considered as lyingabove one text-line in some cases. As a result, these casescannot be properly handled by this method, since bottomedge analysis is carried out to detect the touched underlines.Hence, the problem of text-line segmentation becomes moredifficult because of the presence of touched underlines.

In [5], underline removal is accomplished by separatingtext from overlapping strokes. It is based on the fact that un-derlines are usually longer than text strokes, such as words.The system first detects the smooth strokes in the page,then identifies probable underlines, by measuring the lengthof the stroke. If it is greater than a certain percentage ofthe average word length, it is considered as non-text strokeand removed from the document. As the system works onthinned images, it takes a significant amount of time in pre-processing (thinning), before applying the actual algorithm.

Another approach reported in [6] removes the non-textstrokes (underline etc.) from a binarized image by apply-ing morphological dilation and erosion operators [9]. Such amorphology-based method is also proposed in [3], which usesdynamic selection of the structuring element to avoid exces-sive erosion. In another work [1], the authors show the lineremoval and restoration of handwritten strokes. There alsoexist few works on detection of staff-lines of musical scores.In [4], detection of staff-lines have been shown using Hough

124

transform and mathematical morphology. In [10], authorsshow staff-line detection based on subsection projection andcorrelation algorithm.

As features (thickness, horizontal nature, etc.) of theheadlines are very close to the features of the hand-drawnunderlines, the challenge of the problem lies in designinga general framework for efficient detection and removal ofunderlines from documents written in scripts, whether hav-ing headlines or not. We propose here a novel frameworkthat can work on scripts without headlines (e.g., English,Tamil, Urdu, etc.), as well as on scripts having headlines(e.g., Bangla, Devnagari, etc.). In presence of headlines,we first apply a method to recognize the region of interest,which is defined as the gap between two consecutive text-lines, where the underlines are drawn by hand.

2. DETECTION OF UNDERLINEThe system designed by us works in two steps. The first

is underline detection and the next, underline removal. De-tection of underlines is normally done on the basis of heightand width measurement of connected components. However,for touched underline segments, height and width measure-ments may not become effective for all cases. In particu-lar, for hand-drawn underlines, the problem is quite severe.Though the overall trend of an underline is always horizon-tal, it may possess local bends or curves. The problem ofunderline detection becomes more challenging when we dealwith documents written in scripts having headlines. For,headlines are quite similar to underlines as far as their fea-tures are concerned. We first explain the general approachto solve the problem for scripts without headlines first. Forscripts with headlines, our system uses certain efficient rules,which are discussed next. Input to our system is binarizedtext paragraphs. Before explaining the methodology, wepresent here the various parameters that we have used inthis paper.

2.1 Document Page InformationWe assume that input to our system is text paragraph,

which is binarized and denoted as I. An example is shownin Fig. 1(a). We compute the median width, w, and me-dian height, h, of connected components from their bound-ing boxes. To compute the bounding boxes, we use theboundary edge pixels only detected in the phase of boundaryedge extraction. For documents with headlines, we estimatethe median inter-line gap, g, which remains fixed for theentire paragraph. To estimate g, we compute the averagevertical distance between two consecutive peaks of horizon-tal projection profile of the given text paragraph, as shownin Fig. 1(b). It may be noted that for pages with head-lines, the peaks of horizontal projection profiles are alongthe headlines.

For scripts with headlines, for example, Bengali, Devna-gari, Assamese, etc., the headlines themselves have quitesimilar features as underlines. It is true that the underlineswill be in between the text lines, i.e., in the intervening gapsof text lines. So, using the horizontal projection profile, theheadlines are detected. The inter-line gap (white space orcontaining underline) between two text lines will lie in be-tween two headlines, which are separated by the inter-linedistance, g. The height of this gap is approximately half ofg. From the given document page, the inter-line gaps areextracted using the above observations, and then from these

gaps, we extract the underlines if any, as shown in Fig. 1(c).

2.2 Boundary Edge ExtractionOur algorithm extracts the boundary edge as a digital

curve, C, which is defined as a sequence of points in 8-connectivity. In this 8-connectivity, (i, j) ∈ C and (i′, j′) ∈C are neighbors of each other if max(|i−i′|, |j−j′|) = 1. Thedigital curve is represented by chain codes, which are inte-gers in {0, 1, 2, . . . , 7} [7]. The curve C is said to be digitallystraight if and only if it comprises at most two chain codes,which differ by ±1 (modulo 8), and for one of these, i.e.,the singular code, the run-length must be 1. This is Prop-erty R1 of digital straightness, as explained in [7]. Further,if s and n be the respective singular code and non-singularcode in a digital line, then the runs of n can have only twolengths, which are consecutive integers (Property R2 [7]).We use Property R1 to verify only the non-singular code ofstraightness. Property R2 is not used to allow a flexibilityin identifying a curved underline as approximately straight.These are discussed later in detail.

Boundary edge is extracted from the binarized image us-ing a structuring element E of size 3 × 3 [8]. The centralelement of E is 1, the four vertical and horizontal neigh-bors of the central element are all 1, and the four diagonalneighbors are all 0. We construct a temporary image matrixT from I using morphological operation by the structuringelement E, as follows.T = I ⊙ E, where,T (i, j)← I(i, j)∧I(i−1, j)∧I(i+1, j)∧I(i, j−1)∧I(i, j+1).Here I(i, j) = 1 denotes an object pixel. Finally, the bound-ary edge map B is given by B = I − T .

2.3 Detecting the Underline CoversDuring the extraction of boundary edge of a component,

a component is selected for further processing if it is suffi-ciently long and hence expected to be a underline. The deci-sion is taken on the width of the bounding box of the compo-nent. If it is greater than four times the median word width,i.e., 4·w, then it is considered for underline test. This estima-tion is seen to be robust to detect the broken underlines also.The underline may be touched or untouched by one or morecharacters from the text lines lying above or below it. Hence,our strategy is to detect the set of straight segments, whichcollectively cover only the underline as much as possible, butnot the characters touching the underline. For a long com-ponent suspected to be underline, its cover is representedas a sequence of (approximately) straight segments, recog-nized by certain chain-code properties of digital straightness[7]. As mentioned in Sec. 2.2, singular and non-singular el-ements of the chain code sequence plays an important rolefor recognizing straight pieces. An important observation isthat, as an underline is drawn horizontally, in the sequenceof chain codes describing its cover, the non-singular elementis either 0 or 4, since a run of 0 or 4 represents a horizontallystraight edge. Endpoints of these straight edges play an im-portant role in analyzing the shape of the underline shapesits points of contact/interference with its adjacent text lines.With this idea we extract the cover of underlines. We havedefined an edge of the underline cover to be (approximately)straight, as follows.

Straight segment.A sequence S of points from the underline cover is a

125

(a) (b) (c)

Figure 1: Extraction of inter-line gaps: (a) Input image. (b) Horizontal projection profile. (c) Extractedinter-line gaps.

straight segment if it contains a run of length at least 2 forthe chain code 0 or 4, and |xmax−xmin| ≤ 2, where xmin andxmax denote the minimum x-coordinate and the maximumx-coordinate over all the points in S.

The steps for straight segment extraction are shown inAlgorithm 1. The algorithm is applied on the sequenceof points defining the boundary edge of the component,and maintains an order (in terms of connectedness) of thestraight segments extracted by it. This ordered extractionof segments helps in realizing the smaller segments also.When continuity breaks, the closeness of the endpoints ischecked using their x-coordinates. Finally, on the basis ofthe closeness of these x-coordinates and y-coordinates, theexact cover of the underline (subset of the connected com-ponent) is reported.

For any start direction d, if we get non-singular code c,which is neither 4 nor 0, then we should not continue in thatdirection d. Necessary discussions on selection of the startpoint for a straight edge and on selection of the next pointfrom the current point, are given below.

Start point.The set of the extracted straight edges (from the boundary

edge, which covers the underline) may differ on changing thestart point. Also, for any point p on the boundary curve,there are at least two neighbor pixels. For, the curve isnot minimally connected in 8-connectivity in our case, andhence it is reducible [7]. As the extracted boundary edgeis a reducible curve, we may have some point having morethan two neighbor points. For simplicity, we discuss herethe case of two neighbors. The other cases can be handledin a similar way. If p (with two unvisited neighbors) is astart point, then our procedure of tracing the straight edgesfrom p can start in two directions. Let the start direction bed1 for one neighbor, and d2 for the other neighbor. Let thenon-singular code in direction d1 be c1, and that in directiond2 be c2. Then, using d1 we get the segment s1, and using d2

we get the segment s2. Individually, s1 and s2 may possiblybe considered to be a part of the cover if {c1} ∪ {c2} ={0, 4}. But, the union of the two segments s1 and s2 maynot be a valid straight segment (as per our definition ofstraight segment definition), though individually they arevalid straight segments. Finding the straight edge segment

of maximum possible length thus becomes complex. Hence,to make the process of tracing straight segments simple, weselect a point p as start point such that the non-singular codeis either 4 or 0 in one direction, and the non-singular codeis neither 4 nor 0 in the other direction. If p(x, y) is a startpoint, then formally d(x, y) is the initial direction at p(x, y).Hence, the start point of the traversal will definitely be oneof the two endpoints defining a maximally straight segment.The other endpoint of the straight segment will be the pointq where the straight-edge-finding algorithm halts. Neighborpoints of q are now checked for the conditions of start pointto commence the next straight segment. If q fails to be startpoint, then we search ahead for another start point.

Next point.As mentioned earlier, since we extract the boundary pixels

of a component by morphological operation, the boundaryedge is a closed reducible curve. After we start the segmentextraction from some point p, we select the next point p′

from the boundary edge map. If there is only one neighbor-ing point which is unvisited, then that point is selected asnext point p′. If there are multiple neighbors which are un-visited, then a priority-based scheme is taken. The prioritiesare listed below in descending order.

• Horizontal pixel (in direction of the non-singular code)

• Diagonal neighbor pixel

• Vertical neighbor pixel

3. REMOVAL OF UNDERLINESAfter detecting the underline using the set of straight seg-

ments, we apply the underline removal strategy. The coverof every detected underline is defined by an upper cover-line and a lower cover-line, as shown in Fig. 3. The uppercover-line is a set of straight segments and so is the lowercover-line. The upper and the lower cover-lines come fromthe boundary edge of the component. They are selected insuch a way that they cover the underline part of the com-ponent as much as possible.

126

(a)

(b)

(c)

(d)

(e)

(f)

Figure 2: Deletion of underline for a script with headline. (a) Input image (cropped). (b) Inter-line gap (textline shown in gray). (c) Boundary of probable underline. (d) Straight segments detected from the boundary.(e) Detected underline on the input image. (f) Output after removal of underline.

3.1 Finalization of Cover LinesFor any point p(x, y) on a cover line, we check the pixels

I(x + 1, y) and I(x − 1, y), where I is the 2D array repre-senting the binarized input image. As p(x, y) is a boundarypoint, both I(x + 1, y) and I(x− 1, y) cannot be object pix-els. If I(x+1, y) is an object pixel, then p(x, y) is a point ofsome segment belonging to the upper cover-line. Similarly,if I(x−1, y) is an object pixel, then p(x, y) is a point of somesegment belonging to the lower cover-line.

3.2 Estimation of ThicknessFor every connected component, we define a bounding box

B. The bounding box parameters are xt, xb, yl, yr, whichdefine the top, bottom, left, and right sides of B. Inside thebounding box B, for every vertical line, we find the x-valueof the upper cover (xu), and the x-value of the lower cover(xl); if both of them exist, then we store the value |(xl−xu)|.

1 2 3 4Cases:

Figure 4: Cases of underline removal strategy.

Using all these |(xl−xu)| values, we do a histogram analysis.We take the histogram peak as the width of the line, wl. Thisvalue wl is used in the underline removal procedure to makeunderline pixels as background pixels.

3.3 Removal StrategyThe strategy taken for removal of underline pixels com-

127

(a)

(b)

(c)

(d)

Figure 3: Extraction of boundary edge and underline cover for English (a script without headline).

Algorithm 1: Detect-Straight-Edges

Input: E : Edge map

Output: S : Set of straight segments

ps(x, y) : start point of the current segmentd : initial direction at ps(x, y)pc(x, y)← ps(x, y)S ← ∅

start: run← 0xmax← xmin← xwhile |xmax− xmin| ≤ 2 do

p′(x′, y′) is the next point after pc(x, y)if x′ = x + 1 then

if x′ > xmax thenxmax ← x′

run← 0

if x′ = x− 1 thenif x′ < xmin then

xmin ← x′

run← 0

if x′ = x thenrun← run + 1

pc(x, y) ← p′(x′, y′)

if run ≥ 2 thenS ← S ∪ {ps(x, y), . . . , pc(x, y)}

New initial direction d← d(pc, p′)

ps(x, y)← p′(x′, y′)if E is not finished then

goto start

return S

prises several cases (Fig. 4). As underlines are covered by aset of straight segments, some of them belong to the uppercover-line and others belong to the lower cover-line. We esti-mate the modal width of the underlines by taking the meangap of the upper cover-line and the lower cover-line, as dis-cussed in Sec. 3.2. We detect the ‘vertical strips’ along theunderline covers, where the gap between the upper cover-line and the lower cover-line is at least the modal width, wl.That is, for each vertical strip, the distance between a pixelp of the lower cover-line and the corresponding pixel (lyingvertically above p) of the upper cover-line is at least wl. Itmay be noted that there may occur some strip for whichonly one cover-line exists, either the upper or the lower. Anexample of such a strip is shown in Fig. 3(c) where the lowercover-line does not exist at the contact point of the charac-ter ‘d’ with the underline lying above it. It may also happenthat there is a vertical strip where neither the upper northe lower cover-lines exist. The width of the vertical stripis checked. It must be less than or equal to the averagewidth of a character; otherwise, it is not considered as apoint of contact/interference with the text character. Forscripts with headlines, g/2 is used as the threshold for thesame purpose. The reason is that the font size for this typeof scripts will be g/2, and we are taking the font size as thethreshold.

The cases are shown in Fig. 4. The shaded parts aredeleted as underline, and the remaining parts are kept un-altered. The strategy taken for the different cases are ex-plained below.

• Case 1. For a vertical strip where either the uppercover-line or the lower cover-line is present (but notboth), the underline is touched by a vertical portionof a character in that area. Identification of such anoverlapped area (text part and underline) is difficult.Hence, we do not delete any object pixel in that verti-cal strip.

128

(a) (b) (c)

Figure 5: Deletion shown for the English character‘d’ at the touch point with underline. (a) The char-acter touching the underline (cropped) lying aboveit. (b) Upper and lower cover-lines (in black) of theunderline. (c) Character retained (black) after re-moval of the underline (gray).

• Case 2. For a vertical strip where both upper and lowercover-lines are absent, no deletion is done. The reasonis same as in Case 1.

• Case 3. One instance of this case is when the drawnunderline just touches an object like ‘dot’, and passesby. We cannot decide about the exact overlapped area,and hence the vertical strip is not considered as a partof the underline.

• Case 4. When Case 3 makes a combination with Case 1or Case 2, then this situation appears, i.e., verticalstrips of the two cases are detected one after another(close strips). We delete the object pixels in the origi-nal image as shown in Fig. 4.

The above cases explains the idea in general. The samestrategy, as given in Case 3 and Case 4, can be taken for thecases when they appear in lower cover-lines.

4. EXPERIMENTS AND RESULTSExperiments are conducted on many document images

from various sources like newspapers, magazines, and books,which are manually underlined. The pages have been scannedat a resolution of 300 dpi. Final results are given here fora few of them. Most of the document pages considered forour experimentation are written in Bangla, Devnagari, andEnglish script. For scripts with headlines, we first removedthe text lines, and considered only the inter-line gaps for un-derline detection and removal. To do so, we took horizontalprojection profile as explained earlier. It is assumed thatthere is no skew present in the scanned page. If a perceiv-able skew exists, then it should be corrected by some skewcorrection module. For, a perceivable skew present in thedocument page can affect the headline detection procedure.

The characteristic of our method is to detect the underlinecovers and then apply the removal strategy to remove theunderline parts. Some cases of removal strategies have beendiscussed in Sec. 3.3. Before applying the removal strategy,we search for the vertical strips, after detection of underlinecovers. The vertical strips show the parts having uneventhickness (compared to the line cover as a whole), arising outof the absence of one of the two cover-lines (upper or lower)or both. After detection of these vertical strips, we apply theremoval procedure to remove the object pixels accordingly.The removal strategy ensures that no pixel from the text is

Table 1: CPU time (in seconds) for underline detec-tion and removal by the proposed method for somesample text paragraphs.

Image Script type Image Size Time (Seconds)01 Bangla 1222× 1992 0.6202 Bangla 1232× 2002 0.6203 Bangla 2700× 2035 0.7604 Bangla 1140× 658 0.3805 English 1352× 2063 0.24

Table 2: Test results for images with underlines.

Script Underline Underline Non-underlinetype detected (pixels)removed (pixels)removed

(%) (%) (%)With

Headlines 100 98.12 0WithoutHeadlines 100 98.74 0

ever deleted while removing an underline. By the proposedtechnique, we can also detect whether an uneven thicknessof the underline is because of the fluctuation of the uppercover-line or the lower cover-line. If the thickness is morethan wl (see Sec. 3.2), and the fluctuation is there at bothupper and lower cover-lines, then it is treated as a thick partof the underline; and we delete the object pixels bounded bythe upper and the lower cover-lines.

In Fig. 5, an instance of deletion is shown for the over-lapped area of an underline and the Engish character ‘d’. Alarger example for a script with headline is shown in Fig. 2.For pages with scripts having headlines, we need to applya process of reformation of the document page, as we areapplying the underline detection and removal procedure onthe extracted inter-line-gap regions only. For reformationof the document page, we use image subtraction and thenunion operation, as shown in Fig. 2.

The method given in [2] cannot handle the cases (e.g., theinstance of Fig. 2) when the underline touches some textcharacters in the upper segments of the text line. Theremay be cases where the underline touches two consecutivetext lines, one text line below the underline, and the otherabove it. These cases cannot also be handled by the methoddiscussed in [2]. On the contrary, the proposed method han-dles these cases quite efficiently. However, there is scope toimprove the removal method proposed here, especially in theoverlapped areas (between the underline and text parts).

The algorithm is implemented in C in Linux 2.6.35.6-45.fc14.i686 (Fedora Release 14) and run on a PC with 2.9GHz CPU and 3.0 GB RAM. CPU time is shown for a fewimages in Table 1.

To verify the effectiveness of our approach, we have testedit on text paragraphs, both for scripts with headlines andwithout headlines. For all these paragraphs, underline typesare usually mixed (both touched and untouched), and thereare roughly 5 to 10 underlines per paragraph. Table 2 showsexperimental results on a data set of 50 images, 25 for scriptswith headlines and 25 for scripts without headlines. Somerepresentative final results are shown in Fig. 6 and Fig. 7.

129

Figure 6: Sample result for English script. (a) Input image. (b) Boundary edge extracted for all components.(c) Finalized upper and lower cover-lines of underline covers. (d) Output text after removal of underlines.

5. CONCLUSIONWe have proposed a novel method for underline detection

and removal, which works well for different kinds of under-lines, whether they are untouched or touched by text charac-ters, and whether they are curved or bent, as commonly seenwhen drawn by hand. The method is insensitive to scriptand works in presence of headlines. To manage scripts withheadlines, a preprocessing is needed, as explained in thispaper. To develop a more effective system for dealing withbroken, small-length, and doubtful underlines, a few otherthresholds can be set, which will be investigated by us infuture.

6. REFERENCES

[1] K. R. Arvind, J. Kumar, and A. G. Ramakrishnan.Line removal and restoration of handwritten strokes.Proc. Conf. Computational Intelligence andMultimedia Applications, 3:208–214, 2007.

[2] Zhen-Long Bai and Qiang Huo. Underline detectionand removal in a document image using multiplestrategies. Proceedings of the 17th InternationalConference on Pattern Recognition, pages 578–581,2004.

[3] G. Dimauro, S. Impedovo, G. Pirlo, and A. salzo.Removing underlines from handwritten text: anexperimental investigation. Progress in Handwriting

Recognition, A.C. Downton and S. Impedovo (eds.),pages 497–501, 1997.

[4] C. Genfang, Z. Liyin, Z. Wenjun, and W. Qiuqui.Detecting the staff-lines of musical score with houghtransform and mathematical morphology. Proc.ICMT, pages 1–4, 2010.

[5] V. Govindaraju and S. H. Srihari. Separatinghandwritten text from interfering strokes. From Pixelsto Featuures III - Frontiers in handwriting recognition,S. Impedovo, J.C. Simon (eds.), North-HollandPublication:17–28, 1992.

[6] D. Guillevioc and C. Y. Suen. Cursive scriptrecognition: a fast reader scheme. Proc. of ICDAR,Tzukuba Science City, pages 311–314, 1993.

[7] Reinhard Klette and Azriel Rosenfeld. DigitalGeometry: Geometric Methods for Digital PictureAnalysis. Morgan Kaufmann, San Francisco, 2004.

[8] R. E. Woods R. C. Gonzalez. Digital ImageProcessing. Pearson Education, third edition, 2009.

[9] J. Serra. Image analysis and mathematicalmorphology. Academic Press, London, 1982.

[10] Y. Yin-xian and Y. Ding-li. Staff line detection andrevision algorithm based on subsection projection andcorrelation algorithm. Proc. of ICICA, pages 322–325,2012.

130

Figure 7: Sample result for Bangla script. (a) Input image. (b) Underlines and non-underline (text) partsextracted from the inter-line gaps. (c) Finalized upper and lower cover-lines of underline covers. (d) Outputtext after removal of underlines.

131

[acm press proceeding of the workshop - mumbai, india (2012.12.16-2012.12.16)] proceeding of the...

Documents