expert systems with applications · pdf filethe same document analysis based ... lienhart and...

Expert Systems with Applications 42 (2015) 7627–7640

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

A new Histogram Oriented Moments descriptor for multi-orientedmoving text detection in video

http://dx.doi.org/10.1016/j.eswa.2015.06.0020957-4174/� 2015 Elsevier Ltd. All rights reserved.

⇑ Corresponding author at: C2A, Block-C, Residential College 12, University ofMalaya, Kuala Lumpur, Malaysia. Tel.: +60 1112282697.

E-mail addresses: [email protected] (V. Khare), [email protected](P. Shivakumara), [email protected] (P. Raveendran).

Vijeta Khare ⇑, Palaiahnakote Shivakumara, Paramesran RaveendranUniversity of Malaya, Kuala Lumpur, Malaysia

a r t i c l e i n f o a b s t r a c t

Article history:Available online 6 June 2015

Keywords:Central momentsHistogram Oriented MomentsHistogram Oriented GradientsVideo text detectionOptical flowMoving caption text detection

Developing an expert text detection system for video indexing and retrieving is a challenging task due tolow resolution, complex background, non-illumination and movement of text present in a video. Besides,text detection is vital for several real time applications, such as license plate recognition, assisting a blindperson and other surveillance applications. In this paper, we introduce a new descriptor called HistogramOriented Moments (HOM) for text detection in video, which is invariant to rotation, scaling, font, and fontsize variations. The HOM finds orientations with the second order geometrical moments for each slidingwindow (overlapped block) of the input frame. The proposed method performs histogram operations onthe orientations of each window to identify the dominant orientation (as a representative). Then, a newhypothesis is defined based on the dominant orientations of a connected component as the numbers oforientations, which point towards centroid of the connected components are larger than the number ofdominant orientations which point away from the centroid of the connected components. The compo-nents that satisfy the above hypothesis are considered as text candidates, or else as non-text candidates.Further, to detect a moving text- we explore optical flow properties, such as velocity of text candidates toestimate the motions between temporal frames. The components which move with constant velocity anduniform direction are considered as text candidates otherwise non-text candidates. We demonstrate theproposed method’s dominance over state of the art methods by testing on benchmark database, namelyICDAR 2013 and our own video datasets in terms of recall, precision and F-measure.

� 2015 Elsevier Ltd. All rights reserved.

1. Introduction

In the last few years, advancement of new technologies in thefield of information retrieval has been changing in day to day lifeof humans (Jung, Kim, & Jain, 2004; Sharma, Pal, Blumenstein, &Tan, 2012). It is evident from the official statistics of the popularvideo portal, YouTube, that almost 60 h of videos are uploadedevery minute and more than 3 billion videos are watched per dayover YouTube. Therefore, retrieval of the videos on World WideWeb (WWW) has become a very important and challengingtask for researchers. Secondly, for such a huge database, theconventional content based image retrieval methods may not give anefficient and accurate solution due to the gap between low level andhigh level features (Jung et al., 2004; Sharma, Pal, & Blumenstein,2012). To overcome this problem, text detection in video or imageshas been introduced. It helps to find the meaning, which is close to

content of the video/images with the help of Optical CharacterRecognition (OCR) engines (Fernandez-Caballero, Lopez, &Castillo, 2012; Grafmuller & Beyerer, 2013; Park & Kim, 2013).Thirdly, text detection and recognition in video can be used in sev-eral real time applications, such as assisting blind person, assistingintelligent driving, assisting tourists to spot place with the help ofGPS, tracking vehicles based on license plate recognition and othersurveillance applications. However, text detection and recognitionfrom video or images are not as simple as text detection and recog-nition from scanned plane background images because usuallyvideo suffers from low resolution, complex background and varia-tions in colors, font, font size, orientations and text movements(Chen & Odobez, 2005; Jung et al., 2004; Liu, Wang, & Dai, 2005;Shivakumara, Phan, & Tan, 2010; Wei & Lin, 2012). Therefore, thetraditional document analysis methods may not be suitable for textdetection in video or natural scene images because these methodsrequire complete shape of the characters and high resolutionimages. Generally, video contains two types of texts, namely, (i)graphics/superimposed texts, which are inserted by an editor and(ii) scene text, which is part of the image embedded in background.Since graphics text are edited text, they have good clarity and

http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2015.06.002&domain=pdf

http://dx.doi.org/10.1016/j.eswa.2015.06.002

mailto:[email protected]



http://dx.doi.org/10.1016/j.eswa.2015.06.002

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

7628 V. Khare et al. / Expert Systems with Applications 42 (2015) 7627–7640

visibility. They are easy to process while a scene text naturally existand the characteristics are unpredictable so that it is hard to process(Chen & Odobez, 2005; Jung et al., 2004; Liu et al., 2005;Shivakumara et al., 2010; Wei & Lin, 2012). The presence of bothgraphics and scene texts in video increases the complexity of theproblem. As noted (Risnumawan, Shivakumara, Chan, & Tan,2014) text detection and recognition are not new problems forthe document analysis field. The same document analysis basedtechniques are also extended to solve the problem of text detectionfrom natural scene images (Pan, Hou, & Liu 2008; Pan, Hou, & Liu,2011; Risnumawan et al., 2014). However, these methods requirehigh resolution and still images but not video because generallynatural scene images are captured through high resolution cameraswhile videos capture through low resolution cameras. Therefore,the methods may not be used directly for text detection in video.We therefore propose a new descriptor called Histogram OrientedMoments (HOM) to overcome the above problems by detectingstatic and moving text detection in video.

2. Related work

A large number of methods have been proposed in the literaturefor detecting text in video. These can be classified into two broadcategories (Wang & Chen, 2006), (1) the methods which do notuse temporal information and (2) the methods which use temporalinformation. The methods fall on category-1 generally use firstframe or key frame of video for text detection. These methodseither assume key frames containing text is available or use exist-ing methods for extracting key frames. The methods that fall undercategory-2 prefer to use temporal information for enhancing textof low resolution or reducing false positives but not for trackingtext or for the detection of moving text.

Category 1 can be classified further into three classes, namely,connected component based methods, which exploit characteris-tics of text components for text detection in video because textcomponents properties help us to separate background from thetext information. For example, several methods are discussed inthe survey by Jung et al. (2004) where we can notice that the meth-ods proposed geometrical features of text components for seg-menting text in video, as well as images. In addition, color hasbeen used for detecting text in video based on the fact that charac-ter components in a text line have a uniform color. Chen andOdobez (2005) proposed a sequential Monte Carlo based methodfor text detection in video. This method uses Otsu thresholds tosegment initial text regions and then it uses distribution of pixelsof each segmented region for classification of text pixels from thebackground. Liu, Song, Zhang, and Meng (2013) proposed a methodfor multi-oriented Chinese text extraction in video. This methoduses the combination of wavelet and color domains to obtain textcandidates for the given video image. For each text candidate, themethod extracts features at component level for classifying com-ponent as text or non-text. It is observed from the discussion onconnected component based methods that these methods focuson caption or superimposed text but not scene text becausecaption text has better quality and contrast compared to itsbackground. As a result, the methods expect the shape to bepreserved as in document analysis and hence, these methodsuse uniform color features and shape features. Therefore, thesemethods are sensitive to complex background because compo-nents in background may produce text like features. In addition,these methods are limited to high contrast text but not to scenetexts which can have variation in contrasts.

To overcome the problems associated with connected compo-nent based methods, texture based methods have been proposedfor text detection in video which considers appearance of a text

pattern as a special texture. For example, Shivakumara et al.(2010) proposed a method based on the combination of waveletand color features for detecting text candidates with the help ofk-means clustering. Boundary growing has been proposed toextract text lines of different orientations in video. Wang andChen (2006) have proposed spatial–temporal wavelet transformto enhance the video frames. For the enhanced frame, this methodextracts a set of statistical features by performing sliding windowover an enhanced image. Then a classifier was used for classifyingthe text and non-text pixels. Anthimopoulos and Gatos (2013) pro-posed a method for artificial and scene text detection in imagesand videos using a Random forest classifier and a multi-level adap-tive color edge local binary pattern. The multi-level adaptive coloredge local binary pattern has been used to study the spatial distri-bution of color edges in multiple adaptive levels of contrasts. Incontinuation, gradient based algorithm has been applied to achievetext detection in video/images. It is noted from the review of tex-ture based methods that most of the methods use a large numberof features and classifier with a large number of training samples.Therefore, these methods are said to be computationally expensivethough the methods work well for complex background in contrastto connected component based methods. In addition, the methodsscope is limited to use with specific languages because of con-straints of classifiers and training samples.

To alleviate this problem, gradient based methods have beenproposed for text detection in video. These methods work basedon the fact that text pixels exhibit high contrast compared to imagebackground and spatial relationship between strokes provideunique properties to differentiate text from non-text. For example,Liu et al. (2005) extract a set of statistical features from the edgeimages of different directions. Then k-means clustering has beenused for classifying text and non-text pixels. Geometrical proper-ties have been used for grouping text pixels and to extract text linein video and images. Wei and Lin (2012) proposed a robust videotext detection approach using SVM. This method generates twodownsized images for the input image and then performs gradientdifference for the three images including the input image whichresults in three gradient difference images. K-means clustering isapplied on the difference images to separate text cluster fromnon-text. Finally, the SVM classifier has been used for classifyingtrue text pixels from the text clusters. Shivakumara, Phan, andTan (2009) derive rules using the different edge maps of the inputimage. The rules have been used for segmenting text region andthen the same rules are modified for extracting text informationfrom the video images. Lienhart and Wernicke (2002) have pro-posed a method based on the combination of gradient and RGBcolor space. This results in different directions of edge maps forthe input image. Then neural network classifier is applied forseparating text and non-text pixels. Further, refinement has beenproposed for full text line extraction. Zhang and Kasturi (2014)proposed a text detection method based on character and linkenergies. The method explores stroke width distance to define linkenergies between the character components on the basis of strokewidth distance of the character components is generally almostsame. Then a maximum spanning tree is used for text line extrac-tion from both images and videos. It can be inferred from the liter-ature review on edge and gradient based methods that thesemethods are fast compared to texture based methods, but theseare sensitive to background because edge and gradient are notrobust to background variations. This results in more falsepositives.

Overall, it can be concluded from the above discussions thatmost of the methods in use utilize still images or individual framesextracted from video for text detection. Besides, the main objectiveof these methods is to detect static text in the images and videosbut not moving text in video. As a result, these methods do not

V. Khare et al. / Expert Systems with Applications 42 (2015) 7627–7640 7629

explore temporal information, which is available for the video.Therefore, these methods are good for static caption text detec-tions in images or video frames but not for moving captions orscene text detection. Since the proposed method’s aim is to detecttext in both still images and moving text in video, it fall undercategory-2 to explore the temporal frame information to achievethe objectives of moving text detection in video. The detailed liter-ature survey for category-2 is discussed further below.

Huang, Shivakumara, and Tan (2008) proposed a method forscrolling text detection in video using temporal frames. Thismethod uses motion vector estimation for detecting text.However, this method is limited to only scrolling either horizontalor vertical, but not for other directions. Wang and Chen’s method(2006) use spatio-temporal wavelet transform to extract textobjects in video documents. The method proposes athree-dimensional wavelet transform with one scaling functionand seven wavelet functions using sequence of video frames toextract set of texture features. Then the final classification wasdone by Bayes classifier. Huang (2011) detected a video scene textbased on temporal redundancy in video. Video scene texts in con-secutive frames have arbitrary motion due to camera or objectmovement. Therefore, the method performs the motion detectionin 30 consecutive frames to synthesize the motion image. The syn-thesized motion image is used to filter out candidate text regionsand only kept the candidate text regions which have motion occur-rence as final scene text. Zhao et al. (2011) proposed an approachfor text detection using corners in video. This method uses densecorners for identifying text candidates. With the corners, themethod forms region of text using morphological operations. Themethod extracts features, such as area, aspect ratio, orientationetc. to classify a text region. The optical flow has been used formoving caption text detection. Mi, Xu, Lu, and Xue (2005) proposeda text extraction approach based on multiple frames. The edge fea-tures are explored with a similarity measure for identifying textcandidates. Liu and Wang (2012) proposed a method for video cap-tion text detection using stroke like edges and spatio-temporalinformation. Color histogram is used for segmenting texts. Li, Li,Song, and Wang (2010) proposed a method for video text detectionusing multiple frames integration. This method uses edge informa-tion to extract text candidates. The morphological operation andheuristic rules are proposed to extract the final texts from video.Mosleh, Bouguila, and Hamza (2013) proposed an automaticinpainting scheme for video text detection and removal based onstroke width transform to identify text objects. Then motion pat-terns of text objects of each frame are analyzed to localize videotexts. The detected text regions are finally removed. Then the videohas been restored by an inpainting scheme. The objective of themethod is to restore missing information due to overlapping

Fig. 1. Steps of the pr

caption texts. The method is proposed to detect horizontal textsbut is not considered suitable for arbitrarily oriented text detectionin video.

However, similar to our work, Wu, Shivakumara, Lu, and Tan(2014) recently proposed a method for detecting both captionand scene texts in video using temporal information andDelaunay Triangulation. However, the scope of this method is lim-ited to text detection but not moving text detection. In addition,the method does not work well for texts of multi fonts or sizes invideo. Optical flow based properties have been proposed byShivakumara, Lubani, Wong, and Lu (2014) for dynamic curvedtext detection in video. The method is performs well when textis moving while background keeps static, but has limitation whenbackground moves with high variation. Arbitrary text detectionhas also been proposed by Shivakumara, Phan, Shijian, and Tan(2013) through the use of gradient vector flow and grouping.Though the method finds a solution to the complex text detectionproblem, it does not explore temporal information for text detec-tion in video. Therefore, the methods cannot be used for movingtext detection. Gomez and Karatzas (2014) proposed MaximallyStable Extremal Regions (MSER) based real time text detectionand tracking. The method proposes MSER for identifying text can-didates with the help of color similarity, stroke width similaritycheck etc. After filtering meaningful candidate regions, the methodtracks the same candidate regions in successive frames. The searchprocess results in tree structure. For searching, the method usesinvariant moments for finding candidate match in successiveframes. However, the method’s performance degrades when blurexists in the video as the MSER is sensitive to blur information,though invariant moments are robust to blur to some extent.Minemura, Shivakumara, and Wong (2014) proposedmulti-oriented text detection for intra-frame in H.264/AVC video.The method utilizes AC coefficients in compressed domain forextracting distinct features. Then K-means clustering is used forthe features to classify text pixels from non-text pixels. Thedominant orientation of the text cluster regions is used to eliminatefalse positives. Despite the fact that the method extracts features ina compressed domain, the method consumes more computationaltime for achieving results. In summation, then, it can be claimedthat the method does not employ the temporal frames and henceit does not work for moving text detection in video. Recently,Khare, Shivakumara, and Raveendran (2014) proposed a methodbased on motion vectors for moving text detection in video. Themethod explores moments as feature for searching text blocks intemporal frames and then k-means clustering is used for classify-ing the text blocks. The method is not robust to backgroundchanges and arbitrary movements of the text as the method findsseveral mismatches during the search process in successive frames.

oposed method.


In the light of above discussion, the following are the majorweaknesses of the existing methods in detecting multi-orientedmoving text detection in video.

(1) Most of the current methods focus on horizontal captiontexts for detection in video because caption text has bettervisual quality and contrast compared to scene. These con-straints may not be true for moving scene text in complexbackgrounds.

(2) The existing methods utilize the temporal frame informationfor enhancing text detection performance that is for falseremoval but not for detecting moving text in videos. In other

Fig. 2. Orientations of th

words, the existing methods do not use the fact that text invideo usually moves unidirectional with almost constantvelocity. Therefore, the scope of the method is to detect sta-tic text rather than moving text in video.

(3) Though the existing methods use temporal frames, they donot have proper criteria for determining the number of tem-poral frames. The methods generally assume the number offrames to be processed based on experimental results. Thethreshold fixed based on experiments may not work wellfor different dataset and situations.

(4) Since most of the existing methods use classifiers and largenumber of training samples for classification of text and

e HOM descriptor.


non-text at pixels or component level or text line level.Therefore, the methods lose the ability of multi-lingual textdetection in video. This limits generic ability.

The above demerits of the existing methods have motivated usto propose a new method, which is capable of detecting movingand static text in video accurately and efficiently, irrespective oftext types, orientations and scripts. This paper proposes a newdescriptor called Histogram Oriented Moments (HOM) for bothstatic and moving text detection in video. As we are inspired bythe work presented in Minetto, Thome, Cord, Leite, and Stolfi(2013), Tsai, Chen, and Fang (2009), Yao, Nie, Liu, and Zhu (2014)where the new descriptors were developed by referring aHistogram Oriented Gradients (HOG) for text detection andHistogram of Optical Flow (HOF) for detecting human action, weintroduce our own new descriptor based on moments for textdetection in video in this work. The main reason to choosemoments for deriving this descriptor is that moments considerboth spatial information as well as pixel values for estimating ori-entation in contrast to HOG which uses only gradient informationfor text detection. In this way, the HOM descriptor is different fromthe existing descriptors. The HOM finds dominant orientation foreach overlapped block by performing histogram operation onmoment orientation of each sliding window. The method derivesa new hypothesis based on dominant orientations as the numberof orientations which move towards centroid of the connectedcomponent are larger than the number of orientations which moveaway from the centroid of the connected component to classifytext and non-text components. The components which satisfythe above hypothesis, are considered as text candidates whileothers are considered non-text candidates. Geometrical character-istics of text candidates are proposed for eliminating false text can-didates. The method explores optical flow properties of the text fordetecting moving text.

Fig. 3. Text candidates selec

The main contributions of the proposed method are as follows:

(1) Introducing a new descriptor called HOM which uses bothspatial and intensity values for text multi-oriented movingtext detection in video.

(2) New hypothesis is introduced based on dominant orienta-tion given by HOM for classifying text and non-text candi-dates, which works well for some extent to distortion andblur. This results in improving overall performance of theproposed method.

(3) Exploring optical flow properties for extracting the features,such as text moves with constant and unidirection detectingmoving text in video. Proposing converging criterion forstoping search process using optical flow properties of textcomponents, which helps in determining the number oftemporal frames to be processed for text detection.

(4) Furthermore, the advantage of the proposed method is thatit has ability to detect text of both static and dynamicwithout losing accuracy irrespective of text types, scriptsand orientations because the proposed method does notinvolve any classifier and training samples for detection.

3. Proposed methodology

It is true that moments have been used for text detection suc-cessfully in literature (Li, Doermann, & Kia, 2000) because themoments have ability to capture unique features such as spatialinformation and structure of the components, which can distin-guish text from non-text from the complex background of video.Inspired by this, we propose the second order moments for deriv-ing a new descriptor to estimate orientations which representsgroup of text pixels as HOG uses gradient orientations. This obser-vation leads us to propose a new hypothesis to classify text andnon-text candidates. Due to the complex background of video,

ted by HOM descriptor.


non-text candidates may misclassify as text candidates. Therefore,we propose to extract structural features of text candidates such asdense corners, edge density. This results in removing false text can-didates. Since video provides temporal frames, we use temporalframes to detect moving text also. As we noted from the work(Shivakumara et al., 2014) that text usually have constant velocitywhile non-texts have non-uniform velocity. We therefore, proposeto use velocity determined by optical flow as constant velocity fortext candidates and non-uniform velocity for non-text candidatesto classify them accurately. Finally, false positives are removed toimprove the accuracy of the method. The flow diagram can be seenin Fig. 1.

3.1. HOM for text candidates selection

For a given video, we select a frame containing text as shown inFig. 2(a) where we can see differently oriented text lines. The pro-posed method divides the whole frame into overlapped equal sizedsub-blocks of size 8 � 8. For each overlapped block (size 8 � 8) in

Fig. 4. Orientations of

Fig. 2(a) we determine second order moments as defined inEqs. 1–6. Fig. 2(b) shows the moment image of input frame com-puted through second order moments. Fig. 2(c) represents theorientations for a highlighted block in Fig. 2(b). Then the methodobtains the dominant orientation for each block by performinghistogram operation. Before Histogram operation all orientationsare quantized to the nearest bin (where bin ranges from 0 to 180with bin size of 20). Histogram for the selected block is shown inFig. 2(d) where we can see bars for different orientations. The ori-entation which represents the highest peak is considered as thedominant orientation as shown in Fig. 2(e) where the orientationrepresent the whole block. We believe these dominant orientationsreflect the orientations of the group of pixels. The final dominantorientations for all the blocks can be seen in Fig. 2(f) where it isnoted that all dominant orientations represent edge pixels of theobjects for the input frame in Fig. 2(a). This is the advantage ofthe orientation given by moments; as it gives high response fortext pixels where there is a high contrast and low response fornon-text pixels where there is a low contrast. It is true that text

HOG descriptor.


pixels generally have high contrast compared to its background(Jung et al., 2004; Liu et al., 2005; Sharma et al., 2012;Shivakumara et al., 2010; Wei & Lin, 2012). Therefore, these orien-tations give significant cue for classifying text and non-text pixels.This motivate us to study the orientations of the both text andnon-text pixels. It is observe that for text component, the majorityof the orientations show directions towards the centroid of thecomponent and only few orientations show directions away fromthe component. This is illustrated in Fig. 3 where (a) is the resultof orientations in which text and non-text portions are markedby the rectangle, (b) shows HOM orientations for the non-textportion marked in Fig. 3(a), and (c) shows orientations for the textportion marked in Fig. 3(a). It can be noticed from Fig. 3(b) and (c)that orientations in Fig. 3(b) show random directions while theorientations in Fig. 3(c) shows uniform directions. This observationlead us to derive a new hypothesis that if the orientations of thecomponent which move towards component (inside count) arelarger than the orientations of the component which move away

Fig. 5. Effect of HOG descript fo

Fig. 6. (a) Corners are detected for the text candidates, (b) and (c) represents outputs wheresult are shown.

from the component (outside count) then the component is con-sidered as text component; otherwise, it is non-text componentas shown in Fig. 3(d) where one can see the difference in insideand outside counts for text and non-text components. The methodeliminates non-text components using this hypothesis. The effectcan be seen in Fig. 3(e), where most of the non-text componentshave been removed. This output is called text candidates. Note thatthe components are obtained with the help of Sobel edge image ofthe input frame.

hðf Þ ¼ 1=2 � arctanð2l011=ðl020 � l002ÞÞ ð1Þ

Here lpq is the central moment of the image f(x, y) drawn from,

lpq ¼X

x

Xy

ðx� �xÞpðy� �yÞqf ðx; yÞ ð2Þ

In order to reduce the computational complexity, we use rawmoments to generate 2nd order central moment rather than using

r text candidate selection.

n low dense corners and low edge density edges are rejected (d) final text detection


the above equation. Required second order central moment can bedirectly drawn by:

l020 ¼ M20=M00 � �x2 ð3Þl002 ¼ M02=M00 � �y2 ð4Þl000 ¼ M11=M00 � �x:�y ð5Þ

where ð�x; �yÞ are centroids drawn from the following,

Fig. 7. Optical flow vectors m

Fig. 8. (a) Detected text region multiplied with (b) optical flow

Fig. 9. (a) Input frame (b)–(d) shows interm

�x ¼ M10=M00; �y ¼ M01=M00

Image raw moments can be defined as the weighted average(moment) of the image pixel intensities

Mij ¼XN

x¼1

XN

y¼1

xiy jf ðx; yÞ ð6Þ

arked over video frames.

intensity fields can classifies the moving text region in (c).

ediate results of moving text detection.

Table 1Performance on icdar2013 without temporal.

Method Recall Precision F-Score APT(s) per frame

Proposed method 0.74 0.82 0.78 2Zhao et al. 0.71 0.69 0.70 2.5Shivakumara et al. 0.72 0.78 0.75 2.3Liu et al. 0.68 0.67 0.67 2


When we look at the process and the steps of the orientation esti-mation by moments, there is a similarity between HistogramOriented Gradients (HOG) and HOM. This concept has been usedfor text detection in the past. Therefore, to show the performanceof HOM over HOG, we compile HOG with the same steps that asthose of HOM. Fig. 4 shows orientations given by HOG, where (a)is the input frame, (b) is the gradient image with blocks division,(c) is the gradient direction for the selected block shown inFig. 4(b), (d) is the histogram for selecting the highest peak, (e) isthe dominant direction given by the highest peak of the histogramand (f) is the dominant directions of the whole image shown inFig. 4(a). We apply the same hypothesis for the directions to classifytext and non-text pixels as shown in Fig. 5 where (a) shows text andnon-text regions marked by the rectangle, (b) shows directions fornon-text region, (c) shows directions for text region, (d) showsinside and outside count for the text and non-text regions and (e)shows final effect of the hypothesis for the whole input frame.When we compare the result of HOG in Fig. 5(e) and the resultsof HOM in Fig. 3(e), it can be noticed that HOM is more effectivecompared to HOG because HOG gives more non-text componentscompared to HOM. The main reason for that is that HOG considersonly gradient directions while the HOM considers both spatial, aswell as pixel values for finding orientations. It is true that spatialinformation, such as proximity between pixels of the text

Fig. 10. Sample results for proposed and existing met

component indicates the presence of text. Hence, we name the pro-posed process as a new descriptor called HOM in this work.

3.2. Text candidates verification

Fig. 3(e) shows that the HOM alone is not sufficient to removefalse text candidates due to variation in background and resolu-tion. We propose two features, based on structure of text candi-dates: dense corners and edge density. We believe that densecorners and edge density are high for text candidates comparedto non-text candidates. For the purpose of removing false text can-didates, we use the Harris corners algorithm to detect corners asshown in Fig. 6(a) where we can notice dense corners for text can-didates and sparse corners for non-text candidates. We derive arule based on this observation to remove false text candidates as

hods for when temporal information is not used.


shown in Fig. 6(b) where a few non-text components are removed.However, Fig. 6(b) still contains few more false text candidates. Wepropose further rule based on edge density as defined in Eq. (7) toremove remaining false text candidates. The effect can be seen inFig. 6(c) where almost all false text candidates have been removed.The final text detection results are shown in Fig. 6(d) where we cansee bounding boxes for the text lines. The thresholds for the aboverules are determined based on experimental study.

Edge density is computed as follows. For a given window (here,8 � 8), an edge density feature measures the average edge magni-tude in a sub region of the window. Let W (u, v) be a window and e(u, v) be the edge magnitude of the window. For a sub region r withthe left-top corner at (u1, v1) and the right-bottom corner at (u2,v2), the edge density feature is defined as

f ¼ 1ar

Xu2

u¼u1

Xv2

v¼v1

eðu;vÞ ð7Þ

where ar is the region area, ar ¼ ðu2 � u1 þ 1Þðv2 � v1 þ 1Þ: block isrejected if the edge density is smaller than the fixed threshold.

3.3. Moving text detection

Generally, the video contains both static and dynamic texts. Theprevious section uses individual frames to detect text. This section

Fig. 11. Sample results for proposed and existing

focuses on detecting moving text using temporal frames. For theframes in video, first, the previous step is used to detect text can-didates. Then text candidates are used for estimating motion inthis section. We propose optical flow and its properties to deter-mine moving text. It is valid that text in video moves at constantvelocity and single direction especially, graphics text. We exploitthis observation using optical flow. This is because optical flow isgood for estimating the motion of the text candidates betweentemporal frames according to literature. As we are inspired bythe work presented in Bruhn, Weickert, and Schnörr (2005) whereoptical flow is used globally, as well as locally for tracking objectsof arbitrary movements, we propose the same concept for tracingtext in video in this work. Detecting moving text in video is hardbecause the same features may overlap with the moving objectsin the background. Therefore, we propose to use optical flow glob-ally and locally to overcome the problem of moving backgroundobjects. Global optical flow helps in differentiating moving back-ground objects from the graphics text while local optical flow helpsin estimating constant velocity and direction of the texts. More for-mally, the global and local optical flow can be derived as follows.We use the combined global–local (CLG) spatial approach pre-sented in Bruhn et al. (2005), which tries to combine the advantageof local Lucas–Kanade method and the global Horn–Schunkmethod. Let us first reformulate the previous approaches. Where(u, v) are displacement field called optical flow, using the notions:

methods when temporal information is used.


w :¼ ðu;v ;1ÞT ð8Þjrwj2 :¼ jruj2 þ jrvj2 ð9Þr3f :¼ ðf x; f y; f tÞ

T ð10ÞJpðr3f Þ :¼ Kp � ðr3fr3f TÞ ð11Þ

Lucas–Kanade method minimizes the quadratic form,

ELKðwÞ ¼ wT Jpðr3f Þw ð12Þ

While Horn–Schunk technique minimizes the functional,

EHSðwÞ ¼ ðw J0ðr3f Þwþ a rwjj 2Þdxdy ð13Þ

This terminology suggests a natural way to extend the Horn–Schunk functional to desired CLG functional. We simply replacethe matrix J0ðr3f Þ with some integration scale q > 0. Thus, it min-imizes the functional,

ECLGðwÞ ¼Z

XðwT Jpðr3f Þwþ ajrwj2Þdxdy ð14Þ

Its minimizing flow field (u, v) satisfies the Euler–Lagrangeequations. This replacement is hardly more complicated than theoriginal Horn–Schunk Equations. More details about this algorithmcan be found in Bruhn et al. (2005). We extract the optical flowfeature for every consecutive frame, in order to preserve the

Table 2Performance on icdar2013 with temporal information.

Method Recall Precision F-Score APT(s) per videosequence

Proposed method 0.76 0.79 0.77 828 (13.8 min)Zhao et al. 0.69 0.65 0.7 1008 (16.8 min)Mi et al. 0.73 0.72 0.77 792 (13.2 min)Huang 0.7 0.69 0.69 1296 (21.6 min)

Fig. 12. Sample results for proposed and exis

spatial–temporal information. Sample of optical flow computedover frames are shown in Fig. 7 where we can see optical flowvectors are derived using successive frames.

With the optical flow, we get motion features for the pixels inthe frames. We combine the motion features with the detected textcandidates (by the previous Section) to detect moving text. Sincethe previous step gives text candidates for every frame, we usemotion feature corresponding to text candidates to identify themoving texts as shown in Fig. 8 where (a) is text detected by theprevious step and the optical flow intensity given by the opticalflow method and (c) is the final moving text detection result.

The velocity and direction features are calculated as follows. Let(u, v) be the optical flow vectors given by the optical flow methodfor the text candidate in first and second frame then direction ðhÞand velocity ðvelÞ can be defined as

h ¼ tan�1ðv=uÞ

vel ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiu2 þ v2

p

When velocity and direction changes drastically, the method stopsprocess of estimating motion with optical flow method. Accordingto our experiments, for almost all the cases, the method uses lessthan 10 frames for moving text detection. Fig. 9 illustrates one

ting methods without temporal frames.

Table 3Performance on our own dataset without temporal.

Method Recall Precision F-Score APT(s) per frame

Proposed method 0.8 0.84 0.82 2Zhao et al. 0.74 0.8 0.76 2.5Shivakumara et al. 0.71 0.82 0.76 2.2Liu et al. 0.7 0.65 0.67 2.1


sample example for moving text detection where we can see from(b) to (d) the method detects moving text successfully for the textin (a).

4. Experimental results

To evaluate the performance of the proposed descriptor, we areusing a benchmark database ICDAR 2013 video (Karatzas et al.,2013) which is available publicly. This data contains 15 videos cap-tured at different rates and different situations. In addition, thevideo contains text of different types such as different scripts,fonts, font size and orientations. In the same way, we also createour own dataset of size 1000 video at 30 frames per second. Thisdata is collected from different sources, namely, CNN, CNBC,NDTV news channels ESPN, FOX, Ten sports, Star sports channels.We believe this data is good enough to cover all possible variationsof texts and situations. In total, we consider 1015 videos for eval-uating the proposed descriptor. We run experiments on the systemWindows XP Intel core i5 with 6 GB RAM.

For evaluating the proposed descriptor, we are following theinstructions given in ICDAR 2013 reading competition (Karatzaset al., 2013). According to ICDAR 2013 reading competition, themeasures, namely, recall, precision and F-measure are referred asWolf metric from the Wolf and Jolion (2006). We implemented

Fig. 13. Sample results for proposed and existing metho

the same measures as in Wolf and Jolion (2006) and follow thesame scheme to count recall, precision and F-measure in this work.In addition to these measures, we are also using average processingtime (APT) as a measure to evaluate time efficiency for text detec-tion. However, the ground truth provided in ICDAR 2013 is at wordlevel but the proposed method requires ground truth for the textlines. Therefore, we combine the ground truth of words into linesto calculate recall, precision and F-measure for both the databases.The reason to use line level for counting is that segmenting wordsfrom video is hard and it requires powerful segmentation methodfor obtaining words from the text lines. This is because segmentingwords from horizontal text lines is easy but segmenting wordsfrom arbitrary text line with movements is hard. Since it is beyondscope of the work, it is not considered in this work.

To evaluate the effectiveness of the proposed descriptor, wecompare with the latest and well known existing methods: Liuet al. (2005), Shivakumara et al. (2010) and Zhao et al. (2011),which works for text detection from single frame of video. Liuet al. proposed a method for text detection in video using texturefeatures and k-means clustering. These methods are good for cap-tion text but not for combinational graphics and scene text.Similarly, Shivakumara et al. proposed an improved method whichdetects both graphics and scene text in video based on the combi-nation of wavelet and color features. Zhao et al. uses the dense

ds for scene text when temporal information used.

Table 4Performance for scene text with temporal information.

Method Recall Precision F-Score

APT(s) per video sequence

Proposed method 0.86 0.88 0.87 799.2 (13.32 min)Zhao et al. 0.73 0.75 0.73 972 (16.2 min)Mi et al. 0.81 0.8 0.80 864 (14.4 min)Huang 0.78 0.77 0.77 1188 (19.8 min)


corners and other features to detect text in video as well as in sin-gle frame. We also compare proposed method with the methodswhich use temporal frames for text detection. Mi et al. (2005)which uses multiple frames for text detection, Huang et al.(2008) which uses motion vector for detecting text in video.Since Liu et al. (2005), Shivakumara et al. (2010) and Zhao et al.(2011) detect text from single frame and do not use temporalframes, we compare the detection results with the single frame.However, Zhao et al. (2011) detect text from both single frameand multiple frames, we use this method for comparing with staticand moving text detection results of the proposed descriptor.

4.1. Experiment on ICDAR 2013 Video

Sample qualitative results of the proposed and existing meth-ods for text detection from single frames are shown in Fig. 10where we can notice that the proposed method detects text linewell for the multi-oriented text with different backgrounds. Themethods proposed by Zhao et al. (2011), Liu et al. (2005), andShivakumara et al. (2010) detect text lines with missing text andmore false positives because these methods are sensitive to orien-tations. These existing methods focus much on horizontal textdetection but not multi-oriented text detection. Therefore, themethods give poor results for the input images in Fig. 10. On theother hand, the proposed descriptor is developed for both horizon-tal and non-horizontal text detection in video. Quantitative resultsof the proposed and existing methods are reported in Table 1where it is found that the proposed method outperforms the exist-ing methods in terms of recall, precision, F-measure and averageprocessing time. We can observe from Table 1 that recall of theproposed is close to the existing method while precision shows sig-nificant difference compared to existing methods. This shows theproposed descriptor detects text well without much false positives.

Similarly, the proposed and existing methods are tested on tem-poral frames for text detection. The qualitative and quantitativeresults of the proposed and existing methods are shown inFig. 11 and Table 2, respectively. Fig. 11 and Table 2 show thatthe proposed method utilizes temporal frames well for text detec-tion as it detects text lines well compared to existing methods.When we look at Tables 1 and 2, recall improves slightly comparedto without temporal frames experiments because temporal infor-mation helps us to locate text properly for the complex back-grounds. This shows that the proposed optical flow basedmethod for moving text detection helps in improving overall accu-racy of the methods. However, according to F-measure, the pro-posed method gives consistent results for both temporal andwithout temporal frames.

4.2. Experiment on our own dataset

Sample qualitative and quantitative results of the proposedand existing methods using single frames are shown in Fig. 12and Table 3, respectively. It has been observed from Fig. 12 andTable 3 that the proposed method performs better for the differentorientations text with background images while existing methodseither miss text information or give more false positives. When wecompare our dataset with ICDAR 2013 video, this dataset is hugeand has plenty of variations. For this dataset, the proposed descrip-tor gives better results than existing methods can give, in terms ofrecall, precision, F-measure, as well as average processing time.The main reason is that the proposed descriptor does not involveexpensive operations such as connected component analysis whichis part of the existing methods to improve the accuracy. In thesame way, we also conduct experiments for the temporal framesto detect moving text in video as shown in Fig. 13 and the resultsare reported in Table 4. It is noticed that the results of temporal

frames has been increased compared to the results of singleframes. This is because the dataset is large and the proposedmethod utilizes optical flow properties for improving text detec-tion performance. In summary, the proposed new descriptor isgood enough to handle temporal frames and single frames as itachieves better accuracy compared to existing methods.

5. Conclusion

In this work, we have presented a new descriptor calledHistogram of Oriented Moments (HOM) for both static text andmoving text detection in video. We explore second order geometricmoments for deriving an HOM descriptor to exploit the strength ofmoments, such as spatial information and pixel values. This resultsin dominant orientation for each sliding window over an inputframe. We introduced a new hypothesis based on orientations ofthe moments to identify text candidates. False text candidatesare removed by using dense corners and edge density of the textcandidates. Optical flow with velocity and direction are exploredfor moving text detection. Experimental results on benchmarkdataset, ICDAR 2013 and our own data show that the proposedmethod outperforms the existing methods for all measures i.e.recall, precision, F-Score and average processing time. Besides,experimental results reveal that the proposed method is indepen-dent of orientation, scripts, data, fonts and font size.

In summary, following contributions can be claimed by the pro-posed method: it introduced a new descriptor called HOM whichuses both spatial and intensity values for multi-oriented movingtext detection in video. A new hypothesis based on dominant ori-entation given by HOM for classifying text and non-text candidatesis derived. This, works well for some extent to distortions andblurs. This results in improving overall performance of the pro-posed method. The work also explores optical flow properties forextracting the features, such as text moves with constant andunidirection for detecting moving text in video. We have also pro-posed converging criteria for stopping the search process usingoptical flow properties of text components. This helps in determin-ing the number of temporal frames to be used for text detection.Furthermore, the advantage of the proposed method is that it hasthe ability to detect text, both static and dynamicwithout losingaccuracy irrespective of text types, scripts and orientationsbecause the proposed method does not involve any classifier andtraining samples for detection.

However, there are some limitations as follows. The perfor-mance of the proposed method may degrade when a video con-tains text with arbitrary movements because the scope of theproposed method is limited to unidirectional text detection. Inaddition, when the background changes arbitrarily then also itmay not give the same results. Experimental results on ICDAR2013 videos show that the proposed method does not achieve goodresults as in document analysis where the method achieves morethan 90% accuracy. Since our aim is to develop a new method toovercome the drawbacks of the existing methods, the proposedmethod is considered to be computationally inefficient to use forreal time applications where the method should be fast.


In order to find solutions to the above limitations, we proposethe following to expand the scope of the current work in future.We have plan to extend the proposed method for arbitrary textdetection in video involving arbitrary movements by adding newand more features to the current descriptor, such as our own frac-tional moments. In order to achieve good accuracy as in documentanalysis, we are planning to modify the features and optical flowestimation such that the methods work for arbitrary text move-ments because the conventional optical flow is sensitive to direc-tions. We have plans to develop a working model like an expertsystem for real time applications as mentioned in the introductionsection by implementing the same idea in a user friendly platformefficiently. We will develop a method for tracking text in video inmulti-lingual and multi-oriented environments as the currentscope of the work is text detection but not text tracking of theonline videos with the help of text detection. Further, we also haveplans to estimate the degree of sharpness or quality of the videobased on text information to improve the performance of the textdetection and tracking methods.

Acknowledgments

We acknowledge the University of Malaya for funding thiswork. The research has been carried out under HIR Grant(UM.C/625/1/HIR/MOHE/ENG/42).

References

Anthimopoulos, M., & Gatos, B. (2013). Detection of artificial and scene text inimages and video frames. Pattern Analysis and Applications, 431–446.

Bruhn, A., Weickert, J., & Schnörr, C. (2005). Lucas/Kanade meets Horn/Schunck:Combining local and global optic flow methods. International Journal ofComputer Vision, 211–231.

Chen, D., & Odobez, J. M. (2005). Video text recognition using sequential MonteCarlo and error voting methods. Pattern Recognition Letter, 1386–1403.

Fernandez-Caballero, A., Lopez, M.-T., & Castillo, J.-C. (2012). Display textsegmentation after learning best-fitted OCR binarization parameters. ExpertSystems with Applications, 4032–4043.

Gomez, L., & Karatzas, D. (2014). MSER-based real-time text detection and tracking.In Proceedings of ICPR (pp. 3110–3115).

Grafmuller, M., & Beyerer, J. (2013). Performance improvement of characterrecognition in industrial applications using prior knowledge for more reliablesegmentation. Expert Systems with Applications, 6955–6963.

Huang, W., Shivakumara, P., & Tan, C.-L. (2008). Detecting moving text in videousing temporal information. In Proceeding of the ICPR.

Huang, X. (2011). A novel approach to detecting scene text in video. In Proceeding ofthe CISP (pp. 469–473).

Jung, K., Kim, K.-I., & Jain, A.-K. (2004). Text information extraction in images andvideo: A survey. Pattern Recognition, 977–997.

Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Gomez, L., & Robles, S., et al. (2013).ICDAR 2013 robust reading competition. In Proceeding of the ICDAR(pp. 1115–1124).

Khare, V., Shivakumara, P., Raveendran, P. (2014). Multi-oriented moving textdetection (pp. 347–352). Intelligent Signal Processing and CommunicationSystems (ISPACS).

Li, H., Doermann, D., & Kia, O. (2000). Automatic text detection and tracking indigital video. IEEE Transactions on Image Processing, 147–156.

Li, L., Li, J., Song, Y., & Wang, L. (2010). A multiple frame integration andmathematical morphology based technique for video text extraction. InProceeding of the ICCIA (pp. 434–437).

Lienhart, R., & Wernicke, A. (2002). Localizing and segmenting text in images andvideos. IEEE Transactions Circuits Systems Video Technology, 256–268.

Liu, C., Wang, C., & Dai, R. (2005). Text detection in images based on unsupervisedclassification of edge-based features. In Proceedings of the ICDAR (pp. 610–614).

Liu, X., & Wang, W. (2012). Robustly extracting captions in videos based on stroke-line edges and spatio-temporal analysis. IEEE Transactions on Multi Media,482–489.

Liu, Y., Song, Y., Zhang, Y., & Meng, Q. (2013). A novel multi-oriented Chinese textextraction approach from videos. In Proceedings of ICDAR (pp. 1355–1359).

Mi, C., Xu, Y., Lu, H., & Xue, X. (2005). A novel video text extraction approach basedon multiple frames. In Proceeding of the ICICSP (pp. 678–682).

Minemura, K., Shivakumara, P., & Wong, K. S. (2014). Multi-oriented text detectionfor intra-frame in H.264/AVD video. In Proceedings of ISPACS (pp. 330–335).

Minetto, R., Thome, N., Cord, M., Leite, N.-J., & Stolfi, J. (2013). T-HOG: An effectivegradient-based descriptor for single line text regions. Pattern Recognition,1078–1090.

Mosleh, A., Bouguila, N., & Hamza, A.-B. (2013). Automatic inpainting scheme forvideo text detection and removal. IEEE Transactions on Image Processing,4460–4472.

Pan, Y.-F., Hou, X., & Liu, C.-L. (2008). A robust system to detect and localize texts innatural scene images. In Proceeding of the DAS (pp. 35–42).

Pan, Y.-F., Hou, X., & Liu, C.-L. (2011). A hybrid approach to detect andlocalize texts in natural scene images. IEEE Transactions on Image Processing,800–813.

Park, J.-G., & Kim, K.-J. (2013). Design of a visual perception model with edge-adaptive Gabor filter and support vector machine for traffic sign detection.Expert Systems with Applications, 3679–3687.

Risnumawan, A., Shivakumara, P., Chan, C.-S., & Tan, C.-L. (2014). A robust arbitrarytext detection system for natural scene images. Expert Systems with Applications,8027–8048.

Sharma, N., Shivakumara, P., Pal, U., Blumenstein, M., & Tan, C.-L. (2012). A newmethod for arbitrarily-oriented text detection in video. In Proceedings of the DAS(pp. 74–78).

Shivakumara, P., Lubani, M., Wong, K.-S., & Lu, T. (2014). Optical flow based dynamiccurved video text detection. In Proceeding of the ICIP.

Shivakumara, P., Phan, T. Q., & Tan, C. L. (2010). New wavelet and color features fortext detection in video. In Proceedings of the ICPR (pp. 3996–3999).

Shivakumara, P., Phan, T. Q., Shijian, L., & Tan, C. L. (2013). Gradient vector flowand grouping based for arbitrarily-oriented scene text detection in videoimages. IEEE Transactions on Circuits and Systems for Video Technology,1729–1739.

Shivakumara, P., Phan, T.-Q., & Tan, C.-L. 2009. Video text detection based on filtersand edge features. In Proceeding of the international conference on multimediaand expo (pp. 1–4).

Tsai, T.-H., Chen, Y.-C., & Fang, C.-L. (2009). 2DVTE: A two-directional videotextextractor for rapid and elaborate design. Pattern Recognition, 1496–1510.

Wang, Y.-K., & Chen, J.-M. (2006) Detection video texts using spatial-temporalwavelet transform. In Proceeding of the ICPR (pp. 754–757).

Wei, Y. C., & Lin, C. H. (2012). A robust video text detection approach using SVM.Expert Systems with Applications, 10832–10840.

Wolf, C., & Jolion, J.-M. (2006). Object count/area graphs for the evaluation of objectdetection and segmentation algorithms. International Journal of DocumentAnalysis and Recognition, 280–296.

Wu, L., Shivakumara, P., Lu, T., & Tan, C.-L. (2014). Text detection using Delaunaytriangulation in video sequence. In Proceeding of the DAS (pp. 41–45).

Yao, B.-Z., Nie, B.-X., Liu, Z., & Zhu, S.-C. (2014). Animated pose templates formodeling and detecting human actions. IEEE Transaction PAMI, 1–17.

Zhang, J., & Kasturi, R. (2014). A novel text detection system based on character andling energies. IEEE Transactions on Image Processing, 4187–4198.

Zhao, X., Lin, K.-H., Fu, Y., Hu, Y., Liu, Y., & Huang, T.-S. (2011). Text from corners: Anovel approach to detect text and caption in videos. IEEE Transactions on ImageProcessing, 790–799.

http://refhub.elsevier.com/S0957-4174(15)00396-6/h0005























































expert systems with applications · pdf filethe same document analysis based ... lienhart and...

Documents