3725a171

5
Devanagari and Bangla Text Extraction from Natural Scene Images U. Bhattacharya, S. K. Parui and S. Mondal Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata – 108, India {ujjwal, swapan, srikanta_t}@isical.ac.in Abstract With the increasing popularity of digital cameras attached with various handheld devices, many new computational challenges have gained significance. One such problem is extraction of texts from natural scene images captured by such devices. The extracted text can be sent to OCR or to a text-to-speech engine for recognition. In this article, we propose a novel and effective scheme based on analysis of connected components for extraction of Devanagari and Bangla texts from camera captured scene images. A common unique feature of these two scripts is the presence of headline and the proposed scheme uses mathematical morphology operations for their extraction. Additionally, we consider a few criteria for robust filtering of text components from such scene images. Moreover, we studied the problem of binarization of such scene images and observed that there are situations when repeated binarization by a well-known global thresholding approach is effective. We tested our algorithm on a repository of 100 scene images containing texts of Devanagari and / or Bangla 1. Introduction Digital cameras have now become very popular and it is often attached with various handheld devices like mobile phones, PDAs etc. Manufacturers of these devices are now-a-days looking for embedding various useful technologies into such devices. Prospective technologies include recognition of texts in scene images, text-to-speech conversion etc. Extraction and recognition of texts in images of natural scenes are useful to blind and foreigners with language barrier. Furthermore, the ability to automatically detect text from scene images has potential applications in image retrieval, robotics and intelligent transport systems. However, developing a robust scheme for extraction and recognition of texts from camera captured scenes is a great challenge due to several factors which include variations of style, color, spacing, distribution and alignment of texts, background complexity, influence of luminance, and so on. A survey work of existing methods for detection, localization and extraction of texts embedded in images of natural scenes can be found in [1]. Two broad categories of available methods are connected component (CC) based and texture based algorithms. The first category of methods segments an image into a set of CCs, and then classifies each CC as either text or non-text. CC-based algorithms are relatively simple, but often they fail to be robust. On the other hand, texture-based methods assume that texts in images have different textural properties compared to the background or other non-text regions. Although the algorithms of the latter category are more robust, they have usually higher computational complexities. Additionally, a few authors studied various combinations of the above two categories of methods. Among early works, Zhong et al. [2] located text in images of compact disc, book cover, or traffic scenes in two steps. In the first step, approximate locations of text lines were obtained and then text components in those lines were extracted using color segmentation. Wu et al.[3] proposed a texture segmentation method to generate candidate text regions. A set of feature components is computed for each pixel and these are clustered using K-means algorithm. Jung et al. [4] employed a multi-layer perceptron classifier to discriminate between text and non-text pixels. A sliding window scans the whole image and serves as the input to a neural network. A probability map is constructed where high probability areas are regarded as candidate text regions. In [5], Li et al. computed features from wavelet decomposition of grayscale image and used a neural network classifier for labeling small windows as text or non-text. Gllavata et al. [6] considered wavelet transform based texture analysis for text detection. They used K-means algorithm to cluster text and non- text regions. Saoi et al. [7] used a similar but improved method for detection of text in natural scene images. In this 2009 10th International Conference on Document Analysis and Recognition 978-0-7695-3725-2/09 $25.00 © 2009 IEEE DOI 10.1109/ICDAR.2009.178 171

Upload: parashara

Post on 16-Aug-2015

219 views

Category:

Documents


5 download

DESCRIPTION

scriptz write

TRANSCRIPT

Devanagari and Bangla Text Extraction from Natural Scene Images U. Bhattacharya, S. K. Parui and S. Mondal Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata 108, India {ujjwal, swapan, srikanta_t}@isical.ac.in Abstract Withtheincreasingpopularityofdigitalcameras attachedwithvarioushandhelddevices,manynew computationalchallengeshavegainedsignificance. Onesuchproblemisextractionoftextsfromnatural sceneimagescapturedbysuchdevices.Theextracted textcanbesenttoOCRortoatext-to-speechengine for recognition. In this article, we propose a novel and effectiveschemebasedonanalysisofconnected componentsforextractionofDevanagariandBangla textsfromcameracapturedsceneimages.Acommon uniquefeatureofthesetwoscriptsisthepresenceof headlineandtheproposedschemeusesmathematical morphologyoperationsfortheirextraction. Additionally,weconsiderafewcriteriaforrobust filteringoftextcomponentsfromsuchsceneimages. Moreover,westudiedtheproblemofbinarizationof suchsceneimagesandobservedthatthereare situations when repeated binarization by a well-known globalthresholdingapproachiseffective.Wetested ouralgorithmonarepositoryof100sceneimages containing texts of Devanagari and / or Bangla 1. Introduction Digital cameras have now become very popular and itisoftenattachedwithvarioushandhelddeviceslike mobilephones,PDAsetc.Manufacturersofthese devices are now-a-days looking for embedding various usefultechnologiesintosuchdevices.Prospective technologiesincluderecognitionoftextsinscene images,text-to-speechconversionetc.Extractionand recognitionoftextsinimagesofnaturalscenesare usefultoblindandforeignerswithlanguagebarrier. Furthermore,theabilitytoautomaticallydetecttext fromsceneimageshaspotentialapplicationsinimage retrieval,roboticsandintelligenttransportsystems. However,developingarobustschemeforextraction andrecognitionoftextsfromcameracapturedscenes isagreatchallengeduetoseveralfactorswhich includevariationsofstyle,color,spacing,distribution andalignmentoftexts,backgroundcomplexity, influence of luminance, and so on. Asurveyworkofexistingmethodsfordetection, localizationandextractionoftextsembeddedin imagesofnaturalscenescanbefoundin[1].Two broadcategoriesofavailablemethodsareconnected component(CC)basedandtexturebasedalgorithms. The first category of methods segments an image into a set of CCs, and then classifies each CC as either text or non-text.CC-basedalgorithmsarerelativelysimple, butoftentheyfailtoberobust.Ontheotherhand, texture-basedmethodsassumethattextsinimages havedifferenttexturalpropertiescomparedtothe backgroundorothernon-textregions.Althoughthe algorithmsofthelattercategoryaremorerobust,they haveusuallyhighercomputationalcomplexities. Additionally,afewauthorsstudiedvarious combinations of the above two categories of methods. Among earlyworks, Zhong et al. [2] located text in imagesofcompactdisc,bookcover,ortrafficscenes in two steps. In the first step, approximate locations of textlineswereobtainedandthentextcomponentsin thoselineswereextractedusingcolorsegmentation. Wuetal.[3]proposedatexturesegmentationmethod togeneratecandidatetextregions.Asetoffeature componentsiscomputedforeachpixelandtheseare clustered using K-means algorithm.Jungetal.[4]employedamulti-layerperceptron classifiertodiscriminatebetweentextandnon-text pixels.Aslidingwindowscansthewholeimageand servesastheinputtoaneuralnetwork.Aprobability mapisconstructedwherehighprobabilityareasare regarded as candidate text regions. In[5],Lietal.computedfeaturesfromwavelet decompositionofgrayscaleimageandusedaneural network classifier for labeling small windows as text or non-text.Gllavataetal.[6]consideredwavelet transformbasedtextureanalysisfortextdetection. TheyusedK-meansalgorithmtoclustertextandnon-text regions. Saoietal.[7]usedasimilarbutimprovedmethod fordetectionoftextinnaturalsceneimages.Inthis 2009 10th International Conference on Document Analysis and Recognition978-0-7695-3725-2/09 $25.00 2009 IEEEDOI 10.1109/ICDAR.2009.178171attempt, wavelet transform is applied to all of R, G and B channels of input color image separately.Ezaki,BulacuandSchomaker[8]studied morphologicaloperationsfordetectionofconnected textcomponentsinimages.Theyusedadiskfilter obtaining the difference between the closing image and the opening image. The filtered images are binarized to extract connected components. Inarecentwork,Liuetal.[9]usedaGaussian mixturedistributiontomodeltheoccurrenceofthree neighbouringcharactersandproposedaschemeunder Bayesframeworkfordiscriminatingtextandnon-text components.Panetal.[10]usedasparse representationbasedmethodforthesamepurpose.Ye etal.[11]proposedacoarse-to-finestrategyusing multiscalewaveletfeatures to locate text lines in color images.Textsegmentationmethoddescribedin[12] usesacombinationofaCC-basedstageandaregion filtering stage based on a texture measure. DevanagariandBanglaarethetwomostpopular Indianscriptsusedbymorethan500and200million peoplerespectivelyintheIndiansubcontinent.A uniqueandcommoncharacteristicofthesetwoscripts is the existence of certain headlines as shown in Fig. 1. Thefocusofthepresentworkistoexploittheabove fact for extraction of Devanagari and Bangla texts from imagesofnaturalscenes.Theonlyassumptionwe make is that the characters are sufficiently large and/or thicksothatusingalinearstructuringelementofa certainfixedlengthcancaptureitsheadlines.Tothe best of our knowledge, no existing work deals with the same problem.

(a) (b) Figure 1. (a) A piece of text in Devanagari, (b) a piece of text in Bangla Thepresentstudyisbasedonasetof100outdoor imagesofsignboards,banners,hoardingsand nameplatescollectedusingtwodifferentcameras. Connectedcomponents(bothblackandwhite)are extractedfromthebinaryimage.Then,weuse morphologicalopeningoperationalongwithasetof criteriatoextractheadlinesofDevanagariorBangla texts.Next,weuseseveralgeometricalpropertiesof thecharactersofthesetwoscriptstolocatethewhole text parts in relation to the detected headlines.Therestofthisarticleisorganizedasfollows. Section2describesthepreprocessingoperations.The proposedmethodisdescribedinSection3. Experimental results are provided in Section 4. Section 5 concludes the paper. 2. Preprocessing Sizeofaninputimagevariesdependinguponthe resolutionofthedigitalcamera.Usually,this resolution is 1 MP ormore. Initially,we down sample theinputimagebyanintegralfactorsothatitssizeis reduced to the nearest of 0.25 MP. Next, it is converted to8-bitgrayscaleimageusingtheformulaG= 0.299*R+0.587*G+0.114*B.Infact,thereisno absolutereferenceforweightvaluesofR,GandB. However,theabovesetofweightsisstandardizedby NTSC (National Television System Committee) and its usage is common in computer imaging.Aglobalbinarizationmethodlikethewell-known Otsu'stechniqueisusuallynotsuitableforcamera captured images since the gray-value histogram of such animageisnotbi-modal.Binarizationofsuchan image using a single threshold value often leads to loss of textual information against the background. Texts in theimagesofFigs.2(a)and2(b),arelostduring binarization by Otsus method. (a) (b)

(c) (d) Figure 2. (a) and (b) Two scene images, (c) and (d)afterbinarizationof(a)and(b)byOtsus method Ontheotherhand,localbinarizationmethodsare generally window-based and the choice of window size insuchmethodsseverelyaffecttheresultproducing broken characters, if the characters are thicker than the windowsize.Weimplementedanadaptive thresholdingtechniquewhichusethesimpleaverage gray value in a window of size 2727 around a pixel as thethresholdforthatpixel.InFig.3,weshowthe binarizationresultsoftheimagesofFigs.2bythis HeadlineHeadline172adaptivemethod.However,theexampleinFig.3(b) hastextcomponentsconnectedwiththebackground andsimilarsituationsoccurredfrequentlywiththe sceneimagesusedduringourexperimentations.Also, the latter stages of the proposed method cannot recover from this error. (a) (b) Figure3.(a)&(b)Afterbinarizationofimages in Figs. 2(a) & 2(b) by adaptive method On the other hand, we observed that applying Otsu forthesecondtimeseparatelyonboththesetsof foregroundandbackgroundpixelsofthebinarized imageoftenrecoverlosttextsefficiently.Thesecond timeuseofOtsusmethodasdescribedaboveconvert several pixels from foreground to background and also viceversa.FinalresultsofapplyingOtsu'smethod twice on input images of Fig. 2 are shown in Fig. 4. (a) (b) Figure4.Resultsofbinarizationbyapplying Otsusmethodtwotimes;(a)thebinarized imageofthesampleinFig.2(a),(b)the binarized image of the sample in Fig. 2(b). 3. Proposed approach for text extraction ExtractionofDevanagariand/orBanglatextsfrom binarizedimagesisprimarilybasedontheunique propertyofthesetwoscriptsthattheyhaveheadlines asinFig.1.Basicstepsofourapproach,summarized below,areexecutedseparatelyonresultingimagesof first and second time binarization. 3.1. Algorithm Step1:Obtainconnectedcomponents(C)fromthe binary image (B) corresponding to the gray image (A). These include both white and black components. Step 2: Compute all horizontal or nearly horizontal line segmentsbyapplyingmorphologicalopening operation (Section 3.2) on each C. See Fig. 5(a). Step3:Obtainconnectedsetsoftheaboveline segments. If multiple connected sets are obtained from same C, then we consider only the largest one and call it the candidate headline component HC. Step4:LetEdenoteacomponentCthatproducesa candidateheadlinecomponentHC.ReplaceEby subtractingHCfromit.Thus,Emaynowget disconnectedconsistingofseveralconnected components.Step5:ForeachE,computeH1andH2,whichare respectively the heights of the parts of E that lie above and below HC. Step6:Obtaintheheight(h)ofeachconnected component F of E that lies below HC. Compute p = the standarddeviationofhdividedbythemeanofh,for each E. Step 7: If both H1 / H2 and p are less than two suitably selected threshold values, call the corresponding HC as thetrueheadlinecomponent,HT.Here,itshouldbe notedthatthecharactersofDevanagariandBangla alwayshaveapartbelowtheheadlineandapossible part above the headline is alwayssmaller than the part below it. Step8:SelectallthecomponentsCcorrespondingto each true headline component HT. Step9:Revisitalltheconnectedcomponents,which havenotbeenselectedabove.Foreachsuch componentweexaminewhetheranyothercomponent initsimmediateneighborhoodhasalreadybeen selected.Ifso,wecomparethegrayvaluesofthetwo concernedcomponentsinimageAandifthesevalues areveryclose,thenweincludetheformercomponent into the set of already selected components.As an example,we consider the binarized image of Fig.4(a).Allthelinesegmentsproducedafterthe morphological operations on each component is shown inFig.5(a).Pointsonhorizontallinesegments obtained from white components are represented by the graycolorwhilethesameforblackcomponentsare representedbyblackcolor.Candidateheadlines obtainedattheendofstep3areshowninFig.5(b). Resultofsubtractingcandidateheadlinecomponents fromrespectiveparentcomponentsisshowninFig. 5(c). Trueheadlinecomponentsobtainedattheendof Step7areshowninFig.5(d).Textcomponents selectedbyStep8areshowninFig.5(e).Finally,a few other possible text components are selected by the laststepandthefinalsetofselectedcomponentsare shown in Fig. 5(f). Intheaboveparticularexample,allthetext componentshavebeenselected.However,onlyone non-textcomponent(atthebottomoftheimage)has also been selected. 173 XXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXX XXX XXX XXXXXXXX (a) (b) (c) (d) (e) (f) Figure5.Resultsofdifferentstagesofthe algorithmbasedontheimageofFig.2(a);(a) alllinesegmentsobtainedbymorphological operation,(b)setofcandidateheadlines,(c) allthecomponentsminustherespective candidateheadlines,(d)trueheadlines,(e) componentsselectedcorrespondingtotrue headlines, (f) final set of selected components. (a) (b)

(c) (d) Figure6.(a)Anobject(A),(b)astructuring element(B),(c)erodedobject(C=A-B),(d) the object after opening (D = (A-B)+B) 3.2. Morphological operation Weapplymathematicalmorphologytoolssuchas erosionfollowedbydilationoneachconnected component to extract possible horizontal line segments. Forillustration,considerobjectAandstructuring element B as shown in Figs. 6(a) and 6(b) respectively. TheerosionofobjectAbythestructuringelement B, denoted by A-B, is defined as theset of all pixels P in A such that if B is placed on A with its center at P, B is entirely contained in A. For object A and structuring element B, the eroded object A-B is shown in Fig. 6(c). Thedilationoperationisinsomesensedualof Erosion. For each pixel P in the object A, consider the placementB(P)ofthestructuringelementBwithits centeratP.ThenDilationofobjectAbystructuring element B, denoted by A+B, is defined as theunion of suchplacements B(P)forallPinA.OpeningofAby the element B is (A-B)+B and it is shown in Fig. 6(d). ItisevidentthatopeningofanobjectAwitha linear structuring element B can effectively identify the horizontallinesegmentspresentinaconnected component. However, a suitable choice of the length of this structuring element is crucial for processing of the latter stage and we empirically selected its length as 21 for the present problem. (a) (b)

(c) (d)

(e) (f)

(g) (h) Figure 7. A few images on which our algorithm performed perfectly & respective output. 4. Experimental results Weobtainedsimulationresultsbasedon100test images acquired by (i) a Kodak DX7590 (5.0 MP) still camera and (ii) a SONY DCR-SR85E handy cam used in still mode (1.0 MP).Resolution of images captured bythesetwocamerasarerespectively25761932and 644483pixels.Afterdownsamplingtheirsizesare reducedto644483and576432pixelsrespectively. Theseareofhighways,Institutions,railwaystation, XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXX 174festivalgroundetc.Thesearefocusedonnamesof building/shop/railwaystation/financialinstitutions orhoardingstowardsadvertisements.Thesecontain DevanagariandBanglatextsofvariousfontstyles, sizes, and directions. A few of the images on which the algorithmperfectlyextractsalltheBanglaand Devanagari text components are shown in Fig. 7. There are58suchimagesallofwhoserelevanttext components could be extracted. On the other hand, two of the sample images on which the performance of our algorithmisextremelypoorareshowninFig.8. Similarpoorperformanceoccurredwith6ofour sample images. On rest of the 36 images the algorithm eitherpartiallyextractedrelevanttextcomponentsor extracted text along with a few non-text components.Insummary,theprecisionandrecallvaluesofour algorithmobtainedonthebasisofthepresentsetof 100 images are respectively 68.8% and 71.2%. (a) (b) (c) (d) Figure8.Twosampleimagesonwhichthe performance of our algorithm is very poor. 5. Conclusions The proposed algorithm works well even on slanted or curvedtextcomponentsofDevanagariandBangla. OnesuchsituationisshowninFig.9.However,the proposed algorithmwill fail whenever the size of such curved or slanted text is not sufficiently large. (a) (b) (c) (d) Figure9.Twoimagesconsistingofcurvedor slantedtexts.Extractedcomponentsare shown to the right of each source image. In future,we shall studyuseofmachine learning tools to improve the performance of the proposed algorithm. References [1] J. Liang, D. Doermann, H. Li, Camera based analysis of text and documents : a survey, Int. Journ. on Doc. Anal. and Recog. (IJDAR) vol. 7, pp. 84-104, 2005. [2] Y. Zhong, K. Karu, A. K. Jain, Locating text in complex colorimages,3rdInternationalConferenceonDocument Analysis and Recognition, vol.1, 1995, pp. 146-149. [3] V. Wu, R. Manmatha, E. M. Risemann, Text Finder: an automaticsystemtodetectandrecognizetextinimages, IEEE Transactions on PAMI, vol. 21, pp. 1224-1228, 1999. [4]K.Jung,K.I.Kim,T.Kurata,M.Kourogi,J.H.Han, TextScannerwithTextDetectionTechnologyonImage Sequences,Proceedingsof16thInternationalConference on Pattern Recognition (ICPR), vol. 3, 2002, pp. 473-476. [5]H.Li,D.Doermann,O.Kia,Automatictextdetection andtrackingindigitalvideo,IEEETrans.Image Processing, vol. 9, no. 1, pp. 147-167, 2000. [6]J.Gllavata,R.Ewerth,B.Freisleben,TextDetectionin ImagesBasedonUnsupervisedClassificationofHigh Frequency Wavelet Coefficients, Proc. of 17th Int. Conf. on Pattern Recognition (ICPR), vol. 1, 2004, pp. 425-428. [7] T. Saoi, H. Goto, H. Kobayashi, Text Detection in Color SceneImagesBasedonUnsupervisedClusteringof MultihannelWaveletFeatures,Proc.of8thInt.Conf.on Doc. Anal. and Recog. (ICDAR), pp. 690-694, 2005. [8] N. Ezaki, M. Bulacu, L. Schomaker, Text detection from natural scene images: towards a system for visually Impaired Persons,Proc.of17thInt.Conf.onPatternRecognition, vol. II, pp. 683-686, 2004. [9]X.Liu,H.Fu,Y.Jia,"Gaussianmixturemodelingand learningofneighboringcharactersformultilingualtext extraction in images", Pattern Recognition, vol. 41, pp. 484 493, 2008. [10]W.Pan,T.D.Bui,C.Y.Suen,TextDetectionfrom SceneImagesUsingSparseRepresentation,Proc.ofthe 19th International Conference on Pattern Recognition, 2008. [11]Q.Ye,Q.Huang,W.Gao,D.Zhao,Fastandrobust text detection in images and video frames, Image and Vision Computing, 23, pp. 565576, 2005. [12]C.Merino,M.Mirmehdi,Aframeworktowardsreal-timedetectionandtrackingoftext,2ndInt.Workshopon Camera-Based Doc. Anal. and Recog., pp. 1017, 2007. 175