1398 ieee transactions on knowledge and data...

13
Information Retrieval in Document Image Databases Yue Lu and Chew Lim Tan, Senior Member, IEEE Abstract—With the rising popularity and importance of document images as an information source, information retrieval in document image databases has become a growing and challenging problem. In this paper, we propose an approach with the capability of matching partial word images to address two issues in document image retrieval: word spotting and similarity measurement between documents. First, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to measure the similarity between the two primitive strings generated from two word images. Based on the similarity, we can estimate how a word image is relevant to the other and, thereby, decide whether one is a portion of the other. To deal with various character fonts, we use a primitive string which is tolerant to serif and font differences to represent a word image. Using this technique of inexact string matching, our method is able to successfully handle the problem of heavily touching characters. Experimental results on a variety of document image databases confirm the feasibility, validity, and efficiency of our proposed approach in document image retrieval. Index Terms—Document image retrieval, partial word image matching, primitive string, word searching, document similarity measurement. æ 1 INTRODUCTION M ODERN technology has made it possible to produce, process, store, and transmit document images effi- ciently. In an attempt to move toward the paperless office, large quantities of printed documents are digitized and stored as images in databases. The popularity and importance of document images as an information source are evident. As a matter of fact, many organizations currently use and are dependent on document image databases. However, such databases are often not equipped with adequate index information. This makes it much harder to retrieve user-relevant information from image data than from text data. Thus, the study of information retrieval in document image databases is an important subject in knowledge and data engineering. 1.1 Research Background Millions of digital documents are constantly transmitted from one point to another over the Internet. The most common format of these digital documents is text, in which characters are represented by machine-readable codes. The majority of computer generated documents are in this format. On the other hand, to make billions of volumes of traditional and legacy documents available and accessible on the Internet, they are scanned and converted to digital images using digitization equipment. Although the technology of Document Image Proces- sing (DIP) may be utilized to automatically convert the digital images of these documents to the machine-readable text format using Optical Character Recognition (OCR) technology, it is typically not a cost effective and practical way to process a huge number of paper documents. One reason is that the technique of layout analysis is still immature in handling documents with complicated lay- outs. Another reason is that OCR technology still suffers inherent weaknesses in its recognition ability, especially with document images of poor quality. Interactive manual correction/proofreading of OCR results is usually un- avoidable in most DIP systems. As a result, storing traditional and legacy documents in image format has become the alternative in many cases. Nowadays, we can find on the Internet many digital documents in image format, including scanned journal/conference papers, student theses, handbooks, out-of-print books, etc. More- over, many digital libraries and Web portals such as MEDLINE, ACM, IEEE, etc., keep scanned document images without their text equivalents. Many tools have been developed for retrieving informa- tion from text format documents, but they are not applicable to image format documents. There is therefore a need to study the strategies of retrieving information from document images. For example, a user facing a large number of imaged documents on the Internet has to download each one to see its contents before knowing whether the document is relevant to his/her interest, e.g., by looking for keywords. Obviously, it is of practical value to study a method that is capable of notifying the user, prior to downloading, whether a document image contains information of interest to him/her. In recent years, there has been much interest in the research area of Document Image Retrieval (DIR) [1], [2]. DIR is relevant to document image processing (DIP), but there are some essential differences between them. A DIP system needs to analyze different text areas in a page document, understand the relationship among these text areas, and then convert them to a machine-readable version using OCR, in which each character object is assigned to a certain class. Many commercial OCR systems, which claim 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 . Y. Lu is with the Department of Computer Science and Technology, East China Normal University, Shanghai 200062, China. E-mail: [email protected]. . C.L. Tan is with the Department of Computer Science, School of Computing, National University of Singapore, Kent Ridge, Singapore 117543. E-mail: [email protected]. Manuscript received 8 Sept. 2003; revised 1 Feb. 2004; accepted 25 Feb. 2004. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-0181-0903. 1041-4347/04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society

Upload: others

Post on 25-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …cis.csuohio.edu/~munakata/class/601/pdf/2-40m.pdf · data than from text data. Thus, the study of information retrieval in document

Information Retrievalin Document Image Databases

Yue Lu and Chew Lim Tan, Senior Member, IEEE

Abstract—With the rising popularity and importance of document images as an information source, information retrieval in document

image databases has become a growing and challenging problem. In this paper, we propose an approach with the capability of

matching partial word images to address two issues in document image retrieval: word spotting and similarity measurement between

documents. First, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to

measure the similarity between the two primitive strings generated from two word images. Based on the similarity, we can estimate

how a word image is relevant to the other and, thereby, decide whether one is a portion of the other. To deal with various character

fonts, we use a primitive string which is tolerant to serif and font differences to represent a word image. Using this technique of inexact

string matching, our method is able to successfully handle the problem of heavily touching characters. Experimental results on a

variety of document image databases confirm the feasibility, validity, and efficiency of our proposed approach in document image

retrieval.

Index Terms—Document image retrieval, partial word image matching, primitive string, word searching, document similarity

measurement.

1 INTRODUCTION

MODERN technology has made it possible to produce,process, store, and transmit document images effi-

ciently. In an attempt to move toward the paperless office,large quantities of printed documents are digitized andstored as images in databases. The popularity andimportance of document images as an information sourceare evident. As a matter of fact, many organizationscurrently use and are dependent on document imagedatabases. However, such databases are often not equippedwith adequate index information. This makes it muchharder to retrieve user-relevant information from imagedata than from text data. Thus, the study of informationretrieval in document image databases is an importantsubject in knowledge and data engineering.

1.1 Research Background

Millions of digital documents are constantly transmittedfrom one point to another over the Internet. The mostcommon format of these digital documents is text, in whichcharacters are represented by machine-readable codes. Themajority of computer generated documents are in thisformat. On the other hand, to make billions of volumes oftraditional and legacy documents available and accessibleon the Internet, they are scanned and converted to digitalimages using digitization equipment.

Although the technology of Document Image Proces-sing (DIP) may be utilized to automatically convert thedigital images of these documents to the machine-readable

text format using Optical Character Recognition (OCR)technology, it is typically not a cost effective and practicalway to process a huge number of paper documents. Onereason is that the technique of layout analysis is stillimmature in handling documents with complicated lay-outs. Another reason is that OCR technology still suffersinherent weaknesses in its recognition ability, especiallywith document images of poor quality. Interactive manualcorrection/proofreading of OCR results is usually un-avoidable in most DIP systems. As a result, storingtraditional and legacy documents in image format hasbecome the alternative in many cases. Nowadays, we canfind on the Internet many digital documents in imageformat, including scanned journal/conference papers,student theses, handbooks, out-of-print books, etc. More-over, many digital libraries and Web portals such asMEDLINE, ACM, IEEE, etc., keep scanned documentimages without their text equivalents.

Many tools have been developed for retrieving informa-tion from text format documents, but they are notapplicable to image format documents. There is thereforea need to study the strategies of retrieving information fromdocument images. For example, a user facing a largenumber of imaged documents on the Internet has todownload each one to see its contents before knowingwhether the document is relevant to his/her interest, e.g.,by looking for keywords. Obviously, it is of practical valueto study a method that is capable of notifying the user, priorto downloading, whether a document image containsinformation of interest to him/her.

In recent years, there has been much interest in theresearch area of Document Image Retrieval (DIR) [1], [2].DIR is relevant to document image processing (DIP), butthere are some essential differences between them. A DIPsystem needs to analyze different text areas in a pagedocument, understand the relationship among these textareas, and then convert them to a machine-readable versionusing OCR, in which each character object is assigned to acertain class. Many commercial OCR systems, which claim

1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

. Y. Lu is with the Department of Computer Science and Technology, EastChina Normal University, Shanghai 200062, China.E-mail: [email protected].

. C.L. Tan is with the Department of Computer Science, School ofComputing, National University of Singapore, Kent Ridge, Singapore117543. E-mail: [email protected].

Manuscript received 8 Sept. 2003; revised 1 Feb. 2004; accepted 25 Feb. 2004.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-0181-0903.

1041-4347/04/$20.00 � 2004 IEEE Published by the IEEE Computer Society

Page 2: 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …cis.csuohio.edu/~munakata/class/601/pdf/2-40m.pdf · data than from text data. Thus, the study of information retrieval in document

to achieve a recognition rate of above 99 percent for singlecharacters, have been introduced in the market over thepast few decades.

It is obvious that a document image processing systemcan be applied for the purpose of document imageretrieval. For example, a DIP system is used to automati-cally convert a document image to machine-readable textcodes first, and traditional information retrieval strategies[3], [4] are then applied. However, the performance of aDIP system relies heavily on the quality of the scannedimages. It deteriorates severely if the images are of poorquality or complicated layout.

It is generally acknowledged that the recognition accuracyrequirements of document image retrieval methods areconsiderably lower than those for many document proces-sing applications [5]. Motivated by this observation, someretrieval methods with the ability to tolerate the recognitionerrors of OCR have been recently researched [6], [7], [8], [9].Additionally, some methods are reported to improveretrieval performance by using OCR candidates [10], [11].

It is a fact that such information retrieval methods basedon the technology of document image processing are stillthe best so far because a great deal of research has beendevoted to the area, and numerous algorithms have beenproposed in the past decades, especially in OCR research.However, we argue that DIR, which bypasses OCR, still hasits practical value today.

The main question that a DIR system seeks to answer iswhether a document image contains words which are ofinterest to the user, while paying no attention to unrelatedwords. In other words, a DIR system provides an answer of“yes” or “no” with respect to the user’s query, rather thanthe exact recognition of a character/word as in DIP.Moreover, words, rather than characters, are the basic unitsof meaning in information retrieval. Therefore, directlymatching word images in a document image is analternative way to retrieve information from the document.

In short, DIR and DIP address different needs and havedifferent merits of their own. The former, which is tailoredfor directly retrieving information from document images,could achieve a relatively higher performance in terms ofrecall, precision, and processing speed.

1.2 Previous Related Work

In recent years, a number of attempts have been made byresearchers to avoid the use of character recognition forvarious document image retrieval applications. Forexample, Chen and Bloomberg [12], [13] described amethod for automatically selecting sentences and keyphrases to create a summary from an imaged documentwithout any need for recognition of the characters in eachword. They built word equivalence classes by using arank blur hit-miss transform to compare word imagesand using a statistical classifier to determine the like-lihood of each sentence being a summary sentence. Liuand Jain [14] presented an approach to image-based formdocument retrieval. They proposed a similarity measurefor forms that is insensitive to translation, scaling,moderate skew, and image quality fluctuations, anddeveloped a prototype form retrieval system based ontheir proposed similarity measure. Niyogi and Srihari [15]described an approach to retrieve information fromdocument images stored in a digital library by means ofknowledge-based layout analysis and logical structure

derivation techniques, in which significant sections ofdocuments, such as the title, are utilized. Tang et al. [16]proposed methods for automatic knowledge acquisition indocument images by analyzing the geometric structureand logical structure of the images. In the domain ofChinese document image retrieval, He et al. [17]proposed an index and retrieval method based on thestroke density of Chinese characters.

Sptiz described character shape codes for duplicatedocument detection [18], information retrieval [19], wordrecognition [20], and document reconstruction [21], withoutresorting to character recognition. Character shape codesencode whether or not the character in question fitsbetween the baseline and the x-line or, if not, whether ithas an ascender or descender, and the number and spatialdistribution of the connected components. To get charactershape codes, character cells must first be segmented. Thismethod is therefore unsuitable for dealing with words withconnected characters. Additionally, it is lexicon-dependent,and its performance is somewhat affected by the appro-priateness of the lexicon to the document being processed.

Yu and Tan [22], [23] proposed a method for textretrieval from document images without the use of OCR. Intheir method, documents are segmented into characterobjects, whose image features are utilized to generatedocument vectors. Then, the text similarity betweendocuments is measured by calculating the dot product ofthe document vectors. In the work of both Spitz and Tan etal., character segmentation is a necessary step beforedownstream processes can be performed. In many cases,especially with document images of poor quality, it is noteasy to separate connected characters. Moreover, the word,rather than the character, is normally a basic unit of usefulmeaning in document image retrieval.

To avoid the difficulties of separating touching adjacentcharacters in aword image, segmentation-freemethods havebeen researched in some document image retrieval systems,in which image matching at the word level is commonlyutilized. A segmentation-free word image matching ap-proach treats each word object as a single, indivisible entity,and attempts to recognize it using features of the word as awhole. For example, Ho et al. [24] proposed a wordrecognition method based on word shape analysis withoutcharacter segmentation and recognition. Every word is firstpartitioned into a fixed 4� 10grid. Features are extracted andtheir relative locations in the grid are recorded in a featurevector. The city-block distance is used to compare the featurevector of an input word and that of a word in a given lexicon.This method can only be applied to a fixed vocabulary wordrecognition system and has no ability of partial wordmatching.

Some approaches have been reported in the past yearsfor searching keywords in handwritten [25], [26], [27] andprinted [28], [29], [30], [31], [32] documents. In most of theseapproaches, the Hidden Markov Model and Viterbidynamic programming algorithm are widely used torecognize keywords. However, these approaches have theirinherent disadvantages. Although DeCurtins and Chen’sapproach [28] and Chen et al.’s approach [30], [31] arecapable of searching a word portion, they can only processpredefined keywords and a pretraining procedure isnecessary. In our previous word searching system [32], aweighted Hausdorff distance is used to measure thedissimilarity between word images. Although the system

LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1399

Page 3: 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …cis.csuohio.edu/~munakata/class/601/pdf/2-40m.pdf · data than from text data. Thus, the study of information retrieval in document

can cope with any word the user specifies, it cannot dealwith the problem of partial matching. For instance, theapproach is unable to search the words “undeveloped” and“development” when the user keys in “develop” as thequery word because the three words are treated as differentfrom the standpoint of word image classification. Ob-viously, such a classification is not practical from theviewpoint of document understanding by human beings.

In the compressed document images domain, Hull [33]proposed a method to detect equivalent document imagesby matching the pass mode codes extracted from CCITTcompressed document images. He created a feature vectorthat counts the number of pass mode codes in each cell of afixed grid in the image, and equivalent images are locatedby applying the Hausdorff distance to feature vectors. Inour previous paper [34], we also presented a method formeasuring the similarity between two CCITT group 4compressed document images.

1.3 Current Work

In this paper, we propose an improved document imageretrieval approach with the ability of matching partial wordimages. First, word images are represented by featurestrings. A feature string matching method based ondynamic programming is then utilized to evaluate thesimilarity between two feature strings or a part of them. Byusing the technique of inexact feature string matching, theapproach can successfully handle not only kerning, but alsowords with heavily touching characters.

The underlying idea of the proposed partial word imagematching technique is somewhat similar to that of wordrecognition based on Hidden Markov Models. However,inexact string matching has the advantage of not requiringany training prior to use.

We have tested our technique in two applications ofdocument image retrieval. One application is to search foruser-specified words and their variations in documentimages (word spotting). The user’s query word and a wordimage object extracted from documents are first representedby two respective feature strings. Then, the similaritybetween the two feature strings is evaluated. Based on theevaluation, we can estimate how the document word imageis relevant to the user-specified word. For example, whenthe user keys in the query word “string.” our methoddetects words such as “strings” and “substring.” Ourexperiments on real document images, including scannedbooks, student theses, journal/conference papers, etc., showpromising performance by our proposed approach to wordspotting in document images.

The other application is content-based similarity mea-surement between two document images. The technique ofpartial word image matching is applied to assign all wordobjects from two document images to correspondingclasses, except potential stop words, which are excluded.Document vectors are built using the occurrence frequen-cies of the word object classes, and the pair-wise similarityof two document images is calculated by the scalar productof the document vectors. We have used nine group articlespertaining to different domains to test the validity of ourapproach. The results show that our proposed methodbased on partial word image matching outperforms themethod based on entire word image matching, with the F1

rating improved.

The remainder of this paper is organized as follows:Section 2 presents the feature string description for wordimages. Section 3 describes inexact matching between twofeature strings. Section 4 gives the experimental results thatdemonstrate the effectiveness of the proposed partial wordimage matching approach in word spotting. Section 5presents the results demonstrating the effectiveness of theproposed partial word image matching in documentsimilarity measurement. Section 6 concludes the paper witha summary of the contributions of our proposed techniqueand an indication of the directions for future work.

2 FEATURE STRING DESCRIPTION FOR WORD

IMAGE

It is presumed that word objects are extracted fromdocument images via some image processing such as skewestimation and correction, connected component labeling,word bounding, etc. The feature employed in our approachto represent word bitmap images is the Left-to-RightPrimitive String (LRPS), which is a code string sequencedfrom the leftmost of a word to its rightmost. Line andtraversal features are used to extract the primitives of aword image.

A word printed in documents can be in various sizes,fonts, and spacings. When we extract features from wordbitmaps, we have to take this fact into consideration.Generally speaking, it is easy to find a way to cope withdifferent sizes. But, dealing with multiple fonts and thetouching characters caused by condensed spacing is still achallenging issue.

For a word image in a printed text, two characters couldbe spaced apart by a few white columns caused by inter-character spacing, as shown in Fig. 1a. But, it is alsocommon for one character to overlap another by a fewcolumns due to kerning, as shown in Fig. 1b. Worse still, asshown in Fig. 1c, two or more adjacent characters maytouch each other due to condensed spacing. We are thusfaced with the challenge of separating such touchingcharacters. We shall utilize inexact feature string matchingto handle the problem.

2.1 LRPS Feature Representation

A word is explicitly segmented, from the leftmost to therightmost, into discrete entities. Each entity, called aprimitive here, is represented using definite attributes. Aprimitive p is described using a two-tuple (�, !), where � isthe Line-or-Traversal Attribute (LTA) of the primitive, and! is the Ascender-and-Descender Attribute (ADA). As aresult, the word image is expressed as a sequence P of pi’s

P ¼< p1p2 . . . pn >¼< ð�1; !1Þð�2; !2Þ . . . ð�n; !nÞ >; ð1Þ

1400 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

Fig. 1. Different spacing: (a) separated adjacent characters,

(b) overlapped adjacent characters, and (c) touching adjacent

characters.

Page 4: 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …cis.csuohio.edu/~munakata/class/601/pdf/2-40m.pdf · data than from text data. Thus, the study of information retrieval in document

where the ADA of a primitive ! 2 � = {“x,” “a,” “A,” “D,”“Q”}, which are defined as:

. “x”: The primitive is between the x-line (animaginary line at the x-height running parallel withthe baseline, as shown in Fig. 2) and the baseline.

. “a”: The primitive is between the top boundary andthe x-line.

. “A”: The primitive is between the top boundary andthe baseline.

. “D”: The primitive is between the x-line and thebottom boundary.

. “Q”: The primitive is between the top-boundary andthe bottom boundary.

The definition of x-line, baseline, top boundary, andbottom boundary may be found in Fig. 2. A word bitmapextracted from a document image already contains theinformation of the baseline and x-line, which is a by-product of the text line extraction from the previous stage.

2.2 Generating Line-or-Traversal Attribute

LTA generation consists of two steps. We extract the straightstroke line feature from the word bitmap first, as shown inFig. 2a. Note that only the vertical stroke lines and diagonalstroke lines are extracted at this stage. Then, the traversalfeatures of the remainder part are computed. Finally, thefeatures from the above two steps are aggregated to generatethe LTAs of the corresponding primitives. In other words,the LTA of a primitive is represented by either a straight-linefeature (if applicable) or a traversal feature.

2.2.1 Straight Stroke Line Feature

A run-length-based method is utilized to extract straightstroke lines from word images. We use Rða; �Þ to represent adirectional run, which is defined as a set of blackconcatenating pixels that contains pixel a, along thespecified direction �. jRða; �Þj is the run length of Rða; �Þ,which is the number of black points in the run.

The straight stroke line detection algorithm is summar-ized as follows:

1. Along the middle line (between the x-line andbaseline), detect the boundary pair ½Al;Ar� of eachstroke, where Al and Ar are the left and rightboundary points, respectively.

2. Detect the midpoint Am of a line segment AlAr.3. Calculate RðAm; �Þ for different �s, from which we

select �max as the As’s run direction.4. If jRðAm; �maxÞj is near or larger than the x-height,

the pixels containing Am, between the boundarypoints Al and Ar along the direction �max, areextracted as a stroke line.

In the example of Fig. 2, the stroke lines are extracted asin Fig. 2a, while the remainder is as in Fig. 2b. According toits direction, a line is categorized as one of three basic strokelines: vertical stroke line, left-down diagonal stroke line,and right-down diagonal stroke line. According to the typeof stroke lines, three basic primitives are generated fromextracted stroke lines. Meanwhile, their ADAs are assignedcategories based on their top-end and bottom-end positions.Their LTAs are respectively expressed as:

. “l”: Vertical straight stroke line, such as in thecharacters “l,” “d,” “p,” “q,” “D,” “P,” etc. For theprimitive whose ADA is “x” or “D,” we furthercheck whether there is a dot over the vertical strokeline. If the answer is “yes,” the LTA of the primitiveis reassigned as “i.”

. “v”: Right-down diagonal straight stroke line, suchas in the characters “v,” “w,” “V,” “W,” etc.

. “w”: Left-down diagonal straight stroke line, such asin the characters “v,” “w,” “z,” etc. For the primitivewhose ADA is “x” or “A,” we further check whetherthere are two horizontal stroke lines connected withit at the top and bottom. If so, the LTA of theprimitive is reassigned as “z.”

Additionally, it is easy to detect primitives containingtwo or more straight stroke lines. They are:

. “x”: One left-down diagonal straight stroke linecrosses one right-down diagonal straight stroke line.

. “y”: One left-down diagonal straight stroke linemeets one right-down diagonal straight stroke line atits middle point.

. “Y”: One left-down diagonal stroke line, one right-down diagonal stroke line and one vertical strokeline cross at one point, like the character “Y.”

. “k”: One left-down diagonal stroke line, one right-down diagonal stroke line, and one vertical strokelinemeet at one point, like the characters “k” and “K.”

2.2.2 Traversal Feature

After the primitives based on the stroke line features areextracted as described above, the primitives of theremainder part in the word image are computed based ontraversal features.

To extract traversal features, we scan the word imagecolumn by column, and the traversal number TN isrecorded by counting the number of transitions from blackpixel to white pixel, or vice versa, along each column. Thisprocess is not carried out on the part represented by thestroke line features described above. According to the valueof TN , different feature codes are assigned as follows:

LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1401

Fig. 2. Primitive string extraction (a) straight stroke line features,

(b) remainder part of (a), (c) traversal TN ¼ 2, (d) traversal TN ¼ 4, and

(e) traversal TN ¼ 6.

Page 5: 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …cis.csuohio.edu/~munakata/class/601/pdf/2-40m.pdf · data than from text data. Thus, the study of information retrieval in document

. “&”: There is no image pixel in the column (TN ¼ 0).It corresponds to the blank intercharacter space. Wetreat intercharacter space as a special primitive. Inaddition, the overlap of adjacent characters causedby kerning is easily detected by analyzing therelative positions of the adjacent connected compo-nents. We can insert a space primitive between themin this case.

If TN ¼ 2, two parameters are utilized to assign it a

feature code. One is the ratio of its black pixel number to x-

height, �. The other is its relative position with respect to

the x-line and the baseline, � ¼ Dm=Db, where Dm is the

distance from the topmost stroke pixel in the column to the

x-line and Db is the distance from the bottommost stroke

pixel to the baseline.

. “n”: � < 0:2 and � < 0:3,

. “u”: � < 0:2 and � > 3, and

. “c”: � > 0:5 and 0:5 < � < 1:5.

If TN � 4, the feature code is assigned as:

. “o”: TN ¼ 4,

. “e”: TN ¼ 6, and

. “g”: TN ¼ 8.

Then, the same feature codes in consecutive columns are

merged and represented by one primitive. Note that a few

columns may possibly have no resultant feature codes

because they do not meet the requirements of the various

eligible feature codes described above, which is usually the

result of noise. Such columns are eliminated straightaway.As illustrated in Fig. 2, the word image is decomposed

into one part with stroke lines (as in Fig. 2a) and other parts

with different traversal numbers (Fig. 2c for TN ¼ 2, Fig. 2d

for TN ¼ 4, and Fig. 2e for TN ¼ 6). The primitive string is

composed by concatenating the above generated primitives

from the leftmost to the rightmost. The resultant primitive

string of the word image in Fig. 2 is: (n,x)(l,x)(u,x)(o,x)

(l,x)(u,x)(o,x)(l ,x)(o,x)(n,x)(o,x)(l,x)(u,x)(&,&)(o,A)

(l,A)(o,x)(l,x)(n,x)(&,&)(c,x)(e,x)(o,x)(&,&)(o,x)(e,x)(l,x)(u,x)

(o,A)(l,A)(&,&)(n,x)(l,A)(o,x)(o,A)(l,A)(o,x)(n,x)(o,x)

(l,x)(u,x)(&,&)(y,D).

2.3 Postprocessing

To be able to deal with different fonts, primitives should beindependent of typefaces. Among various fonts, thesignificant difference that impacts the extraction of an LRPSis the presence of serif, especially at the parts expressed bytraversal features. Fig. 3 gives some examples of where serifis present and others, where it is not. It is a basic necessity toavoid the effect of serif in LRPS representation.

Our observation shows that some fonts such as TimesRoman contain serifs as opposed to serifless fonts such asArial. Such serifs do not add on any distinguishing featuresto the characters. Primitives produced by such serifs canthus be eliminated by analyzing its preceding and succeed-ing primitives. For instance, a primitive “(u,x)” in aprimitive subsequence “(l,x)(u,x)(&,&)” is normally gener-ated by a right-side serif of characters such as “a,” “h,” “m,”“n,” “u,” etc. Therefore, we can remove the primitive“(u,x)” from a primitive subsequence “(l,x)(u,x)(&,&).”Similarly, a primitive “(o,x)” in a primitive subsequence“(n,x)(o,x)(l,x)” is normally generated by a serif of char-acters such as “h,” “m,” “n,” etc. Therefore, we can directlyeliminate the primitive “(o,x)” from a primitive subse-quence “(n,x)(o,x)(l,x).”

More postprocessing rules can be used to eliminate theprimitives caused by serif. With this postprocessing, theprimitive string of the word in Fig. 2 becomes: (l,x)(u,x)(l,x)(u,x)(o,x)(l,x)(n,x)(l,x)(&,&)(l,A)(o,x)(l,x)(&,&)(c,x)(e,x)(o,x)(&,&)(o,x)(e,x)(l,x)(l,A)(&,&)(n,x)(l,A)(o,x)(l,A)(n,x)(l,x)(&,&)(y,D).

Based on the feature extraction described above, we cangive each character a standard primitive string token (PST).For example, the primitive string token of the character “b”is “(l,A)(o,x)(c,x),” and that of the character “p” is“(l,D)(o,x)(c,x).” Table 1 lists the primitive string tokens ofall characters in the alphabet. The primitive string of a wordcan be generated by synthesizing the primitive string tokenof each character in the word and inserting a specialprimitive “(&,&)” among them to indicate a spacing gap.

Generally speaking, the resulting primitive string of areal image is not as perfect as that synthesized from thestandard PST of corresponding characters; this is due to a

1402 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

Fig. 3. Different fonts.

TABLE 1Primitive String Tokens of Characters

Page 6: 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …cis.csuohio.edu/~munakata/class/601/pdf/2-40m.pdf · data than from text data. Thus, the study of information retrieval in document

multitude of factors such as the connection betweenadjacent characters, the noise effect, etc. As shown inFig. 2, the primitive substring with respect to the character“h” is changed from “(l,A)(n,x)(l,x)” to “(l,A)(o,x)(l,x)”because of the effect of noise. The touching characters “al”and “th” have also affected the corresponding primitivesubstrings. In the next section, we shall consider thetechnique of inexact matching for robust matching.

3 INEXACT FEATURE MATCHING

Based on the processing described above, each word imageis described by a primitive string. The word searchingproblem can then be stated as finding a particularsequence/subsequence in the primitive string of a word.The procedure of matching word images then becomes ameasurement of the similarity between the string A ¼½a1; a2; . . . ; an� representing the features of the query wordand the string B ¼ ½b1; b2; . . . ; bm� representing the featuresof a word image extracted from a document. Matchingpartial words becomes evaluating the similarity betweenthe feature string A with a subsequence of the feature stringB. For example, the problem of matching the word “health”with the word “unhealthy” in Fig. 2 is to find whether thereexists the subsequence A=(l,A)(n,x)(l,x)(&,&)(c,x)(e,x)(o,x)(&,&)(o,x)(e,x)(l,x)(&,&)(l,A)(&,&)(n,x)(l,A)(o,x)(&,&)(l,A)(n,x)(l,x) in the primitive sequence of the word“unhealthy.”

In a word image, it is common for two or more adjacentcharacters to be connected with each other, possiblybecause of low scanning resolution or poor printing quality.This would result in the omission of the feature “&” in thecorresponding feature string, compared to the primitivestring generated from standard PSTs. Moreover, the noiseeffect would undoubtedly produce the substitution orinsertion of features in the primitive string of a wordimage. Such deletions, insertions, and substitutions are verysimilar to the evolutionary mutations of DNA sequences inmolecular biology [35].

Lopresti and Zhou applied the inexact string matchingstrategy to information retrieval [36] and duplicate docu-ment detection [37] to deal with the imprecise text data inOCR results. Drawing inspiration from the alignmentproblem of two DNA sequences and the progress achievedby Lopresti and Zhou, we apply the technique of inexactstring matching to evaluate the similarity between twoprimitive strings, one from a template image and the otherfrom a word image extracted from a document. From thestandpoint of string alignment [38], two differing featuresthat mismatch correspond to a substitution; a spacecontained in the first string corresponds to an insertion ofan extra feature into the second string and a space in thesecond string corresponds to a deletion of an extra featurefrom the first string. Insertion and deletion are the reverseof each other. A dash (“–”) is therefore used to represent aspace primitive inserted into the corresponding positions inthe strings in the case of deletion. Notice that we use“spacing” to represent the gap in a word image, whereaswe use “space” to represent the inserted feature in a featurestring. The two concepts are completely different.

For a primitive string A of length n and a primitive stringB of lengthm, V ði; jÞ is defined as the similarity value of the

prefixes ½a1; a2; . . . ; ai� and ½b1; b2; . . . ; bj�. The similarity of A

and B is precisely the value V ðn;mÞ.The similarity of two strings A and B can be computed

by dynamic programming with recurrences [39]. The base

conditions are:

8i; j : V ði; 0Þ ¼ 0V ð0; jÞ ¼ 0:

�ð2Þ

The general recurrence relation is: For

1 � i � n; 1 � j � m :

V ði; jÞ ¼ max

0

V ði� 1; j� 1Þ þ �ðai; bjÞV ði� 1; jÞ þ �ðai;�ÞV ði; j� 1Þ þ �ð�; bjÞ:

8>>><>>>:

ð3Þ

Thezero in the above recurrence implements theoperation

of restarting the recurrence, which makes sure that un-

matched prefixes are discarded from the computation.In our experiments, the score of matching any primitive

with the the space primitive “–” is defined as:

�ðak;�Þ ¼ �ð�; bkÞ ¼ �1 for ak 6¼ �; bk 6¼ �; ð4Þ

while the score of matching any primitive with the spacing

primitive “&” is defined as:

�ðak;&Þ ¼ �ð&; bkÞ ¼ �1 for ak 6¼ &; bk 6¼ & ð5Þ

and the matching score between two primitives “&” is

given by:

�ð&;&Þ ¼ 2; ð6Þ

while the matching score between two primitives ai and bjis defined as:

�ðai; bjÞ ¼ �ðð�ai ; !

ai Þ; ð�b

j; !bjÞÞ ¼ "1ð�a

i ; �bjÞ þ "2ð!a

i ; !bjÞ; ð7Þ

where "1 is the function specifying the match value between

two elements x0 and y0 of LTA. It is defined as:

"1ðx0; y0Þ ¼ 1 if x0 ¼ y0

�1 else:

�ð8Þ

Similarly, "2 is the function specifying the match value

between two elements x00 and y00 of ADA. It is defined as:

"2ðx00; y00Þ ¼ 1 if x00 ¼ y00

�1 else:

�ð9Þ

Finally, maximum score is normalized to the interval

[0,1], with 1 corresponding to a perfect match:

S1 ¼ max8i;j

V ði; jÞ=V �Aðn; nÞ; ð10Þ

where V �Aðn; nÞ is the matching score between the string A

and itself. The maximum operation in (10) and the

restarting recurrence operation in (2) ensure the ability of

partial word matching.If the normalized maximum score in (10) is greater than a

predefined threshold �, then we recognize that one word

image is matched with the other (or portion of it).

LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1403

Page 7: 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …cis.csuohio.edu/~munakata/class/601/pdf/2-40m.pdf · data than from text data. Thus, the study of information retrieval in document

On the other hand, the similarity of matching two wholeword images in their entirety (i.e., no partial matching isallowed) can be represented by:

S2 ¼ V ðn;mÞ=minðV �Aðn; nÞ; V �

Bðm;mÞÞ: ð11Þ

The problem can be evaluated systematically usingtabular computation. In this approach, a bottom-upapproach is used to compute V ði; jÞ. We first computeV ði; jÞ for the smallest possible values of i and j, and thencompute the value of V ði; jÞ for increasing values of i and j.This computation is organized with a dynamic program-ming table of size ðnþ 1Þ � ðmþ 1Þ. The table holds thevalues of V ði; jÞ for the choices of i and j (see Table 2). Thevalues in row zero and column zero are filled in directlyfrom the base conditions for V ði; jÞ. After that, theremaining n�m subtable is filled in one row at a time, inthe order of increasing i. Within each row, the cells are filledin, in the order of increasing j.

The following pseudo code describes the algorithm:

The entire dynamic programming table for computingthe similarity between a string of length n and a string oflength m can be obtained in OðnmÞ time, since only threecomparisons and arithmetic operations are needed per cell.

Table 2 illustrates the score table computing from theprimitive string of the word image “unhealthy” extractedfrom a real document image and the word “health” whoseprimitive string is generated by aggregating the characters’primitive string tokens according to the character sequenceof the word, with the special primitive “&” insertedbetween two adjacent PSTs. It can be seen that themaximum score achieved in the table corresponds to the

match of the character sequence “health” in the word“unhealthy.”

4 APPLICATION A: WORD SEARCHING

We now focus on the application of our proposed partialword image matching method to word spotting. Searching/locating a user-specified keyword in image format docu-ments has been a topic of interest for many years. It has itspractical value for document information retrieval. Forexample, by using this technique, the user can locate aspecified word in document images without any prior needfor the images to be OCR-processed.

4.1 System Overview

The overall system structure is illustrated in Fig. 4. When adocument image is presented to the system, it goes throughpreprocessing, as in many document image processingsystems. It is presumed that processes such as skewestimation and correction [40], if applicable, and otherimage-quality related processing, are performed in the firstmodule of the system. Then, all of the connected compo-nents in the image are calculated using an eight-connected-component analysis algorithm. The representation of eachconnected component includes the coordinates and dimen-sions of the bounding box and so on.

Word objects are bounded based on a merger operationon the connected components. As a result, the left, top,right, and bottom coordinates of each word bitmap areobtained. Meanwhile, the baseline and x-line locations ineach word are also available for subsequent processing.Extracted word bitmaps with baseline and x-line informa-tion are the basic units for the downstream process of wordmatching, and are represented with the use of primitivestrings as described in Section 2.

When a user keys in a query word, the system generatesits corresponding word primitive token (WPT) by aggregat-ing the characters’ primitive string tokens according to thecharacter sequence of the word, with the special primitive“&” inserted between two adjacent PSTs. For example, theWPT of the word “health” is: (l,A)(n,x)(l,x)(&,&)(c,x)(e,x)(o,x)(&,&)(o,x)(e,x)(l,x)(&,&)(l,A)(&,&)(n,x)(l,A)(o,x)(&,&)(l,A)(n,x)(l,x).

1404 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

TABLE 2Score Table for Computing V ði; jÞ

Fig. 4. System diagram for searching a user-specified word in a

document image.

Page 8: 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …cis.csuohio.edu/~munakata/class/601/pdf/2-40m.pdf · data than from text data. Thus, the study of information retrieval in document

This primitive string is matched with the primitive stringof each word object in the document image to measure thesimilarity between the two. According to the similaritymeasurement, we can estimate how the word in thedocument image is relevant to the query word. It is alwaysthe case that, for a given query word, a larger scorecorresponds to a better match. If the normalized maximumscore in (10) is greater than the predefined threshold �, thenwe recognize that the word image (or portion of it) matchesthe query word.

The match may be imprecise in the sense that certainprimitives are missing. Some adjacent characters in animage word may touch each other due to various reasons.This results in a lesser number of character spacing to bedetected and, hence, a lesser number of spacing primitives“(&, &)” in the primitive string. If we look up Table 2, wefind that we can get a best matching track if we backtrackfrom the maximum score. Our investigation shows that thescore in the best matching track with respect to the spacingprimitive “(&, &)” in the query word’s primitive stringdecreases if its corresponding spacing primitive in the realword is missing. To remedy this problem, the normalizedmaximum score in (10) is modified as:

Sm ¼ ðmax8i;j

V ði; jÞ þ ðNgÞÞ=V �Aðn; nÞ; ð12Þ

where Ng is the number of such mismatched spacingprimitives and ðNgÞ ¼ 2�Ng in our experiments.

4.2 Data Preparation

The digital library of our university stores a huge numberof old student theses and books, and it has been deemeddesirable to make these documents accessible on theInternet. The original plan was to digitize all the docu-ments, then transform the document images into textformat using a commercial OCR product. However, theOCR results were not satisfactory. Manually correcting theerrors was obviously expensive and impractical for such ahuge data set. It was thus decided to retain the documentsas PDF files in image format. We selected 3,250 pages from113 such PDF files of scanned student theses and 328 pagesfrom 15 PDF files of scanned books, to generate a documentimage database called the NUSDL database.

For various reasons, documents available on the Internetare also often in image format. Examples of such documents

are the journal or conference papers available as IEEE Xploreonline publications and similar papers at other journalonline publication systems such as CVGIP: Graphical Modelsand Image Processing. We obtained 39 such documents with294 pages, and named the collection the JPaper database.

To search words in such PDF files containing documentimages, we developed a tool based on our proposedapproach, using Acrobat SDK. When a PDF file is openedin Adobe Acrobat, the plug-in tool is able to detect andlocate the user’s query words in the document imageswithout any OCR process, like the “Find” functionprovided by Adobe Acrobat for word search in text formatdocuments.

Besides PDF document images, binary images in a UWdocument image database are also used in our experiments.

4.3 Experimental Results

Fig. 5 shows the results of searching words in a scannedstudent thesis selected from the NUSDL database. In thisexample, the words “cooler,” “Cryocooler,” “cryocooler,”“microcooler,” and “cooling” in the document image aresuccessfully detected and located when the user inputs theword “cool” for searching.

Another example is given in Fig. 6. The document imageis stored in a PDF file which is accessible at the Web site ofIEEE Xplore online publications. The words “string,”“strings,” and “substring” in different fonts are correctlydetected and located in the document when “string” is usedas the query word.

To evaluate the performance of our system, we select150 words as queries and search for their correspondingwords and variations in the document images. The systemachieves a precision ranging from 89.06 to 99.36 percent and

LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1405

Fig. 5. Results of searching for the word “cool.”

Fig. 6. Results of searching for the word “string.”

Page 9: 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …cis.csuohio.edu/~munakata/class/601/pdf/2-40m.pdf · data than from text data. Thus, the study of information retrieval in document

a recall1 ranging from 85.67 to 99.19 percent, depending onthe threshold �, as shown in Table 3. A lower � results in ahigher recall, but at the expense of lower precision. Thereverse is true for a higher �. Thus, if the goal is to retrieveas many relevant words as possible, then a lower � shouldbe set. On the other hand, for low tolerance to precision,then � should be raised.

It is common practice to combine recall R and precisionP in some way so that system performance can becompared in terms of a single rating. Our experimentsadopt one of the commonly used measures, the F1 rating[4], which gives equal weight to R and P . It is defined as:

F1 ¼2RP

Rþ P: ð13Þ

Fig. 7 shows the average precision and recall, as well asthe F1 rating. On average, the best F1 rating that the systemcan achieve is 0.9703, where the precision and recall are 98.33and 95.77 percent, and the threshold � is set at 0.85. Theseresults are also better than those of our previous system [32],whose precision and recall are 97.08 and 93.94 percent,respectively, with the best F1 measure at 0.9548.

4.4 Comparison with Commercial OCR

The majority of the document images used in our test arepacked in PDF files. Adobe Acrobat provides an OCR tool ofPage Capture which converts a page of document image in aPDF file to its text format using document image processing.So, it is selected to compare with our proposed approach. Inthe Page Capture, if a word object in a document imagecannot be recognized, its original image will be kept andinserted in the corresponding position in the text for displaypurpose. Fig. 8a shows the OCR result on a part of adocument image page selected from a student thesis, inwhich the words “TMS320C30,” “passband,” and “stop-

band” have not been successfully converted into text format.So, these words still cannot be searched even though theOCR tool has been applied to the page of document image.In comparison, Fig. 8b shows the search result with thequery word “band using the word searching tool based onour proposed word matching approach. All of the wordscontaining the string “band” are successfully detected.

To compare word searching performance between AdobeAcrobat and our proposed approach, the NUSDL databaseand JPaper database are used. The UW database is notincluded in the comparison because Adobe Acrobat can workon document images packed in PDF files only. Theprecision and recall achieved by Adobe Acrobat are 99.93and 94.17 percent, respectively. Our approach achieves aprecision of 98.15 percent and a recall of 95.86 percent,when the threshold � is set at 0.85. It can be seen that theprecision of Adobe Acrobat is very high while its recall is alittle lower compared with our approach. The reason is thatthe OCR tool in Adobe Acrobat is lexicon dependent. Alexicon may be built into its recognition engine, so it canachieve very high precision; but, it does not handle wellwith most words that are not included in the lexicon,especially certain technical terms in student theses andpapers. On the other hand, our proposed approach does notuse any language or lexical information at all. In addition,

1406 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

1. Precision is defined as the percentage of the number of correctlysearched words over the number of all searched words, while recall is thepercentage of the number of correctly searched words over the number ofwords that should be searched.

TABLE 3Performance versus Threshold �

Fig. 7. Performance versus different thresholds.

Page 10: 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …cis.csuohio.edu/~munakata/class/601/pdf/2-40m.pdf · data than from text data. Thus, the study of information retrieval in document

our experiments also show that our proposed approach isapproximately 2.6 times faster than the OCR tool.

5 APPLICATION B: DOCUMENT SIMILARITY

MEASUREMENT

Measuring the similarity between documents has practicalapplications in document image retrieval. For instance, auser may use an entire document rather than a keyword toretrieve documents whose content is similar to the querydocument. As we have mentioned before, the word, ratherthan the character, is the basic unit of meaning in documentimage retrieval. Therefore, we use word objects in docu-ments as our basic unit of information. In this section, wepresent another application of our proposed partial wordmatching to document image retrieval, i.e., similaritymeasurement between two document images. The systemdiagram is shown in Fig. 9.

5.1 Document Vector and Similarity

The word objects in the images are bounded based on theanalysis of the connected components first. Some commonwords (such as “a,” “the,” “by,” etc.)—called stop words inlinguistics—are eliminated according to their shapes as theydo not contribute much to the overall weight of thedocument. Although the width estimates of single wordsare unreliable for stop word identification, stop words tendto be short and are composed of a few characters. Supposethat w and h represent the width and height of a wordobject, respectively. If the aspect ratio w=h of a word objectis less than a certain threshold , it is excluded from thecalculation of document vectors. In our experiments, is setat 2.0. This process can also prevent a short word frombeing matched with a long word. For example, it canprevent the word “the” and the word “weather” beingassigned to the same class because the word “the” hasalready been eliminated.

The remaining word objects in each document arerepresented by means of their corresponding primitivefeature strings. An unsupervised classifier is employed toplace all of the word objects (except the stop words) coming

from both document images to K object classes. Thenumber of classes is initialized as null. The first incomingword generates a new class automatically. The nextincoming word is compared with the word stored in theexisting class. The above described word matching methodis utilized to measure the similarity between two wordobjects. If the similarity score of the two word objects isgreater than a predefined threshold �, we assign them to thesame class. Otherwise, the incoming word is assigned to anew class.

The respective occurrence frequencies of all classes in adocument are used to build the document’s vector. Eachoccurrence frequency is normalized by dividing it by the

LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1407

Fig. 8. Comparison of Adobe Acrobat OCR and the proposed approach. (a) OCR results by Adobe Acrobat. (b) Results of searching for the word

“band” by our approach.

Fig. 9. System diagram for measuring the similarity between twodocument images.

Page 11: 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …cis.csuohio.edu/~munakata/class/601/pdf/2-40m.pdf · data than from text data. Thus, the study of information retrieval in document

total number of word objects in the document. Thesimilarity score between two document vectors is definedas their scalar product divided by their lengths, which is thesame as in our pervious work [34]. A scalar product iscalculated by summing up the products of the correspond-ing elements. This is equivalent to the cosine of the anglebetween two document vectors seen from the origin. So, thesimilarity between the query document image Q and thearchived document image Y is:

SðQ; Y Þ ¼

XKk¼1

qk � ykffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXKk¼1

q2kXKk¼1

y2k

vuut; ð14Þ

where, Q ¼ ðq1; q2; . . . ; qKÞT is the document vector of theimage Q, and Y ¼ ðy1; y2; . . . ; yKÞT is the document vector ofthe image Y . K is the dimension number of the documentvector, which equals the class number generated by theunsupervised classifier.

5.2 Experimental Results

Our experiments compute the similarity measurementbetween document images selected from the NUSDLdatabase. These articles address nine different kinds oftopics, and they are student theses pertaining to electricalengineering, chemical engineering, mechanical engineering,

medical engineering, etc. The topics and the number of eachused in the experiments are listed in Table 4.

The similarity measurement between any two documentimages in the corpus is calculated, in which the proposedword image matching approach using (10) is applied. Fig. 10illustrates the similarity measurement of the document C5in group C with the first three documents taken from eachgroup. If the document C5 is chosen as a query document,the documents in group C (i.e., C1, C2, and C3) can besuccessfully retrieved from the corpus, subject to a suitablethreshold.

The retrieval performance of each group is tabulated inTable 5, according to different similarity thresholds �. Thesystem achieves an average precision ranging from 73.79 to97.78 percent, and an average recall2 ranging from 68.97 to98.72 percent, depending on different similarity thresholds.A lower � allows more relevant documents to be retrieved(hence, the higher recall) but at the expense of falsepositives (hence, the lower precision). The reverse is truefor a higher �. Thus, if the goal is to retrieve as manyrelevant documents as possible by allowing irrelevant onesto be rejected by manual reading, then a lower � should beset. On the other hand, for low tolerance to false positives,then � should be raised.

We compare system performance based on the applica-tion of partial word image matching (PWM method using(10)) with that based on entire word image matching (EWMmethod using (11)). The maximum F1 ratings with respectto each group are illustrated in Fig. 11. From the results, wecan see that the PWM method outperforms the EWMmethod in general. The overall F1 rating achieved by thePWM method improves to 0.8646, compared to the EWMmethod’s 0.7862.

6 CONCLUSION AND FUTURE WORK

Document images have become a popular informationsource in our modern society, and information retrieval indocument image databases is an important topic in knowl-edge and data engineering research. Document image

1408 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004

TABLE 4Corpus Used in the Experiments

2. Precision is defined as the percentage of the number of correctlyretrieved articles over the number of all retrieved articles, while recall is thepercentage of the number of correctly retrieved articles over the number ofarticles in the category.Fig. 10. The similarity with document C5.

Page 12: 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …cis.csuohio.edu/~munakata/class/601/pdf/2-40m.pdf · data than from text data. Thus, the study of information retrieval in document

retrieval without OCR has its practical value, but it is also achallenging problem. In this paper, we have proposed aword image matching approach with the ability of match-ing partial words. To measure the similarity of two wordimages, each word image is represented by a primitivestring. Then, an inexact string matching technique isutilized to measure the similarity between the two primitivestrings generated from the two word images. Based on thematching score, we can estimate how one word image isrelevant to the other and decide whether one is a portion ofthe other. The proposed word image approach has beenapplied to two document image retrieval problems, i.e.,word spotting and document similarity measurement.Experiments have been carried out on various documentimages such as scanned books, students theses, andjournal/conference papers downloaded from the Internet,as well as UW document images. The test results confirmthe validity and feasibility of the applications of the

proposed word matching approach to document image

retrieval.The setting of matching scores between different

primitive codes is beyond the scope of this paper and is

left to future research. How to choose a suitable threshold

for achieving maximum F1 rating also requires further

investigation in the future.

ACKNOWLEDGMENTS

This research is jointly supported by the Agency for Science,

Technology and Research, and the Ministry of Education of

Singapore under research grant R-252-000-071-112/303. The

authors would also like to thank Mr. Ng Kok Koon and Mrs.

Khoo Yee Hoon of the digital library of the National

University of Singapore for providing us with the document

image data of student theses.

REFERENCES

[1] D. Doermann, “The Indexing and Retrieval of Document Images:A Survey,” Computer Vision and Image Understanding, vol. 70, no. 3,pp. 287-298, 1998.

[2] M. Mitra and B.B. Chaudhuri, “Information Retrieval fromDocuments: A Survey,” Information Retrieval, vol. 2, nos. 2/3,pp. 141-163, 2000.

[3] G. Salton, J. Allan, C. Buckley, and A. Singhal, “AutomaticAnalysis, Theme Generation, and Summarization of Machine-Readable Text,” Science, vol. 264, pp. 1421-1426, 1994.

[4] Y. Yang and X. Liu, “A Re-Examination of Text CategorizationMethods,” Proc. 22th Ann. Int’l ACM SIGIR Conf. Research andDevelopment in Information Retrieval, pp. 42-49, 1999.

[5] K. Tagvam, J. Borsack, A. Condir, and S. Erva, “The Effects ofNoisy Data on Text Retrieval,” J. Am. Soc. for Information Science,vol. 45, no. 1, pp. 50-58, 1994.

LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1409

TABLE 5Retrieval Performance for Different Thresholds �

Fig. 11. F1 rating comparison between PWM and EWM.

Page 13: 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …cis.csuohio.edu/~munakata/class/601/pdf/2-40m.pdf · data than from text data. Thus, the study of information retrieval in document

[6] Y. Ishitani, “Model-Based Information Extraction Method Tolerantof OCR Errors for Document Images,” Proc. Sixth Int’l Conf.Document Analysis and Recognition, pp. 908-915, 2001.

[7] M. Ohtam, A. Takasu, and J. Adachi, “Retrieval Methods forEnglish Text with Misrecognized OCR Characters,” Proc. FourthInt’l Conf. Document Analysis and Recognition, pp. 950-956, 1997.

[8] S.M. Harding, W.B. Croft, and C. Weir, “Probabilistic Retrieval ofOCR Degraded Text Using N-Grams,” Proc. European Conf.Research and Advanced Technology for Digital Libraries (ECDL ’97),pp. 345-359, 1997.

[9] P. Kantor and E. Voorhes, “Report on the TREC-5 ConfusionTrack,” Online Proc. TREC-5, NIST special publication 500-238,pp. 65-74, 1997.

[10] T. Kameshiro, T. Hirano, Y. Okada, and F. Yoda, “A DocumentImage Retrieval Method Tolerating Recognition and SegmentationErrors of OCR Using Shape-feature and Multiple Candidates,”Proc. Fifth Int’l Conf. Document Analysis and Recognition, pp. 681-684, 1999.

[11] K. Katsuyama et al., “Highly Accurate Retrieval of JapaneseDocument Images through a Combination of MorphologicalAnalysis and OCR,” Proc. SPIE, Document Recognition and Retrieval,vol. 4670, pp. 57-67, 2002.

[12] F.R. Chen and D.S. Bloomberg, “Summarization of ImagedDocuments without OCR,” Computer Vision and Image Under-standing, vol. 70, no. 3, pp. 307-319, 1998.

[13] D.S. Bloomberg and F.R. Chen, “Document Image Summarizationwithout OCR,” Proc. Int’l Conf. Image Processing, vol. 2, pp. 229-232,1996.

[14] J. Liu and A.K. Jain, “Image-Based Form Document Retrieval,”Pattern Recognition, vol. 33, no. 3, pp. 503-513, 2000.

[15] D. Niyogi and S. Srihari, “The Use of Document StructureAnalysis to Retrieve Information from Documents in DigitalLibraries,” Proc. SPIE, Document Recognition IV, vol. 3027, pp. 207-218, 1997.

[16] Y.Y. Tang, C.D. Yan, and C.Y. Suen, “Document Processing forAutomatic Knowledge Acquisition,” IEEE Trans. Knowledge andData Eng., vol. 6, no. 1, pp. 3-21, Feb. 1994.

[17] Y. He, Z. Jiang, B. Liu, and H. Zhao, “Content-Based Indexing andRetrieval Method of Chinese Document Images,” Prof. Fifth Int’lConf. Document Analysis and Recognition (ICDAR ’99), pp. 685-688,1999.

[18] A.L. Spitz, “Duplicate Document Detection,” Proc. SPIE, DocumentRecognition IV, vol. 3027, pp. 88-94, 1997.

[19] A.F. Smeaton and A.L. Spitz, “Using Character Shape Coding forInformation Retrieval,” Proc. Fourth Int’l Conf. Document Analysisand Recognition, pp. 974-978, 1997.

[20] A.L. Spitz, “Shape-Based Word Recognition,” Int’l J. DocumentAnalysis and Recognition, vol. 1, no. 4, pp. 178-190, 1999.

[21] A.L. Spitz, “Progress in Document Reconstruction,” Proc. 16th Int’lConf. Pattern Recognition, vol. 1, pp. 464-467, 2002.

[22] Z. Yu and C.L. Tan, “Image-Based Document Vectors for TextRetrieval,” Proc. 15th Int’l Conf. Pattern Recognition, vol. 4, pp. 393-396, 2000.

[23] C.L. Tan, W. Huang, Z. Yu, and Y. Xu, “Imaged Document TextRetrieval without OCR,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 24, no. 6, pp. 838-844, June 2002.

[24] T.K. Ho, J.J. Hull, and S.N. Srihari, “A Word Shape AnalysisApproach to Lexicon Based Word Recognition,” Pattern Recogni-tion Letters, vol. 13, pp. 821-826, 1992.

[25] T. Syeda-Mahmood, “Indexing of Handwritten DocumentImages,” Proc. Workshop Document Image Analysis, pp. 66-73, 1997.

[26] A. Kolcz, J. Alspector, M. Augusteijn, R. Carlson, and G.V.Popescu, “A Line-Oriented Approach to Word Spotting in Hand-written Documents,” Pattern Analysis and Applications, vol. 3, no. 2,pp. 153-168, 2000.

[27] R. Manmatha, C. Han, and E.M. Riseman, “Word Spotting: A NewApproach to Indexing Handwriting,” Proc. Int’l Conf. ComputerVision and Pattern Recognition, pp. 631-637, 1996.

[28] J. DeCurtins and E. Chen, “Keyword Spotting via Word ShapeRecognition,” Proc. SPIE, Document Recognition II, vol. 2422,pp. 270-277, 1995.

[29] S. Kuo and O.F. Agazzi, “Keyword Spotting in Poorly PrintedDocuments Using Pseudo 2-D Hidden Markov Models,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 16, no. 8,pp. 842-848, Aug. 1994.

[30] F.R. Chen, L.D. Wilcox, and D.S. Bloomberg, “Word Spotting inScanned Images Using Hidden Markov Models,” Proc. Int’l Conf.Acoustics, Speech, and Signal Processing, vol. 5, pp. 1-4, 1993.

[31] F.R. Chen, L.D. Wilcox, and D.S. Bloomberg, “Detecting andLocating Partially Specified Keywords in Scanned Images UsingHidden Markov Models,” Proc. Int’l Conf. Document Analysis andRecognition, pp. 133-138, 1993.

[32] Y. Lu, C.L. Tan, W. Huang, and L. Fan, “An Approach to WordImage Matching Based on Weighted Hausdorff Distance,” Proc.Sixth Int’l Conf. Document Analysis and Recognition, pp. 921-925,2001.

[33] J.J. Hull, “Document Matching on CCITT Group 4 CompressedImages,” Proc. SPIE, Document Recognition IV, vol. 3027, pp. 82-87,1997.

[34] Y. Lu and C.L. Tan, “Document Retrieval from CompressedImages,” Pattern Recognition, vol. 36, no. 4, pp. 987-996, 2003.

[35] A. Apostolico and R. Giancarlo, “Sequence Alignment inMolecular Biology,” DIMACS Series in Discrete Math. andTheoretical Computer Sciences, vol. 47, pp. 85-115, 1999.

[36] D. Lopresti and J. Zhou, “Retrieval Strategies for Noisy Text,”Proc. Fifth Ann. Symp. Document Analysis and Information Retrieval,pp. 255-269, 1996.

[37] D.P. Lopresti, “A Comparison of Text-Based Methods forDetecting Duplication in Scanned Document Databases,” Informa-tion Retrieval, vol. 4, no. 2, pp. 153-173, 2001.

[38] D. Gusfield, Algorithms on Strings, Trees, and Sequences. CambridgeUniv. Press, 1997.

[39] R.A. Wagner and M.J. Fisher, “The String-to-String CorrectionProblem,” J. ACM, vol. 21, pp. 168-173, 1974.

[40] Y. Lu and C.L. Tan, “A Nearest-Neighbor-Chain Based Approachto Skew Estimation in Document Images,” Pattern RecognitionLetters, vol. 24, pp. 2315-2323, 2003.

Yue Lu received the BE and ME degrees intelecommunications and electronic engineeringfrom Zhejiang University, Zhejiang, China, in1990 and 1993, respectively, and his PhDdegree in pattern recognition and intelligencesystem from Shanghai Jiao Tong University,Shanghai, China, in 2000. He is a professor inthe Department of Computer Science andTechnology, East China Normal University.From 2000 to 2004, he was a research fellow

with the Department of Computer Science, National University ofSingapore. From 1993 to 2000, he was an engineer at the ShanghaiResearch Institute of China State Post Bureau. Because of hisoutstanding contributions to the research and development of theautomatic mail sorting system—OVCS, he won the Science andTechnology Progress Awards issued by the Ministry of Posts andTelecommunications in 1995, 1997, 1998, respectively. His researchinterests include document image processing, document retrieval,character recognition, and intelligence system.

Chew Lim Tan received the BSc (Hons) degreein physics in 1971 from the University ofSingapore, the MSc degree in radiation studiesin 1973 from the University of Surrey, UK, andthe PhD degree in computer science in 1986from the University of Virginia, Virginia. He is anassociate professor in the Department of Com-puter Science, School of Computing, NationalUniversity of Singapore. His research interestsinclude document image and text processing,

neural networks, and genetic programming. He has published more than170 research publications in these areas. He is an associate editor ofPattern Recognition and has served on the program committees of theInternational Conference on Pattern Recognition (ICPR) 2002, theGraphics Recognition Workshop (GREC) 2001 and 2003, the WebDocument Analysis Workshop (WDA) 2001 and 2003, the DocumentImage Analysis and Retrieval Workshop (DIAR) 2003, the DocumentImage Analysis for Libraries Workshop (DIAL) 2004, the InternationalConference on Image Processing (ICIP) 2004, and the InternationalConference on Document Analysis and Recognition (ICDAR) 2005. Heis a senior member of the IEEE and the IEEE Computer Society.

1410 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004