harvesting large-scale weakly-tagged image databases …noting that entity extraction can be done...

Harvesting Large-Scale Weakly-Tagged Image Databases from the Web

Jianping Fan1, Yi Shen1, Ning Zhou1, Yuli Gao2

1Department of Computer Science, UNC-Charlotte, NC28223, USA2Multimedia Interaction and Understanding, HP Labs, Palo Alto, CA94304, USA

Abstract

To leverage large-scale weakly-tagged images for computervision tasks (such as object detection and scene recogni-tion), a novel cross-modal tag cleansing and junk imagefiltering algorithm is developed for cleansing the weakly-tagged images and their social tags (i.e., removing irrele-vant images and finding the most relevant tags for each im-age) by integrating both the visual similarity contexts be-tween the images and the semantic similarity contexts be-tween their tags. Our algorithm can address the issues ofspams, polysemes and synonyms more effectively and de-termine the relevance between the images and their socialtags more precisely, thus it can allow us to create largeamounts of training images with more reliable labels by har-vesting from large-scale weakly-tagged images, which canfurther be used to achieve more effective classifier trainingfor many computer vision tasks.

1. Introduction

For many computer vision tasks, such as object detectionand scene recognition, machine learning techniques are usu-ally involved to learn the classifiers from a set of labeledtraining images [1]. The size of the labeled training imagesmust be large-scale due to: (1) the number of object classesand scenes of interest could be very large; (2) the learningcomplexity for some object classes and scenes could be veryhigh because of visual ambiguity; and (3) a small numberof labeled training images are incomplete or insufficient tointerpret the diverse visual properties of large amounts ofunseen test images. However, hiring professionals to labellarge amounts of training images is cost-sensitive and posesa key limitation for the practical use of some advanced com-puter vision techniques. On the other hand, large-scale dig-ital images and their associated text terms are available onthe Internet, thus it is very attractive to leverage large-scaleonline images for computer vision tasks [2].

Some pioneering works have been done to leverage In-ternet images for computer vision tasks [2, 4-8]. Fergus etal. [4] and Li et al. [6] dealt with the precision problem byre-ranking the images which are downloaded from an imagesearch engine. Recently, Schroff et al. [7] have developed a

new algorithm for harvesting image databases from the webby combining text, meta-data and visual information. Allthese existing techniques have made a hidden assumption,e.g., image semantics have an explicit correspondence withthe associated texts or nearby texts. Unfortunately, such anassumption may not always be true.

Collaborative image tagging system, such as Flickr [3],is now a popular way to obtain large set of labeled imageseasily by relying on the collaborative effort of a large pop-ulation of Internet users. In a collaborative image taggingsystem, people can tag the images according to their socialor cultural backgrounds, personal expertise and perception.We call the collaboratively-tagged images asweakly-taggedimagesbecause their social tags may not have exact cor-respondences with the underlying image semantics. Withthe exponential growth of the weakly-tagged images, it isvery attractive to develop new algorithms that can lever-age large-scale weakly-tagged images for computer visiontasks (such as learning the classifiers for object detectionand scene recognition). Without controlling the word vo-cabulary, many text terms for image tagging may besyn-onymsor polysemesor evenspams. The appearances ofsynonyms, polysemes and spams may either return incom-plete sets of the relevant images or result in large amountsof ambiguous images or even junk images. Thus it is nota trivial task to leverage large-scale weakly-tagged imagesfor computer vision tasks.

In this paper, we focus on collecting large-scale weakly-tagged images from collaborative image tagging systemssuch as Flickr by addressing the following crucial issues:

(a) Synonymous Tags:Different people may use dif-ferent tags, which have the same or close meanings (syn-onyms), to tag their images. For example,car, auto, andau-tomobileare a set of synonyms. The synonyms may resultin incomplete returns of the relevant images in the imagecrawling process, and most tag clustering algorithms can-not incorporate the visual similarities between the relevantimages to deal with the issue of synonyms more effectively.

(b) Polysemous Tags:Collaborative image tagging isan ambiguous process. Without controlling the vocabulary,different people may apply the same tag in different ways(i.e., the same tag may have different meanings under dif-ferent contexts), which may result in large amounts of am-

biguous images. For example, the text term “bank” can beused to tag “bank office”, “river bank” and “cloud bank”.Word sense disambiguation is one potential solution for ad-dressing this ambiguity issue, but it cannot incorporate thevisual properties of the relevant images to deal with the is-sue of polysemes more effectively [9-10].

(c) Spam Tags:Spam tags, which are used to drive traf-fic to certain images for fun or profit, are done by insertingthe text terms that are more related to popular query termsrather than the text terms related to the actual image con-tent. Spam tags are problematic because the junk imagesmay mislead the underlying machine learning tools for clas-sifier training. Junk image filtering is an attractive directionfor dealing with the issue of spam tags, but it is worth notingthat the scenario for junk image filtering in a collaborativeimage tagging space is significantly different.

In this paper, a novel cross-modal tag cleansing andjunk image filtering algorithm is developed by integrat-ing both the visual properties of the weakly-tagged imagesand their social tags to deal with the issues of spams, pol-ysemes and synonyms more effectively, so that we can cre-ate large amounts of training images with more reliable la-bels for computer vision tasks by harvesting from large-scale weakly-tagged images. The paper is organized as fol-lows. In section 2, an automatic algorithm is introduced forimage topic extraction. In section 3, a mixture-of-kernelsalgorithm is introduced for image similarity characteriza-tion. In section 4, a spam tag detection technique is intro-duced for junk image filtering. In section 5, a cross-modaltag cleansing algorithm is introduced for addressing the is-sues of synonyms and polysemes. The algorithm evaluationresults are given in section 6. We conclude this paper atsection 7.

2. Image Topic Extraction

Each image in a collaborative tagging system is associatedwith the image holder’s taggings of the underlying imagecontent and other users’ taggings or comments. It is worthnoting that entity extraction can be done more effectively ina collaborative image tagging space. In this paper, we firstfocus on extracting the social tags which are strongly relatedto the most popular real-world objects and scenes or events.The social tags, which are related to image capture time andplace, are also very attractive, but they are beyond the scopeof this paper. Thus the image tags are first partitioned intotwo categories: noun phrasesversusverb phrases. The nounphrases are further partitioned into two categories automat-ically: content-relevant tags(i.e., tags that are relevant toimage objects and scenes) andcontent-irrelevant tags. Theverb phrases are further partitioned into two categories au-tomatically: event-relevant tags(i.e., tags that are relevantto image events) andevent-irrelevant tags.

The occurrence frequency for each content-relevant tagand each event-relevant tag is counted automatically by us-ing the number of relevant images. The misspelling tagsmay have low frequencies (i.e., different people may makedifferent typing mistakes), thus it is easy for us to correctsuch the misspelling tags and their images are added intothe relevant tags automatically. Two tags, which are usedfor tagging the same image, are considered to co-occur oncewithout considering their order. A co-occurrence matrix isobtained by counting the frequencies of such pairwise tagco-occurrences.

The content-relevant tags and the event-relevant tags arefurther partitioned into two categories according to theirinterestingness scores:interesting tagsand uninterestingtags. In this paper, multiple information sources have beenexploited for determining the interesting tags more accu-rately. For a given tagC, its interestingness scoreω(C)depends on: (1)its occurrence frequencyt(C) (e.g., higheroccurrence frequency corresponds to higher interestingnessscore); and (2)its co-occurrence frequencyϑ(C) with anyother tag in the vocabulary (e.g., higher co-occurrence fre-quency corresponds to higher interestingness score). Theoccurrence frequencyt(C) for a given tagC is equal to thenumber of images that are tagged by the given tagC. Theco-occurrence frequencyϑ(C) for the given tagC is equalto the number of images that are tagged jointly by the giventagC and any other tag in the vocabulary.

The interestingness score ω(C) for a given tagC is de-fined as:

ω(C) = ξ·log(t(C)+√

t2(C) + 1)+ζ·log(ϑ(C)+√

ϑ2(C) + 1)(1)

whereξ andζ are the weighting factors,ξ +ζ = 1.All the interesting tags, which have large values ofΩ(·)

(i.e., top5000 tags in our current experiments), are treatedas image topics. In this work, only the interesting tags,which are used to interpret the most popular real-world ob-ject classes and scenes or events, are treated as the imagetopics. It is worth noting that one single weakly-tagged im-age may be assigned into multiple image topics when therelevant tags are used for tagging the image jointly. Collect-ing large-scale training images for the most popular real-world object classes and scenes or events and learning theirclassifiers more accurately are crucial for many computervision tasks.

3. Image Similarity CharacterizationTo achieve more sufficient characterization of various visualproperties of the images, both global and local visual fea-tures are extracted for image content representation. In ourcurrent experiments, the following visual features are ex-tracted: (1) 36-bin RGB color histogram to characterize theglobal color distributions of the images; (2) 48-dimensional

texture features from Gabor filter banks to characterize theglobal visual properties (i.e., global structures) of the im-ages; and (3) a number of interest points and their SIFT(scale invariant feature transform) features to characterizethe local visual properties of the underlying salient imagecomponents.

By using high-dimensional visual features (color his-togram, wavelet textures, and SIFT features) for image con-tent representation, it is able for us to characterize variousvisual properties of the images more sufficiently. On theother hand, the statistical properties of the images in thehigh-dimensional feature space may be heterogeneous be-cause different feature subsets are used to characterize dif-ferent visual properties of the images, thus the statisticalproperties of the images in the high-dimensional featurespace may be heterogeneous and sparse. Therefore, it ishard to use only one single type of kernel to characterizethe diverse visual similarity contexts between the imagesprecisely.

Based on these observations, the high-dimensional vi-sual features are first partitioned into multiple feature sub-sets and each feature subset is used to characterize one cer-tain type of visual properties of the images, thus the un-derlying visual similarity contexts between the images aremore homogeneous and can be approximated more pre-cisely by using one particular type of kernel. For each fea-ture subset, a suitable base kernel is designed for image sim-ilarity characterization. Because different base image ker-nels may play different roles on characterizing the diversevisual similarity contexts between the images, the optimalkernel for diverse image similarity context characterizationcan be approximated more accurately by using a linear com-bination of these base image kernels with different impor-tance.

For a given image topicCj in the vocabulary, differ-ent base image kernels may play different roles on charac-terizing the diverse visual similarity relationships betweenthe images. Thus the diverse visual similarity contexts be-tween the images are characterized more precisely by usinga mixture-of-kernels [13-14]:

κ(x, y) =

τ∑

l=1

βlκl(x, y),

τ∑

l=1

βl = 1 (2)

whereτ is the number of feature subsets (i.e., the numberof base image kernels),βl ≥ 0 is the importance factorfor the lth base image kernelκl(x, y). Combining multiplebase kernels can allow us to achieve more precise charac-terization of the diverse visual similarity contexts betweenthe weakly-tagged images.

4. Spam Tag Detection

Some popular image topics in the vocabulary may consist oflarge amounts of junk images because of spam tagging, andincorporating the junk images for classifier training may se-riously mislead the underlying machine learning tools. Ob-viously, the junk images, which are induced by spam tag-ging, may make a significant difference on their visual prop-erties with the relevant images. Thus the junk images canbe filtered out effectively by performing visual-based imageclustering and relevance analysis.

4.1 Image Clustering

A K-way min-max cut algorithm is developed to achievemore effective image clustering, where the cumulative inter-cluster visual similarity contexts are minimized while thecumulative intra-cluster visual similarity contexts (summa-tion of pairwise image similarity contexts within a cluster)are maximized. These two criteria can be satisfied simulta-neously with a simple K-way min-max cut function [11].

For a given image topicC, a graph is first constructed fororganizing all its weakly-tagged images according to theirvisual similarity contexts [11-12], where each node on thegraph is one weakly-tagged image for the given image topicC and an edge between two nodes is used to characterizethe visual similarity contexts between two weakly-taggedimages,κ(·, ·).

All the weakly-tagged images for the given image topicC are partitioned intoK clusters automatically by minimiz-ing the following objective function:

min

Ψ(C,K, β) =

K∑

i=1

s(Gi, G/Gi)

s(Gi, Gi)

(3)

whereG = Gi|i = 1, · · · ,K is used to representK im-age clusters,G/Gi is used to represent otherK − 1 imageclusters inG exceptGi, K is the total number of imageclusters,β is the set of the optimal kernel weights. The cu-mulative inter-cluster visual similarity contexts(Gi, G/Gi)is defined as:

s(Gi, G/Gi) =∑

u∈Gi

∑

v∈G/Gi

κ(u, v) (4)

The cumulative intra-cluster visual similarity contexts(Gi, Gi) is defined as:

s(Gi, Gi) =∑

u∈Gi

∑

v∈Gi

κ(u, v) (5)

We further defineX = [X1, · · · ,Xl, · · · ,Xk] as thecluster indicators, and its componentXl is a binary indi-

Figure 1:Image clustering for the image topic “beach”: (a) cluster correlation network; (b) filtered junk images.

cator for the appearance of thelth clusterGl,

Xl(u) =

1, u ∈ Gl

0, otherwise(6)

W is defined as ann×n symmetrical matrix (i.e.,n is thetotal number of web images), and its component is definedas:

Wu,v = κ(u, v) (7)

D is defined as ann × n diagonal matrix, and its diagonalcomponents are defined as:

Du,u =

n∑

v=1

Wu,v (8)

For the given image topicC, an optimal partition of itsweakly-tagged images (i.e., image clustering) is achievedby:

min

Ψ(C,K, β) =

K∑

l=1

XTl (D − W )Xl

XTl WXl

(9)

Let−→W = D− 1

2 WD− 1

2 , and−→Xl = D

1

2 Xl

‖D1

2 Xl‖, the objective

function for our K-way min-max cut algorithm can furtherbe refined as:

min

Ψ(C,K, β) =

K∑

l=1

1−→Xl

T·−→W ·

−→Xl

− K

(10)

subject to:

−→Xl

T·−→Xl = I,

−→Xl

T·−→W ·

−→Xl > 0, l ∈ [1, · · · ,K]

The optimal solution for Eq. (10) is finally achieved bysolving multiple eigenvalue equations:

−→W ·

−→Xl = λl

−→Xl, l ∈ [1, · · · ,K] (11)

The objective function for kernel weight determinationis to maximize the inter-cluster separability and the intra-cluster compactness. For one specific clusterGl, its inter-cluster separabilityµ(Gl) and its intra-cluster compactnessσ(Gl) are defined as:

µ(Gl) = XTl (D − W )Xl, σ(Gl) = XT

l WXl (12)

For one specific clusterGl, we can refine its cumulativeintra-cluster pairwise image similarity contextss(Gl, Gl) asW (Gl):

W (Gl) =∑

u∈Gl

∑

v∈Gl

κ(u, v) =

τ∑

i=1

βiωi(Gl) (13)

D(Gl) − W (Gl) =

τ∑

i=1

βi[εi(Gl) − ωi(Gl)] (14)

whereωi(Gl) andεi(Gl) are defined as:

ωi(Gl) =∑

u∈Gl

∑

v∈Gl

κi(u, v), εi(Gl) =

nl∑

v=1

ωi(Gl) (15)

The optimal weights~β = [β1, · · ·, βτ ] for kernel com-bination are determined automatically by maximizing theinter-cluster separability and the intra-cluster compactness:

max~β

1

K

K∑

l=1

σ(Gl)

µ(Gl)

(16)

subject to:∑τ

i=1βi = 1, ∀i : βi ≥ 0

The optimal kernel weights~β = [β1, · · ·, βτ ] are de-termined automatically by solving the following quadraticprogramming problem:

min~β

1

2~βT

(

K∑

l=1

Ω(Gl)Ω(Gl)T

)

~β

(17)

subject to:∑τ

i=1βi = 1, ∀i : βi ≥ 0

Figure 2:Image clustering for the image topic “rock”: (a) cluster correlation network; (b) filtered junk images.

Ω(Gl) is defined as:

Ω(Gl) =ω(Gl)

ε(Gl) − ω(Gl)(18)

In summary, our K-way min-max cut algorithm takes thefollowing steps iteratively for image clustering and kernelweight determination: (1)β is set equally for all these fea-ture subsets at the first run of iterations. (2) Given the initialvalues of kernel weights, our K-way min-max cut algorithmis performed to partition the weakly-tagged images intoKclusters according to their pairwise visual similarity con-texts. (3) Given an initial partition of the weakly-taggedimages, our kernel weight determination algorithm is per-formed to estimate more suitable kernel weights, so thatmore precise characterization of the diverse visual similar-ity contexts between the images can be achieved. (4) Go tostep 2 and continue the loop iteratively untilβ is convergent.As shown in Fig. 1(a) and Fig. 2(a), our image cluster-ing algorithm can achieve a good partition of large amountsof weakly-tagged images and determine their global distri-butions and inter-cluster correlations effectively. Unfortu-nately, such image clustering process cannot directly iden-tify the clusters for the junk images.

4.2 Relevance Re-Ranking

For different users, their motivations for spam tagging aresignificantly different and their images for spam taggingshould contain different content and have different visualproperties. Thus the clusters for the junk images (whichcome from different users with different motivations) couldbe in small sizes. Based on this observation, it is reasonablefor us to define the relevance scoreρ(C,Gi) for a given im-age clusterGi with the image topicC as:

ρ(C,Gi) =

∑

x∈GiP (x,C)

∑

y∈C P (y, C)(19)

where x and y are used to represent particular weakly-tagged images for the image topicC, P (x,C) andP (y, C)

are used to indicate the co-occurrence probabilities for theimagesx andy with the image topicC.

In order to leverage the inter-cluster correlations forachieving more effective relevance re-ranking, a randomwalk process is performed for automatic relevance score re-finement [15]. For a given image topicC, our image cluster-ing algorithm can automatically determine a cluster corre-lation network (i.e.,K image clusters and their inter-clustercorrelations) as shown in Fig. 1(a) and Fig. 2(a). We useρl(Gi) to denote the relevance score for theith image clus-terGi at thelth iteration. The relevance scores for all theseK image clusters at thelth iteration will form a column vec-tor

−−−→ρ(Gi) ≡ [ρl(Gi)]K×1. We further defineΦ as anK×K

transition matrix, its elementφGi,Gjis used to define the

probability of the transition from the image clusterGi to itsinter-related image clusterGj . φGi,Gj

is defined as:

φGi,Gj=

s(Gi, Gj)∑

Gh∈C s(Gi, Gh)(20)

wheres(Gi, Gj) is the inter-cluster visual similarity contextbetween two image clustersGi andGj as defined in Eq. (4).

The random walk process is then formulated as:

ρl(Gi) = θ∑

j∈Ωj

ρl−1(Gj)φGi,Gj+(1− θ)ρ(C,Gi) (21)

whereΩj is the first-order nearest neighbors of the imageclusterGj on the cluster correlation network,ρ(C,Gi) isthe initial relevance score for the image clusterGi andθ isa weight parameter. This random walk process will promotethe image clusters which have many connections on thecluster correlation network, e.g., the image clusters whichhave close visual properties (i.e., stronger visual similaritycontexts) with other image clusters. On the other hand, thisrandom walk process will also weaken the isolated imageclusters on the cluster correlation network, e.g., the imageclusters which have weak visual correlations with other im-age clusters. This random walk process is terminated whenthe relevance scores converge.

Figure 3:Different views of our topic network.

By performing random walk over the cluster correlationnetwork, our relevance score refinement algorithm can re-rank the relevance between the image clusters and the im-age topicC more precisely. Thus the top-k image clusters,which have higher relevance scores with the image topic,are selected as the most relevant image clusters for the givenimage topicC. Through integrating the cluster correla-tion network and random walk for relevance re-ranking, ourspam tag detection algorithm can filter out the junk imageseffectively as shown in Fig. 1(b) and Fig. 2(b). By filteringout the junk images, we can automatically create large-scaletraining images with more reliable labels to learn more ac-curate classifiers for object detection and scene recognition.

5. Cross-Modal Tag CleansingThe appearance of synonyms may result in insufficient im-age collections, which may prevent the underlying machinelearning techniques from learning reliable classifiers forthesynonymous image topics. On the other hand, the appear-ance of polysems may result in the image sets with hugevisual diversity, which may also prevent the underlying ma-chine learning tools from learning precise classifiers for thepolysemous image topics. To leverage large-scale weakly-tagged images for computer vision tasks, it is very attrac-tive to develop cross-modal tag cleansing techniques for ad-dressing the issues of synonyms and polysems more effec-tively.

5.1 Combining Synonymous Topics

When people tag their images, they may use multiple textterms with similar meanings to tag their images alterna-tively. Thus the image tags are inter-related and such inter-related tags and their relevant images should be consid-ered jointly. Based on this observation, a topic network isconstructed automatically for characterizing such inter-tag(inter-topic) similarity contexts more precisely. Our topicnetwork consists of two key components: (a) a large numberof image topics; and (b) their cross-modal inter-topic corre-lations. The cross-modal inter-topic correlations consist oftwo components: (1) inter-topic co-occurrence correlations;and (2) inter-topic visual similarity contexts.

For two given image topicsCi andCj , their visual simi-larity contextγ(Ci, Cj) is defined as:

γ(Ci, Cj) =1

2|Ci||Cj |

∑

u∈Ci

∑

v∈Cj

[κ(u, v)+ κ(u, v)] (22)

where|Ci| and|Cj | are the numbers of the weakly-taggedimages for the image topicsCi andCj , κ(u, v) is the kernel-based visual similarity context between two weakly-taggedimagesu andv by using the kernel weights for the imagetopic Ci, and κ(u, v) is the kernel-based visual similaritycontext between two weakly-tagged imagesu andv by us-ing the kernel weights for the image topicCj .

The co-occurrence correlationβ(Ci, Cj) between twoimage topicsCi andCj is defined as:

β(Ci, Cj) = −P (Ci, Cj)logP (Ci, Cj)

P (Ci) + P (Cj)(23)

whereP (Ci, Cj) is the co-occurrence probability for twoimage topicsCi andCj , P (Ci) andP (Cj) are the occur-rence probability for the image topicsCi andCj .

The cross-modal inter-topic correlation between two im-age topicsCi andCj is finally defined as:

ϕ(Ci, Cj) = α · γ(Ci, Cj) + (1 − α) · β(Ci, Cj) (24)

whereα is the weighting factor and it is determined throughcross-validation. The topic network for our image collec-tions is shown in Fig. 3, where each image topic is linkedwith multiple most relevant image topics with larger valuesof ϕ(·, ·).

Our K-way min-max cut algorithm is further performedon the topic network for topic clustering, thus the synony-mous topics are grouped into the same cluster and can becombined as onesuper-topic. The images for these synony-mous topics may share similar visual properties and seman-tics, thus they are combined and assigned to the super-topicautomatically and a more comprehensive set of the relevantimages can be obtained. Multiple tags for interpreting thesesynonymous topics are combined as one unified phrase fortagging the super-topic. Through combining the synony-mous topics and their similar images, we can obtain moresufficient images to achieve more reliable learning of theclassifier for the corresponding super-topic.

5.2 Splitting Polysemous Topics

Some image topics may be polysemous, which may resultin large amounts of ambiguous images with diverse visualproperties. Using the ambiguous images for classifier train-ing may result in the classifiers with high variance and lowgeneralization ability. To address the issue of polysemes,automatic image clustering is performed to split the poly-semous topics by partitioning their ambiguous images into

multiple clusters with more homogeneous visual properties.Thus our K-way min-max cut algorithm is used to partitionthe ambiguous images under the same polysemous topicinto multiple groups automatically and each group may cor-respond to one certain sub-topic with more homogeneousvisual properties and smaller semantic gap.

To address the issue of the polysemous topics more ef-fectively, WordNet is first incorporated to identify the can-didates of the polysemous topics. For a given candidate ofthe polysemous topicsP , all its weakly-tagged images arefirst partitioned into multiple clusters according to theirvi-sual similarity contexts by using our K-way min-max cutalgorithm. The visual diversityΩ(P ) for the given candi-dateP is defined as:

Ω(P ) =∑

Gi,Gj∈P

∥

∥

∥

∥

µ(Gi) − µ(Gj)

σ(Gi) + σ(Gj)

∥

∥

∥

∥

2

(25)

whereµ(Gi) andµ(Gj) are the means of the image clustersGi andGj , σ(Gi) andσ(Gj) are the variances of the imageclustersGi andGj .

The candidates with large visual diversity between theirimages are treated as the polysemous topics and are fur-ther partitioned into multiplesub-topics. For a given poly-semous topic, all its ambiguous images are partitioned intomultiple clusters automatically, and each cluster may corre-spond to one certainsub-topic. By assigning the ambiguousimages for the polysemous topic into multiple sub-topics,we can obtain multiple image sets with more homogeneousvisual properties, which may have better correspondencesbetween the tags (i.e., sub-topics) and the image semantics(i.e., smaller semantic gaps). Through splitting the poly-semous topics and their ambiguous images, we can obtain:(a) multiple sub-topics with smaller semantic gaps and vi-sual diversity; and (b) more precise image collections (withsmaller visual diversity) which can be used to achieve moreaccurate learning of the classifiers for multiple sub-topicswith smaller semantic gaps.

6. Algorithm EvaluationWe have carried out our experimental studies by usinglarge-scale weakly-tagged Flickr images. We have down-loaded more than10 million Flickr images. Our algorithmevaluation work focuses on evaluating how well our tech-niques can address the issues of spams, polysemes and syn-onyms. To evaluate the performance of our algorithms onspam tag detection and cross-modal tag cleansing, we havedesigned an interactive system for searching and exploringlarge-scale collections of Flickr images. Thebenchmarkmetricfor algorithm evaluation includesprecisionρ andre-call % for image retrieval. They are defined as:

ρ =ϑ

ϑ + ξ, % =

ϑ

ϑ + ν(26)

Figure 4: The comparison on the precision rates after andbefore performing spam tag detection.

Figure 5:The comparison on the recall rates after and beforemerging the synonymous topics.

whereϑ is the set of images that are relevant to the givenimage topic and are returned correctly,ξ is the set of imagesthat are irrelevant to the given image topic and are returnedincorrectly, andν is the set of images that are relevant to thegiven image but are not returned. In our experiments, onlytop 200 images are used for calculating the precision andrecall rates.

The precision rate is used to characterize the accuracy ofour system for finding the particular images of interest, thusit can be used to assess the effectiveness of our spam tagdetection algorithm. As shown in Fig. 4, one can observethat our spam tag detection algorithm can filter out the junkimages effectively and result in higher precision rates forimage retrieval. On the other hand, the recall rate is used tocharacterize the efficiency of our system for finding the par-ticular images of interest, thus it can be used to assess theeffectiveness of our cross-modal tag cleansing algorithm onaddressing the issue of synonymous tags. As shown in Fig.5, one can observe that our cross-modal tag cleansing algo-rithm can combine the synonymous topics and their similarimages effectively and result in higher recall rates for imageretrieval.

To evaluate the effectiveness of our cross-modal tagcleansing algorithm on dealing with the polysemous tags,we have compared the performance differences on the pre-cision rates before and after separating the polysmous tagsand their ambiguous images. Some results are shown in Fig.6, one can obtain that our cross-modal tag cleansing algo-rithm can tackle the issue of polysemous tags effectively.By splitting the polysemous topics and their ambiguous im-ages into multiple sub-topics, our system can achieve higherprecision rates for image retrieval.

We have also compared the precision and recall ratesbetween our system (i.e., which have provided techniques

Figure 6:The precision rates for some query terms before andafter separating the polysemous topics and their ambiguousimages.

Figure 7: The precision rates for 5000 query terms: (a) oursystem; (b) Flickr search.

to deal with the critical issues of spam tags, synonymoustags, and polysemous tags) and Flickr search system (whichhave not provided techniques to deal with the critical issuesof spam tags, synonymous tags and polysemous tags). Asshown in Fig. 7 and Fig. 8, one can observe that our systemcan achieve higher precision and recall rates for all these5000 queries (i.e., 5000 tags of interest in our experiments)by addressing the critical issues of spams, synonyms andpolysemes effectively.

7. ConclusionsThe objective of this work is to create large amounts oftraining images with more reliable labels for computer vi-sion tasks by harvesting from large-scale weakly-tagged im-ages. A novel cross-modal tag cleansing and junk imagefiltering algorithm is developed by integrating both the vi-sual similarity contexts between the images and the seman-tic similarity contexts between their tags for cleansing theweakly-tagged images and their social tags. Our exper-iments on large-scale collections of weakly-tagged Flickr

Figure 8: The recall rates for 5000 query terms: (a) our sys-tem; (b) Flickr search.

images have provided very positive results. We will alsolease our image sets with more reliable labels on our website.

References

[1] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta and R.Jain, “Content-based image retrieval at the end of the earlyyears”,IEEE Trans. on PAMI, 2000.

[2] J. Fan, C. Yang, Y. Shen, N. Babaguchi, H. Luo, “Leveraginglarge-scale weakly-tagged images to train inter-related classi-fiers for multi-label annotation”, Proc. of first ACM workshopon Large-scale multimedia retrieval and mining, 2009.

[3] Flickr, http://www.flickr.com.

[4] R. Fergus, P. Perona, A. Zisserman, “A visual category filterfor Google Images”, ECCV, 2004.

[5] T. Berg, D. Forthy, “Animals on the Web”, IEEE CVPR, 2006.

[6] L. Li, G. Wang, L. Fei-Fei, “OPTIMOL: automatic online pic-ture collection via incremental model learning”, IEEE CVPR2007.

[7] F. Schroff, A. Criminisi, A. Zisserman, “Harvesting imagedatabases from the web”, IEEE ICCV, 2007.

[8] B.C. Russell, A. Torralba, R. Fergus, W.T. Freeman, “80 mil-lion tiny images: a large dataset for non-parametric objectand scene recognition”,IEEE Trans. on PAMI, vol.30, no.11,2008.

[9] K. Barnard, M. Johnson, ”Word sense disambiguation withpictures”,Artificial Intelligence, vol. 167, pp. 13-30, 2005.

[10] J. Yuan, Y. Wu, M. Yang, “Discovery of collocation patterns:from visual words to visual phrases”, IEEE CVPR, 2007.

[11] C. Ding, X. He, H. Zha, M. Gu, H. Simon, “A Min-maxCut Algorithm for Graph Partitioning and Data Clustering”,ICDM, 2001.

[12] J Shi, J Malik, “Normalized cuts and image segmentation”,IEEE Trans. on PAMI, 2000.

[13] J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, “Local fea-tures and kernels for classification of texture and object cate-tories: A comprehensive study”,Intl. Journal of ComputerVision, vol.73, no.2, 2007.

[14] J. Fan, Y. Gao, H. Luo, ““Integrating concept ontology andmulti-task learning to achieve more effective classifier train-ing for multi-level image annotation”,IEEE Trans. on ImageProcessing, vol. 17, no.3, pp.407-426, 2008.

[15] W. Hsu, L. Kennedy, S.F. Chang, “Video search rerankingthrough random walk over document-level context graph”,ACM Multimedia, 2007.

harvesting large-scale weakly-tagged image databases …noting that entity extraction can be done...

Documents