ieee - 2012 intent search capturing user intention for one-click internet image search

Upload: ashwini-shinde

Post on 02-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image Search

    1/13

    IntentSearch:Capturing User Intention for

    One-Click Internet Image SearchXiaoou Tang,Fellow, IEEE,Ke Liu, Jingyu Cui, Student Member, IEEE, Fang Wen,Member, IEEE

    and Xiaogang Wang, Member, IEEE

    AbstractWeb-scale image search engines (e.g. Google Image Search, Bing Image Search) mostly rely on surrounding text features.

    It is difficult for them to interpret users search intention only by query keywords and this leads to ambiguous and noisy search results

    which are far from satisfactory. It is important to use visual information in order to solve the ambiguity in text-based image retrieval.

    In this paper, we propose a novel Internet image search approach. It only requires the user to click on one query image with the

    minimum effort and images from a pool retrieved by text-based search are re-ranked based on both visual and textual content. Our key

    contribution is to capture the users search intention from this one-click query image in four steps. (1) The query image is categorized

    into one of the predefined adaptive weight categories, which reflect users search intention at a coarse level. Inside each category, a

    specific weight schema is used to combine visual features adaptive to this kind of images to better re-rank the text-based search result.(2) Based on the visual content of the query image selected by the user and through image clustering, query keywords are expanded

    to capture user intention. (3) Expanded keywords are used to enlarge the image pool to contain more relevant images. (4) Expanded

    keywords are also used to expand the query image to multiple positive visual examples from which new query specific visual and

    textual similarity metrics are learned to further improve content-based image re-ranking. All these steps are automatic without extra

    effort from the user. This is critically important for any commercial web-based image search engine, where the user interface has to be

    extremely simple. Besides this key contribution, a set of visual features which are both effective and efficient in Internet image search

    are designed. Experimental evaluation shows that our approach significantly improves the precision of top ranked images and also the

    user experience.

    Index TermsImage search, Intention, Image re-ranking, Adaptive similarity, Keyword expansion

    1 INTRODUCTION

    MANY commercial Internet scale image search en-gines use only keywords as queries. Users typequery keywords in the hope of finding a certain type ofimages. The search engine returns thousands of imagesranked by the keywords extracted from the surroundingtext. It is well known that text-based image search suffersfrom the ambiguity of query keywords. The keywordsprovided by users tend to be short. For example, the av-erage query length of the top 1, 000 queries of Picsearch

    is 1.368 words, and97% of them contain only one or twowords [1]. They cannot describe the content of imagesaccurately. The search results are noisy and consist ofimages with quite different semantic meanings. Figure 1shows the top ranked images from Bing image search us-ing apple as query. They belong to different categories,such as green apple, red apple, apple logo, andiphone, because of the ambiguity of the word apple.The ambiguity issue occurs for several reasons. First, the

    X. Tang and K. Liu are with the Department of Information Engineering,the Chinese University of Hong Kong, Hong Kong.

    J. Cui is with the Department of Electrical Engineering, Stanford Univer-sity, USA.

    F. Wen is with Microsoft Research Asia, China. X. Wang is with the Department of Electronic Engineering, the Chinese

    University of Hong Kong, Hong Kong.

    Fig. 1. Top ranked images returned from Bing imagesearch using apple as query.

    query keywords meanings may be richer than usersexpectations. For example, the meanings of the wordapple include apple fruit, apple computer, and appleipod. Second, the user may not have enough knowledgeon the textual description of target images. For example,if users do not know gloomy bear as the name of acartoon character (shown in Figure 2(a)) and they haveto input bear as query to search images of gloomybear. Lastly and most importantly, in many cases it ishard for users to describe the visual content of targetimages using keywords accurately.

    In order to solve the ambiguity, additional informa-tion has to be used to capture users search intention.

    One way is text-based keyword expansion, making thetextual description of the query more detailed. Existinglinguistically-related methods find either synonyms orother linguistic-related words from thesaurus, or findwords frequently co-occurring with the query keywords.

    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL.34 NO.7 YEAR 2012

  • 8/10/2019 IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image Search

    2/13

    Fig. 2. (a) Images of gloomy bear (b) Google RelatedSearches of query bear.

    For example, Google image search provides the RelatedSearches feature to suggest likely keyword expansions.However, even with the same query keywords, theintention of users can be highly diverse and cannot beaccurately captured by these expansions. As shown inFigure 2(b), gloomy bear is not among the keywordexpansions suggested by Google Related Searches.

    Another way is content-based image retrieval withrelevance feedback. Users label multiple positive andnegative image examples. A query-specific visual sim-ilarity metric is learned from the selected examples andused to rank images. The requirement of more users

    effort makes it unsuitable for web-scale commercial sys-tems like Bing image search and Google image search,in which users feedback has to be minimized.

    We do believe that adding visual information to imagesearch is important. However, the interaction has to beas simple as possible. The absolute minimum is One-Click. In this paper, we propose a novel Internet imagesearch approach. It requires the user to give only oneclick on a query image and images from a pool retrievedby text-based search are re-ranked based on their visualand textual similarities to the query image. We believethat users will tolerate one-click interaction which has

    been used by many popular text-based search engines.For example, Google requires a user to select a suggestedtextual query expansion by one-click to get additionalresults. The key problem to be solved in this paper ishow to capture user intention from this one-click queryimage. Four steps are proposed as follows.

    (1) Adaptive similarity. We design a set of visualfeatures to describe different aspects of images. How tointegrate various visual features to compute the similar-ities between the query image and other images is animportant problem. In this paper, anAdaptive Similarityis proposed, motivated by the idea that a user alwayshas specific intention when submitting a query image.

    For example, if the user submits a picture with a bigface in the middle, most probably he/she wants imageswith similar faces and using face-related features is moreappropriate. In our approach, the query image is firstlycategorized into one of the predefined adaptive weight

    categories, such as portrait and scenery. Inside eachcategory, a specific pre-trained weight schema is usedto combine visual features adapting to this kind of

    images to better re-rank the text-based search result. Thiscorrespondence between the query image and its propersimilarity measurement reflects the user intention. Thisinitial re-ranking result is not good enough and will beimproved by the following steps.

    (2) Keyword expansion. Query keywords input byusers tend to be short and some important keywordsmay be missed because of users lack of knowledge onthe textual description of target images. In our approach,query keywords are expanded to capture users searchintention, inferred from the visual content of queryimages, which are not considered in traditional keywordexpansion approaches. A word w is suggested as anexpansion of the query, if a cluster of images are visuallysimilar to the query image and all contain the same wordw 1 . The expanded keywords better capture users searchintention since the consistency of both visual content andtextual description is ensured.

    (3)Image pool expansion. The image pool retrievedby text-based search accommodates images with a largevariety of semantic meanings and the number of imagesrelated to the query image is small. In this case, re-ranking images in the pool is not very effective. Thusmore accurate query by keywords is needed to narrowthe intention and retrieve more relevant images. A naive

    way is to ask the user to click on one of the suggestedkeywords given by traditional approaches only usingtext information and to expand query results like inGoogle Related Searches. This increases users burden.Moreover, the suggested keywords based on text infor-mation only are not accurate to describe users intention.Keyword expansions suggested by our approach usingboth visual and textual information better capture usersintention. They are automatically added into the textquery and enlarge the image pool to include morerelevant images. Feedback from users is not required.Our experiments show that it significantly improves the

    precision of top ranked images.(4)Visual query expansion. One query image is notdiverse enough to capture the users intention. In Step(2), a cluster of images all containing the same expandedkeywords and visually similar to the query image arefound. They are selected as expanded positive examplesto learn visual and textual similarity metrics, which aremore robust and more specific to the query, for imagere-ranking. Compared with the weight schema in Step(1), these similarity metrics reflect users intention at afiner level since every query image has different metrics.Different from relevance feedback, this visual expansiondoes not require users feedback.

    All these four steps are automatic with only one clickin the first step without increasing users burden. Thismakes it possible for Internet scale image search by

    1. The word w does not have to be contained by the query image.

    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL.34 NO.7 YEAR 2012

  • 8/10/2019 IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image Search

    3/13

    both textual and visual content with a very simple userinterface. Our one-click intentional modeling in Step (1)has been proven successful in industrial applications [2],

    [3] and is now used in Bing image search engine [4]2. This work extends the approach with Steps (2)-(4) tofurther improve the performance greatly.

    2 RELATED WORK

    2.1 Image Search and Visual Expansion

    Many Internet scale image search methods [5][9] aretext-based and are limited by the fact that querykeywords cannot describe image content accurately.Content-based image retrieval [10] uses visual features toevaluate image similarity. Many visual features [11][17]were developed for image search in recent years. Some

    were global features such as GIST [11] and HOG [12].Some quantized local features, such as SIFT [13], intovisual words, and represented images as bags-of-visual-words (BoV) [14]. In order to preserve the geometryof visual words, spatial information was encoded intothe BoV model in multiple ways. For example, Zhanget al. [17] proposed geometry-preserving visual phaseswhich captured the local and long-range spatial layoutsof visual words.

    One of the major challenges of content-based image re-trieval is to learn the visual similarities which well reflectthe semantic relevance of images. Image similarities can

    be learned from a large training set where the relevanceof pairs of images is known [18]. Deng et al. [19] learnedvisual similarities from a hierarchical structure definedon semantic attributes of training images. Since webimages are highly diversified, defining a set of attributeswith hierarchical relationships for them is challenging. Ingeneral, learning a universal visual similarity metric forgeneric images is still an open problem to be solved.

    Some visual features may be more effective for certainquery images than others. In order to make the visualsimilarity metrics more specific to the query, relevancefeedback [20][26] was widely used to expand visualexamples. The user was asked to select multiple relevant

    and irrelevant image examples from the image pool. Aquery-specific similarity metric was learned from theselected examples. For example, in [20][22], [24], [25],discriminative models were learned from the exampleslabeled by users using support vector machines or boost-ing, and classified the relevant and irrelevant images. In[26] the weights of combining different types of featureswere adjusted according to users feedback. Since thenumber of user-labeled images is small for supervisedlearning methods, Huang et al. [27] proposed proba-bilistic hypergraph ranking under the semi-supervisedlearning framework. It utilized both labeled and un-

    labeled images in the learning procedure. Relevancefeedback required more users effort. For a web-scalecommercial system users feedback has to be limited tothe minimum, such as one-click feedback.

    2. The similar images function of http://www.bing.com/images/.

    In order to reduce users burden, pseudo relevancefeedback [28], [29] expanded the query image by takingthe top N images visually most similar to the query

    image as positive examples. However, due to the well-known semantic gap, the top N images may not be allsemantically-consistent with the query image. This mayreduce the performance of pseudo relevance feedback.Chum et al. [30] used RANSAC to verify the spatialconfigurations of local visual features and to purifythe expanded image examples. However, it was onlyapplicable to object retrieval. It required users to drawthe image region of the object to be retrieved and as-sumed that relevant images contained the same object.Under the framework of pseudo relevance feedback, Ah-Pine et al. [31] proposed trans-media similarities whichcombined both textual and visual features. Krapac etal. [32] proposed the query-relative classifiers, whichcombined visual and textual information, to re-rankimages retrieved by an initial text-only search. However,since users were not required to select query images, theusers intention could not be accurately captured whenthe semantic meanings of the query keywords had largediversity.

    We conducted the first study that combines text andimage content for image search directly on the Internetin [33], where simple visual features and clustering al-gorithms were used to demostrate the great potentionalof such an approach. Following our intent image search

    work in [2] and [3], a visual query suggestion methodis developed in [34]. Its difference from [2] and [3]is that instead of asking the user to click on a queryimage for re-ranking, the system asks users to click ona list of keyword-image pairs generated off-line usinga dataset from Flickr and search images on the webbased on the selected keyword. The problem with thisapproach is that on one hand the dataset from Flickr istoo small compared with the entire Internet thus cannotcover the unlimited possibility of Internet images and onthe other hand, the keyword-image suggestions for anyinput query are generated from the millions of images

    of the whole dataset, thus are expensive to computeand may produce a large number of unrelated keyword-image pairs.

    Besides visual query expansion, some approaches [35],[36] used concept-based query expansions through map-ping textual query keywords or visual query examples tohigh-level semantic concepts. They needed a pre-definedconcept lexicons whose detectors were off-line learnedfrom fixed training sets. These approach were suitablefor closed databases but not for web-based image search,since the limited number of concepts cannot cover thenumerous images on the Internet. The idea of learningexample specific visual similarity metric was explored

    in previous work [37], [38]. However, they requiredtraining a specific visual similarity for every examplein the image pool, which is assumed to be fixed. Thisis impractical in our application where the image poolreturned by text-based search constantly changes for

    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL.34 NO.7 YEAR 2012

  • 8/10/2019 IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image Search

    4/13

    different query keywords. Moreover, text information,which can significantly improve visual similarity learn-ing, was not considered in previous work.

    2.2 Keyword Expansion

    In our approach, keyword expansion is used to expandthe retrieved image pool and to expand positive exam-ples. Keyword expansion was mainly used in documentretrieval. Thesaurus-based methods [39], [40] expandedquery keywords with their linguistically related wordssuch as synonyms and hypernyms. Corpus-based meth-ods, such as well known term clustering [41] and LatentSemantic Indexing [42], measured the similarity of wordsbased on their co-occurrences in documents. Words mostsimilar to the query keywords were chosen as textual

    query expansion. Some image search engines have thefeature of expanded keywords suggestion. They mostlyuse surrounding text.

    Some algorithms [43], [44] generated tag suggestionsor annotations based on visual content for input images.Their goal is not to improve the performance of image re-ranking. Although they can be viewed as options of key-word expansions, some difficulties prevent them frombeing directly applied to our problem. Most of themassumed fixed keyword sets, which are hard to obtainfor image re-ranking in the open and dynamic web envi-ronment. Some annotation methods required supervised

    training, which is also difficult for our problem. Differentthan image annotation, our method provides extra imageclusters during the procedure of keyword expansions,and such image clusters can be used as visual expansionsto further improve the performance of image re-ranking.

    3 METHOD

    3.1 Overview

    The flowchart of our approach is shown in Figure 3. Theuser first submits query keywordsq. A pool of images isretrieved by text-based search 3 (Figure 3a). Then the useris asked to select a query image from the image pool.

    The query image is classified as one of the predefinedadaptive weight categories. Images in the pool are re-ranked (Figure 3b) based on their visual similarities tothe query image and the similarities are computed usingthe weight schema (Figure 3c described in Section 3.3)specified by the category to combine visual features(Section 3.2).

    In the keyword expansion step (Figure 3d describedin Section 3.4), words are extracted from the textualdescriptions (such as image file names and surroundingtexts in the html pages) of top k images most similarto the query image, and the tf-idfmethod [45] is usedto rank these words. To save computational cost, onlytopm words are reserved as candidates for further pro-cessing. However, because the initial image re-rankingresult is still ambiguous and noisy, the topk images may

    3. In this paper, it is from Bing image search [4].

    have a large diversity of semantic meanings, and cannotbe used as visual query expansion. The word with thehighest tf-idf score computed from the top k images is

    not reliable to be chosen as keyword expansion either.In our approach, reliable keyword expansions are foundthrough further image clustering. For each candidatewordwi, we find all the images containing wi, and groupthem into different clusters{ci,1, ci,2, , ci,ti} based onvisual content. As shown in Figure 3d, images withthe same candidate word may have a large diversityin visual content. Images assigned to the same clusterhave higher semantic consistency since they have highvisual similarity to one another and contain the samecandidate word. Among all the clusters of differentcandidate words, cluster ci,j with the largest visualsimilarity to the query image is selected as visual queryexpansion (Figure 3d described in Section 3.5), and itscorresponding word wi is selected to form keywordexpansionq =q+ wi.

    A query specific visual similarity metric (Section 3.5)and a query specific textual similarity metric (Section 3.7)are learned from both the query image and the visualquery expansion. The image pool is enlarged throughcombining the original image pool retrieved by the querykeywordsqprovided by the user and an additional im-age pool retrieved by the expanded keywords q (Figure3f described in Section 3.6). Images in the enlarged poolare re-ranked using the learned query-specific visual and

    textual similarity metrics (Figure 3g). The size of theimage cluster selected as visual query expansion andits similarity to the query image indicate the confidencethat the expansion captures the users search intention. Ifthey are below certain thresholds, expansion is not usedin image re-ranking.

    3.2 Visual Feature Design

    We design and adopt a set of features that are botheffective in describing the visual content of images fromdifferent aspects, and efficient in their computational andstorage complexity. Some of them are existing featuresproposed in recent years. Some new features are firstproposed by us or extensions of existing features. It takesan average of 0.01ms to compute the similarity betweentwo features on a machine of 3.0GHz CPU. The totalspace to store all features for an image is 12KB. Moreadvanced visual features developed in recent years or inthe future can also be incorporated into our framework.

    3.2.1 Existing Features

    Gist. Gist [11] characterizes the holistic appearanceof an image, and works well for scenery images.

    SIFT.We adopt 128-dimensionSIFT[13] to describe

    regions around Harris interest points. SIFT descrip-tors are quantized according to a codebook of 450words.

    Daubechies Wavelet. We use the 2nd order mo-ments of wavelet coefficients in various frequency

    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL.34 NO.7 YEAR 2012

  • 8/10/2019 IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image Search

    5/13

    (a) Text-based Search

    Result of Query apple

    (c) Adaptive Weighting Schemes

    Query Image

    General

    Object

    Object with Simple

    BackgroundScene Portrait People

    (b) Re-ranking without

    Query Expansion

    Image Clusters

    green tree

    Extracted Candidate Words

    Cm,1

    (d) Keyword Expansion

    Images Retrieved by

    Original Keywords (apple)

    Images Retrieved by

    Expanded Keywords (green apple)

    (e) Visual Query Expansion

    (f) Image Pool Expansion

    Image Re-ranking

    With Visual Expansion(g) Re-ranking Result

    Fig. 3. An example to illustrate our algorithm. The details of steps (c)-(f) are given in Section 3.3-3.6.

    bands (DWave) to characterize the texture proper-ties in the image [46]. Histogram of Gradient (HoG). HoG [12] reflects

    distributions of edges over different parts of animage, and is especially effective for images withstrong long edges.

    3.2.2 New Features

    Attention Guided Color Signature. Color signature[47] describes the color composition of an image. Af-ter clustering colors of pixels in the LAB color space,cluster centers and their relative proportions aretaken as the signature. We propose a new AttentionGuided Color Signature (ASig) as a color signaturethat accounts for varying importance of differentparts of an image. We use an attention detector [48]to compute a saliency map for the image, and thenperform k-Means clustering weighted by this map.The distance between two ASigs can be calculatedefficiently using the Earth Mover Distance algorithm[47].

    Color Spatialet. We design a novel feature, ColorSpatialet, to characterize the spatial distribution ofcolors. An image is divided into n n patches.Within each patch, we calculate its main color as the

    largest cluster after k-Means clustering. The image ischaracterized by Color Spatialet (CSpa), a vector ofn2 color values. In our experiments, we take n= 9.We account for some spatial shifting and resizing ofobjects in the images when calculating the distanceof two CSpas A and B :

    d(A,B) =ni=1

    nj=1

    min[d(Ai,j , Bi1,j1)],

    where Ai,j denotes the main color of the (i, j)thblock in the image. Color Spatialet describes colorspatial configuration. By capturing only the main

    color, its robust to slight color changes due to light-ing, white balance, and imaging noise. Since localshift is allowed when calculating distance betweentwo Color Spatialets, the feature is shift-invariant tosome extend and has resilience to misalignment.

    Multi-Layer Rotation Invariant EOH. Edge Ori-entation Histogram (EOH) [49] describes the his-togram of edge orientations. We incorporate rotationinvariance when comparing two EOHs, rotating oneof them to best match the other. This results in aMulti-Layer Rotation Invariant EOH (MRI-EOH).Also, when calculating MRI-EOH, a threshold pa-rameter is required to filter out the weak edges.We use multiple thresholds to get multiple EOHsto characterize image edge distributions at differentscales.

    Facial Feature.Face existence and their appearancesgive clear semantic interpretations of the image. We

    apply face detection algorithm [50] to each image,and obtain the number of faces, face sizes andpositions as features to describe the image from afacial perspective.

    3.3 Adaptive Weight Schema

    Human can easily categorize images into high levelsemantic classes, such as scene, people, or object. Weobserved that images inside these categories usuallyagree on the relative importance of features for similaritycalculations. Inspired by this observation, we assignthe query images into several typical categories, and

    adaptively adjust feature weights within each category.Suppose an image i from query category Qq is char-

    acterized using F visual features, the adaptive simi-larity between image i and j is defined as sq(i, j) =F

    m=1qmsm(i, j), where sm(i, j)is the similarity between

    image i and j on feature m, and qm expresses theimportance of feature m for measuring similarities forquery images from category Qq. We further constrain

    qm0 andmqm = 1.

    3.3.1 Query Categorization

    The query categories we considered are: General Object,Object with Simple Background, Scenery Images, Por-trait, and People. We use 500 manually labeled images,100 for each category, to train a C4.5 decision tree forquery categorization.

    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL.34 NO.7 YEAR 2012

  • 8/10/2019 IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image Search

    6/13

    The features we used for query categorization are:existence of faces, the number of faces in the image,the percentage of the image frame taken up by the

    face region, the coordinate of the face center relativeto the center of the image, Directionality (Kurtosis ofEdge Orientation Histogram, Section3.2), Color SpatialHomogeneousness (variance of values in different blocksof Color Spatialet, Section3.2), total energy of edge mapobtained from Canny operator, and Edge Spatial Distri-bution (the variance of edge energy in a 33 regularblock of the image, characterizing whether edge energyis mainly distributed at the image center.).

    3.3.2 Feature Fusion

    In each query category Qq , we pre-train a set of optimal

    weights q

    m based on the RankBoost framework [51].For a query image i, a real valued feedback functioni(j, k) is defined to denote preference between imagej and k. We set i(j, k) > 0 if image k should beranked above image j, and 0 otherwise. The feedbackfunction i induces a distribution over all pairs of

    images (j, k): Di(j, k) = i(j,k)j,k

    i(j,k), and the ranking loss

    for query image i using similarity measurement sq(i, )is Li = Pr(j,k)Di[s

    q(i, k)sq(i, j)].We use an adaptation of the RankBoost framework

    (Algorithm 1) to minimize the loss function with respecttoqm, which is the optimal weight schema for category

    Qq. Steps 2, 3, 4 and 8 differs from the original RankBoostalgorithm to accommodate to our application, and aredetailed as follows:

    Algorithm 1Feature weight learning for a certain querycategory

    1.Input: Initial weight Di for all query images i inthe current intention category Qq , similarity matricessm(i, ) for all query image i and feature m;2. Initialize: Set step t= 1, set D 1i =Di for all i ;while not convergeddo

    for each query image iQqdo3. Select best feature mt and the correspondingsimilaritysmt(i, )for current re-ranking problemunder weight Dti ;4. Calculate ensemble weight t according toEquation 1;5. Adjust weight Dt+1i (j, k) Dti(j, k)exp {t[smt(i, j) smt(i, k)]};6. Normalize Dt+1i to make it a distribution;7.t++;

    end forend while8.Output: Final optimal similarity measure for cur-

    rent intention category: sq (, ) =t

    tsmt (,)

    t

    t , and the

    weight for feature m: aqm =

    mt=m

    t

    t

    t.

    Step 2: Initialization. The training images are catego-rized into the five main classes. Images within each mainclass are further categorized into sub-classes. Images in

    each sub-class are visually similar. Besides, a few imagesare labeled as noise (irrelevant images) or neglect (hardto judge relevancy). Given a query image i, we define4 image sets: S1i includes images within the same sub-class asi,S2iincludes images within the same main classasi, excluding those inS1i,S3i includes images labeledas neglect, andS4iincludes images labeled as noise.For any image jS1i and any image kS2i S4i, weset (k, j) = 1. In all other cases, we set (k, j) = 0.

    Step 3: Select best feature. We need to select a featurethat performs best under current weight Dti for queryimagei. This step is very efficient since we constrain ourweak ranker to be one of theFsimilarity measurementssm(, ). The best featuremt is found by enumerating allF features.

    Step 4: Calculate ensemble weight. It is proven

    in [51] that minimizingZt = 1rt2

    e

    t

    +1+rt

    2

    e

    t

    in each step of boosting is approximately equivalent tominimizing the upper bound of the rank loss, wherert =

    j,k

    Dti(j, k) [smt(i, k) smt(i, j)]. Since we are look-ing for a single weighting scheme for each category,variations in t obtained for different query imagesare penalized by an additional smoothness term. The

    objective becomesZt = 1rt

    2 et

    + 1+rt

    2 et

    +

    2ett1 + et1t, where is a hyperparameter

    to balance the new term and the old terms. In ourimplementation,= 1 is used. Note that the third termtakes minimum value if and only ift = t1, so byimposing this new term, we are looking for a commonqm for all query images i in current intention category,while trying to reduce all the losses Li, i Qq. LettingZtt

    = 0, we know thatZt is minimized whent =

    1

    2ln

    1 + rt + e

    t1

    1 rt + et1

    (1)

    Step 8: Output nal weight for feature fusion. Thefinal output of the new RankBoost algorithm is a linearcombination of all the base rankers generated in eachstep. However, since there are actually Fbase rankers,the output is equivalent to a weighted combination ofthe F similarity measurements.

    3.4 Keyword Expansion

    Once the top k images most similar to the query im-age are found according to the visual similarity metricintroduced in Section 3.2 and 3.3, words from theirtextual descriptions 4 are extracted and ranked, usingtheterm frequency-inverse document frequency (tf-idf) [45]method. The top m (m = 5 in our experiments) wordsare reserved as candidates for query expansion.

    4. The file names and the surrounding texts of images.

    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL.34 NO.7 YEAR 2012

  • 8/10/2019 IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image Search

    7/13

  • 8/10/2019 IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image Search

    8/13

    Fig. 6. An example of image re-ranking using Paris asquery keyword and an image of eiffel tower as queryimage. Irrelevant images are marked by red rectangles.

    text-based image search result. If the query keywordsdo not capture the users search intention accurately,there are only a small number of relevant images with

    the same semantic meanings as the query image in theimage pool. This can significantly degrade the rankingperformance. In Section 3.3, we re-rank the top N re-trieved images by the original keyword query based ontheir visual similarities to the query image. We removethe N/2 images with the lowest ranks from the imagepool. Using the expanded keywords as query, the topN/2 retrieved images are added to the image pool. Webelieve that there are more relevant images in the imagepool with the help of expanded query keywords. There-ranking result by extending image pool and positiveexample images is shown in Figure 6(b), which is signif-icantly improved compared with Figure 6(a).

    3.7 Combining Visual and Textual Similarities

    Learning a query specific textual similarity metric fromthe positive examples E = {e1, , ej} obtained by

    visual query expansion and combining it with the queryspecific visual similarity metric introduced in Section3.5 can further improve the performance of image re-

    ranking. For a selected query image, a word probabilitymodel is trained fromEand used to compute the textualdistance distT. We adopt the approach in [31]. Let bethe parameter of a discrete distribution of words overthe dictionary. Each image i is regarded as a documentdiwhere the words are extracted from its textual descrip-tions (see definition in 3.4). is learned by maximizingthe observed probability

    eiEwdi(p(w|) + (1 )p(w|C))niw ,

    where is a fixed parameter set to be 0.5, w is a word,and niw is the frequency ofw in di. p(w|C) is the wordprobability built upon the whole repository C:

    p(w|C) =

    diniw

    |C| .

    can be learned by the Expectation-Maximization al-gorithm. Once is learned, for an image k its textualdistance to the positive examples is defined by cross-entropy function:

    distT(k) =w

    p(w|dk)log(w|).

    Here p(w|di) =nkw/|dk|. At last, this textual distance canbe combined with the visual similaritysimVobtained inSection 3.5 to re-rank images:

    simV(k) + (1 ) distT(k) is a fixed parameter and set as 0.5.

    3.8 Summary

    The goal of the proposed framework is to capture userintention and is achieved in multiple steps. The userintention is first roughly captured by classifying thequery image into one of the coarse semantic categoriesand choosing a proper weight schema accordingly. The

    adaptive visual similarity obtained from the selectedweight schema is used in all the following steps. Thenaccording to the query keywords and the query imageprovided by the user, the user intention is further cap-tured in two aspects: (1) finding more query keywords(called keyword expansion) describing user intentionmore accurately; (2) and in the meanwhile finding acluster of images (called visual query expansion) whichare both visually and semantically consistent with thequery image. They keyword expansion frequently co-occurs with the query keywords and the visual expan-sion is visually similar to the query image. Moreover, itis required that all the images in the cluster of visual

    query expansion contain the same keyword expansion.Therefore, the keyword expansion and visual expansionsupport each other and are obtained simultaneously. Inthe later steps, the keyword expansion is used to expandthe image pool to include more images relevant to user

    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL.34 NO.7 YEAR 2012

  • 8/10/2019 IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image Search

    9/13

    A

    intention, and the visual query expansion is used to learnvisual and textual similarity metrics which better reflectuser intention.

    4 EXPERIMENTALEVALUATION

    In the first experiment, 300, 000 web images are manuallylabeled into different classes (images that are seman-tically similar) as ground truth. Precisions of differentapproaches are compared. The semantic meanings ofimages are closely related to users intention. However,they are not exactly the same. Images of the same class(thus with similar semantic meanings) can be visuallyquite different. Thus, a user study is conducted in thesecond experiment to evaluate whether the search results

    well capture users intention. Running on a machine of3GHz CPU and without optimizing the code, it needsless than half second computation for each query.

    4.1 Experiment One: Evaluation with Ground Truth

    Fifty query keywords are chosen for evaluation. Usingeach keyword as query, the top 1000images are crawledfrom Bing image search. These images are manuallylabeled into different classes. For example, for queryapple, its images are labeled as red apple, appleipod, and apple pie etc. There are totally 700 classesfor all the fifty query keywords. Another 500 images arecrawled from Bing image search using each keywordexpansion as query. These images are also manuallylabeled. There are totally around300, 000images in ourdata set. A small portion of them are as outliers and notassigned to any category (e.g, some images are irrelevantto the query keywords). The threshold (in Section 3.4)is chosen as0.3through cross-validation measuring andis fixed in all the experiments. The performance is stablewhen varies between 0.25 and0.35.

    4.1.1 Precisions on Different Steps of Our Framework

    Top m precision, the proportion of relevant imagesamong the topm ranked images, is used to evaluate theperformance of image re-ranking. Images are consideredto be relevant if they are labeled as the same class. Foreach query keyword, image re-ranking repeats for manytimes by choosing different query images. Except thoseoutlier images not being assigned to any category, everyimage returned by keyword query has been chosen asthe query image. In order to evaluate the effectivenessof different steps of our proposed image re-rankingframework, we compare the following approaches. From(1) to (8), more and more steps in Figure 3 are added in.

    (1)Text-based: text-based search from Bing. It is used

    as the baseline.(2) GW: image re-ranking using global weights to

    combine visual features (Global Weight).(3) AW: image re-ranking using adaptive weight

    schema to combine visual features (Section 3.3).

    17%

    23%

    29%

    35%

    41%

    47%

    53%

    59%

    10 30 50 70 90

    TopmPrecision

    m

    (1) Text-based (2) GW (3) AW

    (4) ExtEg (5) GW + ExtPool (6) ExtPool

    (7) ExtBoth(V) (8) ExtBoth(V+T)

    Fig. 7. Comparison of averaged top m precisions ondifferent steps.

    (4) ExtEg: image re-ranking by extending positiveexamples only, from which the query specific visualsimilarity metric is learned and used (Section 3.5).

    (5)GW+Pool: image re-ranking by extending the im-age pool only (Section 3.6) while using global weightsto combine visual features.

    (6) ExtPool: similar to GW+Pool, however, using adap-tive weight schema to combine visual features;

    (7) ExtBoth(V): image re-ranking by extending both

    the image pool and positive example images. Only thequery specific visual similarity metric is used.

    (8)ExtBoth(V+T): similar to ExtBoth, however, com-bining query specific visual and textual similarity met-rics (Section 3.7). This is the complete approach proposedby us.

    The averaged top m precisions are shown in Figure 7.Approaches (2)-(7) only use visual similarity. Approach(8) uses both visual and textual similarities. Approaches(2) and (3) are initial image re-ranking based on the text-based search results in (1). Their difference is to combinevisual features in different ways. We can see that by

    using a single query image we can significantly improvethe text-based image search result. The proposed adap-tive weight schema, which reflects user intention at acoarse level, outperforms the global weight. After initialre-ranking using adaptive weight, the top 50 precisionof text-based research is improved from 19.5%to 32.9%.Approaches (4), (6), (7) and (8) are based on the initialimage re-ranking result in (3). We can clearly see theeffectiveness of expanding image pool and expandingpositive image examples through keyword expansionand image clustering. These steps capture user intentionat a finer level since for every query image the imagepool and positive examples are expanded differently.

    Using expansions the top 50 precision of initial re-ranking using adaptive weight is improved from 32.9%to51.9%.

    Our keyword expansion step (Section 3.4) could bereplaced by other equivalent methods, such as image

    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL.34 NO.7 YEAR 2012

  • 8/10/2019 IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image Search

    10/13

    39%

    43%

    47%

    51%

    55%

    59%

    10 30 50 70 90

    TopmP

    recision

    m

    (1) ExtBoth(V) (2) ExtPool

    (3) ExtPoolByTfIdf (4) ExtPoolByWTfIdf

    Fig. 8. Comparison of averaged top m precisions ofkeyword expansion through image clustering with theother two methods.

    28%

    36%

    44%

    52%

    60%

    10 30 50 70 90

    TopmP

    recision

    m

    (1) ExtBoth(V+T) (2) CrossMedia

    (3) NPRF (4) PRF

    Fig. 9. Comparison of averaged top m precisions withexisting methods.

    annotation methods. As discussed in Section 2.2, manyexisting image annotation methods cannot be directlyapplied to our problem. We compare with two imageannotation approaches, which do not require a fixed setof keywords, by replacing the keyword expansion stepwith them. One is to choose the word with the highesttf-idf score as keyword expansion to extend image pool

    (ExtPoolByTfIdf). The other uses the method proposedin [44] which weighted the tf-idf score by images visualsimilarities to extend image pool (ExtPoolByWTfIdf).Figure 8 shows the result. Our ExtPool has a betterperformance. Moreover, image annotation only aims atranking words and cannot automatically provide visualquery expansion as our method does. Therefore, ourExtBoth has a even better performance than the othertwo.

    4.1.2 Comparison with Other Methods

    In this section, we compare with several existing ap-

    proaches [29], [31], [53] which can be applied to imagere-ranking with only one-click feedback as discussed inSection 2.2.

    (1)ExtBoth (V+T): our approach.(2)CrossMedia: image re-ranking by trans-media dis-

    TABLE 1Average number of Irrelevant Images

    No visual expansions With visual expansionsTop 10 images 4.26 2.33Top 5 images 1.67 0.91Top 3 images 0.74 0.43

    40.88%

    26.90%

    22.58%

    7.82%

    1.82%

    Much Better

    Somewhat Better

    Similar

    Somewhat Worse

    Much Worse

    Fig. 10. Percentages of cases when users think the resultof visual expansions is much better, somewhat better,similar, somewhat worse and much worse than theresult of only using intention weight schema.

    tances defined in [31]. It combined both visual andtextual features under the pseudo-relevance feedbackframework.

    (3)NPRF: image re-ranking by the pseudo-relevancefeedback approach proposed in [29]. It used top-rankedimages as positive examples and bottom-ranked imagesas negative examples to train a SVM.

    (4) PRF: image re-ranking by the pseudo-relevancefeedback approach proposed in [53]. It used top-rankedimages as positive examples to train a one-class SVM .Figure 9 shows the result. Our algorithm outperformsothers, especially when m is large.

    4.2 Experiment Two: User Study

    The purpose of the user study is to evaluate the ef-fectiveness of visual expansions (expanding both theimage pool and positive visual examples) to captureuser intention. Forty users are invited. For each querykeyword, the user is asked to browse the images andto randomly select an image of interest as a queryexample. We show them the initial image re-rankingresults of using adaptive weight schema (Section 3.3)and the results of extending both the image pool andpositive example images. The users are asked to do thefollowings:

    Mark irrelevant images among the top 10 images. Compare the top 50 retrieved images by both re-

    sults, and choose whether the final result with visualexpansions is much better, somewhat better,similar, somewhat worse or much worse thanthat of initial re-ranking using adaptive weightschema.

    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL.34 NO.7 YEAR 2012

  • 8/10/2019 IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image Search

    11/13

    Each user is assigned 5 query keywords from all the50 keywords. Given each query keyword, the user isasked to choose 30 different query images and compare

    their re-ranking results. As shown in Table 1, visualexpansions significantly reduce the average numbersof irrelevant images among top 10 images. Figure 10shows that in most cases (>67%) the users think visualexpansions improve the result.

    4.3 Discussion

    In our approach, the keyword expansion (Section 3.4),visual query expansion (Section 3.5) and image pool ex-pansion (Section 3.6) all affect the quality of initial imagere-ranking result (Section 3.3). According to our experi-mental evaluation and user study, if the quality of initial

    image re-ranking is reasonable, which means that thereare a few relevant examples among top ranked images,the following expansion steps can significantly improvethe re-ranking performance. Inappropriate expansionswhich significantly deteriorates the performance happenin the cases when the initial re-rank result is very poor.The chance is lower than 2% according to our user studyin Figure 10.

    In this paper, it is assumed that an image capturesuser intention when it is both semantically and visuallysimilar to the query image. However, in some cases userintention cannot by well expressed by a single query

    image. For instance, the user may be interested in onlypart of the image. In those cases, more user interactions,such as labeling the regions where the user thinks isimportant, have to be allowed. However, more userburden have to be added and it is not considered in thispaper.

    5 CONCLUSION

    In this paper, we propose a novel Internet image searchapproach which only requires one-click user feedback.Intention specific weight schema is proposed to combinevisual features and to compute visual similarity adaptive

    to query images. Without additional human feedback,textual and visual expansions are integrated to captureuser intention. Expanded keywords are used to extendpositive example images and also enlarge the image poolto include more relevant images. This framework makesit possible for industrial scale image search by bothtext and visual content. The proposed new image re-ranking framework consists of multiple steps, which canbe improved separately or replaced by other techniquesequivalently effective. In the future work, this frame-work can be further improved by making use of thequery log data, which provides valuable co-occurrenceinformation of keywords, for keyword expansion. One

    shortcoming of the current system is that sometimesduplicate images show up as similar images to the query.This can be improved by including duplicate detection inthe future work. Finally, to further improve the quality ofre-ranked images, we intent to combine this work with

    photo quality assessment work in [54][56] to re-rankimages not only by content similarity but also by thevisual quality of the images.

    REFERENCES

    [1] F. Jing, C. Wang, Y. Yao, K. Deng, L. Zhang, and W. Ma, Igroup:Web image search results clustering, in Proc. ACM Multimedia,2006.

    [2] J. Cui, F. Wen, and X. Tang, Real time google and live imagesearch re-ranking, inProc. ACM Multimedia, 2008.

    [3] , Intentsearch: Interactive on-line image search re-ranking,inProc. ACM Multimedia, 2008.

    [4] Bing image search, http://www.bing.com/images.[5] N. Ben-Haim, B. Babenko, and S. Belongie, Improving web-based

    image search via content based clustering, inProc. Intl Workshopon Semantic Learning Applications in Multimedia, 2006.

    [6] R. Fergus, P. Perona, and A. Zisserman, A visual category filterfor google images, inProc. European Conf. Computer Vision, 2004.

    [7] G. Park, Y. Baek, and H. Lee, Majority based ranking approachin web image retrieval,in Proc. the 2nd International Conferenceon Image and Video Retrieval, 2003.

    [8] Y. Jing and S. Baluja, Pagerank for product image search, inProc. Intl Conf. World Wide Web, 2008.

    [9] W. H. Hsu, L. S. Kennedy, and S.-F. Chang, Video searchreranking via information bottleneck principle, in Proc. ACM

    Multimedia, 2006.[10] R. Datta, D. Joshi, and J. Z. Wang, Image retrieval: Ideas,

    influences, and trends of the new age,ACM Computing Surveys,vol. 40, pp. 160, 2007.

    [11] A. Torralba, K. Murphy, W. Freeman, and M. Rubin, Context-based vision system for place and object recognition, in Proc.Intl Conf. Computer Vision, 2003.

    [12] N. Dalal and B. Triggs, Histograms of oriented gradients forhuman detection, inProc. IEEE Intl Conf. Computer Vision andPattern Recognition, 2005.

    [13] D. Lowe, Distinctive image features from scale-invariant key-points,International Journal of Computer Vision, vol. 60, no. 2, pp.91110, 2004.

    [14] J. Sivic and A. Zisserman, Video google: a text retrieval approachto object matching in videos, inProc. Intl Conf. Computer Vision,2003.

    [15] Y. Cao, C. Wang, Z. Li, L. Zhang, and L. Zhang, Spatial-bag-of-features, inProc. IEEE Intl Conf. Computer Vision and PatternRecognition, 2010.

    [16] J. Philbin, M. Isard, J. Sivic, and A. Zisserman, DescriptorLearning for Efficient Retrieval, inProc. European Conf. ComputerVision, 2010.

    [17] Y. Zhang, Z. Jia, and T. Chen, Image retrieval with geometry-preserving visual phrases, in Proc. IEEE Intl Conf. ComputerVision and Pattern Recognition, 2011.

    [18] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, Large scaleonline learning of image similarity through ranking, Journal of

    Machine Learning Research, vol. 11, pp. 11091135, 2010.[19] J. Deng, A. C. Berg, and L. Fei-Fei, Hierarchical semantic in-

    dexing for large scale image retrieval, in Proc. IEEE Intl Conf.Computer Vision and Pattern Recognition, 2011.

    [20] K. Tieu and P. Viola, Boosting image retrieval, InternationalJournal of Computer Vision, vol. 56, no. 1, pp. 1736, 2004.

    [21] S. Tong and E. Chang, Support vector machine active learningfor image retrieval, inProc. ACM Multimedia, 2001.

    [22] Y. Chen, X. Zhou, and T. Huang, One-class SVM for learning inimage retrieval, inProc. IEEE Intl Conf. Image Processing, 2001.

    [23] Y. Lu, H. Zhang, L. Wenyin, and C. Hu, Joint semantics andfeature based image retrieval using relevance feedback, IEEETrans. on Multimedia, vol. 5, no. 3, pp. 339347, 2003.

    [24] D. Tao and X. Tang, Random sampling based svm for relevancefeedback image retrieval, inProc. IEEE Intl Conf. Computer Visionand Pattern Recognition, 2004.

    [25] D. Tao, X. Tang, X. Li, and X. Wu, Asymmetric bagging andrandom subspace for support vector machines-based relevancefeedback in image retrieval,IEEE Trans. on Pattern Analysis and

    Machine Intelligence, vol. 28, pp. 1088 1099, 2006.[26] T. Quack, U. Monich, L. Thiele, and B. Manjunath, Cortina: a

    system for large-scale, content-based web image retrieval, inProc. ACM Multimedia, 2004.

    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL.34 NO.7 YEAR 2012

  • 8/10/2019 IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image Search

    12/13

    [27] Y. Huang, Q. Liu, S. Zhang, and D. N. Metaxas, Image retrievalvia probabilistic hypergraph ranking, in Proc. IEEE Intl Conf.Computer Vision and Pattern Recognition, 2011.

    [28] R. Yan, E. Hauptmann, and R. Jin, Multimedia search with

    pseudo-relevance feedback, inProc. Intl Conf. on Image and VideoRetrieval, 2003.

    [29] R. Yan, A. G. Hauptmann, and R. Jin, Negative pseudo-relevancefeedback in content-based video retrieval, in Proc. ACM Multi-media, 2003.

    [30] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, Totalrecall: Automatic query expansion with a generative featuremodel for object retrieval, inProc. Intl Conf. Computer Vision,2007.

    [31] J. Ah-Pine, M. Bressan, S. Clinchant, G. Csurka, Y. Hoppenot,and J. Renders, Crossing textual and visual content in differentapplication scenarios,Multimedia Tools and Applications, vol. 42,pp. 3156, 2009.

    [32] J. Krapac, M. Allan, J. Verbeek, and F. Jurie, Improving webimage search results using query-relative classifiers, inProc. IEEEIntl Conf. Computer Vision and Pattern Recognition, 2010.

    [33] B. Luo, X. Wang, and X. Tang, A world wide web based imagesearch engine using text and image content features, inProc.IS&T/SPIE Electronic Imaging, Internet Imaging IV. , 2003.

    [34] Z. Zha, L. Yang, T. Mei, M. Wang, and Z. Wang, Visual queryexpansion, inProc. ACM Multimedia, 2009.

    [35] A. Natsev, A. Haubold, J. Tesic, L. Xie, and R. Yan, Semanticconcept-based query expansion and re-ranking for multimediaretrieval, inProc. ACM Multimedia, 2007.

    [36] J. Smith, M. Naphade, and A. Natsev, Multimedia semanticindexing using model vectors, inProc. Intl Conf. Multimedia andExpo., 2003.

    [37] A. Frome, Y. Singer, F. Sha, and J. Malik, Learning globally-consistent local distance functions for shape-based image retrievaland classification, inProc. Intl Conf. Computer Vision, 2007.

    [38] Y. Lin, T. Liu, and C. Fuh, Local ensemble kernel learning forobject category recognition, inProc. IEEE Intl Conf. Computer

    Vision and Pattern Recognition, 2007.[39] S. Liu, F. Liu, C. Yu, and W. Meng, An effective approach to doc-ument retrieval via utilizing WordNet and recognizing phrases,inProc. ACM Special Interest Group on Information Retrieval , 2004.

    [40] S. Kim, H. Seo, and H. Rim, Information retrieval using wordsenses: root sense tagging approach, inProc. ACM Special InterestGroup on Information Retrieval, 2004.

    [41] K. Sparck Jones, Automatic keyword classi cation for informationretrieval. Archon Books, 1971.

    [42] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harsh-man, Indexing by latent semantic analysis,Journal of the Ameri-can Society for Information Science, vol. 41, no. 6, pp. 391407, 1990.

    [43] L. Wu, L. Yang, N. Yu, and X. Hua, Learning to tag, in Proc.Intl Conf. World Wide Web, 2009.

    [44] C. Wang, F. Jing, L. Zhang, and H. Zhang, Scalable search-based image annotation of personal images, inProc. the 8th ACM

    International Workshop on Multimedia Information Retrieval, 2006.[45] R. Baeza-Yates and B. Ribeiro-Neto,Modern Information Retrieval.Addison-Wesley Longman Publishing Co., Inc., 1999.

    [46] M. Unser, Texture classification and segmentation using waveletframes,IEEE Trans. on Image Processing, vol. 4, no. 11, pp. 15491560, 1995.

    [47] Y. Rubner, L. Guibas, and C. Tomasi, The earth movers distance,multi-dimensional scaling, and color-based image retrieval, inProc. the ARPA Image Understanding Workshop, 1997.

    [48] T. Liu, J. Sun, N. Zheng, X. Tang, and H. Shum, Learning todetect a salient object, in Proc. IEEE Intl Conf. Computer Visionand Pattern Recognition, 2007.

    [49] W. Freeman and M. Roth, Orientation histograms for handgesture recognition, inProc. Intl Workshop on Automatic Face andGesture Recognition, 1995.

    [50] R. Xiao, H. Zhu, H. Sun, and X. Tang, Dynamic cascades for facedetection, inProc. Intl Conf. Computer Vision, 2007.

    [51] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, An efficientboosting algorithm for combining features,Journal of MachineLearning Research, vol. 4, pp. 933969, 2003.

    [52] X. S. Zhou and T. S. Huang, Relevance feedback in imageretrieval: A comprehensive review, Multimedia Systems, vol. 8,pp. 536544, 2003.

    [53] J. He, M. Li, Z. Li, H. Zhang, H. Tong, and C. Zhang, Pseudo rel-evance feedback based on iterative probabilistic one-class svms inweb image retrieval, inProc. Paci c-Rim Conference on Multimedia,2004.

    [54] Y. Ke, X. Tang, and F. Jing, The design of high-level featuresfor photo quality assessment, in Proc. IEEE Intl Conf. ComputerVision and Pattern Recognition, 2006.

    [55] Y. Luo and X. Tang, Photo and video quality evaluation: Focus-ing on the subject, inProc. European Conf. Computer Vision, 2008.

    [56] W. Luo, X. Wang, and X. Tang, Content-based photo qualityassessment, inProc. Intl Conf. Computer Vision, 2011.

    Xiaoou Tang (S93-M96-SM02-F09) receivedthe B.S. degree from the University of Sci-

    ence and Technology of China, Hefei, in 1990,and the M.S. degree from the University ofRochester, Rochester, NY, in 1991. He receivedthe Ph.D. degree from the Massachusetts Insti-tute of Technology, Cambridge, in 1996.

    He is a Professor in the Department of Infor-mation Engineering and Associate Dean (Re-search) of the Faculty of Engineering of theChinese University of Hong Kong. He worked as

    the group manager of the Visual Computing Group at the Microsoft Re-search Asia from 2005 to 2008. His research interests include computervision, pattern recognition, and video processing.

    Dr. Tang received the Best Paper Award at the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) 2009. He is aprogram chair of the IEEE International Conference on Computer Vision(ICCV) 2009 and an Associate Editor of IEEE Transactions on PatternAnalysis and Machine Intelligence (PAMI) and International Journal ofComputer Vision (IJCV). He is a Fellow of IEEE.

    Ke Liu received the B.S. degree from TsinghuaUniversity in Computer Science in 2009. He iscurrently an M.Phil. student in the Department ofInformation Engineering at the Chinese Univer-sity of Hong Kong. His research interests includeimage search and computer vision.

    Jingyu Cui received his MSc degree in Electri-cal Engineering in 2010 from Stanford University,where he is currently working toward his PhDdegree. He received his Msc and BEng degreein Automation from Tsinghua University in 2008and 2005, respectively. His research interestsinclude visual recognition and retrieval, machine

    learning, and parallel computing.

    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE VOL.34 NO.7 YEAR 2012

  • 8/10/2019 IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image Search

    13/13

    Fang Wen received the Ph.D and M.S. degreein Pattern Recognition and Intelligent Systemand B.S degree in Automation from TsinghuaUniversity in 2003, 1997, respectively. Now she

    is a researcher of visual computing group atMicrosoft Research Asia. Her research interestsinclude computer vision, pattern recognition andmultimedia search.

    Xiaogang Wang (S03-M10) received the B.S.degree from University of Science and Tech-nology of China in Electrical Engineering andInformation Science in 2001, and the M.S. de-gree from Chinese University of Hong Kong inInformation Engineering in 2004. He receivedthe PhD degree in Computer Science from theMassachusetts Institute of Technology. He iscurrently an assistant professor in the Depart-ment of Electronic Engineering at the ChineseUniversity of Hong Kong. His research interests

    include computer vision and machine learning.