corpus-guided sentence generation of natural …yzyang/paper/sengen_emnlp2011_final.pdfcorpus-guided...

Corpus-Guided Sentence Generation of Natural Images

Yezhou Yang † and Ching Lik Teo † and Hal Daume III and Yiannis AloimonosUniversity of Maryland Institute for Advanced Computer Studies

College Park, Maryland 20742, USA{yzyang, cteo, hal, yiannis}@umiacs.umd.edu

Abstract

We propose a sentence generation strategythat describes images by predicting the mostlikely nouns, verbs, scenes and prepositionsthat make up the core sentence structure. Theinput are initial noisy estimates of the objectsand scenes detected in the image using state ofthe art trained detectors. As predicting actionsfrom still images directly is unreliable, we usea language model trained from the English Gi-gaword corpus to obtain their estimates; to-gether with probabilities of co-located nouns,scenes and prepositions. We use these esti-mates as parameters on a HMM that modelsthe sentence generation process, with hiddennodes as sentence components and image de-tections as the emissions. Experimental re-sults show that our strategy of combining vi-sion and language produces readable and de-scriptive sentences compared to naive strate-gies that use vision alone.

1 Introduction

What happens when you see a picture? The mostnatural thing would be to describe it using words:using speech or text. This description of an image isthe output of an extremely complex process that in-volves: 1) perception in the Visual space, 2) ground-ing to World Knowledge in the Language Space and3) speech/text production (see Fig. 1). Each of thesecomponents are challenging in their own right andare still considered open problems in the vision andlinguistics fields. In this paper, we introduce a com-putational framework that attempts to integrate these†indicates equal contribution.

Figure 1: The processes involved for describing a scene.

components together. Our hypothesis is based onthe assumption that natural images accurately reflectcommon everyday scenarios which are captured inlanguage. For example, knowing that boats usuallyoccur over water will enable us to constrain thepossible scenes a boat can occur and exclude highlyunlikely ones – street, highway. It also en-ables us to predict likely actions (Verbs) given thecurrent object detections in the image: detecting adog with a person will likely induce walk ratherthan swim, jump, fly. Key to our approach isthe use of a large generic corpus such as the EnglishGigaword [Graff, 2003] as the semantic groundingto predict and correct the initial and often noisy vi-sual detections of an image to produce a reasonablesentence that succinctly describes the image.

In order to get an idea of the difficulty of thistask, it is important to first define what makes up

Figure 2: Illustration of various perceptual challenges forsentence generation for images. (a) Different images withsemantically the same content. (b) Pose relates ambigu-ously to actions in real images. See text for details.

a description of an image. Based on our observa-tions of annotated image data (see Fig. 4), a de-scriptive sentence for an image must contain at min-imum: 1) the important objects (Nouns) that partic-ipate in the image, 2) Some description of the ac-tions (Verbs) associated with these objects, 3) thescene where this image was taken and 4) the prepo-sition that relates the objects to the scene. That is, aquadruplet of T = {n, v, s, p} (Noun-Verb-Scene-Preposition) that represents the core sentence struc-ture. Generating a sentence from this quadruplet isobviously a simplification from state of the art gen-eration work, but as we will show in the experimen-tal results (sec. 4), it is sufficient to describe im-ages. The key challenge is that detecting objects, ac-tions and scenes directly from images is often noisyand unreliable. We illustrate this using example im-ages from the Pascal-Visual Object Classes (VOC)2008 challenge [Everingham et al., 2008]. First,Fig. 2(a) shows the variability of images in their rawimage representations: pixels, edges and local fea-tures. This makes it difficult for state of the art ob-ject detectors [Felzenszwalb et al., 2010; Schwartzet al., 2009] to reliably detect important objects inthe scene: boat, humans and water – average preci-sion scores reported in [Felzenszwalb et al., 2010]manages around 42% for humans and only 11% forboat over a dataset of almost 5000 images in 20 ob-ject categories. Yet, these images are semanticallysimilar in terms of their high level description. Sec-ond, cognitive studies [Urgesi et al., 2006; Kourtzi,2004] have proposed that inferring the action fromstatic images (known as an “implied action”) is of-

ten achieved by detecting the pose of humans in theimage: the position of the limbs with respect to oneanother, under the assumption that a unique pose oc-curs for a unique action. Clearly, this assumptionis weak as 1) similar actions may be represented bydifferent poses due to the inherent dynamic nature ofthe action itself: e.g. walking a dog and 2) differentactions may have the same pose: e.g. walking a dogversus running (Fig. 2(b)). The missing componenthere is whether the key object (dog) under interac-tion is considered. Recent works [Yao and Fei-Fei,2010; Yang et al., 2010] that used poses for recog-nition of actions achieved 70% and 61% accuracyrespectively under extremely limited testing condi-tions with only 5-6 action classes each. Finally, stateof the art scene detectors [Oliva and Torralba, 2001;Torralba et al., 2003] need to have enough represen-tative training examples of scenes from pre-definedscene classes for a classification to be successful –with a reported average precision of 83.7% testedover a dataset of 2600 images.

Addressing all these visual challenges is clearlya formidable task which is beyond the scope of thispaper. Our focus instead is to show that with theaddition of language to ground the noisy initial vi-sual detections, we are able to improve the qual-ity of the generated sentence as a faithful descrip-tion of the image. In particular, we show that itis possible to avoid predicting actions directly fromimages – which is still unreliable – and to use thecorpus instead to guide our predictions. Our pro-posed strategy is also generic, that is, we make noprior assumptions on the image domain considered.While other works (sec. 2) depend on strong anno-tations between images and text to ground their pre-dictions (and to remove wrong sentences), we showthat a large generic corpus is also able to providethe same grounding over larger domains of images.It represents a relatively new style of learning: dis-tant supervision [Liang et al., 2009; Mann and Mc-callum, 2007]. Here, we do not require “labeled”data containing images and captions but only sep-arate data from each side. Another contribution isa computationally feasible way via dynamic pro-gramming to determine the most likely quadrupletT ∗ = {n∗, v∗, s∗, p∗} that describes the image forgenerating possible sentences.

2 Related Work

Recently, several works from the Computer Visiondomain have attempted to use language to aid im-age scene understanding. [Kojima et al., 2000] usedpredefined production rules to describe actions invideos. [Berg et al., 2004] processed news captionsto discover names associated with faces in the im-ages, and [Jie et al., 2009] extended this work to as-sociate poses detected from images with the verbsin the captions. Both approaches use annotated ex-amples from a limited news caption corpus to learna joint image-text model so that one can annotatenew unknown images with textual information eas-ily. Neither of these works have been tested on com-plex everyday images where the large variations ofobjects and poses makes it nearly impossible to learna more general model. In addition, no attempt wasmade to generate a descriptive sentence from thelearned model. The work of [Farhadi et al., 2010] at-tempts to “generate” sentences by first learning froma set of human annotated examples, and produc-ing the same sentence if both images and sentenceshare common properties in terms of their triplets:(Nouns-Verbs-Scenes). No attempt was made togenerate novel sentences from images beyond whathas been annotated by humans. [Yao et al., 2010]has recently introduced a framework for parsing im-ages/videos to textual description that requires sig-nificant annotated data, a requirement that our pro-posed approach avoids.

Natural language generation (NLG) is a long-standing problem. Classic approaches [Traum et al.,2003] are based on three steps: selection, planningand realization. A common challenge in generationproblems is the question of: what is the input? Re-cently, approaches for generation have focused onformal specification inputs, such as the output of the-orem provers [McKeown, 2009] or databases [Gol-land et al., 2010]. Most of the effort in those ap-proaches has focused on selection and realization.We address a tangential problem that has not re-ceived much attention in the generation literature:how to deal with noisy inputs. In our case, the inputsthemselves are often uncertain (due to misrecogni-tions by object/scene detectors) and the content se-lection and realization needs to take this uncertaintyinto account.

3 Our Approach

Our approach is summarized in Fig. 3. The input is atest image where we detect objects and scenes usingtrained detection algorithms [Felzenszwalb et al.,2010; Torralba et al., 2003]. To keep the frameworkcomputationally tractable, we limit the elements ofthe quadruplet (Nouns-Verbs-Scenes-Prepositions)to come from a finite set of objects N , actions V ,scenes S and prepositions P classes that are com-monly encountered. They are summarized in Ta-ble. 1. In addition, the sentence that is generatedfor each image is limited to at most two objects oc-curring in a unique scene.

Figure 3: Overview of our approach. (a) Detect objectsand scenes from input image. (b) Estimate optimal sen-tence structure quadruplet T ∗. (c) Generating a sentencefrom T ∗.

Denoting the current test image as I , the initialvisual processing first detects objects n ∈ N andscenes s ∈ S using these detectors to computePr(n|I) and Pr(s|I), the probabilities that objectn and scene s exist under I . From the observationthat an action can often be predicted by its key ob-jects, Nk = {n1, n2, · · · , ni}, ni ∈ N that partici-pate in the action, we use a trained Language modelLm to estimate Pr(v|Nk). Lm is also used to com-pute Pr(s|n, v), the predicted scene using the cor-pus given the object and verb; and Pr(p|s), the pre-dicted preposition given the scene. This process isrepeated over all n, v, s, p where we used a modi-fied HMM inference scheme to determine the mostlikely quadruplet: T ∗ = {n∗, v∗, s∗, p∗} that makesup the core sentence structure. Using the contentsand structure of T ∗, an appropriate sentence is thengenerated that describes the image. In the followingsections, we first introduce the image dataset usedfor testing followed by details of how these compo-nents are derived.

Objects n ∈ N Actions v ∈ V Scenes s ∈ S Preps p ∈ P’aeroplane’ ’bicycle’ ’bird’’boat’ ’bottle’ ’bus’ ’car’’cat’ ’chair’ ’cow’ ’table’’dog’ ’horse’, ’motorbike’’person’ ’pottedplant’’sheep’ ’sofa’ ’train’’tvmonitor’

’sit’ ’stand’ ’park’’ride’ ’hold’ ’wear’’pose’ ’fly’ ’lie’ ’lay’’smile’ ’live’ ’walk’’graze’ ’drive’ ’play’’eat’ ’cover’ ’train’’close’ ...

’airport’’field’’highway’’lake’ ’room’’sky’ ’street’’track’

’in’ ’at’ ’above’’around’ ’behind’’below’ ’beside’’between’’before’ ’to’’under’ ’on’

Table 1: The set of objects, actions (first 20), scenes and preposition classes considered

Figure 4: Samples of images with corresponding annota-tions from the UIUC scene description dataset.

3.1 Image DatasetWe use the UIUC Pascal Sentence dataset, first in-troduced in [Farhadi et al., 2010] and available on-line1. It contains 1000 images taken from a sub-set of the Pascal-VOC 2008 challenge image datasetand are hand annotated with sentences that describethe image by paid human annotators using Ama-zon Mechanical Turk. Fig. 4 shows some sampleimages with their annotations. There are 5 anno-tations per image, and each annotation is usuallyshort – around 10 words long. We randomly selected900 images (4500 sentences) as the learning corpusto construct the verb and scene sets, {V,S} as de-scribed in sec. 3.3, and kept the remaining 100 im-ages for testing and evaluation.

3.2 Object and Scene Detections from ImagesWe use the Pascal-VOC 2008 trained object detec-tors [Felzenszwalb et al., 2008] of 20 common ev-eryday object classes that are defined in N . Each ofthe detectors are essentially SVM classifiers trainedon a large number of the objects’ image represen-tations from a large variety of sources. Although20 classes may seem small, their existence in many

1http://vision.cs.uiuc.edu/pascal-sentences/

(a) (b)

Figure 5: (a) [Top] The part based object detector from[Felzenszwalb et al., 2010]. [Bottom] The graphicalmodel representation of an object, for e.g. a bike. (b)Examples of GIST gradients: (left) an outdoor scene vs(right) an indoor scene [Torralba et al., 2003].

natural images (e.g. humans, cars and plants) makesthem particularly important for our task, since hu-mans tend to describe these common objects as well.As object representations, the part-based descriptorof [Felzenszwalb et al., 2010] is used. This repre-sentation decomposes any object, e.g. a cow, intoits constituent parts: head, torso, legs, which areshared by other objects in a hierarchical manner.At each level, image gradient orientations are com-puted. The relationship between each parts is mod-eled probabilistically using graphical models whereparts are the nodes and the edges are the conditionalprobabilities that relate their spatial compatibility(Fig. 5(a)). For example, in a cow, the probabilityof finding the torso near the head is higher than find-ing the legs near the head. This model’s intuition liesin the assumption that objects can be deformed butthe relative position of each constituent parts shouldremain the same. We convert the object detec-tion scores to probabilities using Platt’s method [Linet al., 2007] which is numerically more stable to ob-tain Pr(n|I). The parameters of Platt’s method areobtained by estimating the number of positives andnegatives from the UIUC annotated dataset, from

which we determine the appropriate probabilisticthreshold, which gives us approximately 50% recalland precision.

For detecting scenes defined in S , we use theGIST-based scene descriptor of [Torralba et al.,2003]. GIST computes the windowed 2D Gabor fil-ter responses of an input image. The responses ofGabor filters (4 scales and 6 orientations) encode thetexture gradients that describe the local propertiesof the image. Averaging out these responses overlarger spatial regions gives us a set of global im-age properties. These high dimensional responsesare then reprojected to a low dimensional space viaPCA, where the number of principal components areobtained empirically from training scenes. This rep-resentation forms the GIST descriptor of an image(Fig. 5(b)) which is used to train a set of SVM clas-sifiers for each scene class in S. Again, Pr(s|I) iscomputed from the SVM scores using [Lin et al.,2007]. The set of common scenes defined in S islearned from the UIUC annotated data (sec. 3.3).

3.3 Corpus-Guided Predictions

Figure 6: (a) Selecting the ROOT verb from the depen-dency parse ride reveals its subject woman and directobject bicycle. (b) Selecting the head noun (PMOD)as the scene street reveals ADV as the preposition on

Predicting Verbs: The key component of our ap-proach is the trained language model Lm that pre-dicts the most likely verb v, associated with the ob-jects Nk detected in the image. Since it is possi-ble that different verbs may be associated with vary-ing number of object arguments, we limit ourselvesto verbs that take on at most two objects (or morespecifically two noun phrase arguments) as a sim-plifying assumption: Nk = {n1, n2} where n2 canbe NULL. That is, n1 and n2 are the subject anddirect objects associated with v ∈ V . Using this as-sumption, we can construct the set of verbs, V . Todo this, we use human labeled descriptions of thetraining images from the UIUC Pascal-VOC dataset

(sec. 3.1) as a learning corpus that allows us to deter-mine the appropriate target verb set that is amenableto our problem. We first apply the CLEAR parser[Choi and Palmer, 2010] to obtain a dependencyparse of these annotations, which also performsstemming of all the verbs and nouns in the sentence.Next, we process all the parses to select verbs whichare marked as ROOT and check the existence of asubject (DEP) and direct object (PMOD, OBJ) thatare linked to the ROOT verb (see Fig. 6(a)). Finally,after removing common “stop” verbs such as {is,are, be} we rank these verbs in terms of their oc-currences and select the top 50 verbs which accountsfor 87.5% of the sentences in the UIUC dataset to bein V .

Object class n ∈ N Synonyms, 〈n〉bus autobus charabanc

double-decker jitneymotorbus motorcoach omnibuspassenger-vehicle schoolbustrolleybus streetcar ...

chair highchair chaise daybedthrone rocker armchairwheelchair seat ladder-backlawn-chair fauteuil ...

bicycle bike wheel cycle velocipedetandem mountain-bike ...

Table 2: Samples of synonyms for 3 object classes.

Next, we need to explain how n1 and n2 areselected from the 20 object classes defined previ-ously in N . Just as the 20 object classes are de-fined visually over several different kinds of spe-cific objects, we expand n1 and n2 in their tex-tual descriptions using synonyms. For example,the object class n1=aeroplane should includethe synonyms {plane, jet, fighter jet,aircraft}, denoted as 〈n1〉. To do this, we ex-pand each object class using their correspondingWordNet synsets up to at most three hyponymns lev-els. Example synonyms for some of the classes aresummarized in Table 2.

We can now compute from the Gigaword cor-pus [Graff, 2003] the probability that a verb ex-ists given the detected nouns, Pr(v|n1, n2). We dothis by computing the log-likelihood ratio [Dunning,1993] , λnvn, of trigrams (〈n1〉 , v, 〈n2〉), computedfrom each sentence in the English Gigaword corpus[Graff, 2003]. This is done by extracting only thewords in the corpus that are defined inN and V (in-

cluding their synonyms). This forms a reduced cor-pus sequence from which we obtain our target tri-grams. For example, the sentence:

the large brown dog chases a small young cat

around the messy room, forcing the cat to run

away towards its owner.

will be reduced to the stemmed sequence dog

chase cat cat run owner2 from which we ob-tain the target trigram relationships: {dog chasecat}, {cat run owner} as these trigrams re-spect the (n1, v, n2) ordering. The log-likelihood ra-tios, λnvn, computed for all possible (〈n1〉 , v, 〈n2〉)are then normalized to obtain Pr(v|n1, n2). An ex-ample of ranked λnvn in Fig. 7(a) shows that λnvnpredicts v that makes sense: with the most likelypredictions near the top of the list.

Predicting Scenes: Just as an action is stronglyrelated to the objects that participate in it, ascene can be predicted from the objects and verbsthat occur in the image. For example, detect-ing Nk={boat, person} with v={row} wouldhave predicted the scene s={coast}, since boatsusually occur in water regions. To learn this rela-tionship from the corpus, we use the UIUC datasetto discover what are the common scenes that shouldbe included in S. We applied the CLEAR depen-dency parse [Choi and Palmer, 2010] on the UIUCdata and extracted all the head nouns (PMOD) inthe PP phrases for this purpose and excluded thosenouns with prepositions (marked as ADV) such as{with, of} which do not co-occur with scenes ingeneral (see Fig. 6(b)). We then ranked the remain-ing scenes in terms of their frequency to select thetop 8 scenes used in S.

To improve recall and generalization, we expandeach of the 8 scene classes using their WordNetsynsets 〈s〉 (up to a max of three hyponymns levels).Similar to the procedure of predicting the verbs de-scribed above, we compute the log-likelihood ratioof ordered bigrams, {n, 〈s〉} and {v, 〈s〉}: λns andλvs, by reducing the corpus sentence to the targetnouns, verbs and scenes defined inN ,V and S. Theprobabilities Pr(s|n) and Pr(v|n) are then obtainedby normalizing λns and λvs. Under the assumptionthat the priors Pr(n) and Pr(v) are independent andapplying Bayes rule, we can compute the probabil-

2stemming is done using [Choi and Palmer, 2010]

ity that a scene co-occurs with the object and action,Pr(s|n, v) by:

Pr(s|n, v) =Pr(n, v|s)Pr(s)

Pr(n, v)

=Pr(n|s)Pr(v|s)Pr(s)

Pr(n)Pr(v)

∝ Pr(s|n)× Pr(s|v) (1)

where the constant of proportionality is justified un-der the assumption that Pr(s) is equiprobable for alls. (1) is computed for all nouns in Nk. As shownin Fig. 7(b), we are able to predict scenes that co-locate with reasonable correctness given the nounsand verbs.

Predicting Prepositions: It is straightforward topredict the appropriate prepositions associated witha given scene. When we construct S from the UIUCannotated data, we simply collect and rank all the as-sociated prepositions (ADV) in the PP phrase of thedependency parses. We then select the top 12 prepo-sitions used to define P . Using P , we then computethe log-likelihood ratio of ordered bigrams, {p, 〈s〉}for prepositions that co-locate with the scene syn-onyms over the corpus. Normalizing λps yieldsPr(p|s), the probability that a preposition co-locateswith a scene. Examples of ranked λps are shown inFig. 7(c). Again, we see that reasonable predictionsof p can be found.

Figure 7: Example of how ranked log-likelihood values(in descending order) suggest a possible T : (a) λnvn forn1 = person, n2 = bus predicts v = ride. (b) λnsand λvs for n = bus, v = ride then jointly predictss = street and finally (c) λps with s = street pre-dicts p = on.

3.4 Determining T ∗ using HMM inferenceGiven the computed conditional probabilities:Pr(n|I) and Pr(s|I) which are observationsfrom an input test image with the param-eters of the trained language model, Lm:

Pr(v|n1, n2), Pr(s|n, v), Pr(p|s), we seek tofind the most likely sentence structure T ∗ by:

T ∗ = argmaxn,v,s,p

Pr(T |n, v, s, p)

= argmaxn,v,s,p

{Pr(n1|I)Pr(n2|I)Pr(s|I)×

Pr(v|n1, n2)Pr(s|n, v)Pr(p|s)} (2)

where the last equality holds by assuming indepen-dence between the visual detections and corpus pre-dictions. Obviously a brute force approach to try allpossible combinations to maximize eq. (2) will notbe feasible due to the large number of possible com-binations: (20∗21∗8)∗(50∗20∗20)∗(8∗20∗50)∗(12 ∗ 8) ≈ 5× 1013. A better solution is needed.

Figure 8: The HMM used for optimizing T . The relevanttransition and emission probabilities are also shown. Seetext for more details.

Our proposed strategy is to pose the optimiza-tion of T as a dynamic programming problem, akinto a Hidden Markov Model (HMM) where the hid-den states are related to the (simplified) sentencestructure we seek: T = {n1, n2, s, v, p}, and theemissions are related to the observed detections:{n1, n2, s} in the image if they exist. To sim-plify our notations, as we are concerned with ob-ject pairs we will write NN as the hidden states forall n1, n2 pairs and nn as the corresponding emis-sions (detections); and all object+verb pairs as hid-den states NV. The hidden states are therefore de-noted as: {NN,NV,S,P} with values taken fromtheir respective word classes from Table 1. The

emission states are {nn,s} with binary values: 1if the detections occur or 0 otherwise. The fullHMM is summarized in Fig. 8. The rationale forusing a HMM is that we can reuse all previous com-putation of the probabilities at each level to com-pute the required probabilities at the current level.From START, we assume all object pair detectionsare equiprobable: Pr(NN|START) = 1

|N |∗(|N |+1)where we have added an additional NULL value forobjects (at most 1). At each NN, the HMM emitsa detection from the image and by independencewe have: Pr(nn|NN) = Pr(n1|I)Pr(n2|I). Af-ter NN, the HMM transits to the corresponding verbat state NV with Pr(NV|NN) = Pr(v|n1, n2) ob-tained from the corpus statistic3. As no action detec-tions are performed on the image, NV has no emis-sions. The HMM then transits from NV to S withPr(S|NV) = Pr(s|n, v) computed from the corpuswhich emits the scene detection score from the im-age: Pr(s|S) = Pr(s|I). From S, the HMM transitsto P with Pr(P|S) = Pr(p|s) before reaching theEND state.

Comparing the HMM with eq. (2), one can seethat all the corpus and detection probabilities areaccounted for in the transition and emission prob-abilities respectively. Optimizing T is then equiv-alent to finding the best (most likely) path throughthe HMM given the image observations using theViterbi algorithm which can be done in O(105) timewhich is significantly faster than the naive approach.We show in Fig. 9 (right-upper) examples of the topviterbi paths that produce T ∗ for four test images.

Note that the proposed HMM is suitable for gen-erating sentences that contain the core componentsdefined in T which produces a sentence of the formNP-VP-PP, which we will show in sec. 4 is suf-ficient for the task of generating sentences for de-scribing images. For more complex sentences withmore components: such as adjectives or adverbs, theHMM can be easily extended with similar computa-tions derived from the corpus.

3.5 Sentence Generation

Given the selected sentence structure T ={n1, n2, v, s, p}, we generate sentences using the

3each verb, v, in NV will have 2 entries with the same value,one for each noun.

Figure 9: Four test images (left) and results. (Right-upper): Sentence structure T ∗ predicted using Viterbiand (Right-lower): Generated sentences. Words markedin red are considered to be incorrect predictions. Com-plete results are available at http://www.umiacs.umd.

edu/˜yzyang/sentence_generateOut.html.

following strategy for each component:1) We add in appropriate determiners and cardi-

nals: the, an, a, CARD, based on the contentof n1,n2 and s. For e.g., if n1 = n2, we will useCARD=two, and modify the nouns to be in the plu-ral form. When several possible choices are avail-able, a random choice is made that depends on theobject detection scores: the is preferred when weare confident of the detections while an, a is pre-ferred otherwise.

2) We predict the most likely preposition insertedbetween the verbs and nouns learned from the Giga-word corpus via Pr(p|v, n) during sentence genera-tion. For example, our method will pick the prepo-sition at between verb sit and noun table.

3) The verb v is converted to a form that agreeswith in number with the nouns detected. Thepresent gerund form is preferred such as eating,drinking, walking as it conveys that an ac-tion is being performed in the image.

4) The sentence structure is therefore of the form:NP-VP-PP with variations when only one objector multiple detections of the same objects are de-tected. A special case is when no objects are de-tected (below the predefined threshold). No verbscan be predicted as well. In this case, we sim-ply generate a sentence that describes the sceneonly: for e.g. This is a coast, This isa field. Such sentences account for 20% of the

entire UIUC testing dataset which are scored lowerin our evaluation metrics (sec. 4.1) since they do notfully describe the image content in terms of the ob-jects and actions.

Some examples of sentences generated using thisstrategy are shown in Fig. 9(right-lower).

4 Experiments

We performed several experiments to evaluate ourproposed approach. The different metrics used forevaluation and comparison are also presented, fol-lowed by a discussion of the experimental results.

4.1 Sentence Generation Results

Three experiments are performed to evaluate the ef-fectiveness of our approach. As a baseline, we sim-ply generated T ∗ directly from images without usingthe corpus. There are two variants of this baselinewhere we seek to determine if listing all objects inthe image is crucial for scene description. Tb1 is abaseline that uses all possible objects and scene de-tected: Tb1 = {n1, n2, · · · , nm, s} and our sentencewill be of the form: {Object 1, object 2 and

object 3 are IN the scene.} and we simplyselected IN as the only admissible preposition. Forthe second baseline, Tb2, we limit the number of ob-jects to just any two: Tb2 = {n1, n2, s} and thesentence generated will be of the form {Object1 and object 2 are IN the scene}. In thesecond experiment, we applied the HMM strategydescribed above but made all transition probabilitiesequiprobable, removing the effects of the corpus,and producing a sentence structure which we denoteas T ∗eq. The third experiment produces the full T ∗with transition probabilities learned from the corpus.All experiments were performed on the 100 unseentesting images from the UIUC dataset and we usedonly the most likely (top) sentence generated for allevaluation.

We use two evaluation metrics as a measure of theaccuracy of the generated sentences: 1) ROUGE-1[Lin and Hovy, 2003] precision scores and 2) Rel-evance and Readability of the generated sentences.ROUGE-1 is a recall based metric that is commonlyused to measure the effectiveness of text summariza-tion. In this work, the short descriptive sentence ofan image can be viewed as summarizing the image

content and ROUGE-1 is able to capture how wellthis sentence can describe the image by comparing itwith the human annotated ground truth of the UIUCdataset. Due to the short sentences generated, wedid not consider other ROUGE metrics (ROUGE-2,ROUGE-SU4) which captures fluency and is not anissue here.

Experiment R1,(length) Relevance ReadabilityBaseline 1, T ∗b1 0.35,(8.2) 2.84± 1.40 3.64± 1.20

Baseline 2, T ∗b2 0.39,(6.8) 2.14± 1.13 3.94± 0.91

HMM no cor-pus, T ∗eq

0.42,(6.5) 2.44± 1.25 3.88± 1.18

Full HMM, T ∗ 0.44,(6.9) 2.51± 1.30 4.10± 1.03

Human Anno-tation

0.68,(10.1) 4.91± 0.29 4.77± 0.42

Table 3: Sentence generation evaluation results with hu-man gold standard. Human R1 scores are averaged overthe 5 sentences using a leave one out procedure. Valuesin bold are the top scores.

A main shortcoming of using ROUGE-1 is thatthe generated sentences are compared only to a fi-nite set of human labeled ground truth which ob-viously does not capture all possible sentences thatone can generate. In other words, ROUGE-1 doesnot take into account the fact that sentence genera-tion is innately a creative process, and a better re-call metric will be to ask humans to judge thesesentences. The second evaluation metric: Rele-vance and Readability is therefore proposed as anempirical measure of how much the sentence: 1)conveys the image content (relevance) in terms ofthe objects, actions and scene predicted and 2) isgrammatically correct (readability). We engaged theservices of Amazon Mechanical Turks (AMT) tojudge the generated sentences based on a discretescale ranging from 1–5 (low relevance/readabilityto high relevance/readability). The averaged resultsof ROUGE-1, R1 and mean length of the sentenceswith the Relevance+Readability scores for all exper-iments are summarized in Table 3. For comparison,we also asked the AMTs to judge the ground truthsentences as well.

4.2 DiscussionThe results reported in Table 3 reveals both thestrengths and some shortcomings of the approachwhich we will briefly discuss here. Firstly, the R1

scores indicate that based on a purely summariza-tion (unigram-overlap) point of view, the proposedapproach of using the HMM to predict T ∗ achievesthe best results compared to all other approacheswith R1 = 0.44. This means that our sentences arethe closest in agreement with the human annotatedground truth, correctly predicting the sentence struc-ture components. In addition sentences generated byT ∗ are also succinct: with an average length of 6.9words per sentence. However, we are still some wayoff the human gold standard since we do not predictother parts-of-speech such as adjectives and adverbs.Given this fact, our proposed approach performanceis comparable to other state of the art summarizationwork in the literature [Bonnie and Dorr, 2004].

Next, we consider the Relevance+Readabilitymetrics based on human judges. Interestingly, thefirst baseline, T ∗b1 is considered the most relevant de-scription of the image and the least readable at thesame time. This is most likely due to the fact thatthis recall oriented strategy will almost certainly de-scribe some objects but the lack of any verb descrip-tion; and longer sentences that average 8.2 words persentence, makes it less readable. It is also possiblethat humans tend to penalize less irrelevant objectscompared to missing objects, and further evaluationsare necessary to confirm this. Since T ∗b2 is limitedto two objects just like the proposed HMM, it is amore suitable baseline for comparison. Clearly, theresults show that adding the HMM to predict the op-timal sentence structure increases the relevance ofthe produced sentence. Finally, in terms of read-ability, T ∗ generates the most readable sentences,and this is achieved by leveraging on the corpus toguide our predictions of the most reasonable nouns,verbs, scenes and prepositions that agree with thedetections in the image.

5 Future Work

In this work, we have introduced a computationallyfeasible framework that integrates visual perceptiontogether with semantic grounding obtained from alarge textual corpus for the purpose of generating adescriptive sentence of an image. Experimental re-sults show that our approach produces sentences thatare both relevant and readable. There are, however,instances where our strategy fails to predict the ap-

propriate verbs or nouns (see Fig. 9). This is dueto the fact that object/scene detections can be wrongand noise from the corpus itself remains a problem.Compared to human gold standards, therefore, muchwork still remains in terms of detecting these objectsand scenes with high precision. Currently, at mosttwo object classes are used to generate simple sen-tences which was shown in the results to have penal-ized the relevance score of our approach. This canbe addressed by designing more complex HMMs tohandle larger numbers of object and verb classes.Another interesting direction of future work wouldbe to detect salient objects, learned from trainingimage+corpus or eye-movement data, and to verifyif these objects aid in improving the descriptive sen-tences we generate. Another potential application

Figure 10: Images retrieved from 3 verbal search terms:ride,sit,fly.

of representing images using T ∗ is that we can eas-ily sort and retrieve images that are similar in termsof their semantic content. This would enable us toretrieve, for example, more relevant images given averbal search query such as {ride,sit,fly}, re-turning images where these verbs are found in T ∗.Some results of retrieved images based on their ver-bal components are shown in Fig. 10: many imageswith dissimilar visual content are correctly classifiedbased on their semantic meaning.

6 Acknowledgement

This material is based upon work supported bythe National Science Foundation under Grant No.1035542. In addition, the support of the Eu-ropean Union under the Cognitive Systems pro-gram (project POETICON) and the National Sci-ence Foundation under the Cyberphysical SystemsProgram, is gratefully acknowledged.

ReferencesBerg, T. L., Berg, A. C., Edwards, J., and Forsyth, D. A.

(2004). Who’s in the picture? In NIPS.

Bonnie, D. Z. and Dorr, B. (2004). Bbn/umd at duc-2004:Topiary. In In Proceedings of the 2004 Document Un-derstanding Conference (DUC 2004) at NLT/NAACL2004, pages 112–119.

Choi, J. D. and Palmer, M. (2010). Robust constituent-to-dependency conversion for english. In Proceedingsof the 9th International Workshop on Treebanks andLinguistic Theories, pages 55–66, Tartu, Estonia.

Dunning, T. (1993). Accurate methods for the statistics ofsurprise and coincidence. Computational Linguistics,19(1):61–74.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn,J., and Zisserman, A. (2008). The PASCAL VisualObject Classes Challenge 2008 (VOC2008) Results.

Farhadi, A., Hejrati, S. M. M., Sadeghi, M. A., Young, P.,Rashtchian, C., Hockenmaier, J., and Forsyth, D. A.(2010). Every picture tells a story: Generating sen-tences from images. In Daniilidis, K., Maragos, P.,and Paragios, N., editors, ECCV (4), volume 6314of Lecture Notes in Computer Science, pages 15–29.Springer.

Felzenszwalb, P. F., Girshick, R. B., and McAllester, D.(2008). Discriminatively trained deformable part mod-els, release 4. http://people.cs.uchicago.edu/ pff/latent-release4/.

Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A.,and Ramanan, D. (2010). Object detection with dis-criminatively trained part-based models. IEEE Trans.Pattern Anal. Mach. Intell., 32(9):1627–1645.

Golland, D., Liang, P., and Klein, D. (2010). A game-theoretic approach to generating spatial descriptions.In Proceedings of EMNLP.

Graff, D. (2003). English gigaword. In Linguistic DataConsortium, Philadelphia, PA.

Jie, L., Caputo, B., and Ferrari, V. (2009). Who’s do-ing what: Joint modeling of names and verbs for si-multaneous face and pose annotation. In NIPS, editor,Advances in Neural Information Processing Systems,NIPS. NIPS.

Kojima, A., Izumi, M., Tamura, T., and Fukunaga, K.(2000). Generating natural language description of hu-man behavior from video images. In Pattern Recog-nition, 2000. Proceedings. 15th International Confer-ence on, volume 4, pages 728 –731 vol.4.

Kourtzi, Z. (2004). But still, it moves. Trends in Cogni-tive Sciences, 8(2):47 – 49.

Liang, P., Jordan, M. I., and Klein, D. (2009). Learningfrom measurements in exponential families. In Inter-national Conference on Machine Learning (ICML).

Lin, C. and Hovy, E. (2003). Automatic evaluation ofsummaries using n-gram co-occurrence statistics. InNAACLHLT.

Lin, H.-T., Lin, C.-J., and Weng, R. C. (2007). A noteon platt’s probabilistic outputs for support vector ma-chines. Mach. Learn., 68:267–276.

Mann, G. S. and Mccallum, A. (2007). Simple, robust,scalable semi-supervised learning via expectation reg-ularization. In The 24th International Conference onMachine Learning.

McKeown, K. (2009). Query-focused summarization us-ing text-to-text generation: When information comesfrom multilingual sources. In Proceedings of the 2009Workshop on Language Generation and Summarisa-tion (UCNLG+Sum 2009), page 3, Suntec, Singapore.Association for Computational Linguistics.

Oliva, A. and Torralba, A. (2001). Modeling the shapeof the scene: A holistic representation of the spatialenvelope. International Journal of Computer Vision,42(3):145–175.

Schwartz, W., Kembhavi, A., Harwood, D., and Davis,L. (2009). Human detection using partial least squaresanalysis. In International Conference on Computer Vi-sion.

Torralba, A., Murphy, K. P., Freeman, W. T., and Rubin,M. A. (2003). Context-based vision system for placeand object recognition. In ICCV, pages 273–280. IEEEComputer Society.

Traum, D., Fleischman, M., and Hovy, E. (2003). Nl gen-eration for virtual humans in a complex social environ-ment. In In Proceedings of he AAAI Spring Symposiumon Natural Language Generation in Spoken and Writ-ten Dialogue, pages 151–158.

Urgesi, C., Moro, V., Candidi, M., and Aglioti, S. M.(2006). Mapping implied body actions in the humanmotor system. J Neurosci, 26(30):7942–9.

Yang, W., Wang, Y., and Mori, G. (2010). Recognizinghuman actions from still images with latent poses. InCVPR.

Yao, B. and Fei-Fei, L. (2010). Grouplet: a structuredimage representation for recognizing human and ob-ject interactions. In The Twenty-Third IEEE Confer-ence on Computer Vision and Pattern Recognition, SanFrancisco, CA.

Yao, B., Yang, X., Lin, L., Lee, M. W., and Zhu, S.-C.(2010). I2t: Image parsing to text description. Pro-ceedings of the IEEE, 98(8):1485 –1508.

corpus-guided sentence generation of natural …yzyang/paper/sengen_emnlp2011_final.pdfcorpus-guided...

Documents