distributional semantics from text and images...distributional semantic models use large text...

Distributional semantics from text and images

Elia BruniCIMeC, University of Trento

[email protected]

Giang Binh TranEMLCT, Free University of Bolzano &

CIMeC, University of [email protected]

Marco BaroniCIMeC, University of Trento

[email protected]

Abstract

We present a distributional semantic modelcombining text- and image-based features. Weevaluate this multimodal semantic model onsimulating similarity judgments, concept clus-tering and the BLESS benchmark. When inte-grated with the same core text-based model,image-based features are at least as good asfurther text-based features, and they capturedifferent qualitative aspects of the tasks, sug-gesting that the two sources of information arecomplementary.

1 Introduction

Distributional semantic models use large text cor-pora to derive estimates of semantic similarities be-tween words. The basis of these procedures lies inthe hypothesis that semantically similar words tendto appear in similar contexts (Miller and Charles,1991; Wittgenstein, 1953). For example, the mean-ing of spinach (primarily) becomes the result of sta-tistical computations based on the association be-tween spinach and words like plant, green, iron,Popeye, muscles. Alongside their applications inNLP areas such as information retrieval or wordsense disambiguation (Turney and Pantel, 2010), astrong debate has arisen on whether distributionalsemantic models are also reflecting human cogni-tive processes (Griffiths et al., 2007; Baroni et al.,2010). Many cognitive scientists have however ob-served that these techniques relegate the process ofmeaning extraction solely to linguistic regularities,forgetting that humans can also rely on non-verbal

experience, and comprehension also involves the ac-tivation of non-linguistic representations (Barsalouet al., 2008; Glenberg, 1997; Zwaan, 2004). Theyargue that, without grounding words to bodily ac-tions and perceptions in the environment, we cannever get past defining a symbol by simply pointingto covariation of amodal symbolic patterns (Harnad,1990). Going back to our example, the meaning ofspinach should come (at least partially) from our ex-perience with spinach, its colors, smell and the oc-casions in which we tend to encounter it.

We can thus distinguish two different views ofhow meaning emerges, one stating that it emergesfrom association between linguistic units reflectedby statistical computations on large bodies of text,the other stating that meaning is still the result of anassociation process, but one that concerns the asso-ciation between words and perceptual information.

In our work, we try to make these two appar-ently mutually exclusive accounts communicate, toconstruct a richer and more human-like notion ofmeaning. In particular, we concentrate on percep-tual information coming from images, and we cre-ate a multimodal distributional semantic model ex-tracted from texts and images, putting side by sidetechniques from NLP and computer vision. In a nut-shell, our technique is based on using a collectionof labeled pictures to build vectors recording the co-occurrences of words with image-based features, ex-actly as we would do with textual co-occurrences.We then concatenate the image-based vector witha standard text-based distributional vector, to ob-tain our multimodal representation. The prelimi-nary results reported in this paper indicate that en-

riching a text-based model with image-based fea-tures is at least not damaging, with respect to en-larging the purely textual component, and it leads toqualitatively different results, indicating that the twosources of information are not redundant.

The rest of the paper is structured as follows. Sec-tion 2 reviews relevant work including distributionalsemantic models, computer vision techniques suit-able to our purpose and systems combining text andimage information, including the only work we areaware of that attempts something similar to what wetry here. We introduce our multimodal distributionalsemantic model in Section 3, and our experimentalsetup and procedure in Section 4. Our experiments’results are discussed in Section 5. Section 6 con-cludes summarizing current achievements and dis-cussing next directions.

2 Related Work

2.1 Text-based distributional semantic models

Traditional corpus-based models of semantic repre-sentation base their analysis on textual input alone(Turney and Pantel, 2010). Assuming the distribu-tional hypothesis (Miller and Charles, 1991), theyrepresent semantic similarity between words as afunction of the degree of overlap among their lin-guistic contexts. Similarity is computed in a seman-tic space represented as a matrix, with words as rowsand contextual elements as columns/dimensions.Thanks to the geometrical nature of the represen-tation, words are compared using a distance met-ric, such as the cosine of the angle between vectors(Landauer and Dumais, 1997).

2.2 Bag of visual words

In NLP, “bag of words” (BoW) is a dictionary-basedmethod in which a document is represented as a“bag” (i.e., order is not considered), which containswords from the dictionary. In computer vision, “bagof visual words” (BoVW) is a similar idea for imagerepresentation (Sivic and Zisserman, 2003; Csurkaet al., 2004; Nister and Stewenius, 2006; Bosch etal., 2007; Yang et al., 2007).

Here, an image is treated as a document, and fea-tures from a dictionary of visual elements extractedfrom the image are considered as the “words” repre-senting the image. The following pipeline is typ-

ically adopted in order to group the local interestpoints into types (visual words) within and acrossimages, so that then an image can be representedby the number of occurrences of each visual wordtype in it, analogously to BoW. From every imageof a data set, keypoints are automatically detectedand represented as vectors of various descriptors.Keypoint vectors are then projected into a commonspace and grouped into a number of clusters. Eachcluster is treated as a discrete visual word (this tech-nique is generally known as vector quantization).With its keypoints mapped onto visual words, eachimage can then be represented as a BoVW featurevector according to the count of each visual word. Inthis way, we move from representing the image bya varying number of high-dimensional keypoint de-scriptor vectors to a representation in terms of a sin-gle sparse vector of fixed dimensionality across allimages. What kind of image content a visual wordcaptures exactly depends on a number of factors, in-cluding the descriptors used to identify and representlocal interest points, the quantization algorithm andthe number of target visual words selected. In gen-eral, local interest points assigned to the same visualword tend to be patches with similar low-level ap-pearance; but these common types of local patternsneed not be correlated with object-level parts presentin the images. Figure 1 illustrates the procedure toform bags of visual words. Importantly for our pur-poses, the BoVW representation, despite its unre-lated origin in computer vision, is entirely analogousto the BoW representation, making the integration oftext- and image-based features very straightforward.

2.3 Integrating textual and perceptualinformation

Louwerse (2011), contributing to the debate on sym-bol grounding in cognitive science, theorizes the in-terdependency account, which suggests a conver-gence of symbolic theories (such as distributionalsemantics) and perceptual theories of meaning, butlacks of a concrete way to harvest perceptual infor-mation computationally. Andrews et al. (2009) com-plement text-based models with experiential infor-mation, by combining corpus-based statistics withspeaker-generated feature norms as a proxy of per-ceptual experience. However, the latter are an un-satisfactory proxy, since they are still verbally pro-

Figure 1: Illustration of bag of visual words procedure: (a) detect and represent local interest points as descriptorvectors (b) quantize vectors (c) histogram computation to form BoVW vector for the image

duced descriptions, and they are expensive to collectfrom subjects via elicitation techniques.

Taking inspiration from methods originally usedin text processing, algorithms for image labeling,search and retrieval have been built upon the connec-tion between text and visual features. Such modelslearn the statistical models which characterize thejoint statistical distribution of observed visual fea-tures and verbal image tags (Hofmann, 2001; Hareet al., 2008). This line of research is pursuing the re-verse of what we are interested in: using text to im-prove the semantic description of images, whereaswe want to exploit images to improve our approxi-mation to word meaning.

Feng and Lapata are the first trying to integrateauthentic visual information in a text-based distribu-tional model (Feng and Lapata, 2010). Using a col-lection of BBC news with pictures as corpus, theytrain a Topic model where text and visual words arerepresented in terms of the same shared latent di-mensions (topics). In this framework, word meaningis modeled as a probability distribution over a set oflatent multimodal topics and the similarity betweentwo words can be estimated by measuring the topicsthey have in common. A better correlation with se-mantic intuitions is obtainable when visual modalityis taken into account, in comparison to estimatingthe topic structure from text only.

Although Feng and Lapata’s work is very promis-ing and the main inspiration for our own, theirmethod requires the extraction of a single distribu-tional model from the same mixed-media corpus.This has two important drawbacks: First, the tex-tual model must be extracted from the same corpus

images are taken from, and the text context extrac-tion methods must be compatible with the overallmultimodal approach. Thus, image features can-not be added to a state-of-the-art text-based distri-butional model – e.g., a model computed on thewhole Wikipedia or larger corpora using syntacticdependency information – to assess whether visualinformation is helping even when purely textual fea-tures are already very good. Second, by training ajoint model with latent dimensions that mix textualand visual information, it becomes hard to assess,quantitatively and qualitatively, the separate effectof image-based features on the overall performance.In order to overcome these issues, we propose asomewhat simpler approach, in which the text- andimage-based models are independently constructedfrom different sources, and then concatenated.

3 Proposed method

Figure 2 presents a diagram of our overall sys-tem. The main idea is to construct text-based andimage-based co-occurrence models separately andthen combine them. We first describe our proce-dure to build both text-based and image-based mod-els. However, we stress the latter since it is themore novel part of the procedure. Then, we describeour simple combination technique to integrate bothmodels and create a multimodal distributional se-mantic space. Our implementation of the proposedmethod is open-source1.

1https://github.com/s2m

Image Data

Visual feature extraction

Bag of visual words

Image-based distributional vectorText-based distributional vector

Text feature extraction

Normalize and concatenate

Multimodal distributional semantic vector

Tag modeling

Text corpus

Figure 2: Overview of our system architecture

3.1 Text-based distributional model

Instead of proposing yet another model, we pick onethat is publicly available off-the-shelf and has beenshown to be at the state of the art on a number ofbenchmarks. The picked model (DM)2 is encodedin a matrix in which each target word is representedby a row vector of weights representing its associa-tion with collocates in a corpus. See Section 4.1 fordetails about the text-based model.

3.2 Image-based distributional model

We assume image data where each image is associ-ated with word labels (somehow related to the im-age) that we call tags.

The primary approach to form the image-basedvector space is to use the BoVW method to rep-resent images. Having represented each image inour data set in terms of the frequency of occurrenceof each visual word in it, we construct the image-based distributional vector of each tag as follows.Each tag (textual word) is associated to the list ofimages which are tagged with it; we then sum visualword occurrences across that list of images to ob-tain the co-occurrence counts associated with eachtag. For uniformity with the treatment of textualco-occurrences (see Section 4.1), the raw counts aretransformed into Local Mutual Information scorescomputed between each tag and visual word. Lo-cal Mutual Information is an association measurethat closely approximates the commonly used Log-Likelihood Ratio while being simpler to compute(Evert, 2005).

In this way, we obtain an image-based distribu-

2http://clic.cimec.unitn.it/dm

tional semantic model, that is, a matrix where eachrow corresponds to a tag vector, summarizing thedistributional history of the tag in the image collec-tion in terms of its association with the visual words.

3.3 Integrating distributional models

We assemble the two distributional vectors to con-struct the multimodal semantic space. Given a wordthat is present both in the text-based model and(as a tag) in the image-based model, we separatelynormalize the two vectors representing the word tolength 1 (so that the text and image components willhave equal weight), and we concatenate them to ob-tain the multimodal distributional semantic vectorrepresenting the word. The matrix of concatenatedtext- and image-based vectors is our multimodal dis-tributional semantic model. We leave it to futurework to consider more sophisticated combinationtechniques (preliminary experiments on differentialweighting of the text and image components did notlead to promising results).

4 Experimental setup

4.1 The DM text-based model

DM has been shown to be near or at the state ofthe art in a great variety of semantic tasks, rangingfrom modeling similarity judgments to concept cat-egorization, predicting selectional preferences, rela-tion classification and more.

The DM model is described in detail by Baroniand Lenci (2010), where it is referred to as TypeDM.In brief, the model is trained on a large corpusof about 2.8 billion tokens that include Web docu-ments, the Wikipedia and the BNC. DM is a struc-tured model, where the collocates are labeled withthe link that connect them to the target words. Thelinks are determined by a mixture of dependencyparse information and lexico-syntactic patterns, re-sulting in distributional features (the dimensions ofthe semantic space) such as subject kill, with gun oras sharp as. The score of a target word with a fea-ture is not based on the absolute number of timesthey co-occur in the corpus, but on the variety ofdifferent surface realizations of the feature the wordco-occurs with. For example, for the word fat andthe feature of animal, the raw score is 9 because fatco-occurs with 9 different forms of the feature (a

fat of the animal, the fat of the animal, fats of an-imal. . . ). Refer to Baroni and Lenci (2010) for howthe surface realizations of a feature are determined.Raw scores are then transformed into Local MutualInformation values.

The DM semantic space is a matrix with 30Krows (target words) represented in a space of morethan 700M dimensions. Since our visual dimensionextraction algorithms are maximally producing 32Kdimensions (see Section 4.2 below), we make theimpact of text features on the combined model di-rectly comparable to the one of visual features byselecting only the top n DM dimensions (with n

varying as explained below). The top dimensionsare picked based on their cumulative Local MutualInformation mass. We show in the experiments be-low that trimming DM in this way does not have anegative impact on its performance, so that we arejustified in claiming that we are adding visual in-formation to a state-of-the-art text-based semanticspace.

4.2 Visual Information Extraction

For our experiments, we use the ESP-Game dataset.3 It contains 50K images, labeled through thefamous “game with a purpose” developed by Louisvon Ahn (von Ahn and Dabbish, 2004). The tagsof images in the data set form a vocabulary of 11Kdistinct word types. Image labels contain 6.686 tagson average (2.357 s.d.). The ESP-Game corpus isan interesting data set from our point of view since,on the one hand, it is rather large and we know thatthe tags it contains are related to the images. Onthe other hand, it is not the product of experts la-belling representative images, but of a noisy anno-tation process of often poor-quality or uninterestingimages (e.g., logos) randomly downloaded from theWeb. Thus, analogously to the characteristics of atextual corpus, our algorithms must be able to ex-ploit large-scale statistical information, while beingrobust to noise.

Following what has become an increasingly stan-dard procedure in computer vision, we use the Dif-ference of Gaussian (DoG) detector to automaticallydetect keypoints from images and consequently mapthem to visual words (Lowe, 1999; Lowe, 2004). We

3http://www.espgame.org

use the Scale-Invariant Feature Transform (SIFT) todepict the keypoints in terms of a 128-dimensionalreal-valued descriptor vector. Color version SIFTdescriptors are extracted on a regular grid with fivepixels spacing, at four multiple scales (10, 15, 20,25 pixel radii), zeroing the low contrast ones. Wechose SIFT for its invariance to image scale, ori-entation, noise, distortion and partial invariance toillumination changes. To map the descriptors to vi-sual words, we cluster the keypoints in their 128-dimensional space using the K-means clustering al-gorithm, and encode each keypoint by the index ofthe cluster (visual word) to which it belongs. Wevaried the number of visual words between 250 and2000 in steps of 250. We then computed a one-level4x4 pyramid of spatial histograms (Grauman andDarrell, 2005), consequently increasing the featuresdimensions 16 times, for a number that varies be-tween 4K and 32K, in steps of 4K. From the point ofview of our distributional semantic model construc-tion, the important point to keep in mind is that stan-dard parameter choices such as the ones we adoptedlead to distributional vectors with 4K, 8K, . . . , 32Kdimensions, where a higher number of features cor-responds, roughly, to a more granular analysis of animage. We used the VLFeat implementation for theentire pipeline (Vedaldi and Fulkerson, 2008). Seethe references in Section 2.2 above for technical de-tails.

4.3 Model integration

We remarked above that the visual word extractionprocedure naturally leads to 8 kinds of image-basedvectors of dimensionalities from 4K to 32K in stepsof 4K. To balance text and image information, weuse DM vectors made of top n features ranging from4K to 32K in the same 4K steps. By combining,we obtain 64 combined models (4K text and 4K im-age dimensions, 4K text and 8K image dimensions,etc.). Since in the experiments on WordSim (Section5.1 below) we observe best performance with 32Ktext-based features, we report here only experimentswith (at least) 32K dimensions. Similar patterns tothe ones we report are observed when adding image-based dimensions to text-based vectors of differentdimensionalities.

For a thoroughly fair comparison, if we add n vi-sual features to the text-based model and we notice

an improvement, we must ask whether the same im-provement could also be obtained by adding moretext-based features. To control for this possibility,we also consider a set of purely text-based mod-els that have the same number of dimensions of thecombined models, that is, the top 32K DM featuresplus 8K, . . . , 32K further DM features (the next topfeatures in the cumulative Local Mutual Informationscore ranking). In the experiments below, we refer tothe purely textual model as text (always 32K dimen-sions), to the purely image-based model as image, tothe combined models as combined, and to the con-trol in which further text dimensions are added forcomparability with combined as text+.

4.4 Evaluation benchmarks

We conduct our most extensive evaluation on theWordSim353 data set (Finkelstein et al., 2002),a widely used benchmark constructed by asking16 subjects to rate a set of word pairs on a 10-point similarity scale and averaging the ratings (dol-lar/buck receive a high 9.22 average rating, profes-sor/cucumber a low 0.31). We cover 260 Word-Sim (mostly noun/noun) pairs. We evaluate modelsin terms of the Spearman correlation of the cosinesthey produce for the WordSim pairs with the averagehuman ratings for the same pairs (here and below,we do not report comparisons with the state of theart in the literature, because we have reduced cov-erage of the data sets, making the comparison notmeaningful).

To verify if the conclusions reached on WordSimextend to different semantic tasks, we use two con-cept categorization benchmarks, where the goal isto cluster a set of (nominal) concepts into broadercategories. The Almuhareb-Poesio (AP) concept set(Almuhareb, 2006), in the version we cover, con-tains 230 concepts to be clustered into 21 classessuch as vehicle (airplane, car. . . ), time (aeon, fu-ture. . . ) or social unit (brigade, nation). The Battigset (Baroni et al., 2010), in the version we cover,contains 72 concepts to be clustered into 10 classes.Unlike AP, Battig only contains concrete basic-levelconcepts belonging to categories such as bird (ea-gle, owl. . . ), kitchenware (bowl, spoon. . . ) or veg-etable (broccoli, potato. . . ). For both sets, follow-ing the original proponents and others, we clus-ter the words based on their pairwise cosines in

the semantic space defined by a model using theCLUTO toolkit (Karypis, 2003). We use CLUTO’sbuilt-in repeated bisections with global optimiza-tion method, accepting all of CLUTO’s default val-ues. Cluster quality is evaluated by percentage pu-rity (Zhao and Karypis, 2003). If ni

r is the num-ber of items from the i-th true (gold standard) classthat were assigned to the r-th cluster, n is the totalnumber of items and k the number of clusters, then:Purity = 1

n

∑kr=1 max

i(ni

r). In the best case (per-

fect clusters), purity is 100% and as cluster qualitydeteriorates, purity approaches 0.

Finally, we use the Baroni-Lenci Evaluation ofSemantic Similarity (BLESS) data set made avail-able by the GEMS 2011 organizers.4 In the ver-sion we cover, the data set contains 174 concretenominal concepts, each paired with a set of wordsthat instantiate the following 6 relations: hyper-nymy (spear/weapon), coordination (tiger/coyote),meronymy (castle/hall), typical attribute (an ad-jective: grapefruit/tart) and typical event (a verb:cat/hiss). Concepts are moreover matched with 3sets of randomly picked unrelated words (nouns, ad-jectives and verbs). For each true and random rela-tion, the data set contains at least one word per con-cept, typically more. Following the GEMS guide-lines, we apply a model to BLESS as follows. Giventhe similarity scores provided by the model for aconcept with all associated words within a relation,we pick the term with the highest score. We then z-standardize the 8 scores we obtain for each concept(one per relation), and we produce a boxplot summa-rizing the distribution of z scores per relation acrossthe concepts (i.e., each box of the plot summarizesthe distribution of the 174 scores picked for each re-lation, standardized as we just described). Boxplotsare produced accepting the default boxplotting op-tion of the R statistical package5 (boxes extend fromfirst to third quartile, median is horizontal line insidethe box).

4http://sites.google.com/site/geometricalmodels/shared-evaluation

5http://www.r-project.org/

5 Results

5.1 WordSim

The WordSim results for our models across dimen-sionalities as well as for the full DM are summarizedin Figure 3.

+4K +8K +12K +16K +20K +24K +28K +32K25

30

35

40

45

50

55

Adding more features to top 32K DM

Spea

rman

coe

ffici

ent(%

)

Performance of distributional models on WordSim

DMcombinedtextimagetext+

Figure 3: Performance of distributional models on Word-Sim

The purely image-based model is having theworst performance in all settings, although even thelowest image-based Spearman score (0.29) is signif-icantly above chance (p. < 0.05), suggesting thatthe model does capture some semantic information.Contrarily, adding image-based dimensions to a tex-tual model (combined) consistently reaches the bestperformance, also better – for all choices of dimen-sionality – than adding an equal number of text fea-tures (text+) or using the full DM matrix. Inter-estingly, the same overall result pattern is observedif we limit evaluation to the WordSim subsets thatAgirre et al. (2009) have identified as semanticallysimilar (e.g., synonyms or coordinate terms) and se-mantically related (e.g., meronyms or topically re-lated concepts).

Based on the results reported in Figure 3, fur-ther analyses will focus on the combined model with+20K image-based features, since performance ofcombined does not seem to be greatly affected by thedimensionality parameter, and performance aroundthis value looks quite stable (it is better only at theboundary +4K value, and with +28K, where, how-ever, there is a dip for the image model). The text+

performance is not essentially affected by the di-mensionality parameter, and we pick the +20K ver-sion for maximum comparability with combined.

The difference between combined and text+, al-though consistent, is not statistically significantaccording to a two-tailed paired permutation test(Moore and McCabe, 2005) conducted on the re-sults for the +20K versions of the models. Still, veryinteresting qualitative differences emerge. Table 1reports those WordSim pairs (among the ones withabove-median human-judged similarity) that havethe highest and lowest combined-to-text+ cosine ra-tios, i.e., pairs that are correctly treated as similar bycombined but not by text+, and vice versa. Strik-ingly, the pairs characterizing the image-feature-enriched combined are all made of concrete, highlyimageable concepts, whereas the text+ pairs refer tovery abstract notions. We thus see here the first ev-idence of the complementary nature of visual andtextual information.

combined text+tennis/racket physics/protonplanet/sun championship/tournamentcloset/clothes profit/lossking/rook registration/arrangementcell/phone mile/kilometer

Table 1: WordSim pairs with highest (first column) andlowest (second column) combined-to-text+ cosine ratios

5.2 Concept categorizationTable 2 reports percentage purities in the AP andBattig clustering tasks for full DM and the represen-tative models discussed above.

model AP BattigDM 81 96text 79 83

text+ 80 86image 25 36

combined 78 96

Table 2: Percentage AP and Battig purities of distribu-tional models

Once more, we see that the image model aloneis not at the level of the text models, although bothits AP and Battig purities are significantly above

chance (p < 0.05 based on simulated distributionsfor random cluster assignment). Thus, even alone,image-based vectors do capture aspects of meaning.For AP, adding image features does not improve per-formance, although it does not significantly worsenit either (a two-tailed paired permutation test con-firms that the difference between text+ and com-bined is far from significance). For Battig, addingvisual features improves on the purely text-basedmodels based on a comparable number of features(although the difference between text+ and com-bined is not significant), reaching the same perfor-mance obtained with the full DM model (that inthese categorization tests is slightly above that of thetrimmed models). Intriguingly, the Battig test is en-tirely composed of concrete concepts, so the differ-ence in performance for combined might be relatedto its preference for concrete things we already ob-served for WordSim.

5.3 BLESS

The BLESS distributions of text-based models (in-cluding combined) are very similar, so we use herethe full DM model as representative of the text-basedset – its histogram is compared to the one of thepurely image-based model in Figure 4.

We see that purely text-based DM cosines capturea reasonable scale of taxonomic similarity amongnominal neighbours (coordinates then hypernymsthen meronyms then random nouns), whereas verbsand adjectives are uniformly very distant, whetherthey are related or not. This is not surprising be-cause the DM links mostly reflect syntactic patterns,that will be disjoint across parts of speech (e.g., afeature like subject kill will only apply to nouns,save for parsing errors). Looking at the image-only model, we first observe that it can capture dif-ferences between related attributes/events and ran-dom adjectives/verbs (according to a Tukey HSDtest for all pairwise comparisons, these differencesare highly significant, whereas DM only signifi-cantly distinguishes attributes from random verbs).In this respect, image is arguably the “best” modelon BLESS. However, perhaps more interestingly,the image model also shows a bias for nouns, cap-turing the same taxonomic hierarchy found for DM.This suggests that image analysis is providing a de-composition of concepts into attributes shared by

similar entities, that capture ontological similaritybeyond mere syntagmatic co-occurrence in an im-age description.

To support this latter claim, we counted the av-erage number of times that the related terms pickedby the image model directly co-occur with the targetconcepts in an ESP-Game label. It turns out that thiscount is higher for both attributes (10.6) and hyper-nyms (7.5) than for coordinates (6.5). So, the highersimilarity of coordinates in the image model demon-strates that its features do generalize across images,allowing us to capture “attributional” or “paradig-matic” similarity in visual space. More in general,we find that, among all the related terms picked bythe image model that have an above-average cosinewith the target concept, almost half (41%) never co-occur with the concept in the image set, again sup-porting the claim that, by our featural analysis, weare capturing visual properties of similar conceptsbeyond their co-occurrence as descriptions of thesame image.

A final interesting point pertains to the specific in-stances of each (non-random) relation picked by thetextual and visual models: of 870 related term pairsin total, almost half (418) differ between DM andimage, suggesting that the boxplots in Figure 4 hidelarger differences in what the models are doing. Therandomly picked examples of mismatches in top at-tributes from Table 3 clearly illustrate the qualitativedifference between the models, and, once more, thetendency of image-based representations to favour(not surprisingly!) highly visual properties such ascolours and shapes, vs. the well-known tendency oftext-based models to extract systemic or functionalcharacteristics such as powerful or elegant (Baroniet al., 2010). By combining the two sources of infor-mation, we should be able to develop distributionalmodels that come with more well-rounded charac-terizations of the concepts they describe.

6 Conclusion

We proposed a simple method to augment a state-of-the-art text-based distributional semantic modelwith information extracted from image analysis.The method is based on the standard bag-of-visual-words representation of images in computer vision.The image-based distributional profile of a word is

●●

●

●●●

●

●

●

●●

●●

●

●

●

●●

●●

●●

●●

COORD HYPER MERO ATTRI EVENT RAN.N RAN.J RAN.V

−1.

0−

0.5

0.0

0.5

1.0

1.5

2.0

DM

●●

●

●

●

●

●

●

●●●

●

●

●●●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

COORD HYPER MERO ATTRI EVENT RAN.N RAN.J RAN.V

−1

01

2

Image

Figure 4: Distribution of z-normalized cosines of words instantiating various relations across BLESS concepts.

concept DM image concept DM imageant small black potato edible redaxe powerful old rifle short blackcathedral ancient dark scooter cheap whitecottage little old shirt fancy blackdresser new square sparrow wild brownfighter fast old squirrel fluffy brownfork dangerous shiny sweater elegant oldgoose white old truck new heavyjet fast old villa new cosypistol dangerous black whale large gray

Table 3: Randomly selected cases where nearest at-tributes picked by DM and image differ.

encoded in a vector of co-occurrences with “visualwords”, that we concatenate with a text-based co-occurrence vector. A cautious interpretation of ourresults is that adding image-based features is at leastnot damaging, when compared to adding furthertext-based features, and possibly beneficial. Impor-tantly, in all experiments we find that image-basedfeatures lead to interesting qualitative differences inperformance: Models including image-based infor-mation are more oriented towards capturing similar-ities between concrete concepts, and focus on theirmore imageable properties, whereas the text-basedfeatures are more geared towards abstract conceptsand properties. Coming back to the discussion ofsymbol grounding at the beginning of the paper, we

consider this (very!) preliminary evidence for an in-tegrated view of semantics where the more concreteaspects of meaning derive from perceptual experi-ence, whereas verbal associations mostly accountfor abstraction.

In future work, we plan first of all to improve per-formance, by focusing on visual word extraction andon how the text- and image-based vectors are com-bined (possibly using supervision to optimize bothfeature extraction and integration with respect to se-mantic tasks). However, the most exciting directionwe intend to follow next will concern evaluation,and in particular devising new benchmarks that ad-dress the special properties of image-enhanced mod-els directly. For example, Baroni and Lenci (2008)observe that text-based distributional models are se-riously lacking when it comes to characterize phys-ical properties of concepts such as their colors orparts. These are exactly the aspects of conceptualknowledge where image-based information shouldhelp most, and we will devise new test sets that willfocus specifically on verifying this hypothesis.

References

Eneko Agirre, Enrique Alfonseca, Keith Hall, JanaKravalova, Marius Pasca, and Aitor Soroa. 2009. Astudy on similarity and relatedness using distributionaland WordNet-based approaches. In Proceedings ofHLT-NAACL, pages 19–27, Boulder, CO.

Abdulrahman Almuhareb. 2006. Attributes in LexicalAcquisition. Phd thesis, University of Essex.

Mark Andrews, Gabriella Vigliocco, and David Vinson.2009. Integrating experiential and distributional datato learn semantic representations. Psychological Re-view, 116(3):463–498.

Marco Baroni and Alessandro Lenci. 2008. Conceptsand properties in word spaces. Italian Journal of Lin-guistics, 20(1):55–88.

Marco Baroni and Alessandro Lenci. 2010. Dis-tributional Memory: A general framework forcorpus-based semantics. Computational Linguistics,36(4):673–721.

Marco Baroni, Eduard Barbu, Brian Murphy, and Mas-simo Poesio. 2010. Strudel: A distributional semanticmodel based on properties and types. Cognitive Sci-ence, 34(2):222–254.

Lawrence Barsalou, Ava Santos, Kyle Simmons, andChristine Wilson, 2008. Language and Simulation inConceptual Processing, chapter 13, pages 245–283.Oxford University Press, USA, 1 edition.

Anna Bosch, Andrew Zisserman, and Xavier Munoz.2007. Image Classification using Random Forests andFerns. In Computer Vision, 2007. ICCV 2007. IEEE11th International Conference on, pages 1–8.

Gabriella Csurka, Christopher Dance, Lixin Fan, JuttaWillamowski, and Cedric Bray. 2004. Visual cate-gorization with bags of keypoints. In In Workshop onStatistical Learning in Computer Vision, ECCV, pages1–22.

Stefan Evert. 2005. The Statistics of Word Cooccur-rences. Dissertation, Stuttgart University.

Yansong Feng and Mirella Lapata. 2010. Visual infor-mation in semantic representation. In Human Lan-guage Technologies: The 2010 Annual Conference ofthe North American Chapter of the Association forComputational Linguistics, pages 91–99, Los Angeles,California. Association for Computational Linguistics.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman, and EytanRuppin. 2002. Placing search in context: The conceptrevisited. ACM Transactions on Information Systems,20(1):116–131.

Arthur Glenberg. 1997. What memory is for. BehavBrain Sci, 20(1), March.

Kristen Grauman and Trevor Darrell. 2005. The pyramidmatch kernel: Discriminative classification with setsof image features. In In ICCV, pages 1458–1465.

Tom Griffiths, Mark Steyvers, and Josh Tenenbaum.2007. Topics in semantic representation. Psychologi-cal Review, 114:211–244.

Jonathon Hare, Sina Samangooei, Paul Lewis, and MarkNixon. 2008. Semantic spaces revisited: investigat-ing the performance of auto-annotation and semantic

retrieval using semantic spaces. In Proceedings of the2008 international conference on Content-based im-age and video retrieval, CIVR ’08, pages 359–368,New York, NY, USA. ACM.

Stevan Harnad. 1990. The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1-3):335–346,June.

Thomas Hofmann. 2001. Unsupervised Learning byProbabilistic Latent Semantic Analysis. MachineLearning, 42(1-2):177–196, January.

George Karypis. 2003. CLUTO: A clustering toolkit.Technical Report 02-017, University of Minnesota De-partment of Computer Science.

Thomas Landauer and Susan Dumais. 1997. A solu-tion to Plato’s problem: The latent semantic analysistheory of acquisition, induction, and representation ofknowledge. Psychological Review, 104(2):211–240.

Max Louwerse. 2011. Symbol interdependency in sym-bolic and embodied cognition. Topics in CognitiveScience, 3:273–302.

David Lowe. 1999. Object Recognition from LocalScale-Invariant Features. Computer Vision, IEEE In-ternational Conference on, 2:1150–1157 vol.2, Au-gust.

David Lowe. 2004. Distinctive image features fromscale-invariant keypoints. International Journal ofComputer Vision, 60(2), November.

George Miller and Walter Charles. 1991. Contextual cor-relates of semantic similarity. Language and Cogni-tive Processes, 6(1):1–28.

David Moore and George McCabe. 2005. Introductionto the Practice of Statistics. Freeman, New York, 5edition.

David Nister and Henrik Stewenius. 2006. Scalablerecognition with a vocabulary tree. In Proceedingsof the 2006 IEEE Computer Society Conference onComputer Vision and Pattern Recognition - Volume 2,CVPR ’06, pages 2161–2168.

Josef Sivic and Andrew Zisserman. 2003. Video Google:A text retrieval approach to object matching in videos.In Proceedings of the International Conference onComputer Vision, volume 2, pages 1470–1477, Octo-ber.

Peter Turney and Patrick Pantel. 2010. From frequencyto meaning: Vector space models of semantics. Jour-nal of Artificial Intelligence Research, 37:141–188.

Andrea Vedaldi and Brian Fulkerson. 2008. VLFeat:An open and portable library of computer vision algo-rithms. http://www.vlfeat.org/.

Luis von Ahn and Laura Dabbish. 2004. Labeling im-ages with a computer game. In Proceedings of theSIGCHI conference on Human factors in computingsystems, CHI ’04, pages 319–326, New York, NY,USA. ACM.

Ludwig Wittgenstein. 1953. Philosophical Investiga-tions. Blackwell, Oxford. Translated by G.E.M.Anscombe.

Jun Yang, Yu-Gang Jiang, Alexander G. Hauptmann,and Chong-Wah Ngo. 2007. Evaluating bag-of-visual-words representations in scene classification.In James Ze Wang, Nozha Boujemaa, Alberto DelBimbo, and Jia Li, editors, Multimedia InformationRetrieval, pages 197–206. ACM.

Ying Zhao and George Karypis. 2003. Criterion func-tions for document clustering: Experiments and analy-sis. Technical Report 01-40, University of MinnesotaDepartment of Computer Science.

Rolf Zwaan. 2004. The immersed experiencer: Towardan embodied theory of language comprehension. Psy-chology of Learning and Motivation: Advances in Re-search and Theory, Vol 44, 44.

distributional semantics from text and images...distributional semantic models use large text...

Documents