learning grounded word meaning representations on

10
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4760–4769 November 7–11, 2021. c 2021 Association for Computational Linguistics 4760 Learning grounded word meaning representations on similarity graphs Mariella Dimiccoli Institut de Robòtica i Informàtica Industrial (CSIC-UPC), Barcelona, Spain [email protected] Herwig Wendt CNRS, IRIT Univ. of Toulouse, France [email protected] Pau Batlle * California Institute of Technology Pasadena, California Abstract This paper introduces a novel approach to learn visually grounded meaning represen- tations of words as low-dimensional node embeddings on an underlying graph hierar- chy. The lower level of the hierarchy mod- els modality-specific word representations through dedicated but communicating graphs, while the higher level puts these representa- tions together on a single graph to learn a rep- resentation jointly from both modalities. The topology of each graph models similarity re- lations among words, and is estimated jointly with the graph embedding. The assumption underlying this model is that words sharing similar meaning correspond to communities in an underlying similarity graph in a low- dimensional space. We named this model Hi- erarchical Multi-Modal Similarity Graph Em- bedding (HM-SGE). Experimental results vali- date the ability of HM-SGE to simulate human similarity judgements and concept categoriza- tion, outperforming the state of the art. 1 1 Introduction During the last decade, there has been an increasing interest in deriving semantic representations from text corpora in terms of word vector space for both words in context and words in isolation (Mikolov et al., 2013; Ling et al., 2015; McCann et al., 2017; Devlin et al., 2018; Peters et al., 2018). However, semantic representations are tied to sensory ex- perience, at least for concrete nouns (Anderson et al., 2017), and deriving them by relying solely on text leads to lack of grounding on the extra- language modality. For this reason, several works have addressed the problem of grounding percep- tual information in the form of visual informa- tion approximated by feature norms elicited from * Work done during an internship at the IRI (CSIC-UPC). 1 Code available: https://github.com/mdimiccoli/HM-SGE/. Work partially funded by projects MINECO/ERDF RyC, PID2019-110977GA-I00, MDM-2016-0656. humans (Andrews et al., 2009; Silberer and Lap- ata, 2012), or extracted automatically from images (Kiela and Bottou, 2014; Lazaridou et al., 2015; Sil- berer et al., 2016), or a combination of them (Roller and Im Walde, 2013). Furthermore, several inte- gration mechanisms for the linguistic and percep- tual modalities have been proposed. They include methods employing transformation and dimension reduction on the concatenation of unimodal repre- sentations (Bruni et al., 2014; Hill and Korhonen, 2014); generative probabilistic models (Andrews et al., 2009; Roller and Im Walde, 2013; Feng and Lapata, 2010); deep learning methods such as au- toencoders and recursive neural networks (Silberer et al., 2016; Socher et al., 2013). While the ap- proaches above take as input previously extracted unimodal representations, (Lazaridou et al., 2015; Hill and Korhonen, 2014; Zablocki et al., 2018) learn directly from raw data by integrating both modalities non-linearly in a low-dimensional space within a skip-gram model framework. These ap- proaches typically lead to marginal improvements since only a small part of the vocabulary is cov- ered by a corresponding visual information. Re- cent work has focused on generating pre-trainable generic representations for visual-linguistic tasks by using an instance-level contextualized approach (Su et al., 2020; Lu et al., 2019; Sun et al., 2019). Such models have proved to be effective for several natural language applications such as visual ques- tion answering and visual commonsense reasoning, but their ability to simulate human behaviour phe- nomena has never been assessed so far. In this paper, we propose a novel graph-based model for learning word representations across vi- sion and language modalities that can simulate hu- man behaviour in many NLP tasks. Our assumption is that words sharing similar meaning correspond to communities in an underlying hierarchy of graphs in low-dimensional spaces. The lower level of the hierarchy models modality-specific word represen-

Upload: others

Post on 13-Jun-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning grounded word meaning representations on

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4760–4769November 7–11, 2021. c©2021 Association for Computational Linguistics

4760

Learning grounded word meaning representations on similarity graphs

Mariella DimiccoliInstitut de Robòtica i Informàtica Industrial

(CSIC-UPC), Barcelona, [email protected]

Herwig WendtCNRS, IRIT

Univ. of Toulouse, [email protected]

Pau Batlle ∗

California Institute of TechnologyPasadena, California

Abstract

This paper introduces a novel approach tolearn visually grounded meaning represen-tations of words as low-dimensional nodeembeddings on an underlying graph hierar-chy. The lower level of the hierarchy mod-els modality-specific word representationsthrough dedicated but communicating graphs,while the higher level puts these representa-tions together on a single graph to learn a rep-resentation jointly from both modalities. Thetopology of each graph models similarity re-lations among words, and is estimated jointlywith the graph embedding. The assumptionunderlying this model is that words sharingsimilar meaning correspond to communitiesin an underlying similarity graph in a low-dimensional space. We named this model Hi-erarchical Multi-Modal Similarity Graph Em-bedding (HM-SGE). Experimental results vali-date the ability of HM-SGE to simulate humansimilarity judgements and concept categoriza-tion, outperforming the state of the art. 1

1 Introduction

During the last decade, there has been an increasinginterest in deriving semantic representations fromtext corpora in terms of word vector space for bothwords in context and words in isolation (Mikolovet al., 2013; Ling et al., 2015; McCann et al., 2017;Devlin et al., 2018; Peters et al., 2018). However,semantic representations are tied to sensory ex-perience, at least for concrete nouns (Andersonet al., 2017), and deriving them by relying solelyon text leads to lack of grounding on the extra-language modality. For this reason, several workshave addressed the problem of grounding percep-tual information in the form of visual informa-tion approximated by feature norms elicited from

∗ Work done during an internship at the IRI (CSIC-UPC).1Code available: https://github.com/mdimiccoli/HM-SGE/.

Work partially funded by projects MINECO/ERDF RyC,PID2019-110977GA-I00, MDM-2016-0656.

humans (Andrews et al., 2009; Silberer and Lap-ata, 2012), or extracted automatically from images(Kiela and Bottou, 2014; Lazaridou et al., 2015; Sil-berer et al., 2016), or a combination of them (Rollerand Im Walde, 2013). Furthermore, several inte-gration mechanisms for the linguistic and percep-tual modalities have been proposed. They includemethods employing transformation and dimensionreduction on the concatenation of unimodal repre-sentations (Bruni et al., 2014; Hill and Korhonen,2014); generative probabilistic models (Andrewset al., 2009; Roller and Im Walde, 2013; Feng andLapata, 2010); deep learning methods such as au-toencoders and recursive neural networks (Silbereret al., 2016; Socher et al., 2013). While the ap-proaches above take as input previously extractedunimodal representations, (Lazaridou et al., 2015;Hill and Korhonen, 2014; Zablocki et al., 2018)learn directly from raw data by integrating bothmodalities non-linearly in a low-dimensional spacewithin a skip-gram model framework. These ap-proaches typically lead to marginal improvementssince only a small part of the vocabulary is cov-ered by a corresponding visual information. Re-cent work has focused on generating pre-trainablegeneric representations for visual-linguistic tasksby using an instance-level contextualized approach(Su et al., 2020; Lu et al., 2019; Sun et al., 2019).Such models have proved to be effective for severalnatural language applications such as visual ques-tion answering and visual commonsense reasoning,but their ability to simulate human behaviour phe-nomena has never been assessed so far.

In this paper, we propose a novel graph-basedmodel for learning word representations across vi-sion and language modalities that can simulate hu-man behaviour in many NLP tasks. Our assumptionis that words sharing similar meaning correspond tocommunities in an underlying hierarchy of graphsin low-dimensional spaces. The lower level of thehierarchy models modality-specific word represen-

Page 2: Learning grounded word meaning representations on

4761

tations through modality-specific but communicat-ing graphs, while the higher level puts these rep-resentations together on a single graph to learn asingle representation jointly from both modalities.At each level, the graph topology models similarityrelations among words, and it is estimated jointlywith the graph embedding. To the best of ourknowledge this is the first model that uses a graph-based approach to learn grounded word meaningrepresentations. Technically, our method is a jointfeature learning and clustering approach. More-over, it is compatible with both associative anddomain-general learning theories in experimentalpsychology, following which the learning of a vi-sual (linguistic) output is triggered and mediated bya linguistic (visual) input (Reijmers et al., 2007),and these mediated representations are then fur-ther encoded in higher-level cortices (Rogers et al.,2004). This work has applications in cognitivescience, prominently in simulations of human be-havior involving deep dyslexia, semantic primingand similarity judgments, among others.

The contributions of this work are as follows:1) We propose a novel technical approach tolearn grounded semantic representations as low-dimensional embeddings on a hierarchy of simi-larity graphs, 2) we achieve state-of-the-art perfor-mance with respect to several baselines for sim-ulating human behaviour in the tasks of similar-ity rating and categorization, 3) we demonstratethe ability of our model to perform inductive infer-ence, 4) we validate the proposed approach throughan extensive ablation study, and provide insightsabout the learnt representations through qualitativeresults and visualizations, 5) our model is compati-ble with associative and domain-general learningtheories in experimental psychology.

2 Related work

Distributional Semantic Models (DSMs) are basedon the distributional hypothesis (Harris, 2013), fol-lowing which words that appear in similar lin-guistic contexts are likely to have related mean-ings. DSMs associate each word with a vector(a.k.a. word embedding) that encodes informationabout its co-occurrence with other words in cor-pora (Mikolov et al., 2013; Ling et al., 2015).Recently, instance-level contextualized word em-beddings (Devlin et al., 2018) have emerged as anatural extension of type-level non-contextualizedDSMs and have demonstrated their effectiveness

with respect to their counterpart in a wide variety ofcommon NLP tasks (McCann et al., 2017; Devlinet al., 2018; Peters et al., 2018).

However, humans learn the verbal description ofobjects by hearing words while looking at /listen-ing to/interacting with objects. Therefore, in recentyears there has been an increasing interest in devel-oping linguistic models augmented with perceptualinformation. These are commonly called groundedsemantic spaces. Following the classification intro-duced by (Collell et al., 2017), we can distinguishtwo integration strategies: 1) a posteriori combina-tion, where each modality is learnt separately andthey are integrated afterwards, and 2) simultaneouslearning, where a single representation is learntfrom raw input data enriched with both modalities.

A posteriori combination. Several works aimedat projecting directly vision and language into acommon space. Among them, (Bruni et al., 2014)concatenate and project two independently con-structed textual and visual spaces onto a lower-dimensional space using Singular Value Decom-position (SVD). Other approaches along the sameline build on extensions of topic models as La-tent Dirichlet Allocation (LDA), where topic dis-tributions are learnt from the observed variables(words and other perceptual units) (Andrews et al.,2009; Roller and Im Walde, 2013; Feng and Lapata,2010). In (Kiela and Bottou, 2014) an empiricalimprovement is obtained by using state-of-the-artconvolutional neural networks to extract visual fea-tures, and the skip-gram model for textual features,that are simply concatenated. In (Silberer et al.,2016) a stacked auto-encoder framework is used tolearn a representation by means of an unsupervisedcriterion (the minimization of the reconstructionerror of the attribute-based representation) and thenfine-tuned with a semi-supervised criterion (objectclassification of the input). In the approach pro-posed by (Wang et al., 2018) the weights of theunimodal feature concatenation are learnable pa-rameters that allow to dynamically fuse representa-tions from different modalities according to differ-ent types of words. Other works focus on learningbimodal representations that are task-specific, withthe goal of reasoning about one modality given theother (Lazaridou et al., 2014; Socher et al., 2013).For example, the aim of image retrieval is to find amapping between two modalities to tackle an im-age based task such as zero-shot learning (Fromeet al., 2013) or caption generation/retrieval and

Page 3: Learning grounded word meaning representations on

4762

Figure 1: HM-SGE model overview. The first layerlearns unimodal representations conditioned to theother modality via its corresponding similarity graph.The concatenation of the enhanced representations isthen used to learn jointly a common representation.

caption-based image retrieval (Socher et al., 2014).These models typically use a training criterion andan architecture suited to the task at hand.

Simultaneous learning. Little work has ex-plored the possibility of learning multimodal repre-sentations directly from raw input data, i.e, imagesand corpora, building on the skip-gram framework.(Hill and Korhonen, 2014) treat perceptual input asa word linguistic context and has proved to be effec-tive in propagating visual knowledge into abstractwords. (Lazaridou et al., 2015) modify the skip-gram objective function to predict both visual andlinguistic features and is especially good in zero-shot image classification. (Zablocki et al., 2018)contributed to this research line by leveraging thevisual surroundings of objects to fulfill the distri-butional hypothesis for the visual modality. Thisclass of approaches typically leads only to a smallempirical improvement of linguistic vectors sincewords from the raw text corpus associated with im-ages (and hence perceptual information) cover onlya small portion of the training dataset. In the lastfew years, increasing efforts are being devoted toderiving generic pre-trainable representations forvisual-linguistic tasks based on transformers (Sunet al., 2019; Lu et al., 2019; Su et al., 2020).

Graph-based word meaning representations.Despite the success of graph-based models for sen-tence meaning (Koller et al., 2019), their use forencoding word meaning representation has been lit-

Figure 2: DGE model of which our HM-SGE modulesshare the high-level structure. A similarity graph is es-timated from the initial high dimensional embeddingsthat are then projected in a low-dimensional space.DGE alternates a graph embedding step with a graphupdate step.

tle explored so far. Recently, Graph ConvolutionalNetwork (GCN) based approaches have been usedto improve text-based word meaning representationby incorporating syntactic information (Vashishthet al., 2019; Tran et al., 2020; Ryabinin et al., 2020).To the best of our knowledge, graph based mod-els have never been exploited for grounded wordmeaning representations.

3 Proposed approach

Model assumptions and psychological founda-tions. We assume that each modality, linguistic(X) and visual (Y ), can be well represented by anunknown underlying graph in a low-dimensionalspace, in which nodes correspond to words andweighted edges to respective word similarity. Theassociative nature of human memory, a widely ac-cepted theory in experimental psychology (Ander-son and Bower, 2014), suggests that beyond beingcomplementary, there exists a certain degree of cor-relation between these modalities (Reijmers et al.,2007). Therefore, in a first step, our model aimsat exploiting these correlations to enhance eachunimodal representation individually. This simu-lates the fact that when humans face, for instance,the task of visual similarity rating of a pair of im-ages, their mental search necessarily includes bothcomponents, the visual and the semantic one.

In a second step, these enhanced unimodal repre-sentations are mapped into a common embeddingspace by inducing semantic representations that

Page 4: Learning grounded word meaning representations on

4763

integrate both modalities. This agrees with neuro-scientific findings showing that it is unlikely thathumans have separate representations for differ-ent aspects of word meaning (Rogers et al., 2004).Overall, the proposed two layer structure, where thefirst layer has two modality-specific branches thatprovide input to a second single-generic branch, iscompatible with theories of both associative mem-ory (Reijmers et al., 2007) and domain-generallearning (Rogers et al., 2004).

humans have separate representations for differ-242

ent aspects of word meaning (Rogers et al., 2004).243

Overall, the proposed two layer structure, where the244

first layer has two modality-specific branches that245

provide input to a second single-generic branch, is246

compatible with theories of both associative mem-247

ory (Reijmers et al., 2007) and domain-general248

learning (Rogers et al., 2004).249

Algorithm 1 Hierarchical Multimodal SGEN � number of wordsInput : X 2 RN⇥n, Y 2 RN⇥mV and T feature matricesOutput : X 2 RN⇥d(d⌧ n), Y 2 RN⇥p(p⌧ m),

Z 2 RN⇥d+p graph embedded feature matrices

/* Initialize embeddings and graphs */

X = PCAd(X) 2 RN⇥d, Y = PCAp(Y ) 2 RN⇥p

GX0 = Sl(X) 2 RN⇥N , G

Y0 = Sl(Y ) 2 RN⇥N

GX0 = Sl(X) 2 RN⇥N , GY

0 = Sl(Y ) 2 RN⇥N

/* Layer 1: Coupled SGE loop */for i 1 to K1 do

— graph structure update: semantic priorG

Xi gµ(G

Xi, kmeans(Xi;NC)), G

Yi

gµ(GYi, kmeans(Xi;NC))

estimate communities and encode similarity in graph— graph embedding updatefor (A, B) 2 {(X, Y ), (Y, X)}Ai = A(1�↵A)L(Sl(A), G

Ai�1) + ↵AL(Sl(A), GA

0 )

. + �AL(Sl(A), GBi�1)

update embedded features given current graphsend/* Initialize embedding and graph */

Z0 = (X, Y ), GZ0 = GZ = Sl(Z) 2 RN⇥N

/* Layer 2: Joint SGE loop */for i 1 to K2 do

— graph structure update: semantic priorG

Zi gµ(G

Zi , kmeans(Zi; NC))

estimate communities and encode similarity in graph— graph embedding updateZi =Z (1�↵Z)L1(Sl(Z), G

Z

i�1)+↵ZL2(Sl(Z), GZ0 )

update features given current graphend

Architecture overview. Fig. 1 illustrates the pro-250

posed model. It takes as input linguistic (X) and251

visual (Y ) vector representations. In the first layer,252

the initial embedding of both modalities are en-253

hanced individually by relying on the other modal-254

ity. In the last layer, the conditional embeddings255

are concatenated and jointly optimized. For each256

modality, we first build a fully-connected similarity257

graph GX0 from the initial features X 2 RN⇥n,258

where N is the number of samples and n is the fea-259

ture dimension. GX0 subsequently serves to regu-260

larize the process of jointly learning the underlying261

graph and an embedding X 2 RN⇥d of dimen-262

sion d⌧ n, which is achieved by alternating two263

steps: at iteration i, 1) update node embeddings Xi 264

by taking into account the current graph estimate 265

GXi�1 (reflected by edge weights) and that of the 266

other modality, and 2) update the graph estimate 267

GXi (fully connected similarity graph of Xi) by 268

taking into account semantic similarity priors. 269

Conceptually, the alternating of a graph embed- 270

ding step and a graph structure update step of our 271

SGE is similar to the recently introduced Dynamic 272

Graph Embedding (DGE) (Dimiccoli and Wendt, 273

2020) (see Fig.2), where a low-dimensional em- 274

bedding is learnt from image sequences for the 275

downstream task of temporal segmentation. How- 276

ever, we changed the way each of theses steps is 277

formulated. Firstly, to learn unimodal (visual or 278

textual) representations that take into account se- 279

mantic communities in both graphs (textual and 280

visual), we allow each modality to share its graph 281

with the other modality. This is achieved by mod- 282

ifying the embedding loss during the embedding 283

step. Secondly, since our data are not temporally 284

linked, we do not model temporal constraints in the 285

clustering step, but just semantic similarity among 286

words. Finally, we propose a two layers hierarchi- 287

cal architecture, that is tailored to the learning of 288

visually grounded meaning representations. 289

Coupled similarity graph embedding update. 290

Formally, the embedding update for X (and analo- 291

gously for Y ) is computed as: 292

293

Xi = arg minX

(1�↵X)L(Sl(X), GXi�1) 294

+ ↵XL(Sl(X), G0) + �XL(Sl(X), GYi�1), (1) 295

where L is a cross-entropy loss function that in- 296

cludes normalization of its arguments and Sl stands 297

for a cosine-distance based pairwise similarity func- 298

tion with exponential kernel of bandwidth l. The 299

first term in (1) controls the fit of the representa- 300

tion X with the learnt graph G in low-dimensional 301

embedding space, while the second term ensures 302

that it keeps also aspects of the initial graph GX0 ; 303

↵ 2 [0, 1] controls the relative weight of the terms; 304

the hyperparameters (�X ,�Y ) tune the respective 305

weights of the graphs of the other modalities in the 306

unimodal representations. 307

Similarity graph update. To obtain an update 308

for the graph at the i-th iteration, assuming that 309

Xi is given, the model starts from an initial es- 310

timate as GXi = Sl(Xi) and makes use of the 311

4

Architecture overview. Fig. 1 illustrates the pro-posed model. It takes as input linguistic (X) andvisual (Y ) vector representations. In the first layer,the initial embedding of both modalities are en-hanced individually by relying on the other modal-ity. In the last layer, the conditional embeddingsare concatenated and jointly optimized. For eachmodality, we first build a fully-connected similar-ity graph GX

0 from the initial features X ∈ RN×n,where N is the number of samples and n is the fea-ture dimension. GX

0 subsequently serves to regu-larize the process of jointly learning the underlyinggraph and an embedding X ∈ RN×d of dimen-

sion d� n, which is achieved by alternating twosteps: at iteration i, 1) update node embeddings Xi

by taking into account the current graph estimateGXi−1 (reflected by edge weights) and that of the

other modality, and 2) update the graph estimateGXi (fully connected similarity graph of Xi) by

taking into account semantic similarity priors.Conceptually, the alternating of a graph embed-

ding step and a graph structure update step of ourSGE is similar to the recently introduced DynamicGraph Embedding (DGE) (Dimiccoli and Wendt,2020) (see Fig.2), where a low-dimensional em-bedding is learnt from image sequences for thedownstream task of temporal segmentation. How-ever, we changed the way each of theses steps isformulated. Firstly, to learn unimodal (visual ortextual) representations that take into account se-mantic communities in both graphs (textual andvisual), we allow each modality to share its graphwith the other modality. This is achieved by mod-ifying the embedding loss during the embeddingstep. Secondly, since our data are not temporallylinked, we do not model temporal constraints in theclustering step, but just semantic similarity amongwords. Finally, we propose a two layers hierarchi-cal architecture, that is tailored to the learning ofvisually grounded meaning representations.

Coupled similarity graph embedding update.Formally, the embedding update for X (and analo-gously for Y ) is computed as:

Xi = argminX

(1−αX)L(Sl(X), GXi−1)

+ αXL(Sl(X),G0) + βXL(Sl(X), GYi−1), (1)

where L is a cross-entropy loss function that in-cludes normalization of its arguments and Sl standsfor a cosine-distance based pairwise similarity func-tion with exponential kernel of bandwidth l. Thefirst term in (1) controls the fit of the representa-tion X with the learnt graph G in low-dimensionalembedding space, while the second term ensuresthat it keeps also aspects of the initial graph GX

0 ;α ∈ [0, 1] controls the relative weight of the terms;the hyperparameters (βX , βY ) tune the respectiveweights of the graphs of the other modalities in theunimodal representations.

Similarity graph update. To obtain an updatefor the graph at the i-th iteration, assuming thatXi is given, the model starts from an initial es-timate as G

Xi = Sl(Xi) and makes use of the

Page 5: Learning grounded word meaning representations on

4764

model assumptions to modify GXi . In particular,

the semantic prior assumes that the most similarnodes form communities in the graph and leadsto decreasing the edge weights between nodes ofGXi that do not belong to the same community.

Practically, this can be implemented by estimatingcommunities using clustering, e.g. k-means withNC classes, and multiplying the edge weights fornode pairs belonging to different communities bya factor µ ∈ (0, 1); we denote this operation asGXi ← gµ(G

Xi , kmeans(Xi;NC)).

Joint similarity graph embedding update. Af-ter K1 iterations, the learnt representations are con-catenated, Z0 = (X, Y ), and input to the secondlayer of our model. It learns the joint representationZ that integrates both modalities as node embed-dings on an underlying graph encoding visuallygrounded word meaning. The last term of Eq. (1)is omitted at this stage. A detailed description ofour framework is given in Algorithm 1.

4 Experimental results

4.1 Experimental setting

Visual and textual representations. As visualand textual input feature vectors we used the at-tribute based representations proposed by (Silbereret al., 2016). Specifically, the visual modality isencoded via 414-dimensional vectors of attributesobtained automatically from images by traininga SVM-based classifier for each attribute on theVISA dataset (Silberer et al., 2016). More specif-ically, we used initial meaning representations ofwords for the McRae nouns (McRae et al., 2005)covered by the VISA dataset (Silberer et al., 2016),that consists exclusively of concrete nouns.

The textual modality was encoded in two differ-ent ways: through textual attributes and via wordembeddings. Textual attributes were extracted byrunning Strudel (Baroni et al., 2010) on the WaCk-ypedia corpus (Baroni et al., 2009), and by re-taining only the ten attributes with highest log-likelihood ratio scores for each target word. Theunion of the selected attributes leads to 2,362 di-mensional textual vectors. Word embeddings wereobtained by training the skip-gram model (Mikolovet al., 2013) on the WaCkypedia corpus (Baroniet al., 2009), resulting in 500-dimensional embed-ding vectors. The attribute-based representationsin both modalities were scaled to the [−1, 1] range.

Hyperparameter settings. We use a commonembedding dimension d = p = 15 forboth modalities. The SGE hyperparametersare set to (αX , µX , NX

C ) = (0.1, 0.95, 25) and(αY , µY , NY

C ) = (0.3, 0.7, 5) for the first layerof our model. For the second layer of ourmodel, we set (αZ , µZ , NZ

C ) = (0.05, 0.7, 20) and(0.1, 0.7, 6) when using textual or skip-gram as in-put, respectively. The number of SGE iterationsare set to K1 = 4 or 5 (first layer) and K2 = 2 or 5(second layer) when textual attributes or skip-gramrepresentations are used for the textual modality, re-spectively. These hyperparameters have been opti-mized for each SGE individually using grid search;the cross-coupling parameters were also optimizedseparately and set to (βX , βY ) = (0.01, 0.1).

Performance measures. Similarly to previouswork (Silberer et al., 2016), we evaluate our modelon two different semantic tasks, namely word sim-ilarity rating and categorization. Specifically, wemeasure how well our model predictions of wordsimilarity correlate with human semantic and vi-sual similarity ratings using Spearman’s correlation.Our similarity ratings are calculated as the cosinesimilarity of learnt representation vectors. As hu-man similarity ratings, we used those published in(Silberer et al., 2016). The semantic categories areinduced by following a clustering-based approach,namely the Chinese Whispers algorithm (Biemann,2006), and the quality of the clusters produced wasevaluated using the F-score measure introduced inthe SemEval 2007 task (Agirre and Soroa, 2007).

Computation. The complexity of our approachis O(N2). Experiments were conducted on a 2018Dell Precision T7920 workstation with 64GB RAMand a single NVIDIA Titan XP GPU.

4.2 Comparative results.

In Tab. 1, we compare our HM-SGE to the state ofthe art. Results are presented for two different setsof input features: attribute vectors for the visualmodality (vAttrib) combined with either attributevectors (tAttrib) or skip-gram encoding (skip-gram)for the textual modality (top and center part ofTab. 1, respectively). The bottom part of Tab. 1presents comparisons with models for raw data(Lazaridou et al., 2015) and (Bruni et al., 2014)and pre-trained VL-BERT model (Su et al., 2020) 2

for which we derived the type-level representations.

2https://github.com/jackroos/VL-BERT

Page 6: Learning grounded word meaning representations on

4765

tAttrib vAttrib (tAttriv,vAttrib) HM-SGE

(a) (b) (c) (d)

Figure 3: Similarity matrices of (a) textual attributes, (b) visual attributes, (c) visual and textual attributes (concate-nation), (d) HM-SGE representations for the 10 categories food, animal, bird, tools, mammal, weapon, instrument,transportation, clothing, device (boundaries indicated by dashed lines). Image intensity values are individuallyclipped to cover the upper 3/8 of observed similarity values.

Models Semantic Similarity Visual Similarity CategorizationT V T+V T V T+V T V T+V

HM-SGE (tAttrib, vAttrib) 0.74 0.68 0.77 0.59 0.60 0.64 0.43 0.39 0.45SAE (tAttrib, vAttrib) 0.67 0.61 0.72 0.55 0.60 0.65 0.36 0.35 0.43GCN (tAttrib, vAttrib) — — 0.74 — — 0.59 — — 0.42SVD (tAttrib, vAttrib) — — 0.70 — — 0.59 — — 0.39CCA (tAttrib, vAttrib) — — 0.58 — — 0.56 — — 0.37CONC (tAttrib, vAttrib) 0.63 0.62 0.71 0.49 0.57 0.60 0.35 0.37 0.33CTL (tAttrib, vAttrib) 0.67 0.64 0.69 0.54 0.57 0.58 0.38 0.29 0.30HM-SGE (skip-gram, vAttrib) 0.79 0.68 0.78 0.62 0.60 0.65 0.45 0.40 0.47SAE (skip-gram, vAttrib) 0.74 0.61 0.77 0.59 0.60 0.66 0.44 0.35 0.48GCN (skip-gram, vAttrib) — — 0.76 — — 0.60 — — 0.42SVD (skip-gram, vAttrib) — — 0.75 — — 0.63 — — 0.43CCA (skip-gram, vAttrib) — — 0.59 — — 0.57 — — 0.35CONC (skip-gram, vAttrib) 0.71 0.62 0.75 0.56 0.57 0.62 0.37 0.37 0.45CTL (skip-gram, vAttrib) 0.74 0.67 0.71 0.60 0.57 0.60 0.44 0.29 0.43VL-BERT — — 0.66 — — 0.56 — — 0.36Lazaridou et al. 0.70 0.62 0.70 0.55 0.57 0.61 0.37 0.37 0.39Bruni et al. — — 0.50 — — 0.44 — — 0.34

Table 1: Comparative results in terms of Spearman’s correlations between model predictions and human similarityratings, and categorization on two sets of input features (tAttrib,vAttrib) and (skip-gram,vAttrib). Here T, V, T+Vdenote textual, visual and textual&visual. The bold scores are the best results per semantic task. Since the onlystochastic part of our algorithm is the use of several random centroid seeds for kmeans, it can effectively beconsidered deterministic and results vary extremely little.

Each column of Tab. 1 corresponds to a unimodal(textual (T) or visual (V)) or joint (T+V) represen-tation.

As baseline methods trained with the same at-tribute based input, we considered: SAE (Silbereret al., 2016), SVD (Bruni et al., 2014), CCA (Hilland Korhonen, 2014) and cross-transfer learning(CTL) (Both et al., 2017) models trained on thesame attribute-based input as in (Silberer et al.,2016); SVD and CCA models first concatenatenormalized textual and visual vectors and then con-duct SVD or CCA; CONC stands for concatena-tion of normalized textual and visual vectors (Kielaand Bottou, 2014). CTL transfer information fromthe common space to unimodal representations by

using either a mask (estimated by correlation, ormultilinear regression) or a function generating ar-tificial features (linear, or a neural net) estimatedfrom multimodal examples. We implemented themask approach and reported the better of the resultsfor correlation or multilinear regression.

Moreover, we provide a novel graph-based base-line that learns word embeddings via a two-layerGraph Convolutional Networks (GCN) trained toclassify words. The graph structure (edges) wascreated by using visual features, and node embed-dings were initialized by textual features.

The unimodal (output by layer 1) representationsof our HM-SGE always achieve state-of-the-art re-sults, largely outperforming the SAE method (up

Page 7: Learning grounded word meaning representations on

4766

Models Semantic Similarity Visual Similarity CategorizationT V T+V T V T+V T V T+V

HM-SGE (tAttrib, vAttrib) — — 0.77 — — 0.64 — — 0.45Layer2 only (tAttrib, vAttrib) — — 0.76 — — 0.63 — — 0.44Layer1 only (tAttrib, vAttrib) 0.74 0.68 0.76 0.59 0.60 0.63 0.43 0.39 0.44tAttrib + vAttrib 0.63 0.62 0.71 0.49 0.57 0.60 0.35 0.37 0.33HM-SGE (skip-gram, vAttrib) — — 0.78 — — 0.65 — — 0.47Layer2 only (skip-gram, vAttrib) — — 0.77 — — 0.64 — — 0.46Layer1 only (skip-gram, vAttrib) 0.79 0.68 0.78 0.62 0.60 0.64 0.45 0.40 0.46skip-gram + vAttrib 0.71 0.62 0.75 0.56 0.57 0.62 0.37 0.37 0.45

Table 2: Ablation study. T, V, T+V denote textual, visual and textual&visual. The best results are in bold.

Word pairs 1-4 Word pairs 5-8 Word pairs 9-12lettuce-spinach cello-violin cloak-robe

clarinet-trombone leopard-tiger cabbage-spinachcabbage-lettuce raspberry-strawberry pants-shirt

airplane-helicopter chapel-church blouse-dressWord clusterscatfish, cod, crab, eel, guppy, mackerel, minnow, octopusperch, salmon, sardine, squid, trout, tunaambulance, bus, car, jeep, limousine, taxi, trailer, traintruck, vanbike, buggy, cart, dunebuggy, motorcycle, scooter, tractortricycle, unicycle, wagonant, beetle, butterfly, caterpillar, cockroach, grasshopperhornet, housefly, moth, spider, wasp, wormapartment, barn, brick, bridge, building, bungalow, cabincathedral, chapel, church, cottage, fence, gate, house, hutinn, pier, shack, shed, skyscraper

Table 3: Left: Word pairs with highest semantic and vi-sual similarity according to HM-SGE model. Pairs areranked from highest to lowest similarity. Right: Exam-ples of clusters produced by CW using semantic repre-sentations obtained with our HM-SGE.

to +7%, +4% on average). The unified representa-tions of our HM-SGE (output by layer 2 followinglayer 1) achieves state-of-the-art results in mostcases: for semantic similarity rating and catego-rization when using textual attributes, and for se-mantic similarity only when using the skip-grammodel for the textual modality (up to +5%, +3%on average). In the other cases (visual similarityratings; categorization for skip-gram model) theunified representations also achieve results compa-rable to the best performing method (SAE), up to1% difference. Overall, we improved reported per-formance measures with respect to the SOTA, byup to 7% on average for the 18 cases, in particularby 5%(semantic similarity), 1% (visual similarity)and 3% (categorization).

Our model also outperforms embeddings ob-tained using the pretrained VL-BERT by a largeextent. This is not surprising considering that re-cent studies (Mickus et al., 2021; Rogers et al.,2020) have raised concerns about the coherence ofBERT (text-based) embedding space. Further, as

one would expect, tAttrib dominates vAttrib whenmodeling semantic similarity and categorization,while vAttrib dominates tAttrib for visual similar-ity ratings only. More importantly, the joint useof tAttrib and vAttrib improves all evaluation met-rics, hence corroborating the fact that the modelhas learnt to leverage on their redundancy and com-plementarity. Joint representations also improveperformance when based on skip-gram encoding,except for semantic similarity, which is stronglydominated by the skip-gram features. Examplesof our model output are given in Tab. 3, showingword pairs with highest similarity rating (left) andexamples of word clusters (categories, right).

4.3 Model validation and illustration

Ablation study. To validate the proposed HM-SGE model, in Tab. 2 we report results obtainedwith HM-SGE (top rows), and with HM-SGE uponremoval of one of its layers: removal of the twocoupled SGE (second rows, Layer2), removal of fi-nal SGE (third rows, Layer1), no HM-SGE (bottomrows); for rows 2 to 4, joint representations (T+V)are obtained upon concatenation of the individualvisual and textual representations. It can be seenthat Layer1 as well as Layer2 alone yield signifi-cant performance improvements when comparedwith the initial representations (T, V and T+V).This validates the independent capabilities of theindividual components of our model to learn mean-ingful word representations. Yet, best performancefor the joint representation (T+V) are obtained onlyupon combination of the two layers, demonstratingthe importance of both layers in our model.

Qualitative results. An illustration correspond-ing to the rows 1 and 4 of Tab. 2 is provided inFig. 3, which plots the affinity matrices for textualattributes (T), visual attributes (V), concatenationthereof (T+V) and HM-SGE for the 10 categoriesfood, animal, bird, tools, mammal, weapon, instru-ment, transportation, clothing, device; category

Page 8: Learning grounded word meaning representations on

4767

Concept Textual NN . Visual NN . HM-SGE NN .cabin hut, tent house, cottage, hut, bungalow hut, cottage, house, shack, bungalowox cow bull, pony, cow, calf, camel, pig, sheep, lamb bull, cow, pony, sheepsardine tuna trout tuna, salmon, troutbagpipe accordion clamp, accordion, tuba, faucet accordion, tubahamster chipmunk, squirrel rat, squirrel squirrel, chipmunk, rat, groundhogspoon bowl ladle, whistle, hammer ladle

Table 4: Nearest neighbors (NN, words with similarity larger than 0.9 times that of best pair), for textual, visualand MM-SGE vectors.

boundaries are indicated by dashed lines, and im-age intensity values are clipped to cover the upper3/8 of observed similarity values for each repre-sentation individually. It is observed that T yieldsvisually better results than V, and the concatenationT+V inherits similarity from both attributes but isdominated by T. The observed large off-diagonalsimilarity values for T, V and T+V lead to expectthat categorization results based on these attributesare poor. The affinity obtained by HM-SGE is morestructured, with large values essentially coincid-ing with within-category affinities (the 10 diagonalblocks) and a few off-diagonal blocks. Interest-ingly, these off-diagonal blocks correspond withthe arguably meaningful two groups of categoriesanimal, bird, mammal (rows-columns 2,3,5) andtools, weapon (rows-columns 4,6).

Finally, Tab. 4 exemplifies such results and pro-vides a different view by showing the nearest neigh-bors (NN) for six words in terms of similarity com-puted on textual attributes, visual attributes, and ourHM-SGE attributes; all neighbors with similarityvalues of at least 90% that of the closest neighborare given. The examples illustrate that the unifiedword meaning representations learnt by HM-SGElead to NN that are not a simple union or intersec-tion of visual and textual NN, but that HM-SGE iscapable of removing pairs that make less sense (e.g.tent for cabin; calf,camel,pig for ox; clamp,faucetfor bagpipe; bowl,whistle,hammer for spoon) andcan identify and add new meaningful pairs (e.g.salmon for sardine; groundhog for hamster).

Inductive inference. It is possible to use ourmodel to perform inductive inference when one ofthe two modalities for a concept, say modalityA, ismissing. To this end, it suffices to replace in Eq. (1)the corresponding row and column of the matrixGAi−1 with those of G

Bi−1 for the other modality B.

For example, when only the visual component Xfor the concept bluejay is given, the textual repre-sentation Y learnt by HM-SGE outputs the nearestneighbors robin, stork, falcon, finch, which are all

birds; to give another example, from the visualattribute for shelves HM-SGE predicts textual near-est neighbors dresser, cabinet, cupboard, bureau,desk, closet. Analogously, HM-SGE predicts vi-sual nearest neighbors shelves, cabinet and dagger,knife, spear for cupboard and sword, respectively,when only the textual attribute is given for thesetwo concepts.

5 Conclusion

This paper has proposed a novel approach, namedHM-SGE, to learn grounded word meaning repre-sentations as low-dimensional node embeddingson a hierarchy of graphs. The first layer of thehierarchy encodes unimodal representations condi-tioned to the other modality, and the second layerintegrates these enhanced unimodal representationsinto a single one. The proposed HM-SGE approachis compatible with theories of associative memory(Reijmers et al., 2007) and of domain-general learn-ing (Rogers et al., 2004). Comparative results onword similarity simulation and word categorizationshow that our model outperforms baselines and re-lated models trained on the same attribute-basedinput. Our evaluation reveals that HM-SGE is par-ticularly good at learning enhanced unimodal rep-resentations that simulate how the response of ourbrain to semantic tasks involving a single modalityis always triggered by other modalities. Moreover,it succeeds in encoding these unimodal representa-tions into a meaningful unified representation, com-patible with the point of view of domain-generallearning theory. The ablation study thoughtfullyvalidates the proposed hierarchical architecture.Beside quantitative results, we give several insightson the learnt grounded semantic space through vi-sualization of nearest neighbors, clusters, and mostsimilar pairs. These additional results corroboratethe quality of the learnt multimodal representations.Furthermore, the proposed approach is able to per-form inductive inference for concepts for whichonly one modality is available.

Page 9: Learning grounded word meaning representations on

4768

ReferencesEneko Agirre and Aitor Soroa. 2007. Semeval-2007

task 02: Evaluating word sense induction and dis-crimination systems. In Proceedings of the fourthinternational workshop on semantic evaluations(semeval-2007), pages 7–12.

Andrew J Anderson, Douwe Kiela, Stephen Clark, andMassimo Poesio. 2017. Visually grounded and tex-tual semantic models differentially decode brain ac-tivity associated with concrete and abstract nouns.Transactions of the Association for ComputationalLinguistics, 5:17–30.

John R Anderson and Gordon H Bower. 2014. Humanassociative memory. Psychology press.

Mark Andrews, Gabriella Vigliocco, and David Vinson.2009. Integrating experiential and distributionaldata to learn semantic representations. Psychologi-cal review, 116(3):463.

Marco Baroni, Silvia Bernardini, Adriano Ferraresi,and Eros Zanchetta. 2009. The wacky wide web: acollection of very large linguistically processed web-crawled corpora. Language resources and evalua-tion, 43(3):209–226.

Marco Baroni, Brian Murphy, Eduard Barbu, and Mas-simo Poesio. 2010. Strudel: A corpus-based seman-tic model based on properties and types. Cognitivescience, 34(2):222–254.

Chris Biemann. 2006. Chinese whispers: an efficientgraph clustering algorithm and its application to nat-ural language processing problems. In Proceedingsof the first workshop on graph based methods for nat-ural language processing, pages 73–80. Associationfor Computational Linguistics.

Fabian Both, Steffen Thoma, and Achim Rettinger.2017. Cross-modal knowledge transfer: Improvingthe word embedding of apple by looking at oranges.In Proceedings of the Knowledge Capture Confer-ence, pages 1–8.

Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014.Multimodal distributional semantics. Journal of Ar-tificial Intelligence Research, 49:1–47.

Guillem Collell, Ted Zhang, and Marie-FrancineMoens. 2017. Imagined visual representations asmultimodal embeddings. In Proceedings of theAAAI Conference on Artificial Intelligence, vol-ume 31.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Mariella Dimiccoli and Herwig Wendt. 2020. Learningevent representations for the temporal segmentationof image sequences by dynamic graph embedding.IEEE Transactions on Image Processing.

Yansong Feng and Mirella Lapata. 2010. Visual infor-mation in semantic representation. In Human Lan-guage Technologies: The 2010 Annual Conferenceof the North American Chapter of the Associationfor Computational Linguistics, pages 91–99.

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Ben-gio, Jeff Dean, Marc’Aurelio Ranzato, and TomasMikolov. 2013. Devise: A deep visual-semantic em-bedding model. In Adv. neural information process-ing systemsmm, pages 2121–2129.

Zellig S Harris. 2013. Papers in structural and trans-formational linguistics. Springer.

Felix Hill and Anna Korhonen. 2014. Learning ab-stract concept embeddings from multi-modal data:Since you probably can’t see what i mean. InProceedings of the 2014 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 255–265.

Douwe Kiela and Léon Bottou. 2014. Learning imageembeddings using convolutional neural networks forimproved multi-modal semantics. In Proceedings ofthe 2014 Conference on empirical methods in natu-ral language processing (EMNLP), pages 36–45.

Alexander Koller, Stephan Oepen, and Weiwei Sun.2019. Graph-based meaning representations: De-sign and processing. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics: Tutorial Abstracts, pages 6–11.

Angeliki Lazaridou, Elia Bruni, and Marco Baroni.2014. Is this a wampimuk? cross-modal map-ping between distributional semantics and the visualworld. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1403–1414.

Angeliki Lazaridou, Nghia The Pham, and Marco Ba-roni. 2015. Combining language and vision witha multimodal skip-gram model. In Proceedings ofthe 2015 annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics (NAACL).

Wang Ling, Tiago Luís, Luís Marujo, Ramón Fernan-dez Astudillo, Silvio Amir, Chris Dyer, Alan WBlack, and Isabel Trancoso. 2015. Finding functionin form: Compositional character models for openvocabulary word representation.

Jiasen Lu, Dhruv Batra, Devi Parikh, and StefanLee. 2019. Vilbert: Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks. In 33rd Conference on Neural InformationProcessing Systems (NeurIPS).

Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation: Con-textualized word vectors. In 31st Conference onNeural Information Processing Systems (NeurIPS).

Page 10: Learning grounded word meaning representations on

4769

Ken McRae, George S Cree, Mark S Seidenberg, andChris McNorgan. 2005. Semantic feature produc-tion norms for a large set of living and nonlivingthings. Behavior research methods, 37(4):547–559.

Timothee Mickus, Denis Paperno, Mathieu Constant,and Kees van Deemter. 2021. What do you mean,bert? assessing bert as a distributional semanticsmodel. Proceedings of the Society for Computationin Linguistics: Vol. 3, Article 34.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. Adv. neural information processing systemsmm,26:3111–3119.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In 16th Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics (NAACL).

Leon G Reijmers, Brian L Perkins, Naoki Matsuo,and Mark Mayford. 2007. Localization of a sta-ble neural correlate of associative memory. Science,317(5842):1230–1233.

Anna Rogers, Olga Kovaleva, and Anna Rumshisky.2020. A primer in bertology: What we know abouthow bert works. Transactions of the Association forComputational Linguistics, 8:842–866.

Timothy T Rogers, Matthew A Lambon Ralph, PeterGarrard, Sasha Bozeat, James L McClelland, John RHodges, and Karalyn Patterson. 2004. Structure anddeterioration of semantic memory: a neuropsycho-logical and computational investigation. Psycholog-ical review, 111(1):205.

Stephen Roller and Sabine Schulte Im Walde. 2013. Amultimodal lda model integrating textual, cognitiveand visual modalities. In Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing, pages 1146–1157.

Max Ryabinin, Sergei Popov, Liudmila Prokhorenkova,and Elena Voita. 2020. Embedding words innon-vector space with unsupervised graph learning.arXiv preprint arXiv:2010.02598.

Carina Silberer, Vittorio Ferrari, and Mirella Lapata.2016. Visually grounded meaning representations.IEEE transactions on pattern analysis and machineintelligence, 39(11):2284–2297.

Carina Silberer and Mirella Lapata. 2012. Groundedmodels of semantic representation. In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning, pages 1423–1433.

Richard Socher, Milind Ganjoo, Christopher D Man-ning, and Andrew Ng. 2013. Zero-shot learningthrough cross-modal transfer. Adv. neural informa-tion processing systemsmm, 26:935–943.

Richard Socher, Andrej Karpathy, Quoc V Le, Christo-pher D Manning, and Andrew Y Ng. 2014.Grounded compositional semantics for finding anddescribing images with sentences. Transactionsof the Association for Computational Linguistics,2:207–218.

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,Furu Wei, and Jifeng Dai. 2020. Vl-bert: Pre-training of generic visual-linguistic representations.In International Conference on Learning Represen-tations (ICLR).

Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur-phy, and Cordelia Schmid. 2019. Videobert: A jointmodel for video and language representation learn-ing. In Proceedings of the IEEE/CVF InternationalConference on Computer Vision, pages 7464–7473.

Thy Thy Tran, Makoto Miwa, and Sophia Anani-adou. 2020. Syntactically-informed word represen-tations from graph neural network. Neurocomput-ing, 413:431–443.

Shikhar Vashishth, Manik Bhandari, Prateek Yadav,Piyush Rai, Chiranjib Bhattacharyya, and ParthaTalukdar. 2019. Incorporating syntactic and seman-tic information in word embeddings using graph con-volutional networks.

Shaonan Wang, Jiajun Zhang, and Chengqing Zong.2018. Learning multimodal word representationvia dynamic fusion methods. arXiv preprintarXiv:1801.00532.

Eloi Zablocki, Benjamin Piwowarski, Laure Soulier,and Patrick Gallinari. 2018. Learning multi-modalword representation grounded in visual context. InProceedings of the AAAI Conference on Artificial In-telligence, volume 32.