leveraging past references for robust language...

Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 430–440Hong Kong, China, November 3-4, 2019. c©2019 Association for Computational Linguistics

430

Leveraging Past References for Robust Language Grounding

Subhro Roy∗, Michael Noseworthy∗, Rohan Paul, Daehyung Park, Nicholas RoyComputer Science and Artificial Intelligence Laboratory

Massachusetts Institute of Technology{subhro, mnosew, rohanp, daehyung, nickroy}@csail.mit.edu

Abstract

Grounding referring expressions to objects inan environment has traditionally been consid-ered a one-off, ahistorical task. However,in realistic applications of grounding, multi-ple users will repeatedly refer to the sameset of objects. As a result, past referring ex-pressions for objects can provide strong sig-nals for grounding subsequent referring ex-pressions. We therefore reframe the ground-ing problem from the perspective of corefer-ence detection and propose a neural networkthat detects when two expressions are referringto the same object. The network combines in-formation from vision and past referring ex-pressions to resolve which object is being re-ferred to. Our experiments show that detectingreferring expression coreference is an effectiveway to ground objects described by subtle vi-sual properties, which standard visual ground-ing models have difficulty capturing. We alsoshow the ability to detect object coreferenceallows the grounding model to perform welleven when it encounters object categories notseen in the training data.

1 Introduction

Grounding referring expressions to objects in anenvironment is a key Artificial Intelligence chal-lenge spanning computer vision and natural lan-guage processing. Past work in referring expres-sion grounding has focused on understanding theways a human might resolve ambiguity that ariseswhen multiple similar objects are in a scene –for example, by referring to visual properties orspatial relations (Nagaraja et al., 2016; Hu et al.,2017). However, most previous work treats it asan ahistorical task: a user is presented with an im-age and a referring expression and only featuresfrom the current referring expression are used to

∗Equal contribution.

Utterances from a User:(1) The open text book.(2) Page with a dark border and images of two people

at the bottom.(3) The open book with typed pages.(4) Pages of handwritten notes under the open book.

Figure 1: An example where a single user repeatedlyrefers to objects in the same scene. The color identifieswhich bounding box is being referred to.

determine which object is being referred to. How-ever, in many scenarios where a grounding systemcan be deployed, grounding is not an isolated one-off task. Instead, users will repeatedly refer to thesame set of objects. In a household environment,a robot equipped with a grounding system will re-peatedly be asked to retrieve objects by the peoplewho live there. Similarly, during a cooperative as-sembly task, a robot will repeatedly be asked forvarious parts or tools. Even in non-embodied sys-tems, such as conversational agents, a user will re-peatedly refer to different relevant entities.

These scenarios present new challenges and op-portunities for grounding referential expressions.When people in the same environment commonlyinteract with each other (as in a household orworkplace), lexical entrainment will likely occur(Brennan and Clark, 1996). That is, users of thesystem will come to refer to objects in similarways to how they have been referred to in the past.

431

Thus, it is important that the grounding system canadapt to the vocabulary used where it is deployed.

Even if each object is being repeatedly referredto by a set of people who do not interact with eachother, the agent can still learn general informationabout the objects by remembering how they havebeen referred to in the past. This is useful as itprovides a way for the model to learn propertiesof objects that are otherwise hard to detect suchas subtle visual properties which cannot easily becaptured by standard visual grounding systems.

To include past referring expressions in agrounding model, we formulate grounding as atype of coreference resolution – are a new phraseand a past phrase (known to identify a specific ob-ject) referring to the same object? To compare anew referring expression with past referring ex-pressions, we need to learn a type of compatibil-ity metric that tells us whether two expressionscan both describe the same object. Detecting ob-ject coreference involves distinguishing betweenmutually exclusive object attributes, recognizingtaxonomic relations, and permitting unrelated butpossibly co-existing properties.

Our contribution is to demonstrate that ground-ing accuracy can be improved by incorporating amodule that has been trained to perform corefer-ence resolution to previous referring expressions.We introduce a neural network module that learnsto identify when two referring expressions de-scribe the same object. By jointly training thesystem with a visual grounding module, we showhow grounding can be improved using informationfrom both linguistic and visual modalities.

We evaluate our model on a dataset where usersrepeatedly refer to objects in the same scene (seeFigure 1). Given the same amount of trainingdata, our coreference grounding model achievesan overall increase of 15% grounding accuracywhen compared to a state-of-the-art visual ground-ing model (Hu et al., 2017). We show that thecoreference grounding model can better general-ize to object categories and their descriptions notseen during training – a common difficulty of vi-sual grounding models. Finally, we show thatjointly training the coreference model with a vi-sual grounding model allows the joint model touse object properties not stated in previous re-ferring expressions. As an example application,we demonstrate how the coreference grounding

paradigm can be used with a robotic platform.1

2 Technical Approach

The task of grounding referring expressions is toidentify which object is being described by a queryreferring expression, Q. The input to this problemis a set of N objects, O = {o1, o2, . . . , oN}.2 Eachobject, oi, is represented by its visual features, vi(i.e., pixels from the object’s bounding box), and areferring expression, ri, that was previously usedto refer to that object (ri may not always be avail-able). The grounding problem can then be mod-elled as estimating the distribution over which ob-ject is being referred to: p(x|Q, v1:N , r1:N ) wherex is a random variable with domain O.

There has been a lot of work in grounding refer-ring expressions (Hu et al., 2017), many of whichuse only visual features and no interaction history.Most of the proposed models have the form:

p(x = oi|Q, v1:N ) = S (Wvis · fvis (Q, v1:N ))i

=exp (Wvis · fvis (Q, vi))

N∑j=1

exp (Wvis · fvis (Q, vj))

where fvis (Q, vi) is a low dimensional represen-tation of the visual features vi and the query Q,Wvis is a learned linear transformation, and S(·)iis the ith entry of the softmax function output.

We introduce a similar model for coreferencegrounding which uses past referring expressionsto decide which object oi is being referred to:

p (x = oi|Q, r1:N ) = S (Wcoref · fcoref (Q, r1:N ))i

where fcoref (Q, ri) is an embedding of a past re-ferring expression and the query expression, andWcoref is a learned linear transformation.

Finally, we introduce a joint model which fusesrepresentations fvis and fcoref by a function g:

p (x = oi|Q, v1:N , r1:N ) =

S (g (fcoref (Q, r1:N ) , fvis (Q, v1:N )))i (1)

Note that fvis can come from any visual ground-ing model that associates text to visual features ex-tracted from the objects’ bounding boxes.

1The dataset, code, and demonstration videos canbe found at https://mike-n-7.github.io/coreference-grounding.html.

2Object proposal networks (e.g., Faster R-CNN) can beused to extract object bounding boxes from an image.

https://mike-n-7.github.io/coreference-grounding.html

https://mike-n-7.github.io/coreference-grounding.html

432

Figure 2: (a) A visual grounding model only takes in images to resolve the query expression and outputs a categor-ical distribution over the objects. (b) We propose a model that uses past referring expressions to resolve the newexpression. (c) These models can be combined to fuse visual and linguistic information.

Our contributions are the coreference and jointgrounding models. Both models learn to grounda new referring expression by computing compat-ibility with past referring expressions of candidateobjects. We describe fcoref in detail in Section 2.1and the joint model in Section 2.2. For a visualiza-tion of the model, see Figure 2.

2.1 Coreference Grounding ModelGiven a query referring expression Q and an ex-pression ri for each object (each represented bya sequence of M words: {w1, w2, . . . , wM}), wedefine a joint representation of these two expres-sions, fcoref , as follows:

fcoref (Q, ri) = fenc(Q)� fenc(ri)

where fenc produces referring expression embed-dings for the given phrase of dimension l× 1, and� is the elementwise multiplication operator. Notethe same encoder is used to embed both ri and Q.In Section 4.3, we evaluate various referring ex-pression embedding methods described below.LSTM Embeddings The final output state of anLSTM (Hochreiter and Schmidhuber, 1997) isused as the referring expression embedding.BiLSTM Embeddings For the bidirectionalLSTM, forward and backward LSTMs are runover the input sequence, and their outputs for eachword are concatenated. An expression embeddingis computed by the dimension-wise max acrosswords (Collobert and Weston, 2008).Attention Embeddings Attention encoders (Linet al., 2017) learn a weighted average of BiLSTMoutputs as the referring expression representation.

They output an attention score that is used to com-pute a weighted average of the BiLSTM outputs.

InferSent Embeddings Recently, InferSent (Con-neau et al., 2017) was proposed as a general pur-pose sentence embedding method. InferSent issimilar to the BiLSTM model but was trainedusing the Natural Language Inference task withthe intuition that this task would require the sen-tence embeddings to contain semantically mean-ingful information. The authors showed that theirsentence representation generalized well to othertasks. The pretrained encoder from the InferSentmodel is used to embed a referring expression.

2.2 Integrating Vision and Coreference

The coreference model can learn the complexitiesof coreference with past expressions, but if certainproperties are not mentioned in a previous expres-sion, the model will have difficulty deciding whichobject is being referred to. If the referenced prop-erty could have been learned by a visual groundingmodel, we would like to include this representa-tion in our model. In this section, we show how wecan take an existing visual grounding model andcombine it with a coreference grounding model.We generate representations fvis from an existinggrounding model, and fcoref from our coreferencegrounding model. These representations are fusedusing a function g(·), which is used to computethe most likely referred object using Equation 1.We experiment with two choices of g, which wedescribe in the following subsections.

433

2.2.1 AdditionOne approach is to take the sum of the scoresof component visual and coreference groundingmodels. The function g(·) can be written as (ar-guments of functions removed for brevity):

g = Wvis · fvis(·) +Wcoref · fcoref (·)

where Wvis,Wvis ∈ R1×l are learned parameterswhich transform their respective feature vectors ofsize l into scalar scores.

2.2.2 ConcatenationSimple addition might not capture all interdepen-dencies between different modalities. We proposean approach where we concatenate the represen-tations fvis and fcoref , and add a two layer feedforward network to output the final score,

g = Wc1(ReLU(Wc2(ReLU([fvisfcoref ]).

For our visual grounding model, we usethe Compositional Modular Network (Hu et al.,2017), which is one of the top performing mod-els in several benchmark referring expressionsdatasets. We train the joint model end-to-end sothat it can learn how to properly merge the visualand coreference information.

3 Dataset

Our goal is to ground a referring expression toan object in the environment using past referringexpressions and visual features. Existing refer-ring expression datasets do not contain at leasttwo referring expressions for each object. Sincethis is a requirement to learn and evaluate mod-els that can utilize past expressions, we createtwo new datasets for the task – a large Diagnosticdataset where past expressions are algorithmicallyassigned, and a smaller Episodic dataset created tocapture more realistic interaction scenarios.

3.1 Diagnostic Dataset

In this dataset, artificial scenes are created bygrouping similar objects, and each referring ex-pression for an object is collected independently.This form of dataset allows us to easily scale upthe amount of data used for training and cap-ture more descriptive language by introducingcategory-level ambiguity into the scene. We useimages from the MSCOCO dataset (Lin et al.,2014), which contains bounding boxes for each

object in an image. We randomly group togetherfour object images from the same category. Werandomly label one object as the goal object andthe remaining three as distractor objects. Anno-tators from the Figure Eight platform3 are shownthese images with the goal object labeled by a redbounding box. They are asked to write an Englishexpression to refer to the goal object so that it canbe easily distinguished. Two expressions are col-lected for each object, each time with different dis-tractor objects.

To ensure the model can distinguish betweenobjects of different categories, we randomly selecthalf the data, and replace two distractor objects inthe group with two objects from a different cate-gory. Since these objects are from a different cate-gory, we expect the referring expression to still beable to correctly identify the goal object.

Each instance, or dataset sample, now consistsof four objects (represented by images) with aquery expression referring to the goal object. Toassociate each object with a past expression, weuse expressions that were used to reference thatobject in other instances. Each instance now hasa set of objects associated with a past expressionand an image. In addition, the goal object is la-beled and a query referring expression is providedfor the goal. We randomly split the data into train,development and test sets in a 60/20/20 ratio. Werefer to this split as STANDARD.

To evaluate the ability of grounding models todisambiguate objects from categories not seen dur-ing training, we create an alternative split of theDiagnostic dataset. This split ensures that no ob-ject category in the test set is present in the train-ing or development splits. We refer to this split asHARD. More details of dataset construction andverification are present in the supplementary ma-terial.

3.2 Episodic Dataset

The Diagnostic dataset uses cropped images ofobjects from MSCOCO images and programmati-cally assigned past expressions. In order to cap-ture nuances of realistic interaction, we collecta smaller Episodic dataset where annotators areshown a full scene and repeatedly asked to re-fer to objects within that scene. We select scenesfrom the MSCOCO dataset which have three tosix objects of the same category. We prune ob-

3https://www.figure-eight.com/

https://www.figure-eight.com/

434

ject bounding boxes with area less than 5% or over50% of the image area. Extremely small objectsare often not distinguishable, and large bound-ing boxes often correspond to a cluster of objectsrather than a single object in cluttered scenes.

Each annotator is shown the same scene 10times in a row. Each time one of the ambiguousobjects is marked with a red bounding box. Theannotator is asked to provide an English referringexpression to uniquely identify the marked object.Our interface does not allow the user to view re-ferring expressions once it has been entered. As aresult, if the same object is marked again in the se-ries, the user will have no way to look up what theyhad written before. This process simulates how auser will refer to objects in the same environmentover a period of time. As the annotators do notcommunicate with each other, the dataset will notcapture between-user entrainment. For each im-age, we collect two such series of 10 expressionsfrom two different users. We create examples forthe Episodic dataset in two ways:SAME USER: For each scene and annotator, weorder the expressions in the same order they wereprovided by the annotator. Each expression formsan example where it is the respective query expres-sion and the candidate objects are all the objects inthe scene. All previous referring expressions forthis scene and annotator are assigned to the respec-tive objects as past referring expressions. Theseexamples capture the scenario where the interac-tion history is provided by a single user.ACROSS USERS: Similar to the SAME USER

dataset, we create a new example each time the an-notator refers to an object. However, the past ex-pressions come from the other annotator who wasdisplayed the same scene. These examples rep-resent cases where interaction history is acquiredfrom people who do not interact with each other.

We create train and development sets with bothtypes of examples. For testing, we create one setcorresponding to each of the previously mentionedtypes. Validation details are in the supplementarymaterial.

The statistics for both datasets are given in Ta-ble 1. We report a metric called Lexical Overlapto denote the extent of similarity between trainingand test data. The Lexical Overlap is the fractionof word types in the test set that also appear in thetraining set. As seen in Table 1, the HARD split haslower Lexical Overlap compared to STANDARD.

DatasetDiagnostic Episodic

STANDARD HARD SAME ACROSSUSER USERS

# Train 10133 9855 2656 2656# Dev 3889 4170 714 714# Test 3796 3793 1686 1686Objects per Example 4.0 4.0 5.64 5.64Expressions per Object 0.47 0.47 0.5 1.94Lexical Overlap 0.71 0.63 0.47 0.47

Table 1: Statistics for the various datasets used. Unlessotherwise mentioned, the numbers are reported fromthe test set. Low lexical overlap for HARD indicatesmore unseen words in the test set. The ACROSS USERStest set of Episodic dataset has more past expressionsfor each object compared to other splits.

This means that a greater number of novel wordsappear in the HARD dataset.

4 Experiments

We run multiple experiments to evaluate the pro-posed grounding models and characterize the ad-vantages of using coreference along with visualfeatures for grounding. Specifically, we considerthe following questions:

1. Which method performs best for coreferencegrounding and the joint model? (Section 4.3)

2. How does grounding with different types ofinformation (visual, coreference) transfer tonew object categories? (Section 4.4)

3. How does grounding performance vary whenpast expressions are acquired from the same(or entrained) users, as opposed to users un-known to each other? (Section 4.5)

All quantitative evaluation can be found in Table2. As our datasets contain examples where the ob-ject being referred to may or may not have a pre-vious referring expression, we report overall testset accuracy (All), as well as accuracy grouped bythe number of past referring expressions belong-ing to the ground truth object (0, 1, or, 2). Thisallows us to more accurately evaluate the corefer-ence models as they are not expected to performwell without a previous referring expression.

We show how coreference grounding is used inpractice by demonstrating its use on the Baxterrobot. Namely, we show how a system can keeptrack of past referring expressions and associatethem with objects.

4.1 Model DescriptionsWe compare against the following unsupervisedcoreference baselines (i.e., not trained on the re-ferring expression task). All these methods use a

435

similarity score between the input expression andthe past referring expression to make a prediction.

Word Overlap Baseline: Compute the Jaccardsimilarity between two expressions.Word Averaging Baseline: Each expression isrepresented by the average of the word vectors ofconstituent words. Similarity is computed as co-sine similarity between vectors.Paragram Phrase: Uses the paraphrase modelproposed by Wieting et al. (2016) to compute sim-ilarity between past and input expressions.InferSent Unsupervised: Each expression is rep-resented by its pretrained InferSent embeddingand similarity is computed as cosine similarity.

We also consider supervised models for the re-ferring expression grounding task:

Vision: The Compositional Modular Network (Huet al., 2017) which grounds an input expression toa bounding box’s visual features.Coreference: The proposed model that groundsan expression to objects represented by previousreferring expressions. We evaluate the LSTM,BiLSTM-Max, Attention, and InferSent Encoders.Joint: Jointly trains the Vision and Coreferencecomponents of the model. This model uses onlythe InferSent encoder. We evaluate both additionand concatenation methods for information fusion.

4.2 Implementation Details

Since the Episodic dataset is much smaller in size,all models (except joint with concatenation) arefirst trained on the Diagnostic training set until thevalidation error stops decreasing, and then trainedon the Episodic train set. For the joint model withconcatenation, we found that tuning only the fu-sion parameters, while holding the others fixed, onthe Episodic data performs better.

The large Diagnostic dataset contains at mostone past expression for each object. We cannotexpect our models trained on the Diagnostic datato handle multiple past referring expressions. As aresult, when an object is associated with multiplepast expressions, we only consider the expressionmost similar to the query expression according tothe Unsupervised InferSent model.

The models are implemented in PyTorch(Paszke et al., 2017). All models for the Diag-nostic dataset are trained with the Adam optimizer(Kingma and Ba, 2015), and the best model is se-lected based on the performance on the develop-

ment set. We use GloVe vectors (Pennington et al.,2014) and a pretrained VGG19 model to extractvisual features (Simonyan and Zisserman, 2015).

4.3 Coreference and Joint Model Evaluation

We evaluate the performance of the coreferenceand joint models on the STANDARD split of the Di-agnostic dataset (see column STANDARD of Table2). First, we observe that all unsupervised coref-erence methods perform poorly when the goalobject does not have past referring expressions.However, when past expressions are present, allthe coreference baselines outperform the visionmodel. Coreference using pretrained InferSentembeddings performs the best among the unsuper-vised methods, possibly because InferSent embed-dings have been pretrained on the SNLI dataset.SNLI was created from image captions, makingthe domain similar to that of referring expressions.

Learned models of coreference start performingbetter on cases where the goal object has no pastreferring expression. Note that in the 0-column,the 0 only refers to the ground-truth object nothaving a referring expression – the other objectsmay have expressions associated with them. Thusthe supervised models can better determine whentwo expressions are incompatible, leading to themodel choosing an object without any previous re-ferring expression. However, when a past refer-ring expression is present, they are typically infor-mative and the Word Overlap model performs thebest amongst unsupervised and supervised meth-ods (see the 1-column). InferSent (Unsupervised)still achieves strong performance yet fine-tuning tothe task still helps. We use the coreference modelwith the InferSent encoder for the joint models.

The jointly trained models achieve high accura-cies in both cases where the ground truth objecthas previously been referred to (1-column) andthose where it has not (0-column). This indicatesthat the models can successfully utilize informa-tion from both vision and coreference modalities.The concatenation fusion method outperform sim-ple shallow addition of component scores.4

The joint model consistently outperforms thevision model. This is because people usually pro-vide information relevant to how the object can bereferred. If this information is available, which is

4Note that different objects in a scene might have differentnumber of past expressions. At test time, we will not knowthe goal object, and hence, we cannot use the number of pastexpressions to determine which model to use at test time.

436

Diagnostic Episodic (s. 4.5)STANDARD (s. 4.3) HARD (s. 4.4) SAME USER

ACROSSUSERS

0 1 All 0 1 All 0 1 2 All57% 43% 56% 44% 49% 33% 15%

Vision 64.2 66.1 65.0 59.6 60.6 60.0 32.1 32.1 29.4 31.5 31.5

Uns

uper

vise

d Word Overlap 3.5 74.2 34.1 4.0 72.6 34.3 5.3 85.4 87.3 37.6 58.1Word Averaging 3.5 69.6 32.1 4.0 67.8 32.2 5.3 77.8 80.2 42.7 51.8Paragram Phrase 3.5 71.3 32.8 4.0 71.3 33.7 5.3 81.1 83.7 44.3 56.9InferSent 8.8 73.8 36.9 9.2 73.6 37.6 5.9 85.0 85.7 46.3 47.7

Supe

rvis

ed LSTM 28.2 41.1 33.8 28.2 40.3 33.5 6.0 69.4 63.9 37.2 46.8BiLSTM-Max 30.0 60.8 43.3 27.0 57.8 40.6 7.0 75.7 74.6 41.8 53.1Attention 29.2 61.6 43.2 26.8 56.0 39.7 10.1 67.0 63.9 38.4 47.4InferSent 27.9 70.5 46.3 28.8 68.6 46.4 5.8 82.5 85.3 45.4 61.3

Join

t Addition 65.3 69.4 67.1 62.0 62.3 62.1 22.3 63.8 60.3 42.6 48.7Concatenation 68.6 71.3 69.8 58.0 70.7 63.6 19.0 84.7 86.1 52.7 62.7

Table 2: Grounding accuracy under different conditions. Best scores in bold. The columns labeled 0, 1, and2 correspond to test examples where the goal object has 0, 1 and 2 past expressions respectively (percentageindicates fraction of test examples applicable for that column). All refers to the score on the entire test set. Mostexamples in ACROSS USERS have large number of past expressions, so we only report score on the entire test set.

true in many realistic scenarios, it is beneficial toutilize coreference grounding.

With more data, the vision model could achievehigher performance. However, we argue that asambiguity between objects increases, the proper-ties that distinguish objects become more subtleand a large dataset would be necessary to learnthese intricacies.

When does Joint perform better than Visionand Coreference? Our Joint model performs bet-ter than models trained on single modality (Vision,Coreference). The joint model can use visual fea-tures when the past referring expressions are notsufficient to discriminate between objects. In 25%of examples, the Coreference model predicts thewrong object whereas the Joint model selects thecorrect object. A majority of these cases are exam-ples where the goal object has no past expressionassociated with it. In 8% of examples, the Jointand Coreference models are correct even thoughthe Vision model is wrong. Finally, for only 7% ofthe examples, the Joint model predicts the correctobject when both the Vision and Coreference mod-els predict incorrectly. This indicates that the Jointmodel is primarily merging the gains of the Visionand Coreference models; most correct decisions ofJoint correspond to correct decisions either fromVision or Coreference. Figure 3 shows a break-down of how often the various models outperformeach other. Figure 4 shows examples where Jointoutperforms models trained on a single modality.

Figure 3: Proportion of examples of the STANDARDtest set where different subsets of the systems groundcorrectly (green) or incorrectly (red). Instances are ar-ranged along the x-axis.

4.4 Generalizing to New Object Categories

To evaluate how the various models handle gen-eralization to unseen object categories, we use theHARD split of the Diagnostic dataset. The test setof this split contains object categories which havenot explicitly been seen during training. We hy-pothesize that due to pretrained word embeddings(which include embeddings for words describingthe unknown categories), the coreference modelswill be able to successfully ground to new objectcategories. On the other hand, the performance ofthe Vision model will decrease in the HARD split,because the pretrained visual features of the newobjects are not well aligned with representationsof unseen words. As seen in Table 2, this is indeedthe case as the Vision model’s performance dropsbetween the STANDARD and HARD datasets. Onthe other hand, the coreference models’ perfor-mance on the HARD split is comparable to those

437

of the STANDARD split. Although the aggregateperformance is low, the coreference models per-form strongly on examples where the goal objecthas past expressions (see 1-column). In particular,Coref Supervised with InferSent achieves 70.5%on the STANDARD split, which reduces only to68.6% on the HARD split. This can be explainedby observing that the representation of the object(its past referring expression) and a new referringexpression are already aligned within the samevector space (pretrained InferSent embeddings).

4.5 Performance on Episodic Dataset

As the Diagnostic dataset is somewhat artificial inthe way objects and properties are grouped, wealso evaluate these models on the Episodic dataset.As described in Section 3.2, the key differences inthis dataset are that all candidate objects in an ex-ample are from the same scene, and previous refer-ring expressions are added sequentially as the userrefers to more objects in the scene (thus a previousreferring expression truly did occur previously).

We find similar trends in the performance onthe Episodic dataset (see the Episodic columns ofTable 2). Since the same user is referring to thesame object multiple times, we expect expressionsfor the same object to be similar. This leads tothe high performance of coreference models in theSAME USER split of the Episodic dataset (see the1-column). Even the word-overlap model doesparticularly well due to the correlation betweenexpressions from the same user (over 85% accu-racy when the goal object has past expressions).

If the past expressions were not provided by thesame user, but by users unknown to each other, wecan still see improvement over not having any pastexpressions (see the ACROSS USERS column). Inthis split, past expressions are always associatedwith objects as we assume other users have in-teracted with the system. In the ACROSS USERS

split, even though the Coreference models are lesseffective than in the SAME USER split, they stilloutperform the Vision model (e.g., the SupervisedInferSent model achieves 61.3%, whereas a purevision model achieves only 31.5%). This is be-cause users unknown to each other still provideuseful knowledge about the properties of objects.

The Joint models benefit from using bothmodalities, which is particularly important in re-alistic scenarios as most objects initially will nothave any previous referring expressions. Note

Query: the stuffed bear with blue clothesCoref, Joint Vision

the brown a stuffedteddy bear animalwith a bow in blue

Query: its a cup of coffeeCoref Vision, Joint

a coca cola the glasscup tumbler with

a drink

Figure 4: Examples where the Joint model accuratelygrounds the query, but either Vision or Coreferencedoes not. The past expression is mentioned below eachobject image. The correct object has a green border anda model’s name appears above its predicted object.

that the performance of the Vision model on theEpisodic dataset is much lower than its perfor-mance on the Diagnostic dataset. This loss inperformance is due to the images in the Episodicdataset being more cluttered and annotators of-ten using spatial relationships to refer. As theDiagnostic training dataset does not contain spa-tial information, our models cannot handle thesecases. We can train on existing referring expres-sion datasets to learn spatial relationships, but wedo not explore this as it is not central to this paper.More analysis is added to the appendix.

4.6 Robot Demonstration

One of the motivations of this work was to en-able robots to use past referring expressions toaid grounding in human-robot interactions. In thissection, we provide a demonstration of how ourcoreference grounding system can be integratedwith the Baxter robotic platform. To benefit fromcoreference grounding, the system must have anatural way to keep track of past referring expres-sions of an object. We demonstrate such an inter-face in Figure 5.

We focus on a scenario where a human asks therobot to pick up specific objects. If the robot incor-rectly identifies the object being referred to, on thenext turn, the user can correct the robot by indicat-ing the correct object (Figure 5b). The referringexpression can then be associated with the correct

438

Figure 5: Demonstration of a coreference grounding system on the Baxter robot. (a) The cup being identified. (b)A user refers to the cup and the robot grounds incorrectly. The user corrects the robot and associates the referringexpression with that object. (c) When a new user refers to the same object, the system’s output is correct.

object and this association can be maintained witha tracking system. Future users (with no knowl-edge of the first user) can then successfully referto the object (Figure 5c).

5 Related Work

Grounding Referring Expressions There hasbeen a lot of recent work on resolving referringexpressions to objects (Mao et al., 2016; Yu et al.,2016; Shridhar and Hsu, 2018; Nagaraja et al.,2016; Cirik et al., 2018). In contrast to these ap-proaches which are static once trained, our modelleverages past referring expressions, which per-mits an interface to add information during exe-cution. Our task is related to Visual Dialogue (Daset al., 2017; Kottur et al., 2018), where an agentinteractively answers questions about visual prop-erties of a scene where the answers only exist inthe image and not in the dialogue context. In con-trast, we focus on using complementary knowl-edge from past referring expressions. There hasalso been work to learn a compatibility metric dic-tating whether two words can refer to the same ob-ject (Kruszewski and Baroni, 2015). Our work fo-cuses on coreference between entire referring ex-pressions instead of atomic words.Lexical Choice in Interactions A large bodyof work in psycholinguistics (Clark and Wilkes-Gibbs, 1986; Brennan and Clark, 1996; Picker-ing and Garrod, 2004) show that participants in aconversation collaboratively come up with lexicalterms to refer to objects and consequently, get en-trained with each other and start using similar ex-pressions to refer to objects. These works form themotivation for our problem formulation. Recently,the PhotoBook Dataset has been proposed to in-vestigate shared dialogue history in conversation(Haber et al., 2019). The dataset differs from ours

in that expressions refer to entire scenes instead ofobjects within a scene. The authors’ conclusionssupport our findings that using past expressions isuseful for resolving referring expressions.Interaction History for Human Robot Interac-tion Paul et al. (2017) maintains knowledge pro-vided by users, using a closed set of predicates.In constrast, we use raw past referring expressionsto handle an open domain of knowledge. Therehas been work on grounding in dialogue for robots(Tellex et al., 2014; Whitney et al., 2017; Thoma-son et al., 2017; Padmakumar et al., 2017). Incontrast to our work, they focus on refinement orclarification, and hence past expressions only helpwithin the same dialogue episode.

6 Conclusion

In this work, we reformulated the grounding prob-lem as a type of coreference resolution, allowingfor the inclusion of past referring expressions thatare typical in many real-world scenarios. We pro-posed a model that can use both linguistic featuresfrom past expressions and visual features of theobject to ground to a new expression. We showedthat this model outperforms a purely vision-basedmodel as it can use past descriptions of salient fea-tures that the vision-based model may have diffi-culty learning with limited data. It further benefitsfrom having a vision model that can fill in infor-mation not provided in past expressions.

Acknowledgements

We gratefully acknowledge funding support inpart by the Honda Research Institute and the Toy-ota Research Institute. However, this article solelyreflects the opinions and conclusions of its au-thors.

439

ReferencesSusan E. Brennan and Herbert H. Clark. 1996. Con-

ceptual pacts and lexical choice in conversation.In Journal of Experimental Psychology: Learning,Memory, and Cognition.

Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. 2018. Using syntax to groundreferring expressions in natural images. In Proceed-ings of the 32nd AAAI Conference on Artificial In-telligence (AAAI).

Herbert H. Clark and Deanna Wilkes-Gibbs. 1986.Referring as a collaborative process. Cognition,22(1):1–39.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: deepneural networks with multitask learning. In Pro-ceedings of the 25th International Conference onMachine Learning (ICML).

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoïcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP).

Abhishek Das, Satwik Kottur, Khushi Gupta, AviSingh, Deshraj Yadav, José M.F. Moura, DeviParikh, and Dhruv Batra. 2017. Visual Dialog. InProceedings of Computer Vision and Pattern Recog-nition (CVPR).

Janosch Haber, Tim Baumgärtner, Ece Takmaz, LiekeGelderloos, Elia Bruni, and Raquel Fernández.2019. The photobook dataset: Building commonground through visually-grounded dialogue. In Pro-ceedings of the 57th Annual Meeting of the Associa-tion for Computational Linguistics (ACL).

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Ronghang Hu, Marcus Rohrbach, Jacob Andreas,Trevor Darrell, and Kate Saenko. 2017. Modelingrelationships in referential expressions with compo-sitional modular networks. In Proceedings of Com-puter Vision and Pattern Recognition (CVPR).

Diederik P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceed-ings of the 3rd International Conference on Learn-ing Representations (ICLR).

Satwik Kottur, José M. F. Moura, Devi Parikh, DhruvBatra, and Marcus Rohrbach. 2018. Visual corefer-ence resolution in visual dialog using neural mod-ule networks. In Proceedings of the 15th EuropeanConference on Computer Vision (ECCV).

Germán Kruszewski and Marco Baroni. 2015. So sim-ilar and yet incompatible: Toward the automated

identification of semantically compatible words. InHLT-NAACL.

Tsung-Yi Lin, Michael Maire, Serge J. Belongie,Lubomir D. Bourdev, Ross B. Girshick, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollár, andC. Lawrence Zitnick. 2014. Microsoft coco: Com-mon objects in context. In Proceedings of the 13thEuropean Conference on Computer Vision (ECCV).

Zhouhan Lin, Minwei Feng, Cícero Nogueira dos San-tos, Mo Yu, Bing Xiang, Bowen Zhou, and YoshuaBengio. 2017. A structured self-attentive sentenceembedding. In Proceedings of the 5th InternationalConference on Learning Representations (ICLR).

Junhua Mao, Jonathan Huang, Alexander Toshev, OanaCamburu, Alan Yuille, and Kevin Murphy. 2016.Generation and comprehension of unambiguous ob-ject descriptions. In Proceedings of Computer Vi-sion and Pattern Recognition (CVPR).

Varun K. Nagaraja, Vlad I. Morariu, and Larry S.Davis. 2016. Modeling context between objects forreferring expression understanding. In Proceedingsof the 14th European Conference on Computer Vi-sion (ECCV).

Aishwarya Padmakumar, Jesse Thomason, and Ray-mond J. Mooney. 2017. Integrated learning of di-alog strategies and semantic parsing. In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics(EACL).

Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.In NIPS-W.

Rohan Paul, Andrei Barbu, Sue Felshin, Boris Katz,and Nicholas Roy. 2017. Temporal groundinggraphs for language understanding with accruedvisual-linguistic context. In Proceedings of the 26thInternational Joint Conference on Artificial Intelli-gence (IJCAI).

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP).

Martin J. Pickering and Simon Garrod. 2004. Towarda mechanistic psychology of dialogue. Behavioraland Brain Sciences, 27(2):169–190.

Mohit Shridhar and David Hsu. 2018. Interactive vi-sual grounding of referring expressions for human-robot interaction. In Proceedings of Robotics: Sci-ence and Systems (RSS).

Karen Simonyan and Andrew Zisserman. 2015. Verydeep convolutional networks for large-scale imagerecognition. In Proceedings of the 3rd InternationalConference on Learning Representations (ICLR).

440

Stefanie Tellex, Ross Knepper, Adrian Li, Daniela Rus,and Nicholas Roy. 2014. Asking for Help Using In-verse Semantics. In Proceedings of Robotics: Sci-ence and Systems (RSS).

Jesse Thomason, Aishwarya Padmakumar, JivkoSinapov, Justin Hart, Peter Stone, and Raymond J.Mooney. 2017. Opportunistic active learning forgrounding natural language descriptions. In Pro-ceedings of the 1st Conference on Robot Learning(CoRL).

David Whitney, Eric Rosen, James MacGlashan, Law-son Wong, and Stefanie Tellex. 2017. Reducing Er-rors in Object-Fetching Interactions through SocialFeedback. In Proceedings of the 2017 IEEE In-ternational Conference on Robotics and Automation(ICRA).

John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2016. Towards universal paraphrastic sen-tence embeddings. In Proceedings of the 4th Inter-national Conference on Learning Representations(ICLR).

Licheng Yu, Patric Poirson, Shan Yang, Alexander C.Berg, and Tamara L. Berg. 2016. Modeling contextin referring expressions. In Proceedings of the 14thEuropean Conference on Computer Vision (ECCV).

leveraging past references for robust language...

Documents