the photobook dataset: building common ground through...

The PhotoBook Dataset:Building Common Ground through Visually-Grounded Dialogue

Janosch Haber∗ Tim Baumgartner♣ Ece Takmaz♣ Lieke Gelderloos‡Elia Bruni† and Raquel Fernandez♣

♣University of Amsterdam, ∗Queen Mary University of London‡Tilburg University, †Universitat Pompeu Fabra

{raquel.fernandez|ece.takmaz}@[email protected], {baumgaertner.t|elia.bruni}@gmail.com

[email protected]

Abstract

This paper introduces the PhotoBook dataset,a large-scale collection of visually-grounded,task-oriented dialogues in English designed toinvestigate shared dialogue history accumu-lating during conversation. Taking inspira-tion from seminal work on dialogue analysis,we propose a data-collection task formulatedas a collaborative game prompting two onlineparticipants to refer to images utilising boththeir visual context as well as previously estab-lished referring expressions. We provide a de-tailed description of the task setup and a thor-ough analysis of the 2,500 dialogues collected.To further illustrate the novel features of thedataset, we propose a baseline model for ref-erence resolution which uses a simple methodto take into account shared information accu-mulated in a reference chain. Our results showthat this information is particularly importantto resolve later descriptions and underline theneed to develop more sophisticated models ofcommon ground in dialogue interaction.1

1 Introduction

The past few years have seen an increasing inter-est in developing computational agents for visuallygrounded dialogue, the task of using natural lan-guage to communicate about visual content in amulti-agent setup. The models developed for thistask often focus on specific aspects such as im-age labelling (Mao et al., 2016; Vedantam et al.,2017), object reference (Kazemzadeh et al., 2014;De Vries et al., 2017a), visual question answer-ing (Antol et al., 2015), and first attempts of vi-sual dialogue proper (Das et al., 2017), but fail toproduce consistent outputs over a conversation.

1The PhotoBook dataset is being released by the DialogueModelling Group led by Raquel Fernandez at the Universityof Amsterdam. The core of this work was done while JanoschHaber and Elia Bruni were affiliated with the group.

We hypothesise that one of the main reasonsfor this shortcoming is the models’ inability toeffectively utilise dialogue history. Human in-terlocutors are known to collaboratively establisha shared repository of mutual information dur-ing a conversation (Clark and Wilkes-Gibbs, 1986;Clark, 1996; Brennan and Clark, 1996). Thiscommon ground (Stalnaker, 1978) then is used tooptimise understanding and communication effi-ciency. Equipping artificial dialogue agents with asimilar representation of dialogue context thus is apivotal next step in improving the quality of theirdialogue output.

To facilitate progress towards more consis-tent and effective conversation models, we in-troduce the PhotoBook dataset: a large collec-tion of 2,500 human-human goal-oriented Englishconversations between two participants, who areasked to identify shared images in their respec-tive photo books by exchanging messages via writ-ten chat. This setup takes inspiration from ex-perimental paradigms extensively used within thepsycholinguistics literature to investigate partner-specific common ground (for an overview, seeBrown-Schmidt et al., 2015), adapting them tothe requirements imposed by online crowdsourc-ing methods. The task is formulated as a gameconsisting of five rounds. Figure 1 shows an exam-ple of a participant’s display. Over the five roundsof a game, a selection of previously displayed im-ages will be visible again, prompting participantsto re-refer to images utilising both their visual con-text as well as previously established referring ex-pressions. The resulting dialogue data thereforeallows for tracking the common ground develop-ing between dialogue participants.

We describe in detail the PhotoBook task andthe data collection, and present a thorough analy-sis of the dialogues in the dataset. In addition, toshowcase how the new dataset may be exploited

arX

iv:1

906.

0153

0v2

[cs

.CL

] 2

6 Ju

n 20

19

for computational modelling, we propose a refer-ence resolution baseline model trained to identifytarget images being discussed in a given dialoguesegment. The model uses a simple method to takeinto account information accumulated in a refer-ence chain. Our results show that this informationis particularly important to resolve later descrip-tions and highlight the importance of developingmore sophisticated models of common ground indialogue interaction.

The PhotoBook dataset, together with the datacollection protocol, the automatically extractedreference chains, and the code used for our analy-ses and models are available at the following site:https://dmg-photobook.github.io.

2 Related Work

Seminal works on cooperative aspects of dialoguehave developed their hypotheses and models basedon a relatively small number of samples collectedthrough lab-based conversation tasks (e.g., Kraussand Weinheimer, 1964, 1966; Clark and Wilkes-Gibbs, 1986; Brennan and Clark, 1996; Andersonet al., 1991). Recent datasets inspired by this lineof work include the REX corpora (Takenobu et al.,2012) and PentoRef (Zarrieß et al., 2016). Withthe development of online data collection meth-ods (von Ahn et al., 2006) a new, game-based ap-proach to quick and inexpensive collection of di-alogue data became available. PhotoBook buildson these traditions to provide a large-scale datasetsuitable for data-driven development of computa-tional dialogue agents.

The computer vision community has re-cently developed large-scale datasets for visuallygrounded dialogue (Das et al., 2017; De Vrieset al., 2017b). These approaches extend earlierwork on visual question answering (Antol et al.,2015) to a multi-turn setup where two agents, eachwith a pre-determined Questioner or Answererrole, exchange sequences of questions and an-swers about an image. While data resulting fromthese tasks provides interesting opportunities toinvestigate visual grounding, it suffers from fun-damental shortcomings with respect to the collab-orative aspects of natural goal-oriented dialogue(e.g., fixed, pairwise structuring of question andanswers, no extended dialogue history). In con-trast, PhotoBook includes natural and free-fromdialogue data with a variety of dialogue acts andopportunities for participant collaboration.

Resolving referring expressions in the visualmodality has also been studied in computer vi-sion. Datasets such as ReferIt (Kazemzadeh et al.,2014), Flicker30k Entities (Plummer et al., 2015)and Visual Genome (Krishna et al., 2017) map re-ferring expressions to regions in a single image.Referring expressions in the PhotoBook datasetdiffer from this type of data in that the candi-date referents are independent but similar imagesand, most importantly, are often part of a referencechain in the participants’ dialogue history.

3 Task Description and Setup

In the PhotoBook task, two participants are pairedfor an online multi-round image identificationgame. In this game, participants are shown collec-tions of images that resemble the page of a photobook (see Figure 1). Each of these collectionsis a randomly ordered grid of six similar imagesdepicting everyday scenes extracted from the MSCOCO Dataset (Lin et al., 2014). On each pageof the photo book, some of the images are presentin the displays of both participants (the commonimages). The other images are each shown toone of the participants only (different). Three ofthe images in each display are highlighted througha yellow bar under the picture. The participantsare tasked to mark these highlighted target im-ages as either common or different by chatting withtheir partner.2 The PhotoBook task is symmetric,i.e., participants do not have predefined roles suchas instruction giver and follower, or questionerand answerer. Consequently, both participants canfreely and naturally contribute to the conversation,leading to more natural dialogues.

Once the two participants have made their se-lections on a given page, they are shown a feed-back screen and continue to the next round of thegame, a new page of the photo book displaying adifferent grid of images. Some of the images inthis grid will be new to the game while others willhave appeared before. A full game consists of la-belling three highlighted target images in each offive consecutive rounds.

Each highlighted image is displayed exactly fivetimes throughout a game while the display of im-ages and the order of rounds is randomised to pre-vent participants from detecting any patterns. As

2Pilot studies showed that labelling all six images tookparticipants about half an hour, which appeared to be too longfor the online setting, resulting in large numbers of discon-nects and incomplete games.

https://dmg-photobook.github.io

Figure 1: Screenshot of the Amazon Mechanical Turk user interface designed to collect the PhotoBook dataset.

a result of this carefully designed setup, dialoguesin the PhotoBook dataset contain multiple descrip-tions of each of the target images and thus pro-vide a valuable resource for investigating partici-pant cooperation, and specifically collaborative re-ferring expression generation and resolution withrespect to the conversation’s common ground.

Image Sets The task setup requires each gameof five rounds to display 12 unique but similar im-ages to elicit non-trivial referring expressions. Weuse the object category annotations in MS COCOto group all landscape, unmarked, colour imageswhere the two largest objects belong to the samecategory across all images in the set (e.g., all im-ages in the set prominently feature a person anda cat).3 This produced 30 sets of at least 20 im-ages from which 12 were selected at random. Asa given game highlights only half of the imagesfrom a given set, each image set produces two dif-ferent game sets with different target images to behighlighted, for a total of 60 unique games and 360unique images. More details on the PhotoBooksetup and image sets are provided in Appendix A.

4 Data Collection

We use the ParlAI framework (Miller et al., 2017)to implement the task and interface with crowd-sourcing platform Amazon Mechanical Turk(AMT) to collect the data. To control the qualityof collected dialogues, we require AMT workersto be native English speakers and to have com-pleted at least 100 other tasks on AMT with a

3All images where the two largest objects cover less than30k pixels (∼10% of an average COCO image) were rejected.

minimum acceptance rate of 90%. Workers arepaired through an algorithm based on whether ornot they have completed the PhotoBook task be-fore and which of the individual games they haveplayed. In order to prevent biased data, work-ers can complete a maximum of five games, eachparticipant can complete a given game only once,and the same pair of participants cannot completemore than one game.

Participants are instructed about the task andfirst complete a warming-up round with only threeimages per participant (two of them highlighted).In order to render reference grounding as clean aspossible and facilitate automatic processing of theresulting dialogue data, participants are asked totry to identify the common and different images asquickly as possible, only describe a single imageper message, and directly select an image’s labelwhen they agree on it. The compensation schemeis based on an average wage of 10 USD per hour(Hara et al., 2018). See Appendix B for a full ac-count of the instructions and further details on par-ticipant payment.

During data collection, we recordedanonymised participant IDs, the author, timestampand content of all sent messages, label selectionsand button clicks, plus self-reported collaborationperformance scores. For a period of two months,data collection produced human-human dialoguesfor a total of 2,506 completed games. Theresulting PhotoBook dataset contains a total of164,615 utterances, 130,322 actions and spans avocabulary of 11,805 unique tokens.

Each of the 60 unique game sets was playedbetween 15 and 72 times, with an average of 41

1 2 3 4 5Game Round

80

100

120

140

160

180Bl

ue: D

urat

ion

(in S

econ

ds)

5.600

5.625

5.650

5.675

5.700

5.725

5.750

5.775

5.800

Red:

Rou

nd S

core

(out

of 6

)(a)

1 2 3 4 5Game Round

0.48

0.50

0.52

0.54

0.56

Cont

ent T

oken

/Tot

al T

oken

Rat

io(b)

1 2 3 4 5Game Round

0.0

0.2

0.4

0.6

0.8

1.0

Ratio

of N

ew C

onte

nt T

oken

s

(c)

NOUN VERB ADJ ADVPOS-Tag

-30%

-20%

-10%

0%

10%

20%

30%

Chan

ge in

rela

ticve

Dist

ribut

ion

(d)

Figure 2: (a) Average completion times (solid blue) and scores (dashed red) per game round. (b) Ratio of contenttokens over total token count per round with best linear fit. (c) Ratio of new content tokens over total content tokencount per round. (d) Relative change in distribution of main content POS between the first and last game round.

games per set. The task was completed by 1,514unique workers, of which 472 only completeda single game, 448 completed between two andfour games, and 594 the maximum of five games.Completing a full five-round game took an averageof 14.2 minutes. With three highlighted imagesper player per round, during a full game of fiverounds 30 labelling decisions have to be made. Onaverage, participants correctly labelled 28.62 outof these 30.

5 Dataset Analysis

In this paper, we focus on the analysis of partici-pants’ interaction during a game of five labellingrounds.4 Our data here largely confirms the ob-servations concerning participants’ task efficiencyand language use during a multi-round communi-cation task made by seminal, small-scale experi-ments such as those by Krauss and Weinheimer(1964); Clark and Wilkes-Gibbs (1986); Brennanand Clark (1996) and, due to its scale, offers addi-tional aspects for further investigation.

5.1 Task Efficiency

Completing the first round of the PhotoBook tasktakes participants an average of almost three min-utes. Completing the fifth round on the other handtakes them about half that time. As Figure 2ashows, this decline roughly follows a negative log-arithmic function, with significant differences be-tween rounds 1, 2, 3 and 4, and plateauing towardsthe last round. The number of messages sent byparticipants as well as the average message lengthfollow a similar pattern, significantly decreasing

4Tracking participant IDs, for example, also allows for ananalysis of differences in behaviour across different games.

between consecutive game rounds. The averagenumber of correct image labels, on the other hand,significantly increases between the first and lastround of the game (cf. the red dashed graph in Fig-ure 2a). As a result, task efficiency as calculatedby points per minute significantly increases witheach game round.

5.2 Linguistic Properties of Utterances

To get a better understanding of how participantsincrease task efficiency and shorten their utter-ances, we analyse how the linguistic characteris-tics of messages change over a game.

We calculated a normalised content word ratioby dividing the count of content words by the to-tal token count.5 This results in an almost linearincrease of content tokens over total token ratiothroughout a game (average Pearson’s r per gameof 0.34, p� 0.05, see Figure 2b). With refer-ring expressions and messages in general gettingshorter, content words thus appear to be favouredto remain. We also observe that participants reusethese content words. Figure 2c shows the num-ber of novel content tokens per game round, whichroughly follows a negative logarithmic function.This supports the hypothesis of participants estab-lishing a conceptual pact on the referring expres-sion attached to a specific referent: Once accepted,a referring expression is typically refined throughshortening rather than by reformulating or addingnovel information (cf., Brennan and Clark, 1996).

We also analysed in more detail the distribu-tion of word classes per game round by taggingmessages with the NLTK POS-Tagger. Figure 2d

5We filtered out function words with NLTK’s stopwordlist http://www.nltk.org/.

http://www.nltk.org/

displays the relative changes in content-word-classusage between the first round and last round of agame. All content word classes but verbs show arelative increase in occurrence, most prominentlynouns with a 20% relative increase. The case ofadverbs, which show a 12% relative increase, isparticular: Manual examination showed that mostadverbs are not used to described images but ratherto flag that a given image has already appeared be-fore or to confirm/reject (‘again’ and ‘too’ makeup 21% of all adverb occurrences; about 36% are‘not’, ‘n’t’ and ‘yes’). These results indicate thatinterlocutors are most likely to retain the nounsand adjectives of a developing referring expres-sion, while increasingly dropping verbs, as wellas prepositions and determiners. A special rolehere takes definite determiner ‘the’, which, in spiteof the stark decline of determiners in general, in-creases by 13% in absolute occurrence counts be-tween the first and last round of a game, suggest-ing a shift towards known information.

Finally, in contrast to current visual dialoguedatasets (Das et al., 2017; De Vries et al., 2017b)which exclusively contain sequences of question-answer pairs, the PhotoBook dataset includesdiverse dialogue acts. Qualitative examinationshows that, not surprisingly, a large proportionof messages include an image description. Thesedescriptions however are interleaved with clarifi-cation questions, acceptances/rejections, and ac-knowledgements. For an example, see the dia-logue excerpt in Figure 1. Further data samplesare available in Appendix C. A deeper analysis ofthe task-specific dialogue acts would require man-ual annotation, which could be added in the future.

5.3 Reference Chains

In a small-scale pilot study, Ilinykh et al. (2018)find that the pragmatics of goal-oriented dialogueleads to consistently more factual scene descrip-tions and reasonable referring expressions thantraditional, context-free labelling of the same im-ages. We argue that in the PhotoBook task refer-ring expressions are not only adapted based on thegoal-oriented nature of the interaction but also byincorporating the developing common ground be-tween the participants. This effect becomes mostapparent when collecting all referring expressionsfor a specific target image produced during the dif-ferent rounds of a game in its coreference chain.The following excerpt displays such a coreference

chain extracted from the PhotoBook dataset:

1. A: Do you have a boy with a teal coloured shirtwith yellow holding a bear with a red shirt?

2. B: Boy with teal shirt and bear with red shirt?3. A: Teal shirt boy?

To quantify the effect of referring expression re-finement, we compare participants’ first and lastdescriptions to a given target image with the im-age’s captions provided in the MS COCO dataset.For this purpose we manually annotated the firstand last expressions referring to a set of six tar-get images across ten random games in the Photo-Book dataset. Several examples are provided inAppendix C. Table 1 shows their differences intoken count before and after filtering for contentwords with the NLTK stopword list.

Source # Tokens # Content DistanceCOCO captions 11.167 5.255 –First description 9.963 5.185 0.091Last description 5.685 5.128 0.156

Table 1: Avg. token counts in COCO captions and thefirst and last descriptions in PhotoBook, plus their co-sine distance to the caption’s cluster mean vector. Thedistance between first and last descriptions is 0.083.

Before filtering, first referring expressions donot significantly differ in length from the COCOcaptions. Last descriptions however are signifi-cantly shorter than both the COCO captions andfirst descriptions. After filtering for content words,no significant differences remain. We also calcu-late the cosine distance between the three differentdescriptions based on their average word vectors.6

Non-function words here should not significantlyalter an utterance’s mean word vector, which isconfirmed in our results. Before as well as after fil-tering, the distance between last referring expres-sion and COCO captions is almost double the dis-tance between the first referring expressions andthe captions (see last column in Table 1). Compar-ing the distribution of word classes in the captionsand referring expressions finally revealed a sim-ilar distribution in first referring expressions andCOCO captions, and a significantly different dis-tribution in last referring expressions, among otherthings doubling the relative frequency of nouns.

6We average pretrained word vectors fromWord2Vec (Mikolov et al., 2013) in gensim (https://radimrehurek.com/gensim/) to generate utter-ance vectors. The five COCO captions are represented by acluster mean vector.

https://radimrehurek.com/gensim/

https://radimrehurek.com/gensim/

6 Reference Chain Extraction

Collecting reference chains from dialogue data isa non-trivial task which normally requires man-ual annotation (Yoshida, 2011). Here we proposea simple procedure to automatically extract refer-ence chains made up of dialogue segments. A dia-logue segment is defined as a collection of consec-utive utterances that, as a whole, discuss a giventarget image and include expressions referring toit. All dialogue segments within a game that referto the same target image form its reference chain.

In order to automatically segment the collecteddialogues in this way, we developed a rule-basedheuristics exploiting participants’ image labellingactions to detect segment boundaries and their re-spective targets. The heuristics is described in de-tail in Appendix D. Since the task instructs par-ticipants to label images as soon as they identifythem as either common or different, the majorityof registered labelling actions can be assumed toconclude the current dialogue segment. The fol-lowing excerpt displays a segment extracted froma game’s first round, discussing one target imagebefore a participant selects its label:

B: I have two kids (boys) holding surfboards walking.

A: I do not have that one.B: marks #340331 as different

Image selections however do not always delimitsegments in the cleanest way possible. For ex-ample, a segment may refer to more than one tar-get image, i.e., the participants may discuss twoimages and only after this discussion be able toidentify them as common/different. 72% of theextracted segments are linked to only one target;25% to two. Moreover, reference chains do notnecessarily contain one segment for each of thefive game rounds. They may contain fewer ormore segments than rounds in a game, since par-ticipants may discuss the same image more thanonce in a single round and some of the extractedchains may be noisy, as explained in the evaluationsection below. 75% of the automatically extractedchains contain three to six segments.

Evaluation To evaluate the segmentation, twoannotators independently reviewed segments ex-tracted from 20 dialogues. These segments wereannotated by marking all utterances u in a segmentS with target images I that refer to an image i′

where i′ /∈ I to determine precision, and markingall directly preceding and succeeding utterances u′

outside of a segment S that refer to a target imagei ∈ I to determine recall. Additionally, if a seg-ment S did not include any references to any of itstarget images I , it was labelled as improper. 95%of annotated segments were assessed to be proper(Cohen’s κ of 0.87), with 28.4% of segments con-taining non-target references besides target refer-ences (Cohen’s κ of 0.97). Recall across all re-viewed segments is 99% (Cohen’s κ of 0.93).

7 Experiments on Reference Resolution

Using the automatically extracted dialogue seg-ments, we develop a reference resolution modelthat aims at identifying the target images referredto in a dialogue segment. We hypothesise that latersegments within a reference chain might be moredifficult to resolve, because they rely on referringexpressions previously established by the dialogueparticipants. As a consequence, a model that isable to keep track of the common ground shouldbe less affected by this effect. To investigate theseissues, we experiment with two conditions: In theNO-HISTORY condition, the model only has ac-cess to the current segment and to the visual fea-tures of each of the candidate images. In the HIS-TORY condition, on the other hand, the model alsohas access to the previous segments in the refer-ence chain associated with each of the candidateimages, containing the linguistic common groundbuilt up by the participants.

We keep our models very simple. Our aim is topropose baselines against which future work canbe compared.

7.1 DataThe automatically extracted co-reference chainsper target image were split into three disjoint setsfor training (70%), validation (15%) and testing(15%), aiming at an equal distribution of target im-age domains in all three sets. The raw numbers perdata split are shown in Table 2.

Split Chains Segments Targets Non-TargetsTrain 12,694 30,992 40,898 226,993Val 2,811 6,801 9,070 50,383Test 2,816 6,876 9,025 49,774

Table 2: Number of reference chains, dialogue seg-ments, and image types (targets and non-targets) ineach data split.

Dotproduct

Dotproduct

Dotproduct

LSTM Segment EncoderA: two people with bikes next to a train both wearing the same helmet?B: yes i have that one

0.32

0.21

0.78

Seg 1Seg 2Seg 3

Candidateimages

Segment to be resolved

...

LSTM RefChain

Encoder

Image has not been discussedbefore; no available chain

...

Seg 1Seg 2

LSTM RefChain

Encoder

Figure 3: Diagram of the model in the HISTORY condition. For simplicity, we only show three candidate images.Some candidate images may not have a reference chain associated with them, while others may be linked to chainsof different length, reflecting how many times an image has been referred to in the dialogue so far. In this example,the model predicts that the bottom candidate is the target referent of the segment to be resolved.

7.2 Models

Our resolution model encodes the linguistic fea-tures of the dialogue segment to be resolved witha recurrent neural network with Long Short-TermMemory (LSTM, Hochreiter and Schmidhuber,1997). The last hidden state of the LSTM isthen used as the representation for the dialoguesegment. For each candidate image in the con-text, we obtain image features using the activa-tions from the penultimate layer of a ResNet-152(He et al., 2016) pre-trained on ImageNet (Denget al., 2009). These image features, which areof size 2048, are projected onto a smaller dimen-sion equal to the hidden dimension of LSTM units.Projected image features go through ReLU non-linearity and are normalised to unit vectors. Toassess which of the candidate images is a target,in the NO-HISTORY condition we take the dotproduct between the dialogue segment representa-tion and each image feature vector, ending up withscalar predictions for all N images in the context:s = {s0, ..., sN}.

For the HISTORY condition, we propose a sim-ple mechanism for taking into account linguisticcommon ground about each image. For each can-didate image, we consider the sequence of pre-vious segments within its reference chain. Thisshared linguistic background is encoded with an-other LSTM, whose last hidden state is addedto the corresponding image feature that was pro-jected to the same dimension as the hidden stateof the LSTM. The resulting representation goesthrough ReLU, is normalised, and compared tothe target dialogue segment representation via dot

product, as in NO-HISTORY (see Figure 3).As an ablation study, we train a HISTORY model

without visual features. This allows us to establisha baseline performance only involving languageand to study whether the HISTORY model with vi-sual features learns an efficient multi-modal repre-sentation. We hypothesise that some descriptionscan be successfully resolved by just comparing thecurrent segment and the reference chain in the his-tory (e.g., when descriptions are detailed and re-peated). However, performance should be signifi-cantly lower than with visual features, for examplewhen referring expressions are ambiguous.

Sigmoid is applied element-wise over the scalarpredictions in all three models. As a result, eachimage can be assessed independently using a deci-sion threshold (set to 0.5). This allows the modelto predict multiple images as referents.7 We useBinary Cross Entropy Loss to train the models.Since distractor images make up 84.74% of theitems to be classified in the training set and tar-get images constitute only the 15.26% of them, weprovided 84.74/15.26 ≈ 5.5 as the weight of thetarget class in the loss function.

All models were implemented in PyTorch,trained with a learning rate of 0.001 and a batchsize of 512. The dimension of the word embed-dings and the hidden dimensions of the LSTMunits were all set to 512. The parameters wereoptimised using Adam (Kingma and Ba, 2014).The models were trained until the validation lossstopped improving, after which we selected the

7As explained in Section 6, 25% of segments are linked totwo targets; 3% to more. See Appendix D for further details.

model with the best weighted average of the tar-get and non-target F-scores.

7.3 ResultsWe report precision, recall, and F-score for the tar-get images in Table 3. Results for non-target im-ages are available in Appendix E. Every candidateimage contributes individually to the scores, i.e.,the task is not treated as multi-label for evaluationpurposes. Random baseline scores are obtained bytaking the average of 10 runs with a model thatpredicts targets and non-targets randomly for theimages in the test set.

Given the low ratio of target to distractor images(see Table 2 in Section 7.1), the task of identify-ing target images is challenging and the randombaseline achieves an F-score below 30%. The re-sults show that the resolution capabilities of ourmodel are well above the baseline. The HISTORY

model achieves higher recall and F-score than theNO-HISTORY model, while precision is compara-ble across these two conditions.

Model Precision Recall F1Random baseline 15.34 49.95 23.47NO-HISTORY 56.65 75.86 64.86HISTORY 56.66 77.41 65.43HISTORY /No image 35.66 63.18 45.59

Table 3: Results for the target images in the test set.

For a more in-depth analysis of the results, we ex-amine how precision and recall vary depending onthe position of the to-be-resolved segment withina reference chain. Figure 4 displays this informa-tion. As hypothesised, we observe that resolutionperformance is lower for later segments in a refer-ence chain. For example, while precision is closeto 60% for first mentions (position 1 in a chain), itdeclines by around 20 points for last mentions.

1 2 3 4 5 6

40

60

80

100Precision

No historyHistoryNo image

1 2 3 4 5 6

40

60

80

100Recall

Figure 4: Precision and recall (y axis) for target images,given the position of the segment in a reference chain(x axis).

The plots in Figure 4 also show the impact of tak-ing into account the common ground accumulated

over a reference chain. This is most prominentwith regard to the recall of target images. TheHISTORY model yields higher results than the NO-HISTORY model when it comes to resolving seg-ments that refer to an image that has already beenreferred to earlier within the dialogue (positions> 1). Yet, the presence of linguistic context doesnot fully cancel out the effect observed above: Theperformance of the HISTORY model also declinesfor later segments in a chain, indicating that moresophisticated methods are needed to fully exploitshared linguistic information.

Experiments with the HISTORY model withoutvisual features (HISTORY/No image) confirm ourhypothesis. The HISTORY model outperforms the“blind” model by about 21 points in precision and14 points in recall. We thus conclude that evenour simple fusion mechanism already allows forlearning an efficient multimodal encoding and res-olution of referring expressions.

7.4 Qualitative Analysis

The quantitative dataset analysis presented in Sec-tion 5 showed that referring expressions becomeshorter over time, with interlocutors being mostlikely to retain nouns and adjectives. Qualita-tive inspection of the reference chains reveals thatthis compression process can lead to very non-standard descriptions. We hypothesise that thedegree to which the compressed descriptions relyon visual information has an impact on the per-formance of the models. For example, the NO-HISTORY model can be effective when the par-ticipants converge on a non-standard descriptionwhich highlights a visual property of the target im-age that clearly discriminates it from the distrac-tors. This is the case in the example shown on theleft-hand side of Figure 5. The target image showsa woman holding what seems to be a plastic carrot.This feature stands out in a domain where all thecandidate images include a person and a TV.8 Af-ter an initial, longer description (‘a woman sittingin front of a monitor with a dog wallpaper whileholding a plastic carrot’), the participants use themuch more compact description ‘the carrot lady’.Arguably, given the saliency of the carrot in thegiven context, relying on the preceding linguistichistory is not critical in this case, and thus both theNO-HISTORY and the HISTORY model succeed in

8The COCO annotations for this image seem to be slightlyoff, as the image is tagged as including a TV but in fact showsa computer monitor.

Figure 5: Reference chain for each of the two displayed images. The dialogue segments in the chains are slightlysimplified for space reasons. Left: Both the HISTORY and the NO-HISTORY models succeed at identifying thisimage as the target of the segment to be resolved. Right: The NO-HISTORY model fails to recognise this image asthe target of the segment to be resolved, while the HISTORY model succeeds. The distractor images for these twoexamples are available in Appendix E.

identifying the target.We observe that the HISTORY model is partic-

ularly helpful when the participants converge ona non-standard description of a target image thatcannot easily be grounded on visual information.The image and reference chain on the right-handside of Figure 5 illustrate this point, where thedescription to be resolved is the remarkably ab-stract ‘strange’. Here the HISTORY model suc-ceeds while the NO-HISTORY model fails. As inthe previous example, the referring expression inthe first segment of the reference chain for this im-age (‘a strange bike with two visible wheels in theback’) includes more descriptive content – indeed,it is similar to a caption, as shown by our analy-sis in Section 5.3. By exploiting shared linguisticcontext, the HISTORY model can not only inter-pret the non-standard phrase, but also recover ad-ditional properties of the image not explicit in thesegment to be resolved, which presumably help toground it.

8 Conclusion

We have presented the first large-scale dataset ofgoal-oriented, visually grounded dialogues for in-vestigating shared linguistic history. Through thedata collection’s task setup, participants repeat-edly refer to a controlled set of target images,which allows them to improve task efficiency ifthey utilise their developing common ground andestablish conceptual pacts (Brennan and Clark,1996) on referring expressions. The collected di-alogues exhibit a significant shortening of utter-ances throughout a game, with final referring ex-pressions starkly differing from both standard im-age captions and initial descriptions. To illustratethe potential of the dataset, we trained a baselinereference resolution model and showed that infor-mation accumulated over a reference chain helps

to resolve later descriptions. Our results suggestthat more sophisticated models are needed to fullyexploit shared linguistic history.

The current paper showcases only some ofthe aspects of the PhotoBook dataset, whichwe hereby release to the public (https://dmg-photobook.github.io). In futurework, the data can be used to further inves-tigate common ground and conceptual pacts;be extended through manual annotations for amore thorough linguistic analysis of co-referencechains; exploit the combination of vision and lan-guage to develop computational models for refer-ring expression generation; or use the PhotoBooktask in the ParlAI framework for Turing-Test-likeevaluation of dialogue agents.

Acknowledgements

The PhotoBook dataset was collected thanks tofunding in the form of a Facebook ParlAI Re-search Award to Raquel Fernandez. We are grate-ful to the ParlAI team, in particular Jack Urbanekand Jason Weston, for their continued supportand assistance in using the ParlAI framework dur-ing the data collection, which was coordinated byJanosch Haber. We warmly thank the volunteerswho took part in pilot experiments. Thanks alsogo to Aashish Venkatesh for his valuable contri-butions in brainstorming sessions about the taskdesign and for preliminary modelling efforts. Fi-nally, we are grateful to the participants of twoFaculty Summits at Facebook AI Research NYCfor their feedback.

ReferencesLuis von Ahn, Mihir Kedia, and Manuel Blum. 2006.

Verbosity: A game for collecting common-sensefacts. In Proceedings of the Conference on HumanFactors in Computing Systems.



https://doi.org/10.1145/1124772.1124784

https://doi.org/10.1145/1124772.1124784

Anne H. Anderson, Miles Bader, Ellen Gurman Bard,Elizabeth Boyle, Gwyneth Doherty, Simon Garrod,Stephen Isard, Jacqueline Kowtko, Jan McAllister,Jim Miller, et al. 1991. The HCRC Map Task cor-pus. Language and Speech, 34(4):351–366.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,and Devi Parikh. 2015. VQA: Visual Question An-swering. In Proceedings of ICCV.

Susan E. Brennan and Herbert H. Clark. 1996. Con-ceptual pacts and lexical choice in conversation.Journal of Experimental Psychology: Learning,Memory, and Cognition, 22:1482–1493.

Sarah Brown-Schmidt, Si On Yoon, and Rachel AnnaRyskin. 2015. People as contexts in conversation. InPsychology of Learning and Motivation, volume 62,chapter 3, pages 59–99. Elsevier.

Herbert H. Clark. 1996. Using Language. ’Using’ Lin-guistic Books. Cambridge University Press.

Herbert H. Clark and Deanna Wilkes-Gibbs. 1986.Referring as a collaborative process. Cognition,22(1):1 – 39.

Abhishek Das, Satwik Kottur, Khushi Gupta, AviSingh, Deshraj Yadav, Jose M.F. Moura, DeviParikh, and Dhruv Batra. 2017. Visual Dialog. InProceedings of CVPR.

Harm De Vries, Florian Strub, Sarath Chandar, OlivierPietquin, Hugo Larochelle, and Aaron Courville.2017a. Guesswhat?! visual object discoverythrough multi-modal dialogue. In Proc. of CVPR.

Harm De Vries, Florian Strub, Jeremie Mary, HugoLarochelle, Olivier Pietquin, and Aaron Courville.2017b. Modulating early visual processing by lan-guage. In Proceedings of NIPS.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, KaiLi, and Li Fei-Fei. 2009. ImageNet: A Large-ScaleHierarchical Image Database. In Proc. of CVPR.

Kotaro Hara, Abigail Adams, Kristy Milland, SaiphSavage, Chris Callison-Burch, and Jeffrey PBigham. 2018. A Data-Driven Analysis of Workers’Earnings on Amazon Mechanical Turk. In Proceed-ings of the Conference on Human Factors in Com-puting Systems.

He He, Anusha Balakrishnan, Mihail Eric, and PercyLiang. 2017. Learning symmetric collaborative di-alogue agents with dynamic knowledge graph em-beddings. In Proceedings of ACL.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of CVPR.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Nikolai Ilinykh, Sina Zarrieß, and David Schlangen.2018. The task matters: Comparing image caption-ing and task-based dialogical image description. InProceedings of INLG.

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten,and Tamara L. Berg. 2014. ReferIt Game: Refer-ring to Objects in Photographs of Natural Scenes.In Proceedings of EMNLP.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Robert M. Krauss and Sidney Weinheimer. 1964.Changes in reference phrases as a function of fre-quency of usage in social interaction: a preliminarystudy. Psychonomic Science, 1(1):113–114.

Robert M. Krauss and Sidney Weinheimer. 1966. Con-current feedback, confirmation, and the encoding ofreferents in verbal communication. Journal of Per-sonality and Social Psychology, 4(3):343–346.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma,et al. 2017. Visual Genome: Connecting languageand vision using crowdsourced dense image anno-tations. International Journal of Computer Vision,123(1):32–73.

Rensis Likert. 1932. A technique for the measurementof attitudes. Archives of Psychology, 22 140:55–55.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C. Lawrence Zitnick. 2014. Microsoft COCO:Common Objects in Context. In Proc. of ECCV.

Junhua Mao, Jonathan Huang, Alexander Toshev, OanaCamburu, Alan Yuille, and Kevin Murphy. 2016.Generation and comprehension of unambiguous ob-ject descriptions. In Proceedings of CVPR.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. In Proceedings of NIPS.

Alexander Miller, Will Feng, Dhruv Batra, AntoineBordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Ja-son Weston. 2017. ParlAI: A Dialog Research Soft-ware Platform. In Proceedings of EMNLP.

Bryan A Plummer, Liwei Wang, Chris M Cervantes,Juan C Caicedo, Julia Hockenmaier, and SvetlanaLazebnik. 2015. Flickr30k entities: Collectingregion-to-phrase correspondences for richer image-to-sentence models. In Proceedings of ICCV.

Robert Stalnaker. 1978. Assertion. In P. Cole, edi-tor, Pragmatics, volume 9 of Syntax and Semantics,pages 315–332. New York Academic Press.

https://doi.org/10.1017/CBO9780511620539

https://doi.org/https://doi.org/10.1016/0010-0277(86)90010-7

https://hal.inria.fr/hal-01648683

https://hal.inria.fr/hal-01648683

http://arxiv.org/abs/1704.07130



https://doi.org/10.1162/neco.1997.9.8.1735



https://doi.org/10.3758/BF03342817

https://doi.org/10.3758/BF03342817

https://doi.org/10.3758/BF03342817

https://doi.org/10.1037/h0023705

https://doi.org/10.1037/h0023705

https://doi.org/10.1037/h0023705

https://doi.org/10.1109/CVPR.2016.9


http://dl.acm.org/citation.cfm?id=2999792.2999959



http://aclweb.org/anthology/D17-2014

http://aclweb.org/anthology/D17-2014

Tokunaga Takenobu, Iida Ryu, Terai Asuka, andKuriyama Naoko. 2012. The REX corpora: A col-lection of multimodal corpora of referring expres-sions in collaborative problem solving dialogues. InProceedings of LREC.

Ramakrishna Vedantam, Samy Bengio, Kevin Murphy,Devi Parikh, and Gal Chechik. 2017. Context-awarecaptions from context-agnostic supervision. In Pro-ceedings of CVPR.

Etsuko Yoshida. 2011. Referring expressions in En-glish and Japanese: patterns of use in dialogue pro-cessing, volume 208. John Benjamins Publishing.

Sina Zarrieß, Julian Hough, Casey Kennington,Ramesh Manuvinakurike, David DeVault, RaquelFernandez, and David Schlangen. 2016. PentoRef:A Corpus of Spoken References in Task-oriented Di-alogues. In Proceedings of LREC.

Appendices

A Task Setup

Image Sets The images used in the PhotoBooktask are taken from the MS COCO 2014 Train-set (Lin et al., 2014). Images in MS COCO werecollected from the Flickr9 image repository, whichcontains labelled photos predominantly uploadedby amateur photographers. The pictures largelyare snapshots of everyday situations, placing ob-jects in a natural and often rich context (hencethe name Common Objects in COntext) instead ofshowing an iconic view of objects. In the MSCOCO Trainset, images are manually annotatedwith the outlines of the depicted objects as well astheir object categories. We use this information toselect similar pictures to display in the PhotoBooktask. Through the filtering described in Section 3,we obtained 30 sets of similar images with differ-ent pairings of their most prominent objects. Intotal there are 26 unique object categories in theimage sets. The most frequent object category inthe image sets is person, which is one of the twomain objects in 19 sets.

Specification of Games We developed a simplefunction to select which images of a set should beshown to which participant in which round of agame in order to guarantee that the task setup elic-its sufficient image (re-)references for collectingco-reference chains. In this function, the 12 im-ages in a set of similar photographs are randomlyindexed and then assigned to a participant’s dis-play based on the schema displayed in Table 4.

9https://www.flickr.com/

With this schema, each photograph is displayedexactly five times while the order of images andthe order of rounds can be randomised to preventparticipants from detecting patterns in the display.Each of these sets then is duplicated and assigned adifferent selection of highlighted images to obtainthe 60 game sets of the PhotoBook task. Whilemost highlighted images recur five times during agame, they can be highlighted for both participantsin the same round. As a result, any given image ishighlighted in an average of 3.42 rounds of a game(see Table 5 for the highlighting schema).

Round Participant A Participant B1 1, 2, 3, 4, 5, 6 1, 2, 3, 4, 7, 82 1, 3, 6, 7, 9, 10 2, 3, 6, 7, 9, 113 4, 5, 7, 10, 11, 12 2, 3, 6, 7, 9, 114 1, 2, 5, 8, 11, 12 1, 4, 5, 8, 10, 125 5, 6, 8, 9, 11, 12 3, 7, 9, 10, 11, 12

Table 4: Assignment of image IDs to the different par-ticipants and rounds of a game schema. The order ofrounds and the arrangement of images on the partici-pant’s display can be randomised without effect on thegame setup.

Game Round Statistics1 2 3 4 5 H

ID A B A B A B A B A B T 1 2 R1 1 1 1 1 1 5 5 0 32 2 2 2 2 2 5 0 5 43 1 1 1 1 1 5 5 0 34 2 2 2 1 2 5 4 1 35 1 1 1 1 1 5 0 5 46 2 2 2 2 2 5 5 0 47 1 1 1 1 1 5 0 5 38 2 1 2 1 1 5 2 3 39 2 2 1 2 2 5 4 1 3

10 2 2 2 2 2 5 5 0 411 1 1 1 1 1 5 0 5 412 2 2 2 2 2 5 5 0 3

Table 5: Schema of referent image highlighting in thePhotoBook task. The left part of the table indicateswhether a given image is highlighted for one of the twoparticipants (A and B) in a given game round in eithergame 1 or 2. T indicates the total count of highlights(which is 5 always), H counts the highlights per gameand R the number of rounds that an image is high-lighted in.

B Task Instructions

HIT Preview When the PhotoBook task envi-ronment is initialised, it publishes a specified num-ber of Human Intelligence Tasks (HITs) titledGame: Detect common images by chatting withanother player on Amazon Mechanical Turk (see



http://www.lrec-conf.org/proceedings/lrec2016/pdf/563_Paper.pdf



https://www.flickr.com/

Figure 6 for a full print of the descriptions). Par-ticipants entering the HIT are shown a previewscreen with the central task details as shown inFigure 7.

Game Round Mechanics The PhotoBook taskAMT user interface is designed in such a way thatthe six images per round are displayed in a 2 × 3grid, with a coloured bar under each image: If theimage is highlighted, this bar is yellow and con-tains a radio button option for the common anddifferent labels. If they are not highlighted for aplayer, the bar is greyed out and empty. The sub-mit button is deactivated as long as not all high-lighted images have been labelled. As soon as bothplayers submitted their selection, a feedback pageis shown where the bars under the highlighted im-ages either colour green to indicate a correct se-lection or red to indicate a wrong one. Figure 8(b)shows a screenshot of the feedback display.

The radio buttons are disabled in the feedbackscreens so players cannot revise their selection -they can however communicate about their mis-takes or pass any other feedback to their part-ner. The title of a page indicates the current pagenumber so participants can always check theirprogress; the text input field is limited to a max-imum of hundred characters to prevent listings ofmultiple images or overly elaborate descriptions -which, if necessary, can be conveyed in a numberof subsequent messages.

Feedback Questionnaire In order to facilitate aqualitative analysis of dialogue agents developedfor the PhotoBook task, we also collect a gold-standard benchmark of participant’s self-reportedsatisfaction scores. These scores later can be com-pared with those obtained by pairing human par-ticipants with an artificial dialogue agent in orderto assess it in a Turing Test-like setting. Follow-ing He et al. (2017), we ask participants to ratethree statements on a five-point Likert scale (Lik-ert, 1932), ranging from strongly agree to stronglydisagree:

1. Overall collaboration with my partner workedwell.

2. I understood my partner’s descriptions well.

3. My partner seemed to understand me well.

Warming-Up Round During an initial series ofpilot runs we observed that for new participantsthe first round of their first game took significantly

longer than any other ones. Although we do ex-pect that participants get more efficient over time,we argue that this effect is largely related to thefact that participants need to get familiar with thetask’s mechanics when it is the first time they areexposed to it. In order to control for this effect, weadded a warming-up round with three images perparticipant10 for each pair of new participants (seeFigure 8a). This strongly reduced the completiontime of new participants’ first game rounds.

Matching Participants In order to collect un-biased samples of the referring expression gener-ation process, we aim to prevent both, i) partic-ipants completing the same game multiple times(as here they could re-use referring expressionsthat worked well during the last one) and ii)specific pairs of participants completing multiplegames (as they might have established some kindof strategy or code already). We however also aimat designing the task in such a way that the degreeof the partner-specificity in established canoni-cal expressions could be assessed. To achievethis, the participant matching should create set-tings where a re-entering participant is assigned agame with the same image set as in the game be-fore, but paired with a different conversation part-ner changes (compare for example Brennan andClark, 1996). In order to maximise the number ofthis second game setting, we encourage workers tocontinue playing by paying them a bonus of 0.25USD for each 2nd, 3rd, 4th and 5th game.

Worker Payment The HIT description also de-tails the worker’s payment. We want to pro-vide fair payment to workers, which we calculatedbased on an average wage of 10 USD per hour(Hara et al., 2018).11 An initial set of runs resultedin an average completion time of 12 minutes,which indicated an expected expense of about 2USD per participant per game. More experiencedworkers however managed to complete a full gamein six to ten minutes, meaning that for them wewould often surpass the 10 USD/h guideline basedon this calculation. Other workers – especiallynew ones - took up to 25 minutes for the first game,which means that they on the other hand would bestrongly under-payed with a rigid per-game pay-ment strategy. To mitigate this effect, we devel-oped the following payment schema: Each worker

10Warming-Up image categories are disjoint from the reg-ular PhotoBook image sets.

11See also DynamoWiki.

http://wiki.wearedynamo.org/index.php?title=Fair_payment

Figure 6: Screenshot of the PhotoBook task AMT HIT details shown to a participant.

Figure 7: Screenshot of the PhotoBook task AMT HIT preview page.

that completes a full game is payed a base-amountof 1.75 USD – which is indicated in the HIT de-scription. If the game took longer than ten min-utes, the participants are payed a bonus amount of0.10 USD per minute, up to an additional bonuspayment of 1.50 USD for 25 or more minutes.In order to not encourage workers to play slowly,we only inform them about this bonus at the endof a HIT. With this bonus and the 20% AMT feeon each transaction, we expected an average costof about 5 USD per game, which due to connec-tion problems in the framework ultimately accu-mulated to 6 USD for a completed game. The to-tal cost of the data collection, including pilot runs,was 16,350 USD.

C Dataset Samples

Through the goal-oriented nature of participants’interactions in the PhotoBook dataset, we do notonly collect image descriptions but rather the full,collaborative process of establishing, groundingand refining referring expressions throughout the

subsequent rounds of the PhotoBook task. As aresult, we capture a wide range of dialogue actssuch as clarification questions, corrections, exten-sions, (self-)repairs as well as interactions con-cerning game mechanics. Consider for examplethe following interactions:

A: Man with dog on lap looking at his com-puter?

B: I don’t have that, but could it be a TV inyours? Mine has a man sitting with hisdog watching TV.

A: yes, TV - sorry!B: Okay.

A: Do you have someone on a big motorcy-cle and their head isn’t visible?

A: There is a blue car in the backgroundB: NoA: In any of the pictures?B: NoA: Okay, thank you

(a) Screenshot of the PhotoBook task’s display for one of theparticipants during the warming-up round.

(b) Example screenshot of the PhotoBook AMT feedback dis-play.

Figure 8: Example screenshots for a participant’s display during the warming-up round and feedback screen.

B: Woman with hot dogA: Older girl with glasses holding a hot

dog?B: sittingA: Yeah

A: Do you have a picture with a lady in afancy dress standing by a motorcycle?

B: noB: waitB: yes, in black?A: Yes, it’s a black dress with white trim.

A: Is there anything else?B: Do you have the old lady in the white

hat/blue pants reading?A: Yes, I do.B: Okay, that’s all for me

In most cases, referring expressions agreed uponduring the first rounds of a game are further re-fined and optimised while re-referring to the sametarget object in later rounds of the game. Theserefinements often are manifested in an omissionof detail while retaining core features of the targetobject.

A: Do you have a boy with a teal colouredshirt with yellow holding a bear with ared shirt?

B: Yes–

B: Boy with teal shirt and bear with redshirt?

A: Yes!–

A: Teal shirt boy?B: No

Collecting all utterances that refer to a specifictarget image during a given game creates its co-reference chain. Consider the following examplesof first (F) and last (L) referring expressions fromco-reference chains manually extracted from thePhotoBook dataset:

F: Two girls near TV playing wiiOne in white shirt, one in grey

L: Girls in white and grey

F: A person that looks like a monk sittingon a benchHe’s wearing a blue and white ball cap

L: The monk

F: A white, yellow, and blue bus beingtowed by a blue tow truck

L: Yellow/white bus being towed by blue

D Reference Chain Extraction

As explained in Section 6, instead of collecting co-reference chains from manual annotation, we use aheuristics to automatically extract reference chainsof dialogue segments likely to contain referringexpressions to a chain’s target image. We considerparticipants’ image labelling actions to signal thata previously discussed target image was identi-fied as either common or different and thereforeconcluding the current dialogue segment. Dueto the spontaneous and unrestricted nature of thePhotoBook dialogues, these labelling actions how-ever do not always indicate segment boundariesas cleanly as possible. To improve the quality ofextracted dialogue segments and reference chains,we therefore developed a more context-sensitiveheuristics to automate segmentation. The heuris-tics is implemented as a binary decision tree that

uses labelling actions as well as any preceding andsubsequent messages and additional labelling ac-tions to better decide on segment boundaries andassociated target images. It considers 32 combi-nations of eight different factors. The first case ofthe heuristics, for example, states that if

1. the current turn is a message,2. the previous turn was an image labelling action,3. the previous turn was by the other participant,4. the next turn is an image selection action,5. the next turn is by the current participant,6. the next labelling action assigns a common la-

bel,7. the other participant’s previous labelling and

the current participant’s next labelling addressthe same target image, and

8. there is a non-empty, currently developing dia-logue segment,

then we assume that after one speaker selected animage as common, the other speaker makes oneutterance and marks the same image as common,which is resolved by saving the currently develop-ing segment with the common image as referentand initialising a new segment with the trailing ut-terance of the second speaker. This prevents creat-ing a segment with just the trailing utterance (thatcannot be a complete segment) which would bethe naive decision if segmenting was based solelyon labelling actions. Other cases include whetherthe next turn is a message by the other participantfollowed by them labelling a second image as dif-ferent (likely to indicate that two images were dis-cussed and the segment should be extended by thefollowing message as well as the second target im-age) or whether none of the preceding and subse-quent turns contains labelling actions (indicatingan ongoing dialogue segment).

The following shows a typical example of an au-tomatically extracted chain of dialogue segmentsassociated with the image in Figure 9:

B: HelloA: HiA: Do you have a woman with a black

coat with buttons, glasses and a pieceof pizza on table

B: no

Figure 9: Sample image MS COCO #449904.

A: Lady with black shirt, glasses with pizzaon table?

B: yesA: Table with orange bowl with lemons and

liquor, cups?B: no

A: Orange bowl with lemons, liquor?B: lady pizzaA: No lady pizzaB: yes

B: woman and pizzaA: Empty kitchen wood coloured cabinets?A: No woman pizzaB: no

About 72% of all segments are assigned to a sin-gle co-reference chain, 25% were automaticallyassigned to co-reference chains of two differenttarget images and the remaining 3% to 3 or morechains.

E Reference Resolution Experiments

Data and Results In addition to the results re-ported on Table 3 in Section 7, which concern thetarget images in the test set, here we report thescores for target images on the validation set (Ta-ble 6) and the scores for non-target images (Ta-ble 7). The latter constitute the large majorityof candidate images, and thus results are substan-tially higher for this class.

Model Precision Recall F1NO HISTORY 56.37 75.91 64.70HISTORY 56.32 78.10 65.45NO IMAGE 34.61 62.49 44.55

Table 6: Results for target images in the validation set.

Model Precision Recall F1NO HISTORY 95.34 (95.37) 89.48 (89.42) 92.31 (92.30)HISTORY 95.61 (95.76) 89.26 (89.10) 92.33 (92.31)No image 92.24 (92.10) 79.33 (78.74) 85.30 (84.90)

Table 7: Results for non-target images in the test set(and the validation set, in brackets).

Finally, Table 8 reports the overall number ofreference chains in the dataset broken down bylength, that is, by the number of dialogue segmentsthey contain.

Length (# segments) of the reference chainsSplit 1 2 3 4 5 6 7Train 1783 1340 3400 4736 1322 110 3Val 398 295 754 1057 281 30 1Test 400 296 754 1057 281 23 0

Table 8: Total number of reference chains per length(i.e., # segments in the chain) in each of the data splits.

Qualitative Analysis Figures 10 and 11 showthe distractor images for the examples provided inFigure 5 and discussed in Section 7.4.

Figure 10: Set of distractors for the target image andsegment to be resolved on the left-hand side of Fig. 5.

Figure 11: Set of distractors for the target image andsegment to be resolved on the right-hand side of Fig. 5.

the photobook dataset: building common ground through...

Documents