arxiv:1909.03396v1 [cs.cl] 8 sep 2019tomer levinboim ashish thapliyal piyush sharma radu soricut...

10
Quality Estimation for Image Captions Based on Large-scale Human Evaluations Tomer Levinboim Ashish Thapliyal Piyush Sharma Radu Soricut Google Research {tomerl,asht,piyushsharma,rsoricut}@google.com Abstract Automatic image captioning has improved sig- nificantly in the last few years, but the prob- lem is far from being solved. Furthermore, while the standard automatic metrics, such as CIDEr and SPICE (Vedantam et al., 2015; An- derson et al., 2016), can be used for model selection, they cannot be used at inference- time given a previously unseen image since they require ground-truth references. In this paper, we focus on the related problem called Quality Estimation (QE) of image-captions. In contrast to automatic metrics, QE attempts to model caption quality without relying on ground-truth references. It can thus be applied as a second-pass model (after caption genera- tion) to estimate the quality of captions even for previously unseen images. We conduct a large-scale human evaluation experiment, in which we collect a new dataset of more than 600k ratings of image-caption pairs. Using this dataset, we design and experiment with several QE modeling approaches and provide an analysis of their performance. Our results show that QE is feasible for image captioning. 1 Introduction Image captioning technology produces automatic image descriptions in natural language (at sen- tence or paragraph level), with the goal of being consumed by end-users that may not be able to di- rectly access images. This need arises either be- cause the user has a permanent condition (acces- sibility for visually impaired people), or due to a temporary situation where the user cannot use the visual modality (such as limited bandwidth, or smart voice-assistant). In any of these situations, exposing the end-users to a generated caption that is incorrect negatively impacts user-trust, as it can have undesirable consequences on how they act next (e.g., how they comment on a social-media site based on their misguided understanding). In this paper, we propose to mitigate this issue through Quality-Estimation (QE) of im- age captions. That is, we propose to auto- matically compute a quality estimation score QE(image, caption) for a generated caption and use it as a means to control the quality of cap- tions presented to the user. For example, by simply thresholding the score, high scoring captions will be presented to the end-users, while low scoring ones will not, thereby minimizing their potential negative effects. In contrast to automatic metrics for image captioning (Vedantam et al., 2015; An- derson et al., 2016), QE does not rely on ground- truth labels to arrive at a quality estimate, and can be applied as a second-pass model (after caption generation) to arbitrary images at serving time. We emphasize three important aspects about the QE task. First, we define the problem of quality estimation as producing an estimate not from the generation model’s view-point, but from a human evaluator’s view-point. In other words, we are in- terested in learning a model QE(image, caption) for the function QE * (image, caption) reflecting some human judgment(s) of how well caption de- scribes image. This is in similar vein as the QE approach for the Machine Translation task (Spe- cia et al., 2019), where the QE estimate learns the minimum distance (in terms of post-editing oper- ations) to a human-authored translation reference as defined by the HTER (Snover et al., 2006; Spe- cia and Farzindar, 2010) metric. The modeling of the human judgments on caption quality be- comes possible as a result of a consistent need to perform human evaluations on image captioning outputs that can reliably guide modeling decisions and model selection (“is image-captioning Model A better than Model B?”). This need itself arises from the struggle of current image-captioning au- tomatic metrics, such as ROUGE (Lin and Och, 2004), METEOR (Banerjee and Lavie, 2005), arXiv:1909.03396v1 [cs.CL] 8 Sep 2019

Upload: others

Post on 08-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1909.03396v1 [cs.CL] 8 Sep 2019Tomer Levinboim Ashish Thapliyal Piyush Sharma Radu Soricut Google Research ftomerl,asht,piyushsharma,rsoricutg@google.com Abstract Automatic image

Quality Estimation for Image CaptionsBased on Large-scale Human Evaluations

Tomer Levinboim Ashish Thapliyal Piyush Sharma Radu SoricutGoogle Research

{tomerl,asht,piyushsharma,rsoricut}@google.com

Abstract

Automatic image captioning has improved sig-nificantly in the last few years, but the prob-lem is far from being solved. Furthermore,while the standard automatic metrics, such asCIDEr and SPICE (Vedantam et al., 2015; An-derson et al., 2016), can be used for modelselection, they cannot be used at inference-time given a previously unseen image sincethey require ground-truth references. In thispaper, we focus on the related problem calledQuality Estimation (QE) of image-captions.In contrast to automatic metrics, QE attemptsto model caption quality without relying onground-truth references. It can thus be appliedas a second-pass model (after caption genera-tion) to estimate the quality of captions evenfor previously unseen images. We conduct alarge-scale human evaluation experiment, inwhich we collect a new dataset of more than600k ratings of image-caption pairs. Usingthis dataset, we design and experiment withseveral QE modeling approaches and providean analysis of their performance. Our resultsshow that QE is feasible for image captioning.

1 Introduction

Image captioning technology produces automaticimage descriptions in natural language (at sen-tence or paragraph level), with the goal of beingconsumed by end-users that may not be able to di-rectly access images. This need arises either be-cause the user has a permanent condition (acces-sibility for visually impaired people), or due toa temporary situation where the user cannot usethe visual modality (such as limited bandwidth, orsmart voice-assistant). In any of these situations,exposing the end-users to a generated caption thatis incorrect negatively impacts user-trust, as it canhave undesirable consequences on how they actnext (e.g., how they comment on a social-mediasite based on their misguided understanding).

In this paper, we propose to mitigate thisissue through Quality-Estimation (QE) of im-age captions. That is, we propose to auto-matically compute a quality estimation scoreQE(image, caption) for a generated caption anduse it as a means to control the quality of cap-tions presented to the user. For example, by simplythresholding the score, high scoring captions willbe presented to the end-users, while low scoringones will not, thereby minimizing their potentialnegative effects. In contrast to automatic metricsfor image captioning (Vedantam et al., 2015; An-derson et al., 2016), QE does not rely on ground-truth labels to arrive at a quality estimate, and canbe applied as a second-pass model (after captiongeneration) to arbitrary images at serving time.

We emphasize three important aspects about theQE task. First, we define the problem of qualityestimation as producing an estimate not from thegeneration model’s view-point, but from a humanevaluator’s view-point. In other words, we are in-terested in learning a modelQE(image, caption)for the function QE∗(image, caption) reflectingsome human judgment(s) of how well caption de-scribes image. This is in similar vein as the QEapproach for the Machine Translation task (Spe-cia et al., 2019), where the QE estimate learns theminimum distance (in terms of post-editing oper-ations) to a human-authored translation referenceas defined by the HTER (Snover et al., 2006; Spe-cia and Farzindar, 2010) metric. The modelingof the human judgments on caption quality be-comes possible as a result of a consistent need toperform human evaluations on image captioningoutputs that can reliably guide modeling decisionsand model selection (“is image-captioning ModelA better than Model B?”). This need itself arisesfrom the struggle of current image-captioning au-tomatic metrics, such as ROUGE (Lin and Och,2004), METEOR (Banerjee and Lavie, 2005),

arX

iv:1

909.

0339

6v1

[cs

.CL

] 8

Sep

201

9

Page 2: arXiv:1909.03396v1 [cs.CL] 8 Sep 2019Tomer Levinboim Ashish Thapliyal Piyush Sharma Radu Soricut Google Research ftomerl,asht,piyushsharma,rsoricutg@google.com Abstract Automatic image

CIDEr (Vedantam et al., 2015) and SPICE (Ander-son et al., 2016), to correctly and robustly measurecaption similarity, see (Kilickaya et al., 2017), orcorrectly identify modeling advances, see Sharmaet al. (2018); Zhao et al. (2019). We have thereforecommissioned human evaluations of captions gen-erated by various image-captioning models over16,000 unique Open Image Dataset (Kuznetsovaet al., 2018) images, for a total of 55,000 unique〈image, caption〉 pairs, over which we have col-lected close to 600,000 human judgments. Wemake this data publicly available1, and provide ex-tensive details in Section 3.

Second, from a model view-point, the prob-lem of quality estimation modeling (commonly re-ferred to as model confidence) for structured out-puts is non-trivial compared to the similar problemfor single-point prediction (e.g., classification).For instance, for a single atomic label used for im-age classification, a model-view quality estimatefor predicting label is often simply the model con-ditional probability, P (label|image). In contrast,structured-prediction offers several possible can-didates for representing model confidence: an ag-gregate over the conditional probabilities at eachdecision point; the sequence of such label proba-bilities; the sequence of full conditional probabil-ity distributions at each decision point, etc.

Third, the complexity of structured outputs fora vision-and-language task like image captioning– as it relates natural images to natural-language–based descriptions – implies there is a potentialto leverage measures of compatibility between animage and a description. This can be achievedusing pretrained representations for both imagesand text, learned outside the caption-generationmodel’s view-point. Put differently, this paper alsoexplores improving QE by means of transfer learn-ing from pretrained models.

In summary, our contributions are as follows:1. We release a dataset of approximately 65k

human rated image-caption pairs (Englishonly), obtained by collecting more than 600khuman ratings in total.

2. By analyzing the collected ratings, we deter-mine that they encode a stable signal of cap-tion quality.

3. We present and analyze various QE model ar-chitectures, including both model-view esti-

1https://github.com/google-research-datasets/Image-Caption-Quality-Dataset

mation (confidence for structured prediction)and transfer-learning approaches.

2 Related Work

There has been a lot of attention to the problemof Quality Estimation for structured prediction inthe Machine Translation (MT) field, where thisproblem has been studied for almost a decade,from the early work based on feature engineer-ing (Specia et al., 2009; Soricut and Echihabi,2010), to more recent neural-network–based ap-proaches (Kreutzer et al., 2015; Kim and Lee,2016; Kim et al., 2017). The QE track at the WMTconference (Specia et al., 2019) has been runningfor several years, with multiple participants andnotable improvements in model performance overthe years. However, there are also significant dif-ference in the formulation of the two problems,most notably the fact that the MT formulation isuni-modal (text-only alignment), and therefore itlends itself to feature-engineering that exploits thisaspect (Specia et al., 2013; Kreutzer et al., 2015;Martins et al., 2017; Wang et al., 2018). In con-trast, QE for Image Captioning is a multi-modalproblem (image-and-text alignment), and there-fore better suited to approaches based primarilyon deep feature representation and integration. Inparticular, we show in this paper that our approachto QE for image captioning benefits from transferlearning, as it relates to the ability to exploit bothimage feature representations and text representa-tions trained for different, unrelated tasks.

Recently, Madhyastha et al. (2019) achieve fur-ther progress on evaluation metrics for image cap-tioning. They propose VIFIDEL, a learned sim-ilarity function between the candidate descriptionand object labels detected in the image. Their sim-ilarity function assigns higher weights to object la-bels that appear frequently in reference captions.Interestingly, VIFIDEL with no references at allcorrelates with human judgment almost as well assingle reference BLEU or ROUGE.

Besides the work on quality estimation mod-eling, the issue of effectively using quality esti-mators to improve the accessibility use-case forblind or visually impaired people (BVIPs) hasbeen previously studied (MacLeod et al., 2017).The main question of their study is how to best in-form the BVIP user about the uncertainty aroundthe generated captions, experimenting with fram-ing the captions using phrases like “Im not really

Page 3: arXiv:1909.03396v1 [cs.CL] 8 Sep 2019Tomer Levinboim Ashish Thapliyal Piyush Sharma Radu Soricut Google Research ftomerl,asht,piyushsharma,rsoricutg@google.com Abstract Automatic image

sure but I think its $CAPTION” or “Im 98% surethats $CAPTION”. The findings are relevant inthat BVIPs have difficulties calibrating themselvesinto trusting or distrusting $CAPTION, mostly be-cause there is no alternative form of reference forthe image content. Therefore, if the caption pro-vided to them (even accompanied by “Im not re-ally sure but ...”) is in dissonance with the rest ofthe context (as it may be available in text form,e.g. as part of a tweet thread as in the above-cited study), they tend to resolve this dissonancenot by believing that the caption is wrong, but byconstructing scenarios or explanations that wouldsomehow connect the two sources of information.

3 A Caption-Quality Dataset

A key contribution of this paper is the Caption-Quality dataset, a collection of human judgmentson the quality of machine-generated captions overa large set of images. Below, we describe themodels used for caption generation and the rat-ing collection process with which we collect ap-proximately 600,000 ratings from unpaid internetusers. We further provide an analysis of the rat-ings, which shows that they contain a consistentsignal about the quality of the captions.

3.1 Caption Generation Models

To generate a diverse set of captions, we trainseveral variants of a Transformer-based (Vaswaniet al., 2017) image-captioning model on the Con-ceptual Captions dataset (Sharma et al., 2018),which consists of 3.3M training and∼15,000 vali-dation images-caption pairs. As the authors report,captions generated by models trained on Concep-tual Captions are strongly favored over those gen-erated by COCO-trained models (Lin et al., 2014).

All of the models are trained to minimize theground-truth caption perplexity; however, theydiffer on three important aspects (which con-tributes to caption diversity) - the image featureextraction model, the number of object detectionresults they use, and the caption decoding proce-

Set Samples UniqueImages

UniqueCaptions

UniqueModels

Train 58354 11027 34532 11Dev 2392 654 1832 4Test 4592 1237 3359 4

Table 1: The Caption-Quality dataset statistics

dure. We discuss these differences below; for fur-ther details, see the model descriptions by Sharmaet al. (2018); Changpinyo et al. (2019).

Global Image Representation Our captioningmodels use one of the following pretrained im-age encoders: (1) The Inception-ResNet-v2 model(Szegedy et al., 2016), (2) The Picturebook imageencoder (Kiros et al., 2018), or, (3) The Graph-RISE model (Juan et al., 2019), a ResNet-101model (He et al., 2016) trained for an image clas-sification task at ultra-fine granularity levels.

Object Representations The identification ofobjects in an image is done using a Faster R-CNNmodel, training it to predict both 1,600 object and400 attribute labels in Visual Genome (Krishnaet al., 2017), following the standard setting fromAnderson et al. (2018). In terms of featuriza-tion for the identified bounding boxes, we use aResNet-101 model that can be pre-trained on Im-ageNet (Russakovsky et al., 2015) or pre-trainedusing the Graph-RISE model (Juan et al., 2019).

Object Labels In addition to object-level repre-sentations, we detect object labels over the entireimage, using a ResNet object-detection classifiertrained on the JFT dataset (Hinton et al., 2015).The classifier produces a list of detected object-label identifiers, sorting in decreasing order by theclassifier’s confidence score. These identifiers arethen mapped to embeddings oj using an object-label embedding layer which is pre-trained to pre-dict label co-occurrences in web documents usinga word2vec approach (Mikolov et al., 2013).

Decoding We used either greedy decoding orbeam search with beam width 5.

3.2 Human Evaluation Setup

Our goal with this evaluation is to efficiently ob-tain caption quality judgments on a large scale.To do that, we leverage Google’s crowdsourcingplatform2, on which we present (image, caption)pairs and ask raters a simple binary question: “Isthis a good caption for the image?”. Raters couldthen select YES/NO, or skip to the next sample(SKIP) (see Fig. 1). We collect 10 ratings per im-age/caption pair, some of which could be SKIP.

Note that we intentionally do not provide an in-terpretation of the question prompt, such as an ex-plicit definition of a good caption. In fact, we use

2http://crowdsource.google.com

Page 4: arXiv:1909.03396v1 [cs.CL] 8 Sep 2019Tomer Levinboim Ashish Thapliyal Piyush Sharma Radu Soricut Google Research ftomerl,asht,piyushsharma,rsoricutg@google.com Abstract Automatic image

Figure 1: Our caption evaluation interface. Raters indi-cate whether the caption is good/bad, or, they can skip.

the interpretation variability to provide a signal forcaption quality: a caption that gets 9/10 YES an-swers is likely better than one that gets 5/10 YES,which in turn is better than one that gets 0/10.

For images, we use the Open Images Dataset(OID) (Kuznetsova et al., 2018). We first ran-domly subsample 16,000 images and then, for le-gal and privacy concerns, filter out those whichcontain faces3. We then generate captions overthe remaining images using the various image-captioning modeling options described above.

The human-provided results are processed fur-ther by: (1) filtering out (image, caption) entriesthat received more than 2 SKIP ratings, and (2)averaging the 8 to 10 ratings ri for each of the re-maining (image, caption) pairs, and rounding tothe closest score y in {0, 18 , . . . ,

78 , 1}, using the

equation y = round(mean(ri) ∗ 8)/8. The re-sulting dataset, which we call the Caption-Qualityv1.0 dataset, is then split into three image-disjointsubsets, used for train, dev and test purposes in theexperiments presented in Section 5.1.

We provide statistics for these subsets in Table 1as well as human rating histograms in Figure 2.Table 2 further shows a small sample of rated im-age captions taken from the dev set.

3.2.1 Ratings StabilityAs described above, in our human evaluations theinterpretation of what a “good caption” means isleft up to the raters. To verify the stability of thisunderspecified interpretation, we study the degreeof agreement between different sets of 10 raterson the resulting averaged ratings y. To that end,we exploit the fact that similarly trained image-

3Detected using the Google Cloud Vision API,https://cloud.google.com/vision/

Figure 2: A histogram of the dev, test and train ratings.The train set values were divided by 10 for visualiza-tion purposes.

captioning models often generate identical cap-tions for the same image. For instance, two of ourmodels, evaluated 4 weeks apart, generated identi-cal captions for 509 images. Analyzing the differ-ence4 of scores (y1−y2) over these image/captionpairs gives a mean=0.015 and std=0.212. Fig-ure 3 provides a histogram of score differences(y1−y2), and clearly shows a concentration of thedifference about 0. Repeating this analysis acrossother model pairs reveals similar statistics.

In conclusion, even though the question de-scription is underspecified, a statistical analysisshows that collecting 10 such ratings yields repeat-able results, with well-concentrated sample-levelhuman scores.

4The evaluation platform is set up such that it is almostguaranteed that the ratings were provided by different subsetsof raters.

Figure 3: The same set of 509 captions were evaluatedtwice by different sets of 10 raters and 4 weeks apart.The figure shows a histogram of average human scoredifferences (y1 − y2) ∈ [−1, 1], with scores y1 col-lected in the 1st evaluation, and scores y2 in the 2nd.85% of rating pairs are within a 0.25 distance indicat-ing stability in the evaluation setup.

Page 5: arXiv:1909.03396v1 [cs.CL] 8 Sep 2019Tomer Levinboim Ashish Thapliyal Piyush Sharma Radu Soricut Google Research ftomerl,asht,piyushsharma,rsoricutg@google.com Abstract Automatic image

Image Generated Captions Human RatingAveraged

a general view of atmosphere . 0.375this is a picture of a yacht . 0.75the yacht is a great place to take a rest . 0.875

silhouette of a woman meditating on the beach at sunset 0.625silhouette of a man watching sunset 1.0

person and her husband take a walk . 0.125people walking along the beach 0.25people walking along the beach 0.5

plants for sale at the local market 0.625a selection of plants in the flower market 0.875flowers for sale at the market 1.0

the team at the opening . 0.375the cast performs on stage . 0.5the cast of musical film 0.75

cat in the grass with a dog 0.25a tiger in the grass 0.25cat lying on the grass 0.75

a young girl with a book and a cat on the table . 0.0a photo of my daughter ’s room . 0.125the table in the room 0.75

a police car in the middle of the road 0.125automobile model in the rain . 0.5vehicles drive through a flooded street 0.875

Table 2: Samples from the Caption-Quality dataset dev set. Each image is shown with up to 3 automaticallygenerated captions and their corresponding average human ratings. Note that repeating captions were likely ratedby a different set of raters and tend to have similar scores (see Figure 3).

4 Models

4.1 Confidence-based Features QE Model

We first look at the quality estimation problemfrom a caption generation model’s view-point(a.k.a., confidence modeling). The basic idea is tointegrate an image-captioning model M as a sub-module within the QE model, and expose the un-certainty behind all its decoding decisions as fea-tures for quality estimation. An instance of this ap-proach, in the context of quality estimation for ma-chine translation, is the predictor-estimator frame-work (Kim and Lee, 2016; Kim et al., 2017).

Specifically, a single-scalar confidencep(c|input) is insufficient to capture the un-certainty behind a structured-output predictionc = (c1, c2, . . . , cn). Instead, one can use the fullsequence L = [l1, l2, . . . , ln] of unnormalizedconditional probability distributions (or rather, thelogits) as provided by a caption generation modelM , where li ∈ R|V | denotes the logit values atdecoding step i, and |V | denotes the size of themodel vocabulary. Note that this sequence oflogits is force-decoded against the sequence c, inthe sense each that li is generated by M given theprefix c1, . . . , ci−1.

Page 6: arXiv:1909.03396v1 [cs.CL] 8 Sep 2019Tomer Levinboim Ashish Thapliyal Piyush Sharma Radu Soricut Google Research ftomerl,asht,piyushsharma,rsoricutg@google.com Abstract Automatic image

Forced-decoding

Trainable

Pre-trained (fixed)

Image Captioning

Model(M)

Image

Caption

DNN QE Score

𝞢 log pi

len

LSTM1

LSTMk

...

l1

log p1

Concat

...

...

...

...

...

...

Tokenize

c1, c2, …, cn

Softmax

LSTM1

LSTMk

...

l1

log p1

Concat

LSTM1

LSTMk

...

l1

log p1

Concat

Figure 4: Confidence-based features QE model: Using a pre-trained image-captioning submodule M , the QEmodel first extracts confidence based features such as the sequence of logits li and output log probabilities log pi.These sequences are encoded into a fixed length vector using a stack of LSTMs. This vector along with othersequence properties such as the sum of output log probabilities and the length of the output sequence are fed intoa dense layer with sigmoid activation to produce a QE score. (best viewed in color)

We apply a stack-LSTM (Dyer et al., 2015) ontoL and concatentate its final states and outputs toproduce a fixed-sized vector. This fixed-sized vec-tor is then passed to a dense layer with a sig-moid activation unit, to produce the final quality-estimation prediction y ∈ [0, 1]. This model archi-tecture is presented in Figure 4.

In practice, we set M to our best image-captioning model (as determined by human evalu-ation), which uses Graph-RISE image features and13 object-label features. We then experiment withfour types of model-confidence features:• logits - the features L, as described above.• seqlogp - the sequence of valueslog p1, log p2, . . . , log pn, wherepi = (SoftMax(li))ci provides the modelestimate for P (ci|image, c1, c2, ..., ci−1).• sumlogp -

∑ni=1 log pi, which is M ’s scalar

confidence for the sample.• seqlen - the number of tokens in caption c.In Section 5, We report results corresponding to

the best single feature and feature combination.

4.2 A Bilinear QE Model usingGeneration-independent Features

The second class of models we describe hereis based on features that are independent of theimage-captioning model. Here, in order to arriveat a model estimate QE(image, caption), we usefeatures that (a) directly represent the input imageand encode our ability to understand what the im-

age is about, and (b) directly represent the inputcaption and encode our ability to understand whatthe caption is about. The goal of the model is tolearn to transform these representations into an es-timate that reflects how well caption representsthe (salient aspects of the) image content.

Specifically, one likely shortcoming of theconfidence-feature based QE model lies in itsweak power to represent the caption. This weak-ness originates from the generation model itself- being an auto-regressive decoder, it only hasaccess to its prefix as context. However, theGeneration-independent Bilinear QE model is freeto exploit text encoders that consider the entirecaption as a whole, and moreover, offer an oppor-tunity for transfer learning using pretrained sen-tence encoders that are trained for different lan-guage understanding tasks.

4.2.1 Pretrained Input featuresBelow, we describe the Generation-independentBilinear QE model’s image and text input features.

Global Image Embedding For a global repre-sentation of the image, we used the latest Graph-RISE model version (Juan et al., 2019). whichprovides the ability to achieve transfer learning forour task with respect to image representation. Thismodel produces a compact image embedding i ofdimension Di = 64.

Object Labels Embeddings Objects present inthe image (e.g. “woman”, “flag”, “laptop”) can

Page 7: arXiv:1909.03396v1 [cs.CL] 8 Sep 2019Tomer Levinboim Ashish Thapliyal Piyush Sharma Radu Soricut Google Research ftomerl,asht,piyushsharma,rsoricutg@google.com Abstract Automatic image

DNNobjects

Trainable

Pre-trained (fixed)

ok ∈ RDo (k = 1, 2, …, O)ok ∈ RDo (k = 1, 2, …, O)ok ∈ RDo

i ∈ RDi

s ∈ RDs

DNNimage

DNNsentence

Image

Caption

Object Classifier

FeatureExtractor

Sentence Encoder

Label Embeddings b(ok, i; Bo,i) C

oncatenation

b(ok, s; Bo,s)

b(i, s; Bi,s)

DNN QEScore

(k = 1, 2, …, O)

(o)

(o)

(1)

(2O + 1)

Figure 5: The Generation-independent Bilinear QE model: The model uses pre-trained image, caption and object-label embeddings. Each embedding pair is ultimately combined into a scalar using one of three bilinear layer, allof which are fed to a dense layer (DNN) with sigmoid activation to produce the QE score. (Best viewed in color)

help assess the correctness and informativeness ofa candidate caption, where the intuition is that thecaption should probably mention the more salientobjects. We use the object label model mentionedin Section 3.1, whose resulting sequence of em-beddings is denoted O = (o1, . . . , o|O|), whereeach oj has dimension Do = 256.

Caption Universal Sentence Embedding Thecaption is embedded using a pretrained version ofthe Universal Sentence Encoder (USE) (Cer et al.,2018) into a Ds = 512 dimensional vector s. TheUSE model itself was trained on vast amounts ofEnglish sources (Wikipedia, web news, discussionforums, etc.), and fine-tuned using supervised la-bels from the SNLI corpus (Bowman et al., 2015).Previous findings (Conneau et al., 2017) report animprovement in transfer-learning performance asa result of this setup.

Given these features, the Bilinear QE modelprocesses each individual feature using a denselayer with a leaky-ReLU activation (Xu et al.,2015), and combines each of the resulting vectorpairs using bilinear layers (see below). All bilinearoutputs are concatenated and fed to a dense layerwith sigmoid activation, to produce the quality es-timation y. This model is illustrated in Figure 5.

4.2.2 Bilinear LayersA bilinear layer models the inner product of its twoinputs after applying a linear transformation to thefirst. This layer is defined as:

b(x, y;B) = xTBy (1)

where x ∈ RDx and y ∈ RDy are input features,andB ∈ RDx×Dy is the learned parameter matrix.Linear and bias terms can be added by appendinga constant 1 to both x and y.

Our bilinear QE model uses three such parame-ter matrices:

1. Bo,i ∈ RDo×Di , applied to each of theobject-label embeddings [o1, . . . , o|O|] andthe image embedding i, capturing a fit be-tween the image and the detected objects.

2. Bo,s ∈ RDo×Ds , applied to each of theobject-label embeddings [o1, . . . , o|O|] andthe sentence embedding s, capturing a fit be-tween the detected objects and the caption.

3. Bi,s ∈ RDi×Ds , for the image embedding iand sentence embedding s, capturing a fit be-tween the entire image and the caption.

These 2×|O|+1 bilinear outputs are concatenatedand fed to a dense layer with a sigmoid activationto produce a quality estimation score y

4.3 A Combined Model

We combine the model-confidence features and bi-linear model-independent features together into asingle model by simply concatenating the outputsof the LSTM and bilinear layers. The combinedoutputs are fed to a dense layer with a sigmoid ac-tivation to produce the combined model’s qualityestimation score y.

5 Experimental Results

The quality-estimation models are trained on theCaption-Quality training set (Section 3). We useMean Squared Error (MSE =

∑Bj=1

1N (yj −

yj)2) as the loss function, where yj are the pre-

dicted scores, yj the ground-truth human scores.For optimization, we use Adam (Kingma and

Ba, 2015) with batch size B = 256 andtune the learning rate lr ∈ {1e-4, 1e-5, 1e-6}.Dropout rate is set to 0.2, and applied to alltrainable layers inputs. The pre-trained mod-els are frozen during optimization: the image-captioning model M , the image encoder, theUSE caption encoder, and the object-label en-coder. The LSTM dimension×layers is tuned

Page 8: arXiv:1909.03396v1 [cs.CL] 8 Sep 2019Tomer Levinboim Ashish Thapliyal Piyush Sharma Radu Soricut Google Research ftomerl,asht,piyushsharma,rsoricutg@google.com Abstract Automatic image

Modelcategory

QE inputfeatures

LSTMd x layers lr Spearman

ρSdev

SpearmanρS

testConfidence- logits 2048x1 1e-6 0.470 0.453

based +seqlogp +sumlogp +seqlen 2048x3 1e-6 0.494 0.467Bilinear Image +USE embedding N/A 1e-5 0.486 0.471

+20 object-labels N/A 1e-5 0.500 0.472Combined All bilinear model inputs + seqlogp 128x3 1e-5 0.539 0.519

+logits +sumlogp +seqlen 128x3 1e-4 0.516 0.486

Table 3: Spearman’s ρS results for each model category. The best confidence-based and bilinear models acheivesimilar scores. Combining the two approaches improves the Spearman scores further.

over {128, 256, 1024, 2048, 4096}×{1, 2, 3}. TheGeneration-independent Bilinear QE model istuned over {0, 5, 10, 20} object-labels.

Model selection is done by picking the check-point that maximizes Spearman’s correlationρS(y, y) over the dev set. This selection crite-rion better matches the ultimate use of the QEmodel - at inference time, only images whose QEscores pass some threshold will be served. Sincethe threshold can be tuned, the absolute value ofthe predicted scores y is not as important as ob-taining a monotonic relationship between the pre-dicted and ground truth scores. (We also note that,unfortunately, using ρS as a loss function is notfeasible due to non-differentiability).

5.1 Spearman’s ρ Analysis

We present in Table 3 our dev and test Spearmanresults, based on selecting the best-performingmodel configurations over the dev set. Gener-ally, the best confidence-based model and thebest Generation-independent Bilinear QE modelattain similar results on the dev set, and showsimilar generalization capabilities on the test set.We attribute the success of the confidence-basedmodel to the fact that the image-captioning sub-component was pretrained for the captioning task,and as such, the confidence features it producescapture well the uncertainty in the evaluated cap-tions. On the other hand, we attribute the successof the bilinear model to transfer learning, as it re-lies on external representation power for the im-age and the caption features, both of which wereobtained by pre-training over huge datasets (abouttwo orders of magnitude larger compared to Con-ceptual Captions).

Among the confidence-based mod-els, the one using all the features (log-its+seqlogp+sumlogp+seqlen) outperforms

the logits-only model. Possibly, the se-qlogp+sumlogp+seqlen features provide aless noisy signal for optimization, as they areessentially low-dimensional summaries of thehigh-dimensional logits feature.

Among the bilinear model, although adding 20object-label embeddings helped improve the devset score (ρS

dev = 0.5), it did not translate to sub-stantial gains on the test set.

Finally, as one may expect, the combined modelleads to further gains on both the dev and test sets,increasing the Spearman scores to ρS

dev = 0.54 andρS

test = 0.52. We attribute these gains to the factthat the two models are exposed to complementaryfeatures.

6 Conclusions

Human judgments regarding image caption qual-ity are necessary to ensure correct decisions re-garding modeling improvements. At the sametime, discarding them after the correct decision isreached seems wasteful. In this work, we showthat there exists a fruitful venue for making fur-ther use of these judgments, as supervised labelsfor caption quality estimation modeling.

Moreover, our fast&simple human evaluation,done using unpaid crowd-sourcing, were shownto contain enough signal to serve as supervisionfor learning QE models. We make available thislarge-scale dataset of human judgments to encour-age further research in this area, and also providea framework under which we discuss various QEmodels and evaluate their capabilities.

References

Peter Anderson, Basura Fernando, Mark Johnson, andStephen Gould. 2016. SPICE: semantic proposi-tional image caption evaluation. In ECCV.

Page 9: arXiv:1909.03396v1 [cs.CL] 8 Sep 2019Tomer Levinboim Ashish Thapliyal Piyush Sharma Radu Soricut Google Research ftomerl,asht,piyushsharma,rsoricutg@google.com Abstract Automatic image

Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attention forimage captioning and visual question answering. InProceedings of CVPR.

Satanjeev Banerjee and Alon Lavie. 2005. METEOR:An automatic metric for MT evaluation with im-proved correlation with human judgments. In Pro-ceedings of the ACL Workshop on intrinsic and ex-trinsic evaluation measures for machine translationand/or summarization.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,Nicole Limtiaco, Rhomni St. John, Noah Constant,Mario Guajardo-Cespedes, Steve Yuan, Chris Tar,Brian Strope, and Ray Kurzweil. 2018. Universalsentence encoder for English. In Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing: System Demonstrations,pages 169–174, Brussels, Belgium. Association forComputational Linguistics.

Soravit Changpinyo, Bo Pang, Piyush Sharma, andRadu Soricut. 2019. Decoupled box proposal andfeaturization with ultrafine-grained semantic labelsimprove image captioning and visual question an-swering. In EMNLP-IJCNLP.

Alexis Conneau, Douwe Kiela, Holger Schwenk, LoıcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data.

Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: LongPapers), pages 334–343, Beijing, China. Associa-tion for Computational Linguistics.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of CVPR.

Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean.2015. Distilling the knowledge in a neural network.In NIPS Deep Learning and Representation Learn-ing Workshop.

Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng,Aleksei Timofeev, Yi-Ting Chen, Yaxi Gao, TomDuerig, Andrew Tomkins, and Sujith Ravi. 2019.Graph-rise: Graph-regularized image semantic em-bedding. CoRR, abs/1902.10814.

Mert Kilickaya, Aykut Erdem, Nazli Ikizler, and ErkutErdem. 2017. Re-evaluating automatic metrics for

image captioning. In Proceedings of the 15th Con-ference of the European Association for Computa-tional Linguistics, pages 199–209.

Hyun Kim, Hun-Young Jung, Hongseok Kwon, Jong-Hyeok Lee, and Seung-Hoon Na. 2017. Predictor-estimator: Neural quality estimation based ontarget word prediction for machine translation.ACM Trans. Asian Low-Resour. Lang. Inf. Process.,17(1):3:1–3:22.

Hyun Kim and Jong-Hyeok Lee. 2016. A recurrentneural networks approach for estimating the qualityof machine translation output. In Proceedings of theNorth American Chapter of the Association of Com-putational Linguistics, pages 494–498.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. ICLR.

Jamie Kiros, William Chan, and Geoffrey Hinton.2018. Illustrative language understanding: Large-scale visual grounding with image search. In Pro-ceedings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 922–933, Melbourne, Australia. As-sociation for Computational Linguistics.

Julia Kreutzer, Shigehiko Schamoni, and Stefan Rie-zler. 2015. Quality estimation from scratch(quetch): Deep learning for word-level translationquality estimation. In Proceedings of the 10th Work-shop on Statistical Machine Translation (WMT).

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A. Shamma,Michael Bernstein, and Li Fei-Fei. 2017. Vi-sual Genome: Connecting language and vision us-ing crowdsourced dense image annotations. IJCV,123(1):32–73.

Alina Kuznetsova, Hassan Rom, Neil Alldrin, JasperR. R. Uijlings, Ivan Krasin, Jordi Pont-Tuset, Sha-hab Kamali, Stefan Popov, Matteo Malloci, TomDuerig, and Vittorio Ferrari. 2018. The open im-ages dataset V4: unified image classification, objectdetection, and visual relationship detection at scale.CoRR, abs/1811.00982.

Chin-Yew Lin and Franz Josef Och. 2004. Auto-matic evaluation of machine translation quality us-ing longest common subsequence and skip-bigramstatistics. In Proceedings of ACL.

Tsung-Yi Lin, Michael Maire, Serge J. Belongie,Lubomir D. Bourdev, Ross B. Girshick, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, andC. Lawrence Zitnick. 2014. Microsoft COCO: com-mon objects in context. In Proceedings of ECCV.

Haley MacLeod, Cynthia L. Bennett, Meredith RingelMorris, and Edward Cutrell. 2017. Understandingblind people’s experiences with computer-generatedcaptions of social media images. In CHI.

Page 10: arXiv:1909.03396v1 [cs.CL] 8 Sep 2019Tomer Levinboim Ashish Thapliyal Piyush Sharma Radu Soricut Google Research ftomerl,asht,piyushsharma,rsoricutg@google.com Abstract Automatic image

Pranava Madhyastha, Josiah Wang, and Lucia Specia.2019. VIFIDEL: Evaluating the visual fidelity ofimage descriptions. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 6539–6550, Florence, Italy. Asso-ciation for Computational Linguistics.

Andre F. T. Martins, Marcin Junczys-Dowmunt, FabioKepler, Ramon Fernandez Astudillo, Chris Hokamp,and Roman Grundkiewicz. 2017. Pushing the limitsof translation quality estimation. Transactions of theAssociation for Computational Linguistics, 5:205–218.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Proceedings of NeurIPS.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. 2015. ImageNetlarge scale visual recognition challenge. IJCV,115(3):211–252.

Piyush Sharma, Nan Ding, Sebastian Goodman, andRadu Soricut. 2018. Conceptual Captions: Acleaned, hypernymed, image alt-text dataset for au-tomatic image captioning. In Proceedings of ACL.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A study oftranslation edit rate with targeted human annotation.In Proceedings of AMTA.

Radu Soricut and Abdessamad Echihabi. 2010.Trustrank: Inducing trust in automatic translationsvia ranking. In Proceedings of the 48th AnnualMeeting of the Association for Computational Lin-guistics, pages 612–621, Stroudsburg, PA, USA. As-sociation for Computational Linguistics.

Lucia Specia, Frederic Blain, Varvara Logacheva, Ra-mon Astudillo, and Andre Martins. 2019. Findingsof the wmt 2018 shared task on quality estimation.In Proceedings of the Third Conference on MachineTranslation (WMT), Volume 2: Shared Task Papers.

Lucia Specia, Nicola Cancedda, Marc Dymetman,Marco Turchi, and Nello Cristianini. 2009. Estimat-ing the sentence-level quality of machine translationsystems. pages 28–37.

Lucia Specia and Atefeh Farzindar. 2010. Estimatingmachine translation post-editing effort with hter.

Lucia Specia, Kashif Shah, Jose Guilherme Camargode Souza, and Trevor Cohn. 2013. QuEst - A Trans-lation Quality Estimation Framework. In Proceed-ings of the 51th Conference of the Association forComputational Linguistics (ACL), Demo Session,Sofia, Bulgaria. Association for Computational Lin-guistics.

Christian Szegedy, Sergey Ioffe, and Vincent Van-houcke. 2016. Inception-v4, inception-resnet andthe impact of residual connections on learning.CoRR, abs/1602.07261.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of NeurIPS.

Ramakrishna Vedantam, C. Lawrence Zitnick, andDevi Parikh. 2015. CIDEr: Consensus-based imagedescription evaluation. In Proceedings of CVPR.

Jiayi Wang, Kai Fan, Bo Li, Fengming Zhou, BoxingChen, Yangbin Shi, and Luo Si. 2018. Alibaba sub-mission for wmt18 quality estimation task. In Pro-ceedings of the Third Conference on Machine Trans-lation (WMT).

Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li.2015. Empirical evaluation of rectified activations inconvolutional network. Deep Learning Workshop,ICML.

Sanqiang Zhao, Piyush Sharma, Tomer Levinboim, andRadu Soricut. 2019. Informative image captioningwith external sources of information. In ACL.