arxiv:2002.11848v1 [cs.cl] 27 feb 2020 · 2020. 2. 28. · arxiv:2002.11848v1 [cs.cl] 27 feb 2020....

12
Analysis of diversity-accuracy tradeoff in image captioning Ruotian Luo TTI-Chicago [email protected] Gregory Shakhnarovich TTI-Chicago [email protected] Abstract We investigate the effect of different model ar- chitectures, training objectives, hyperparame- ter settings and decoding procedures on the di- versity of automatically generated image cap- tions. Our results show that 1) simple decod- ing by naive sampling, coupled with low tem- perature is a competitive and fast method to produce diverse and accurate caption sets; 2) training with CIDEr-based reward using Rein- forcement learning harms the diversity proper- ties of the resulting generator, which cannot be mitigated by manipulating decoding param- eters. In addition, we propose a new metric AllSPICE for evaluating both accuracy and di- versity of a set of captions by a single value. 1 Introduction People can produce a diverse yet accurate set of captions for a given image. Sources of diversity include syntactic variations and paraphrasing, fo- cus on different components of the visual scene, and focus on different levels of abstraction, e.g., describing scene composition vs. settings vs. more abstract, non-grounded circumstances, observed or imagined. In this paper we consider the goal of endowing automatic image caption generators with the ability to produce diverse caption sets. Why is caption diversity important, beyond the fact that “people can do it” (evident in existing data sets with multiple captions per image, such as COCO)? One caption may not be sufficient to describe the whole image and there are more than one way to describe an image. Producing multi- ple distinct descriptions is required in tasks like image paragraph generation and dense captioning. Additionally, access to diverse and accurate cap- tion set may enable better production of a single caption, e.g., when combined with a re-ranking method (Collins and Koo, 2005; Shen and Joshi, 2003; Luo and Shakhnarovich, 2017; Holtzman et al., 2018; Park et al., 2019). How is diversity achieved in automated cap- tioning? Latent variable models like Conditional GAN (Shetty et al., 2017; Dai et al., 2017), Con- ditional VAE (Wang et al., 2017; Jain et al., 2017) and Mixture of experts (Wang et al., 2016; Shen et al., 2019) generate a set of captions by sampling the latent variable. However, they didn’t carefully compare against different sampling methods. For example, the role of sampling temperature in con- trolling diversity/accuracy tradeoff was overlooked. Some efforts have been made to explore different sampling methods to generate diverse outputs, like diverse beam search (Vijayakumar et al., 2016), Top-K sampling (Fan et al., 2018; Radford et al., 2019), nucleus sampling (Holtzman et al., 2019) and etc.(Batra et al., 2012; Gimpel et al., 2013; Li and Jurafsky, 2016; Li et al., 2015). We want to study these methods for captioning. (Wang and Chan, 2019) has found that existing models have a trade-off between diversity and accu- racy. By training a weighted combination of CIDEr reward and cross entropy loss, they can get mod- els with different diversity-accuracy trade-off with different weight factor. In this paper, we want to take SOTA captioning models, known to achieve good accuracy, and study the diversity accuracy trade-off with different sam- pling methods and hyperparameters while fixing the model parameters. Especially, we will show if CIDEr-optimized model (high accuracy low diver- sity) can get a better trade-off than cross-entropy trained model under different sampling settings. How is diversity measured? Common metrics for captioning tasks like BLEU, CIDEr, etc., are aimed at measuring similarity between a pair of single captions, and can’t evaluate diversity. On the other hand, some metrics have been proposed to capture diversity, but not accuracy. These include arXiv:2002.11848v1 [cs.CL] 27 Feb 2020

Upload: others

Post on 25-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Analysis of diversity-accuracy tradeoff in image captioning

    Ruotian LuoTTI-Chicago

    [email protected]

    Gregory ShakhnarovichTTI-Chicago

    [email protected]

    Abstract

    We investigate the effect of different model ar-chitectures, training objectives, hyperparame-ter settings and decoding procedures on the di-versity of automatically generated image cap-tions. Our results show that 1) simple decod-ing by naive sampling, coupled with low tem-perature is a competitive and fast method toproduce diverse and accurate caption sets; 2)training with CIDEr-based reward using Rein-forcement learning harms the diversity proper-ties of the resulting generator, which cannotbe mitigated by manipulating decoding param-eters. In addition, we propose a new metricAllSPICE for evaluating both accuracy and di-versity of a set of captions by a single value.

    1 Introduction

    People can produce a diverse yet accurate set ofcaptions for a given image. Sources of diversityinclude syntactic variations and paraphrasing, fo-cus on different components of the visual scene,and focus on different levels of abstraction, e.g.,describing scene composition vs. settings vs. moreabstract, non-grounded circumstances, observed orimagined. In this paper we consider the goal ofendowing automatic image caption generators withthe ability to produce diverse caption sets.

    Why is caption diversity important, beyondthe fact that “people can do it” (evident in existingdata sets with multiple captions per image, suchas COCO)? One caption may not be sufficient todescribe the whole image and there are more thanone way to describe an image. Producing multi-ple distinct descriptions is required in tasks likeimage paragraph generation and dense captioning.Additionally, access to diverse and accurate cap-tion set may enable better production of a singlecaption, e.g., when combined with a re-rankingmethod (Collins and Koo, 2005; Shen and Joshi,

    2003; Luo and Shakhnarovich, 2017; Holtzmanet al., 2018; Park et al., 2019).

    How is diversity achieved in automated cap-tioning? Latent variable models like ConditionalGAN (Shetty et al., 2017; Dai et al., 2017), Con-ditional VAE (Wang et al., 2017; Jain et al., 2017)and Mixture of experts (Wang et al., 2016; Shenet al., 2019) generate a set of captions by samplingthe latent variable. However, they didn’t carefullycompare against different sampling methods. Forexample, the role of sampling temperature in con-trolling diversity/accuracy tradeoff was overlooked.

    Some efforts have been made to explore differentsampling methods to generate diverse outputs, likediverse beam search (Vijayakumar et al., 2016),Top-K sampling (Fan et al., 2018; Radford et al.,2019), nucleus sampling (Holtzman et al., 2019)and etc.(Batra et al., 2012; Gimpel et al., 2013; Liand Jurafsky, 2016; Li et al., 2015). We want tostudy these methods for captioning.

    (Wang and Chan, 2019) has found that existingmodels have a trade-off between diversity and accu-racy. By training a weighted combination of CIDErreward and cross entropy loss, they can get mod-els with different diversity-accuracy trade-off withdifferent weight factor.

    In this paper, we want to take SOTA captioningmodels, known to achieve good accuracy, and studythe diversity accuracy trade-off with different sam-pling methods and hyperparameters while fixingthe model parameters. Especially, we will show ifCIDEr-optimized model (high accuracy low diver-sity) can get a better trade-off than cross-entropytrained model under different sampling settings.

    How is diversity measured? Common metricsfor captioning tasks like BLEU, CIDEr, etc., areaimed at measuring similarity between a pair ofsingle captions, and can’t evaluate diversity. Onthe other hand, some metrics have been proposed tocapture diversity, but not accuracy. These include

    arX

    iv:2

    002.

    1184

    8v1

    [cs

    .CL

    ] 2

    7 Fe

    b 20

    20

  • mBLEU, Distinct unigram, Distinct bigram, Self-CIDEr (Wang and Chan, 2019). We propose a newmetric AllSPICE, an extension of SPICE able tomeasure diversity and accuracy of a caption setw.r.t. ground truth captions at the same time. Notethat, diversity evaluated in AllSPICE is definedas the ability of generating different captions forthe same image, similar to Self-CIDEr, as opposedto vocabulary size or word recall (van Miltenburget al., 2018), where diversity across test image setis considered.

    In addition to the new metric, our primary con-tribution is the systematic evaluation of the role ofdifferent choices – training objectives; hyperparam-eter values; sampling/decoding procedure – playin the resulting tradeoff between accuracy and thediversity of generated caption sets. This allows usto identify some simple but effective approaches toproducing diverse and accurate captions.

    2 Methods for diverse captioning

    In most of our experiments we work with theattention-based LSTM model (Rennie et al., 2017)with hidden layer size 512. We consider alternativemodels (without attention; larger layer size; and atransformer-based model) in Appendix and someexperiments.Learning objectives we consider:XE: standard cross entropy loss.RL: pretrained with XE, then finetuned on CIDErreward using REINFORCE (Williams, 1992; Ren-nie et al., 2017)XE+RL: pretrained with XE, then finetuned onconvex combination of XE and CIDEr reward.

    Next, we consider different decoding (produc-tion) methods for trained generators.SP Naive sampling, with temperature T ,

    p(wt|I, w1...,t) =exp (s(wt|I, w1,...,t−1)/T )∑w exp (s(wt|I, w1,...,t−1)/T )

    where s is the score (logit) of the model for thenext word given image I . We are not aware of anydiscussion of this approach to diverse captioning inthe literature.Top-K sampling (Fan et al., 2018; Radford et al.,2019) limiting sampling to top K words.Top-p (nucleus) sampling (Holtzman et al., 2019)limiting sampling to a set with high probabilitymass p.BS Beam search, producing m-best list.

    DBS Diverse beam search (Vijayakumar et al.,2016) with diversity parameter λ.Each of these methods can be used with tempera-ture T affecting the word posterior, as well as its“native” hyperparameters (K,p,m,λ).

    3 Measuring diversity and accuracy

    We will use use the following metrics for diversecaptioning:mBLEU Mean of BLEU scores computed be-tween each caption in the set against the rest.Lower=more diversity.Div-1, Div-2: ratio of number of uniqueuni/bigrams in generated captions to number ofwords in caption set. Higher=more diverse.Self-CIDEr(Wang and Chan, 2019): a diversitymetric derived from latent semantic analysis andCIDEr similarity. Higher=more diverse.Vocabulary Size: number of unique words used inall generated captions.% Novel Sentences: percentage of generated cap-tions not seen in the training set.Oracle/Average scores: taking the maxi-mum/mean of each relevant metric over all thecandidates (of one image).

    We propose a new metric AllSPICE. It buildsupon the SPICE (Anderson et al., 2016) designedfor evaluating a single caption. To compute SPICE,a scene graph is constructed for the evaluated cap-tion, with vertices corresponding to objects, at-tributes, and relations; edges reflect possession ofattributes or relations between objects. Anotherscene graph is constructed from the reference cap-tion; with multiple reference captions, the referencescene graph is the union of the graphs for individ-ual captions, combining synonymous vertices. TheSPICE metric is the F-score on matching verticesand edges between the two graphs.

    AllSPICE builds a single graph for the generatedcaption set, in the same way that SPICE treatsreference caption sets. Note: repetitions acrosscaptions in the set won’t change the score, due tomerging of synonymous vertices. Adding a captionthat captures part of the reference not capturedby previous captions in the set may improve thescore (by increasing recall). Finally, wrong contentin any caption in the set will harm the score (byreducing precision). This is in contrast to “oraclescore”, where only one caption in the set needs tobe good, and adding wrong captions does not affectthe metric, making it an imperfect measure of set

  • quality.Thus, higher AllSPICE requires the samples to

    be semantically diverse and correct; diversity with-out accuracy is penalized. This is in contrast toother metrics (mBLEU, Self-CIDEr), that separatediversity considerations from those of accuracy.1

    4 Experiments

    The aim of our experiments is to explore how dif-ferent methods (varying training procedures, de-coding procedures, and hyper-parameter settings)navigate the diversity/accuracy tradeoff. We usedata from COCO captions (Lin et al., 2014), withsplits from (Karpathy and Fei-Fei, 2015), report-ing results on the test split. (For similar resultson Flickr30k (Young et al., 2014) see Appendix.)Unless stated otherwise, caption sets contain fivegenerated captions for each image. 2

    Naive sampling (Wang and Chan, 2019) note thatwith temperature 1, XE-trained model will havehigh diversity but low accuracy, and RL-trainedmodel the opposite. Thus they propose to use aweighted combination of RL-objective and XE-loss to achieve better trade-off. However, Figure 5shows that a simple alternative for trading diversityfor accuracy is to modulate the sampling temper-ature. Although RL-trained model is not diversewhen T=1, it generates more diverse captions withhigher temperature. XE-trained model can generatemore accurate captions with lower temperature. InFigure 2, we show that naive sampling with T=0.5performs better than other settings on oracle CIDErand AllSPICE, as well.

    We also see, comparing the two panels of Fig-ure 2 to each other and to Figure 5 that measuringAllSPICE allows significant distinction betweensettings that would appear very similar under ora-cle CIDEr. This suggests that AllSPICE is comple-mentary to other metrics, and reflects a differentaspect of accuracy/diversity tradeoff.Biased sampling Top-K sampling (Fan et al.,2018; Radford et al., 2019) and Top-p (nucleus)sampling(Holtzman et al., 2019) are used to elim-inate low-probability words by only samplingamong top probability choices. This is intended toprevent occurrence of “bad” words, which, onceemitted, poison subsequent generation.

    1AllSPICE is released at https://github.com/ruotianluo/coco-caption

    2The experiment related code is released at in https://github.com/ruotianluo/self-critical.pytorch/tree/master/projects/Diversity

    0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Self-CIDEr

    0.2

    0.4

    0.6

    0.8

    1.0

    1.2

    Avg

    CIDE

    r

    1 1.5

    2

    2.5

    3

    1

    0.75

    0.50.20.1

    RL; decoded with Ts(1- )XE+ RL; decoded with T = 1XE; decoded with Ts

    Figure 1: Diversity-accuracy tradeoff. Blue: RL, de-coding with different temperatures. Orange: XE+RL,trained with different CIDEr-MLE weights γ, decod-ing with T = 1. Green: XE, decoding with differenttemperatures.

    0.4 0.6 0.8 1.0Self-CIDEr

    0.6

    0.8

    1.0

    1.2

    1.4

    Orac

    le C

    IDEr

    1 1.5 2

    2.5

    3

    1

    0.750.50.2

    0.1

    0.4 0.6 0.8 1.0Self-CIDEr

    0.20

    0.22

    0.24

    0.26

    AllS

    PICE 1 1.5

    2

    1

    0.75

    0.5

    0.2

    0.1

    Figure 2: Additional views of diversity/accuracy trade-off: Oracle CIDEr and AllSPICE vs. self-CIDEr. Seelegend for Figure 5.

    avg CIDEr oracle CIDEr AllSPICE Self-CIDEr

    DBS λ=0.3, T=1 0.919 1.413 0.271 0.754BS T=0.75 1.073 1.444 0.261 0.588Top-K K=3 T=0.75 0.921 1.365 0.258 0.736Top-p p=0.8 T=0.75 0.929 1.366 0.257 0.744SP T=0.5 0.941 1.364 0.255 0.717

    Table 1: Performnace for best hyperparamters for eachmethod (XE-trained model).

    0.2 0.4 0.6 0.8Self-CIDEr

    0.9

    1.0

    1.1

    1.2

    1.3

    Orac

    le C

    IDEr

    Top-pTop-KSP

    0.2 0.4 0.6 0.8Self-CIDEr

    0.20

    0.22

    0.24

    0.26

    AllS

    PICE

    Top-pTop-KSP

    Figure 3: Oracle CIDEr and AllSPICE of Top-K(orange) and Top-p (blue) sampling, compared tonaive sampling with varying T (green), for XE-trainedmodel.

    In Figure 3, we explore these methods for XE-trained models, with different thresholds and tem-peratures, compared against SP/temperature. Wesee that they can slightly outperform SP. In Ap-pendix, we show that Top-K sampling with RL-trained model gets better results than SP or Top-p,however, the results are still worse than XE-trainedmodels.

    https://github.com/ruotianluo/coco-captionhttps://github.com/ruotianluo/coco-captionhttps://github.com/ruotianluo/self-critical.pytorch/tree/master/projects/Diversityhttps://github.com/ruotianluo/self-critical.pytorch/tree/master/projects/Diversityhttps://github.com/ruotianluo/self-critical.pytorch/tree/master/projects/Diversity

  • 5 10 20# of samples

    1.0

    1.1

    1.2

    1.3

    1.4

    1.5

    1.6

    1.7

    Oracle CIDEr

    5 10 20# of samples

    0.18

    0.20

    0.22

    0.24

    0.26

    AllSPICE

    5 10 20# of samples

    0.6

    0.7

    0.8

    0.9

    1.0

    Avg CIDEr

    5 10 20# of samples

    0.55

    0.60

    0.65

    0.70

    0.75

    0.80

    0.85

    0.90

    Self-CIDErBS T=1DBS =3 T=1SP T=0.5SP T=1

    Figure 4: Effect of sample size on different metrics,XE-trained model. Colors indicate decoding strategies.

    Beam search: we include in the Appendix the de-tailed curves comparing results of (regular and di-verse) beam search with varying temperature onXE and RL models. Briefly, we observe that aswith sampling, RL trained models exhibit worsediversity/accuracy tradeoff when BS is used fordecoding.

    In summary, in all settings, the RL-trained modelperforms worse than XE-trained models. suggest-ing that RL-objective is detrimental to diversity. Inthe remainder, we only discuss XE-trained models.Comparing across methods (sample size 5) Ta-ble 1 shows the performance of each methods un-der its best performing hyperparameter (under All-SPICE). Diverse beam search is the best algorithmwith high AllSPICE and Self-CIDEr, indicatingboth semantic and syntactic diversity.

    Beam search performs best on oracle CIDErand average CIDEr, and it performs well on All-SPICE too. However although all the generatedcaptions are accurate, the syntactic diversity is miss-ing, shown by Self-CIDEr. Qualitative results areshown in supplementary.

    Sampling methods (SP, Top-K, Top-p) are muchfaster than beam search, and competitive in perfor-mance. This suggests them as a compelling alter-native to recent methods for speeding up diversecaption generation, like (Deshpande et al., 2018).Different sample size (see Figure 4): OracleCIDEr tends to increase with sample size, as morecaptions mean more chances to fit the reference.AllSPICE drops with more samples, because addi-tional captions are more likely to hurt (say some-thing wrong) than help (add something correct notyet said). BS, which explores the caption spacemore “cautiously” than other methods, is initiallyresilient to this effect, but with enough samples its

    METEOR SPICE Div-1 Div-2 mBleu-4 Vocab- %Novelulary SentBase 0.265 0.186 0.31 0.44 0.68 1460 55.2Adv 0.236 0.166 0.41 0.55 0.51 2671 79.8T=0.33 0.266 0.187 0.31 0.44 0.65 1219 58.3T=0.5 0.260 0.183 0.37 0.55 0.47 1683 75.6T=0.6 0.259 0.181 0.41 0.60 0.37 2093 83.7T=0.7 0.250 0.174 0.45 0.66 0.28 2573 89.9T=0.8 0.240 0.166 0.49 0.71 0.21 3206 94.4T=0.9 0.228 0.157 0.54 0.77 0.14 3969 97.1T=1 0.214 0.144 0.59 0.81 0.08 4875 98.7

    Table 2: Base and Adv are from (Shetty et al., 2017).The other models are trained by us. All the methodsuse naive sampling decoding.

    AllSPICE drops as well.Average CIDEr Sampling methods’ average scoresare largely invariant to sample size. BS and espe-cially DBS suffer a lot with more samples, becausediversity constraints and the properties of the beamsearch force the additional captions to be lowerquality, hurting precision without improving recall.Self-CIDEr The relative order in Self-CIDEr is thesame across different sample size; but SP/T=1 iscloser to DBS and SP/T=0.5 to BS in Self-CIDEr.Because of the definition of self-CIDEr, compari-son across sample sizes is not very meaningful.Comparison with adversarial loss In (Shettyet al., 2017), the authors get more diverse butless accurate captions with adversarial loss. Ta-ble 2 shows that by tuning sampling temperature, aXE-trained model can generate more diverse andmore accurate captions than adversarially trainedmodel. Following (Shetty et al., 2017), METEORand SPICE are computed by measuring the score ofthe highest likelihood sample among 5 generatedcaptions. For fair comparison, we use a similarnon-attention LSTM model. When temperature is0.8, the output generations outperforms the adver-sarial method (Adv). While (Shetty et al., 2017)also compare their model to a base model usingsampling method (Base), they get less diverse cap-tions because they use temperature 13 (our modelwith T=0.33 is performing similar to “Base”).

    5 Conclusions

    Our results provide both a practical guidance anda better understanding of the diversity/accuracytradeoff in image captioning. We show that cap-tion decoding methods may affect this tradeoffmore significantly than the choice of training ob-jectives or model. In particular, simple naive sam-pling, coupled with suitably low temperature, isa competitive method with respect to speed anddiversity/accuracy tradeoff. Diverse beam searchexhibits the best tradeoff, but is also the slowest.

  • Among training objectives, using CIDEr-based re-ward reduces diversity in a way that is not mitigatedby manipulating decoding parameters. Finally, weintroduce a metric AllSPICE for directly evaluatingtradeoff between accuracy and semantic diversity.

    ReferencesPeter Anderson, Basura Fernando, Mark Johnson, and

    Stephen Gould. 2016. Spice: Semantic propo-sitional image caption evaluation. In EuropeanConference on Computer Vision, pages 382–398.Springer.

    Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attention forimage captioning and visual question answering. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 6077–6086.

    Dhruv Batra, Payman Yadollahpour, Abner Guzman-Rivera, and Gregory Shakhnarovich. 2012. Diversem-best solutions in markov random fields. In Euro-pean Conference on Computer Vision, pages 1–16.Springer.

    Michael Collins and Terry Koo. 2005. Discriminativereranking for natural language parsing. Computa-tional Linguistics, 31(1):25–70.

    Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin.2017. Towards diverse and natural image descrip-tions via a conditional gan. In Proceedings of theIEEE International Conference on Computer Vision,pages 2970–2979.

    Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexan-der Schwing, and David A Forsyth. 2018. Di-verse and controllable image captioning with part-of-speech guidance. arXiv preprint arXiv:1805.12589.

    Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-erarchical neural story generation. arXiv preprintarXiv:1805.04833.

    Kevin Gimpel, Dhruv Batra, Chris Dyer, and GregoryShakhnarovich. 2013. A systematic exploration ofdiversity in machine translation. In Proceedings ofthe 2013 Conference on Empirical Methods in Natu-ral Language Processing, pages 1100–1111.

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

    Ari Holtzman, Jan Buys, Maxwell Forbes, AntoineBosselut, David Golub, and Yejin Choi. 2018.Learning to write with cooperative discriminators.arXiv preprint arXiv:1805.06087.

    Ari Holtzman, Jan Buys, Maxwell Forbes, and YejinChoi. 2019. The curious case of neural text degener-ation.

    Unnat Jain, Ziyu Zhang, and Alexander G Schwing.2017. Creativity: Generating diverse questions us-ing variational autoencoders. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pages 6485–6494.

    Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descrip-tions. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages3128–3137.

    Yoon Kim, Alexander M Rush, Lei Yu, Adhiguna Kun-coro, Chris Dyer, and Gábor Melis. 2019. Unsu-pervised recurrent neural network grammars. arXivpreprint arXiv:1904.03746.

    Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma,et al. 2017. Visual genome: Connecting languageand vision using crowdsourced dense image anno-tations. International Journal of Computer Vision,123(1):32–73.

    Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2015. A diversity-promoting objec-tive function for neural conversation models. arXivpreprint arXiv:1510.03055.

    Jiwei Li and Dan Jurafsky. 2016. Mutual informationand diverse decoding improve neural machine trans-lation. arXiv preprint arXiv:1601.00372.

    Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollár,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In European confer-ence on computer vision, pages 740–755. Springer.

    Ruotian Luo and Gregory Shakhnarovich. 2017.Comprehension-guided referring expressions. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 7102–7111.

    Emiel van Miltenburg, Desmond Elliott, and PiekVossen. 2018. Measuring the diversity of automaticimage descriptions. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,pages 1730–1741.

    Andriy Mnih and Danilo J Rezende. 2016. Variationalinference for monte carlo objectives. arXiv preprintarXiv:1602.06725.

    Jae Sung Park, Marcus Rohrbach, Trevor Darrell, andAnna Rohrbach. 2019. Adversarial inference formulti-sentence video description. In Proceedings of

    http://arxiv.org/abs/arXiv:1904.09751http://arxiv.org/abs/arXiv:1904.09751

  • the IEEE Conference on Computer Vision and Pat-tern Recognition, pages 6598–6608.

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. OpenAIBlog, 1:8.

    Shaoqing Ren, Kaiming He, Ross Girshick, and JianSun. 2015. Faster r-cnn: Towards real-time ob-ject detection with region proposal networks. InAdvances in neural information processing systems,pages 91–99.

    Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jerret Ross, and Vaibhava Goel. 2017. Self-criticalsequence training for image captioning. In Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 7008–7024.

    Piyush Sharma, Nan Ding, Sebastian Goodman, andRadu Soricut. 2018. Conceptual captions: Acleaned, hypernymed, image alt-text dataset for au-tomatic image captioning. In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages2556–2565.

    Libin Shen and Aravind K Joshi. 2003. An svm-basedvoting algorithm with application to parse reranking.In Proceedings of the seventh conference on Naturallanguage learning at HLT-NAACL 2003.

    Tianxiao Shen, Myle Ott, Michael Auli, andMarc’Aurelio Ranzato. 2019. Mixture models fordiverse machine translation: Tricks of the trade.arXiv preprint arXiv:1902.07816.

    Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hen-dricks, Mario Fritz, and Bernt Schiele. 2017. Speak-ing the same language: Matching machine to humancaptions by adversarial training. In Proceedings ofthe IEEE International Conference on Computer Vi-sion, pages 4135–4144.

    Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

    Ashwin K Vijayakumar, Michael Cogswell, Ram-prasath R Selvaraju, Qing Sun, Stefan Lee, DavidCrandall, and Dhruv Batra. 2016. Diverse beamsearch: Decoding diverse solutions from neural se-quence models. arXiv preprint arXiv:1610.02424.

    Liwei Wang, Alexander Schwing, and Svetlana Lazeb-nik. 2017. Diverse and accurate image descriptionusing a variational auto-encoder with an additivegaussian encoding space. In Advances in Neural In-formation Processing Systems, pages 5756–5766.

    Qingzhong Wang and Antoni B Chan. 2019. Describ-ing like humans: on diversity in image captioning.arXiv preprint arXiv:1903.12020.

    Zhuhao Wang, Fei Wu, Weiming Lu, Jun Xiao, Xi Li,Zitong Zhang, and Yueting Zhuang. 2016. Diverseimage captioning via grouptalk. In IJCAI, pages2957–2964.

    Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine learning, 8(3-4):229–256.

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-enmaier. 2014. From image descriptions to visualdenotations: New similarity metrics for semantic in-ference over event descriptions. Transactions of theAssociation for Computational Linguistics, 2:67–78.

    https://doi.org/10.1162/tacl_a_00166https://doi.org/10.1162/tacl_a_00166https://doi.org/10.1162/tacl_a_00166

  • AppendicesA Training details

    All of our captioning models are trained accord-ing to the following scheme. The XE-trained cap-tioning models are trained for 30 epochs (trans-former 15 epochs) with Adam(Kingma and Ba,2014). RL-trained models are finetuned from XE-trained model, optimizing CIDEr score with REIN-FORCE (Williams, 1992). The baseline used RE-INFORCE algorithm follows (Mnih and Rezende,2016; Kim et al., 2019). This baseline performs bet-ter than self critical baseline in (Rennie et al., 2017).The batch size is chosen to be 250 for LSTM-basedmodel, and for transformer the batch size is 100.The learning rate is initialized to be 5e-4 and decayby a factor 0.8 for every three epochs. During rein-forcement learning phase, the learning rate is fixedto 4e-5.

    The visual features For non-attention LSTM cap-tioning model, we used the input of the last fullyconnected layer of a pretrained Resnet-101(Heet al., 2016) as image feature.

    For attention-based LSTM and transformer, thespatial features are extracted from output of a FasterR-CNN(Anderson et al., 2018; Ren et al., 2015)with ResNet-101(He et al., 2016), trained by ob-ject and attribute annotations from Visual Genome(Krishna et al., 2017).The output feature has shapeKx2048 where K is the number of detected objectsin the image. Both the fully-connected features andspatial features are pre-extracted, and no finetuningis applied on image encoders.

    B RL-trained model with TopK andNucleus sampling

    For RL-trained model, Top-K sampling can out-perform SP and nucleus sampling: captions aremore accurate with similar level of diversity, shownin Figure 5. Our intuition is CIDEr optimizationmakes the word posterior very peaky and the taildistribution is noisy. With large temperature, thesampling would fail because the selection will fallinto the bad tail. Top-K can successfully eliminatethis phenomenon and focus on correct things. Un-fortunately, nucleus sampling is not performing assatisfying in captioning task.

    0.2 0.4 0.6 0.8 1.0Self-CIDEr

    0.9

    1.0

    1.1

    1.2

    1.3

    1.4

    Orac

    le C

    IDEr

    Top-pTop-KSP

    0.2 0.4 0.6 0.8 1.0Self-CIDEr

    0.200

    0.205

    0.210

    0.215

    0.220

    0.225

    0.230

    0.235

    0.240

    AllS

    PICE

    Top-pTop-KSP

    Figure 5: Oracle CIDEr and AllSPICE of Top-K(orange) and Top-p (blue) sampling, compared tonaive sampling with varying T (green), for RL-trainedmodel.

    C Beam search

    Figure 6 shows how scores change by tuning thetemperature for beam search. It’s clear that nomatter how to tune the BS for RL-trained model,the performance is always lower than that of XE-trained model.

    Another interesting finding is unlike other meth-ods, beam search produces more diverse set whenhaving lower temperature. This is because lowertemperature makes the new beams to have morechances to be expanded from different old beamsinstead of having one beam dominating (beamssharing same root).

    D Diverse beam search

    Figure 7 shows the performance of oracle CIDErand AllSPICE with different setting of λ and Tfrom a XE-trained model. Figure 8 compares be-tween XE-trained model and RL-trained model.The RL-trained model is also worse than XE-trained model when using DBS.

    E Different models

    Here we compare naive sampling results of dif-ferent model architectures. We consider 4 dif-ferent architectures. FC is non-attention LSTMmodel. ATTN(default in main paper) and ATTN-Lare attention-based LSTM with hidden size 512

  • 0.475 0.500 0.525 0.550 0.575 0.600 0.625Self-CIDEr

    1.20

    1.25

    1.30

    1.35

    1.40

    1.45

    Orac

    le C

    IDEr

    1.02.02.53.05.0

    0.10.2

    0.50.75

    1.0

    1.5

    2.0

    2.53.0

    RLXE

    0.475 0.500 0.525 0.550 0.575 0.600 0.625Self-CIDEr

    0.235

    0.240

    0.245

    0.250

    0.255

    0.260

    AllS

    PICE 1.02.02.53.0

    5.0

    0.10.2

    0.5

    0.751.0

    1.5

    2.0

    2.53.0

    RLXE

    Figure 6: Oracle CIDEr and AllSPICE of beamsearch with different temperature, from RL-trainedmodel(blue) and XE-trained model(orange)

    and 2048. Trans is standard transformer model in(Vaswani et al., 2017; Sharma et al., 2018). Theyare all trained with XE.

    Table 3 shows the single caption generation re-sult of these 4 architecture: FC < ATTN < ATTN-L < Trans (We include both XE and RL-trainedmodels in the table, but for diversity evaluation weonly show results on XE-trained models). Figure9 shows the naive sampling results for differentmodels. AllSPICE performance on different mod-els has the same order as single caption generationperformance.

    ROUGE METEOR CIDEr SPICE

    FC(XE) 0.543 0.258 1.006 0.187ATTN(XE) 0.562 0.272 1.109 0.201ATTN-L(XE) 0.561 0.275 1.116 0.205Trans(XE) 0.563 0.278 1.131 0.208

    FC(RL) 0.555 0.263 1.123 0.197ATTN(RL) 0.572 0.274 1.211 0.209ATTN-L(RL) 0.584 0.285 1.267 0.219Trans(RL) 0.589 0.291 1.298 0.230

    Table 3: The single caption result by different modelarchitectures.

    F Results on Flickr30k

    Here we also plot similar figures, but on Flickr30kdataset(Young et al., 2014). It shows that our con-clusions generalize across datasets. This is usingthe same model as in the main text.

    0.55 0.60 0.65 0.70 0.75 0.80Self-CIDEr

    1.30

    1.32

    1.34

    1.36

    1.38

    1.40

    Orac

    le C

    IDEr

    0.3

    0.50.70.3

    0.50.7

    0.30.5

    0.7

    0.1

    0.2 0.3

    0.5

    0.7

    1.0

    T=0.1T=0.2T=0.5T=1.0

    0.55 0.60 0.65 0.70 0.75 0.80Self-CIDEr

    0.250

    0.255

    0.260

    0.265

    0.270

    AllS

    PICE

    0.3

    0.50.7

    0.30.5 0.7

    0.30.5 0.7

    0.1

    0.20.3

    0.5

    0.71.0

    T=0.1T=0.2T=0.5T=1.0

    Figure 7: Diverse beam search. Curve colors corre-spond to results with decoding an XE-trained modelwith different temperatures; each curve tracks varyingλ in DBS.

    G Qualitative results

    We show caption sets sampled by different decod-ing method for 5 different images. The model isattention LSTM model trained with cross entropyloss.

  • 0.50 0.55 0.60 0.65 0.70 0.75 0.80Self-CIDEr

    1.30

    1.32

    1.34

    1.36

    1.38

    1.40

    Orac

    le C

    IDEr

    RLXE

    0.50 0.55 0.60 0.65 0.70 0.75 0.80Self-CIDEr

    0.245

    0.250

    0.255

    0.260

    0.265

    0.270

    AllS

    PICE

    RLXE

    Figure 8: Diverse beam search result of XE-trainedmodel and RL-trained model. Each dot correspond to aspecific λ T pair.

    0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95Self-CIDEr

    0.8

    1.0

    1.2

    1.4

    1.6

    Orac

    le C

    IDEr

    5

    10

    20

    5

    10

    205

    10

    20

    5

    10

    205

    10

    20

    5

    10

    20

    5

    10

    20

    5

    10

    20

    ATTN T=0.5ATTN T=1ATTN-L T=0.5ATTN-L T=1FC T=0.5FC T=1Trans T=0.5Trans T=1

    0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95Self-CIDEr

    0.14

    0.16

    0.18

    0.20

    0.22

    0.24

    0.26

    AllS

    PICE

    51020

    510

    20

    510

    20

    510

    20

    510

    20

    510

    20

    51020

    510

    20

    ATTN T=0.5ATTN T=1ATTN-L T=0.5ATTN-L T=1FC T=0.5FC T=1Trans T=0.5Trans T=1

    Figure 9: Oracle CIDEr and AllSPICE of differentmodels. The numbers with arrows indicates the sam-ple size

    0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Self-CIDEr

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    Avg

    CIDE

    r

    11.5

    2

    2.5

    3

    1

    0.75

    0.5

    0.20.1

    RL; decoded with TsXE; decoded with Ts

    Figure 10: Diversity-accuracy tradeoff of att2in onFlickr30k. Blue: RL, decoding with different temper-atures. Orange: XE, decoding with different tempera-tures.

    0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Self-CIDEr

    0.50

    0.55

    0.60

    0.65

    0.70

    0.75

    Orac

    le C

    IDEr

    1 1.5

    2

    0.75

    0.50.2

    0.1

    RL; decoded with TsXE; decoded with Ts

    0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Self-CIDEr

    0.06

    0.08

    0.10

    0.12

    0.14

    0.16

    0.18

    AllS

    PICE

    1 1.5

    2

    2.5

    3

    1

    0.750.5

    0.20.1

    RL; decoded with TsXE; decoded with Ts

    Figure 11: Oracle CIDEr and AllSPICE vs. Self-CIDEr on Flickr30k. See legend for Figure 10.

  • 0.0 0.2 0.4 0.6 0.8 1.0Self-CIDEr

    0.45

    0.50

    0.55

    0.60

    0.65

    0.70

    0.75

    0.80Or

    acle

    CID

    ErTop-pTop-KSP

    0.0 0.2 0.4 0.6 0.8 1.0Self-CIDEr

    0.15

    0.16

    0.17

    0.18

    0.19

    0.20

    AllS

    PICE

    Top-pTop-KSP

    Figure 12: Oracle CIDEr and AllSPICE of Top-K (or-ange) and Top-p (blue) sampling, compared to naivesampling with varying T (green) on Flickr30k, for XE-trained model.

    DBS λ=3 T=1:a man riding a motorcycle down a dirt roadthere is a man riding a motorcycle down the roadman riding a motorcycle down a dirt roadan image of a man riding a motorcyclethe person is riding a motorcycle down the roadBS T=0.75:a man riding a motorcycle down a dirt roada man riding a motorcycle on a dirt roada man riding a motorcycle down a rural roada man riding a motorcycle down a roada man riding a motorcycle down a road next to a mountainTop-K K=3 T=0.75:a man riding a motorcycle down a dirt roada person on a motor bike on a roada man riding a bike on a path with a mountain in the back-grounda man riding a motorcycle down a dirt road with mountains inthe backgrounda man riding a motorcycle down a road next to a mountainTop-p p=0.8 T=0.75:a man riding a motorcycle down a dirt roada person riding a motorcycle down a dirt roada man riding a motorcycle down a dirt roada man riding a motorcycle down a dirt roada man on a motorcycle in the middle of the roadSP T=0.5:a man riding a motorcycle down a dirt roada man on a motorcycle in the middle of a roada person riding a motorcycle down a dirt roada man rides a scooter down a dirt roada man riding a motorcycle on a dirt road

    DBS λ=3 T=1:a woman sitting at a table with a plate of foodtwo women are sitting at a table with a cakethe woman is cutting the cake on the tablea woman cutting a cake with candles on ita couple of people that are eating some foodBS T=0.75:a woman sitting at a table with a plate of fooda woman sitting at a table with a cakea woman cutting a cake with a candle on ita couple of women sitting at a table with a cakea couple of women sitting at a table with a plate of foodTop-K K=3 T=0.75:a woman sitting at a table with a cake with candlesa woman cutting into a cake with a candlea woman sitting at a table with a fork and a cakea woman cutting a birthday cake with a knifea woman sitting at a table with a plate of foodTop-p p=0.8 T=0.75:two women eating a cake on a tablea woman cutting a cake with a candle on ittwo women sitting at a table with a plate of fooda woman cutting a cake with candles on ita woman is eating a cake with a forkSP T=0.5:a woman is cutting a piece of cake on a tablea woman is cutting a cake with a knifea woman is eating a piece of cakea woman is cutting a cake on a tablea group of people eat a piece of cake

  • DBS λ=3 T=1:a person riding a bike down a streetthere is a man riding a bike down the streetthe woman is riding her bike on the streetan image of a person riding a bikean older man riding a bike down a streetBS T=0.75:a man riding a bike down a street next to a traina person riding a bike down a streeta person riding a bike on a streeta man riding a bicycle down a street next to a traina man riding a bike down a street next to a red trainTop-K K=3 T=0.75:a person on a bicycle with a red traina man riding a bike down a street next to a red traina man riding a bike down a street next to a traina man riding a bike down a road next to a traina person on a bike is on a roadTop-p p=0.8 T=0.75:a woman riding a bicycle in a streeta woman rides a bicycle with a train on ita man on a bicycle and a bike and a traina man rides a bike with a train on the backa woman is riding her bike through a citySP T=0.5:a woman on a bike with a bicycle on the side of the roada person riding a bicycle down a streeta man on a bicycle near a stop signa man riding a bike next to a traina man riding a bike down the street with a bike

    DBS λ=3 T=1:a kitchen with a sink and a windowthere is a sink and a window in the kitchenan empty kitchen with a sink and a windowan image of a kitchen sink and windowa sink in the middle of a kitchenBS T=0.75:a kitchen with a sink and a windowa kitchen with a sink a window and a windowa kitchen sink with a window in ita kitchen with a sink and a sinka kitchen with a sink and a window in itTop-K K=3 T=0.75:a kitchen with a sink and a mirrorthe kitchen sink has a sink and a windowa sink and a window in a small kitchena kitchen with a sink a window and a windowa kitchen with a sink a window and a mirrorTop-p p=0.8 T=0.75:a kitchen with a sink a window and a windowa kitchen with a sink a sink and a windowa kitchen with a sink and a windowa kitchen with a sink and a windowa kitchen with a sink and a windowSP T=0.5:a kitchen with a sink a window and a windowa sink sitting in a kitchen with a windowa kitchen sink with a window on the side of the countera kitchen with a sink and a windowa kitchen with a sink and a window

  • DBS λ=3 T=1:a wooden table topped with lots of wooden boardsa bunch of different types of food on a cutting boardthere is a wooden cutting board on the tablesome wood boards on a wooden cutting boardan assortment of vegetables on a wooden cutting boardBS T=0.75:a wooden table topped with lots of wooden boardsa wooden cutting board topped with lots of wooden boardsa wooden cutting board with a bunch of wooden boardsa wooden cutting board with a wooden cutting boarda wooden cutting board with a bunch of wooden boards on itTop-K K=3 T=0.75:a wooden cutting board with a bunch of wooden boardsa wooden table with several different itemsa wooden cutting board with some wooden boardsa wooden cutting board with some wooden boards on ita bunch of different types of food on a cutting boardTop-p p=0.8 T=0.75:a bunch of wooden boards sitting on top of a wooden tablea wooden cutting board with several pieces of breada wooden cutting board with a bunch of food on ita bunch of different types of different colored UNKa wooden cutting board with a wooden board on top of itSP T=0.5:a wooden cutting board with knife and cheesea wooden table topped with lots of wooden boardsa wooden cutting board with chopped up and vegetablesa wooden table topped with lots of wooden boardsa wooden cutting board with some wooden boards on it