probing contextualized sentence representations with ... · probing contextualized sentence...

7
Probing Contextualized Sentence Representations with Visual Awareness Zhuosheng Zhang 1,2,3 , Rui Wang 4 , Kehai Chen 4 , Masao Utiyama 4 , Eiichiro Sumita 4 , Hai Zhao 1,2,3 1 Department of Computer Science and Engineering, Shanghai Jiao Tong University 2 Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China 3 MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China 4 National Institute of Information and Communications Technology (NICT) [email protected],[email protected] {wangrui,khchen,mutiyama,eiichiro.sumita}@nict.go.jp Abstract We present a universal framework to model contextualized sentence representations with visual awareness that is motivated to over- come the shortcomings of the multimodal par- allel data with manual annotations. For each sentence, we first retrieve a diversity of im- ages from a shared cross-modal embedding space, which is pre-trained on a large-scale of text-image pairs. Then, the texts and images are respectively encoded by transformer en- coder and convolutional neural network. The two sequences of representations are further fused by a simple and effective attention layer. The architecture can be easily applied to text- only natural language processing tasks with- out manually annotating multimodal parallel corpora. We apply the proposed method on three tasks, including neural machine transla- tion, natural language inference and sequence labeling and experimental results verify the ef- fectiveness. 1 Introduction Learning vector representations of sentence mean- ing is a long-standing objective in natural language processing (NLP) (Wang et al., 2018b; Chen et al., 2019; Zhang et al., 2019b). Text representation learning has evolved from word-level distributed representations (Mikolov et al., 2013; Pennington et al., 2014) to contextualized language modeling (LM) (Peters et al., 2018; Radford et al., 2018; De- vlin et al., 2018; Yang et al., 2019). Despite the success of LMs, NLP models are impoverished compared to humans due to the monotonous learn- ing solely from textual features without ground- ing in the outside world such as visual conception. Therefore, there emerges a trend of researches that are motivated to apply non-linguistic modalities into language representations (Bruni et al., 2014; Calixto et al., 2017; Zhang et al., 2018; Ive et al., 2019; Shi et al., 2019). Previous work mainly integrates the visual guid- ance to word or character representations (Kiela and Bottou, 2014; Silberer and Lapata, 2014; Zablocki et al., 2018; Wu et al., 2019), which re- quires the alignment of word and images. Besides, word meaning may vary in different sentences de- pending on the context, thus the aligned image would not be optimal. Recently, there is a recent trend of pre-training visual-linguistic (VL) repre- sentations for visual and language tasks (Su et al., 2019; Lu et al., 2019; Tan and Bansal, 2019; Li et al., 2019; Zhou et al., 2019; Sun et al., 2019). However, these VL studies rely on the text-image annotations as the paired input, thus are retrained only in VL tasks, such as image caption and vi- sual questions answering. However, for NLP tasks, most texts are unlabeled. Therefore, it is essential to probe a general method to apply visual infor- mation to a wider range of mono-modal text-only tasks. Recent studies have verified that the representa- tions of images and texts can be jointly leveraged to build visual-semantic embeddings in a shared representation space (Frome et al., 2013; Karpa- thy and Fei-Fei, 2015; Ren et al., 2016; Mukherjee and Hospedales, 2016). To this end, a popular ap- proach is to connect both of the mono-modal text and image encoding paths with fully connected lay- ers (Wang et al., 2018a; Engilberge et al., 2018). The shared deep embedding can be used for cross- modal retrieval thus it can associate sentence texts with associated images. Inspired by this line of re- search, we are motivated to incorporate the visual awareness into sentence modeling by retrieving a group of images for a given sentence. According to the Distributional Hypothesis (Har- ris, 1954) which states that words that occur in similar contexts tend to have similar meanings, we make the attempt to extend it to visual modali- ties, the sentences with similar meanings would arXiv:1911.02971v1 [cs.CL] 7 Nov 2019

Upload: others

Post on 19-Mar-2020

27 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probing Contextualized Sentence Representations with ... · Probing Contextualized Sentence Representations with Visual Awareness Zhuosheng Zhang1,2,3, Rui Wang4, Kehai Chen4, Masao

Probing Contextualized Sentence Representations with Visual AwarenessZhuosheng Zhang1,2,3, Rui Wang4, Kehai Chen4, Masao Utiyama4,

Eiichiro Sumita4, Hai Zhao1,2,3

1 Department of Computer Science and Engineering, Shanghai Jiao Tong University2 Key Laboratory of Shanghai Education Commission for Intelligent Interaction

and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China3MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China

4 National Institute of Information and Communications Technology (NICT)[email protected],[email protected]

{wangrui,khchen,mutiyama,eiichiro.sumita}@nict.go.jp

AbstractWe present a universal framework to modelcontextualized sentence representations withvisual awareness that is motivated to over-come the shortcomings of the multimodal par-allel data with manual annotations. For eachsentence, we first retrieve a diversity of im-ages from a shared cross-modal embeddingspace, which is pre-trained on a large-scale oftext-image pairs. Then, the texts and imagesare respectively encoded by transformer en-coder and convolutional neural network. Thetwo sequences of representations are furtherfused by a simple and effective attention layer.The architecture can be easily applied to text-only natural language processing tasks with-out manually annotating multimodal parallelcorpora. We apply the proposed method onthree tasks, including neural machine transla-tion, natural language inference and sequencelabeling and experimental results verify the ef-fectiveness.

1 Introduction

Learning vector representations of sentence mean-ing is a long-standing objective in natural languageprocessing (NLP) (Wang et al., 2018b; Chen et al.,2019; Zhang et al., 2019b). Text representationlearning has evolved from word-level distributedrepresentations (Mikolov et al., 2013; Penningtonet al., 2014) to contextualized language modeling(LM) (Peters et al., 2018; Radford et al., 2018; De-vlin et al., 2018; Yang et al., 2019). Despite thesuccess of LMs, NLP models are impoverishedcompared to humans due to the monotonous learn-ing solely from textual features without ground-ing in the outside world such as visual conception.Therefore, there emerges a trend of researches thatare motivated to apply non-linguistic modalitiesinto language representations (Bruni et al., 2014;Calixto et al., 2017; Zhang et al., 2018; Ive et al.,2019; Shi et al., 2019).

Previous work mainly integrates the visual guid-ance to word or character representations (Kielaand Bottou, 2014; Silberer and Lapata, 2014;Zablocki et al., 2018; Wu et al., 2019), which re-quires the alignment of word and images. Besides,word meaning may vary in different sentences de-pending on the context, thus the aligned imagewould not be optimal. Recently, there is a recenttrend of pre-training visual-linguistic (VL) repre-sentations for visual and language tasks (Su et al.,2019; Lu et al., 2019; Tan and Bansal, 2019; Liet al., 2019; Zhou et al., 2019; Sun et al., 2019).However, these VL studies rely on the text-imageannotations as the paired input, thus are retrainedonly in VL tasks, such as image caption and vi-sual questions answering. However, for NLP tasks,most texts are unlabeled. Therefore, it is essentialto probe a general method to apply visual infor-mation to a wider range of mono-modal text-onlytasks.

Recent studies have verified that the representa-tions of images and texts can be jointly leveragedto build visual-semantic embeddings in a sharedrepresentation space (Frome et al., 2013; Karpa-thy and Fei-Fei, 2015; Ren et al., 2016; Mukherjeeand Hospedales, 2016). To this end, a popular ap-proach is to connect both of the mono-modal textand image encoding paths with fully connected lay-ers (Wang et al., 2018a; Engilberge et al., 2018).The shared deep embedding can be used for cross-modal retrieval thus it can associate sentence textswith associated images. Inspired by this line of re-search, we are motivated to incorporate the visualawareness into sentence modeling by retrieving agroup of images for a given sentence.

According to the Distributional Hypothesis (Har-ris, 1954) which states that words that occur insimilar contexts tend to have similar meanings, wemake the attempt to extend it to visual modali-ties, the sentences with similar meanings would

arX

iv:1

911.

0297

1v1

[cs

.CL

] 7

Nov

201

9

Page 2: Probing Contextualized Sentence Representations with ... · Probing Contextualized Sentence Representations with Visual Awareness Zhuosheng Zhang1,2,3, Rui Wang4, Kehai Chen4, Masao

be likely to pair with similar or the same imagesin the shared embedding space. In this paper, wepropose an approach to model contextualized sen-tence representations with visual awareness. Foreach sentence, we retrieve a diversity of imagesfrom a shared text-visual embedding space that ispre-trained on a large-scale of text-image pairs toconnect both the mono-modal paths of text and im-age embeddings. The texts and images are encodedby transformer LM and pre-trained convolutionalneural network (CNN), respectively. A simple andeffective attention layer is then designed to fusethe two sequences of representations. In particular,the proposed approach can be easily applied to text-only tasks without manually annotating multimodalparallel corpora. The proposed method was evalu-ated on three tasks, including neural machine trans-lation (NMT), natural language inference (NLI)and sequence labeling (SL). Experiments and anal-ysis show effectiveness. In summary, our contribu-tions are primarily three-fold:

1. We present a universal visual representationmethod that overcomes the shortcomings ofthe multimodal parallel data with manual an-notations.

2. We propose a multimodal context-drivenmodel to jointly learn sentence-level repre-sentations from textual and visual modalities.

3. Experiments on different tasks verified theeffectiveness and generality of the proposedapproach.

2 Approach

Figure 1 illustrates the architecture of our proposedmethod. Given a sentence, we first fetch a groupof matched images from the cross-modal retrievalmodel. The text and images are encoded by thetext feature extractor and image feature extractor,respectively. Then the two sequences of representa-tion are integrated by multi-head attention to form ajoint representation which is passed to downstreamtask-specific layers. Before introducing our visual-aware model, let us briefly show the cross-modalretrieval model which is used to for image retrievalgiven sentence text.

2.1 Cross-modal Retrieval ModelFor input sentence, our aim is to associate it with anumber of images from a candidate corpus. Follow-

Add & Norm

Feed Forward

Add & Norm

Multi-headAttention

Add & Norm

Multi-headAttention

Feed Forward

Image Encoder

Text Embedding

Cross-modalRetriever

Sentence

Hidden StatesText Encoder

Figure 1: Overview of the our model architecture.

ing (Engilberge et al., 2018), we train a semantic-visual embedding on a text-image corpus, whichis then used for image retrieval. The semantic-visual embedding architecture comprises two pathsto encoder the texts and images into vectors, re-spectively. Based on our preliminary experiments,we choose the simple recurrent unit (SRU) architec-ture as our text encoder and the fully convolutionalresidual ResNet-152 (Xie et al., 2017) with Weldonpooling (Durand et al., 2016).

Both pipelines are learned simultaneously, eachimage is paired with 1) a positive text Y that de-scribes the imageX and 2) a hard negativeZ whichis selected as the one that has the highest similaritywith the image while not being associated with it.The architecture of the model is shown on Figure2. During training, a triplet loss (Wang et al., 2014;Schroff et al., 2015; Gordo et al., 2017) is used toconverge correctly and increase our performancesas shown in equation 1.

loss(x, y, z) = max(0, α− x · y + x · z), (1)

where x, y, and z are respectively the embeddingsofX , Y , and Z. α is the minimum margin betweenthe similarity of the correct caption and the unre-lated caption. The loss function enables that thesentence Y should be closer to the correspondingimage X than the unrelated one Z.

In prediction time, the relationship of texts andimages is calculated by cosine similarity.

Page 3: Probing Contextualized Sentence Representations with ... · Probing Contextualized Sentence Representations with Visual Awareness Zhuosheng Zhang1,2,3, Rui Wang4, Kehai Chen4, Masao

Feature Extractor

RegionPooling

FC+Norm

WordEmebdding

TextRepresentation Norm

WordEmebdding

TextRepresentation Norm

- <x, y>

+ <x, z>

a surfer riding a small wave to the shore.

a boy smiling from a canoe as he paddles.

X

Y

Z

Figure 2: Details of the proposed semantic-visual embedding model.

2.2 Visual-aware Model

2.3 Encoding Layer

For each sentence X = {x1, x2, . . . , xI}, wepair it with the top matched m images E ={e1, e2, . . . , em} according to the retrieval methodabove. Then, the sentence X={x1, x2, . . . , xI} isfed into multi-layer transformer encoder to learnthe sentence representation H. Meanwhile, theimages E ={e1, e2, . . . , em} are encoded by a pre-trained ResNet (He et al., 2016) followed by a feedforward layer to learn the image representation Mwith the same dimension with H.

2.4 Multi-modal Integration Layer

Then, we apply a one-layer multi-head attentionmechanism to append the image representation tothe text representation:

H′ = ATT(H,KM,VM), (2)

where {KM, VM} are packed from the learned im-age representation M.

Following the same practice in the transformerblock, We fuse H and H with layer normalizationto learn the joint representation:

H = LayerNorm(W (H + H′) + b). (3)

where W and b are parameters.

2.5 Task-specific Layer

In this section, we show how the joint representa-tion is used for downstream tasks by taking NMT,NLI and SL tasks for example. For NMT, H is fedto the decoder to learn a dependent-time contextvector for predicting target translation. For NLIand SL, H is directly fed to a feed forward layer tomake the prediction.

3 Task Settings

We evaluate our model on three different NLP tasks,namely neural machine translation, natural lan-guage inference and sequence labeling. We presentthese evaluation benchmarks in what follows.

3.1 Neural Machine Translation

We use two translation datasets, includingWMT’16 English-to-Romanian (EN-RO), andWMT’14 English-to-German (EN-DE) which arestandard corpora for NMT evaluation.

1) For the EN-RO task, we experimented withthe officially provided parallel corpus: Europarl v7and SETIMES2 from WMT’16 with 0.6M sentencepairs. We used newsdev2016 as the dev set andnewstest2016 as the test set.

2) For the EN-DE translation task, 4.43M bilin-gual sentence pairs of the WMT14 dataset wereused as training data, including Common Crawl,News Commentary, and Europarl v7. The new-stest2013 and newstest2014 datasets were used asthe dev set and test set, respectively.

3.2 Natural Language Inference

Natural Language Inference involves reading a pairof sentences and judging the relationship betweentheir meanings, such as entailment, neutral and con-tradiction. In this task, we use the Stanford Nat-uralLanguage Inference (SNLI) corpus (Bowmanet al.,2015) which provides approximately 570k hypoth-esis/premise pairs.

3.3 Sequence Labeling

We use the CoNLL-2003 Named Entity Recog-nition dataset (Sang and Meulder, 2003) for thesequence labeling task, which includes four kindsof NEs: Person, Location, Organization and MISC.

Page 4: Probing Contextualized Sentence Representations with ... · Probing Contextualized Sentence Representations with Visual Awareness Zhuosheng Zhang1,2,3, Rui Wang4, Kehai Chen4, Masao

Model EN-RO EN-DEPublic Systems

Trans. (Vaswani et al., 2017) - 27.3Trans. (Lee et al., 2018) 32.40 -

Our implementationTrans. 32.66 27.31+ VA 34.63 27.83

Table 1: BLEU scores on EN-RO and EN-DE for theNMT tasks. Trans. is short for transformer.

4 Model Implementation

Now, we introduce the specific implementationparts of our method. All the experiments weredone on 8 NVIDIA TESLA V100 GPUs.

4.1 Cross-modal Retrieval Model

The cross-modal retrieval model is trained on theMS-COCO dataset (Lin et al., 2014) that contains123 287 images with 5 English captions per image.It is split into 82,783 training images, 5,000 valida-tion images and 5,000 testing images. We used theKarpathy split (Karpathy and Fei-Fei, 2015) thatforms 113, 287 training, 5,000 validation and 5,000test images. The model is implemented followingthe same settings in (Engilberge et al., 2018) withthe state-of-the-art results (94.0% R@10) in cross-modal retrieval. The maximum number of retrievedimages m for each sentence is set to 8 according toour preliminary experimental results.

4.2 Baseline

To incorporate our visual-aware model (+VA), weonly modify the encoder of the baselines by intro-ducing the image encoder layer and multi-modalintegration layer.

For the NMT tasks, the baseline was Trans-former (Vaswani et al., 2017) implemented byfairseq1 (Ott et al., 2019). We used six layers forthe encoder and the decoder. The number of dimen-sions of all input and output layers was set to 512.The inner feed-forward neural network layer wasset to 2048. The heads of all multi-head moduleswere set to eight in both encoder and decoder lay-ers. The byte pair encoding algorithm was adopted,and the size of the vocabulary was set to 40,000.In each training batch, a set of sentence pairs con-tained approximately 4096×4 source tokens and4096×4 target tokens. During training, the valueof label smoothing was set to 0.1, and the attention

1https://github.com/pytorch/fairseq

Model AccPublic Systems

GPT (Radford et al., 2018) 89.9DRCN (Kim et al., 2018) 90.1MT-DNN (Liu et al., 2019) 91.6SemBERT (Zhang et al., 2019a) 91.6BERT (Base) (Liu et al., 2019) 90.8

Our implementationBERT (Base) 90.7+ VA 91.2

Table 2: Accuracy on SNLI dataset.

Model F1 scorePublic Systems

LSTM-CRF (Lample et al., 2016) 90.94BERT (Base) (Pires et al., 2019) 91.07

Our implementationBERT (Base) 91.21+VA 91.46

Table 3: Results (%) of CoNLL-2003 NER dataset.

dropout and residual dropout were p = 0.1. TheAdam optimizer (Kingma and Ba, 2014) was usedto tune the parameters of the model. The learn-ing rate was varied under a warm-up strategy with8,000 steps.

For the NLI and SL tasks, the baseline was BERT(Base)2. We used the pre-trained weights of BERTand follow the same fine-tuning procedure as BERTwithout any modification. The initial learning ratewas set in {8e-6, 1e-5, 2e-5, 3e-5} with warm-uprate of 0.1 and L2 weight decay of 0.01. The batchsize is selected in {16, 24, 32}. The maximumnumber of epochs is set in [2, 5]. Texts are tok-enized using wordpieces, with maximum length of128.

4.3 Results

Table 1 shows the translation results for theWMT’14 EN-DE and WMT’16 EN-RO translationtask. We see that our method significantly outper-formed the baseline Transformer, demonstratingthe effectiveness of modeling visual informationfor NMT.

Table 2-3 show the results for the NLI and SLtasks, which also verify the effectiveness. The re-sults show the our method is not only useful for the

2https://github.com/huggingface/pytorch-pretrained-BERT

Page 5: Probing Contextualized Sentence Representations with ... · Probing Contextualized Sentence Representations with Visual Awareness Zhuosheng Zhang1,2,3, Rui Wang4, Kehai Chen4, Masao

Figure 3: Examples of the retrieved images for sentences.

Figure 4: Concept activation maps with different inputwords. The orange region indicates the highest peak inthe heatmap.

fundamental tagging task but also more advancedtranslation and inference tasks.

5 Analysis

5.1 Concept LocalizationWe observe an important advantage of shared em-bedding space is that it can address the localizationof arbitrary concepts within the image. For inputtext, we compute the image localization heatmapderived from the activation map of the last convo-lutional layer following (Engilberge et al., 2018).Figure 4 shows the example, which indicates thatthe shared space can not only perform the imageretrieval, but also match language concepts in theimage for any text query.

5.2 Examples of Image RetrievalAlthough the image retrieval method achieves won-derful results on COCO datasets, we are interestedin the explicit results on our task-specific datasets.We randomly select some examples to interpretthe image retrieval process intuitively, as shown inFigure 3. Besides the “good” examples that show

good matched contents of the text and image, wealso observe some “negative” examples that thecontents might not be related in concept but showsome potential connections. To some extent, thealignment of the text and image concepts might notbe the only effective factor for multimodal model-ing, since they are defined by human knowledgeand the meanings may vary among different peo-ple or different time. In contrast, the consistentmapping relationships of the modalities in a sharedembedding space would be more potentially benefi-cial, because the similar images tend to be retrievedfor similar sentences, which can play the role oftopical hints for sentence modeling.

6 Conclusion

In this work, we present a universal method toincorporate visual information into sentence mod-eling by conducting image retrieval from a pre-trained shared cross-modal embedding space toovercome the shortcomings of the manual anno-tated multimodal parallel data. The text and imagerepresentations are respectively encoded by trans-former encoder and convolutional neural networkand then integrated in a multi-head attention layer.Empirical studies on a wide range of NLP tasksincluding NMT, NLT and SL verify the effective-ness. Our method is general and fundamental andcan be easily implemented to any existing deeplearning NLP system. We hope this work will fa-cilitate future multimodal researches across visionand language.

Page 6: Probing Contextualized Sentence Representations with ... · Probing Contextualized Sentence Representations with Visual Awareness Zhuosheng Zhang1,2,3, Rui Wang4, Kehai Chen4, Masao

ReferencesElia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014.

Multimodal distributional semantics. Journal of Ar-tificial Intelligence Research, 49:1–47.

Iacer Calixto, Qun Liu, and Nick Campbell. 2017.Doubly-attentive decoder for multi-modal neuralmachine translation. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1913–1924.

Kehai Chen, Rui Wang, Masao Utiyama, EiichiroSumita, and Tiejun Zhao. 2019. Neural ma-chine translation with sentence-level topic context.IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing, 27(12):1970–1984.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. arXiv preprint arXiv:1810.04805.

Thibaut Durand, Nicolas Thome, and Matthieu Cord.2016. Weldon: Weakly supervised learning of deepconvolutional neural networks. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition, pages 4743–4752.

Martin Engilberge, Louis Chevallier, Patrick Perez, andMatthieu Cord. 2018. Finding beans in burgers:Deep semantic-visual embedding with localization.In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 3984–3993.

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Ben-gio, Jeff Dean, Marc’Aurelio Ranzato, and TomasMikolov. 2013. Devise: A deep visual-semantic em-bedding model. In Advances in neural informationprocessing systems, pages 2121–2129.

Albert Gordo, Jon Almazan, Jerome Revaud, and Di-ane Larlus. 2017. End-to-end learning of deep vi-sual representations for image retrieval. Interna-tional Journal of Computer Vision, 124(2):237–254.

Zellig S Harris. 1954. Distributional structure. Word,10(2-3):146–162.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

Julia Ive, Pranava Madhyastha, and Lucia Specia. 2019.Distilling translations with visual awareness. arXivpreprint arXiv:1906.07701.

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descrip-tions. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages3128–3137.

Douwe Kiela and Leon Bottou. 2014. Learning imageembeddings using convolutional neural networks forimproved multi-modal semantics. In Proceedings ofthe 2014 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP), pages 36–45.

Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and No-jun Kwak. 2018. Semantic sentence matching withdensely-connected recurrent and co-attentive infor-mation. arXiv preprint arXiv:1805.11360.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 260–270.

Jason Lee, Elman Mansimov, and Kyunghyun Cho.2018. Deterministic non-autoregressive neural se-quence modeling by iterative refinement. In Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing, pages 1173–1182.

Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, andMing Zhou. 2019. Unicoder-vl: A universal encoderfor vision and language by cross-modal pre-training.arXiv preprint arXiv:1908.06066.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In European confer-ence on computer vision, pages 740–755. Springer.

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-feng Gao. 2019. Multi-task deep neural networksfor natural language understanding. arXiv preprintarXiv:1901.11504.

Jiasen Lu, Dhruv Batra, Devi Parikh, and StefanLee. 2019. Vilbert: Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks. arXiv preprint arXiv:1908.02265.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In NIPS.

Tanmoy Mukherjee and Timothy Hospedales. 2016.Gaussian visual-linguistic embedding for zero-shotrecognition. In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing, pages 912–918.

Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In Proceedings of

Page 7: Probing Contextualized Sentence Representations with ... · Probing Contextualized Sentence Representations with Visual Awareness Zhuosheng Zhang1,2,3, Rui Wang4, Kehai Chen4, Masao

the 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics(Demonstrations), pages 48–53.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In EMNLP.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In NAACL-HLT.

Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.How multilingual is multilingual bert? arXivpreprint arXiv:1906.01502.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training. Technical re-port.

Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and AlanYuille. 2016. Joint image-text representation bygaussian visual-semantic embedding. In Proceed-ings of the 24th ACM international conference onMultimedia, pages 207–211. ACM.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the conll-2003 shared task:Language-independent named entity recognition. InHLT-NAACL 2003, pages 1–6.

Florian Schroff, Dmitry Kalenichenko, and JamesPhilbin. 2015. Facenet: A unified embedding forface recognition and clustering. In Proceedings ofthe IEEE conference on computer vision and patternrecognition, pages 815–823.

Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and KarenLivescu. 2019. Visually grounded neural syntax ac-quisition. arXiv preprint arXiv:1906.02890.

Carina Silberer and Mirella Lapata. 2014. Learn-ing grounded meaning representations with autoen-coders. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 721–732.

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations.arXiv preprint arXiv:1908.08530.

Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur-phy, and Cordelia Schmid. 2019. Videobert: A jointmodel for video and language representation learn-ing. arXiv preprint arXiv:1904.01766.

Hao Tan and Mohit Bansal. 2019. Lxmert: Learningcross-modality encoder representations from trans-formers. arXiv preprint arXiv:1908.07490.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008.

Jiang Wang, Yang Song, Thomas Leung, Chuck Rosen-berg, Jingbin Wang, James Philbin, Bo Chen, andYing Wu. 2014. Learning fine-grained image simi-larity with deep ranking. In Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition, pages 1386–1393.

Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazeb-nik. 2018a. Learning two-branch neural networksfor image-text matching tasks. IEEE Transac-tions on Pattern Analysis and Machine Intelligence,41(2):394–407.

Rui Wang, Masao Utiyama, Andrew Finch, LemaoLiu, Kehai Chen, and Eiichiro Sumita. 2018b. Sen-tence selection and weighting for neural machinetranslation domain adaptation. IEEE/ACM Transac-tions on Audio, Speech, and Language Processing,26(10):1727–1741.

Wei Wu, Yuxian Meng, Qinghong Han, Muyu Li, Xi-aoya Li, Jie Mei, Ping Nie, Xiaofei Sun, and Jiwei Li.2019. Glyce: Glyph-vectors for chinese characterrepresentations. arXiv preprint arXiv:1901.10125.

Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu,and Kaiming He. 2017. Aggregated residual trans-formations for deep neural networks. In Proceed-ings of the IEEE conference on computer vision andpattern recognition, pages 1492–1500.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Ruslan Salakhutdinov, and Quoc V Le.2019. XLNet: Generalized autoregressive pretrain-ing for language understanding. arXiv preprintarXiv:1906.08237.

Eloi Zablocki, Benjamin Piwowarski, Laure Soulier,and Patrick Gallinari. 2018. Learning multi-modalword representation grounded in visual context. InThirty-Second AAAI Conference on Artificial Intelli-gence.

Kun Zhang, Guangyi Lv, Le Wu, Enhong Chen, Qi Liu,Han Wu, and Fangzhao Wu. 2018. Image-enhancedmulti-level sentence representation net for naturallanguage inference. In 2018 IEEE InternationalConference on Data Mining (ICDM), pages 747–756. IEEE.

Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li,Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2019a.Semantics-aware bert for language understanding.arXiv preprint arXiv:1909.02209.

Zhuosheng Zhang, Yuwei Wu, Junru Zhou, SufengDuan, Hai Zhao, and Rui Wang. 2019b. SG-Net: Syntax-guided machine reading comprehen-sion. arXiv preprint arXiv:1908.05147.

Luowei Zhou, Hamid Palangi, Lei Zhang, HoudongHu, Jason J Corso, and Jianfeng Gao. 2019. Uni-fied vision-language pre-training for image caption-ing and vqa. arXiv preprint arXiv:1909.11059.