visual attention model for name tagging in multimodal

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 1990–1999Melbourne, Australia, July 15 - 20, 2018. c©2018 Association for Computational Linguistics

1990

Visual Attention Model for Name Tagging in Multimodal Social Media

Di Lu∗1, Leonardo Neves2, Vitor Carvalho3, Ning Zhang2, Heng Ji11Computer Science, Rensselaer Polytechnic Institute

{lud2,jih}@rpi.edu2Snap Research

{lneves, ning.zhang}@snap.com3Intuit

vitor [email protected]

Abstract

Everyday billions of multimodal postscontaining both images and text are sharedin social media sites such as Snapchat,Twitter or Instagram. This combination ofimage and text in a single message allowsfor more creative and expressive forms ofcommunication, and has become increas-ingly common in such sites. This newparadigm brings new challenges for nat-ural language understanding, as the tex-tual component tends to be shorter, moreinformal, and often is only understood ifcombined with the visual context. In thispaper, we explore the task of name tag-ging in multimodal social media posts.We start by creating two new multimodaldatasets: one based on Twitter posts1 andthe other based on Snapchat captions (ex-clusively submitted to public and crowd-sourced stories). We then propose a novelmodel based on Visual Attention that notonly provides deeper visual understandingon the decisions of the model, but also sig-nificantly outperforms other state-of-the-art baseline methods for this task. 2

1 Introduction

Social platforms, like Snapchat, Twitter, Insta-gram and Pinterest, have become part of ourlives and play an important role in making com-munication easier and accessible. Once text-centric, social media platforms are becoming in-

∗∗This work was mostly done during the first author’s in-ternship at Snap Research.

1The Twitter data and associated images presented in thispaper were downloaded from https://archive.org/details/twitterstream

2We will make the annotations on Twitter data availablefor research purpose upon request.

creasingly multimodal, with users combining im-ages, videos, audios, and texts for better expres-siveness. As social media posts become more mul-timodal, the natural language understanding of thetextual components of these messages becomes in-creasingly challenging. In fact, it is often the casethat the textual component can only be understoodin combination with the visual context of the mes-sage.

In this context, here we study the task of NameTagging for social media containing both imageand textual contents. Name tagging is a key taskfor language understanding, and provides input toseveral other tasks such as Question Answering,Summarization, Searching and Recommendation.Despite its importance, most of the research inname tagging has focused on news articles andlonger text documents, and not as much in mul-timodal social media data (Baldwin et al., 2015).

However, multimodality is not the only chal-lenge to perform name tagging on such data. Thetextual components of these messages are oftenvery short, which limits context around names.Moreover, there linguistic variations, slangs, ty-pos and colloquial language are extremely com-mon, such as using ‘looooove’ for ‘love’, ‘LosAn-geles’ for ‘Los Angeles’, and ‘#Chicago #Bull’ for‘Chicago Bulls’. These characteristics of socialmedia data clearly illustrate the higher difficultyof this task, if compared to traditional newswirename tagging.

In this work, we modify and extend the currentstate-of-the-art model (Lample et al., 2016; Maand Hovy, 2016) in name tagging to incorporatethe visual information of social media posts us-ing an Attention mechanism. Although the usuallyshort textual components of social media postsprovide limited contextual information, the ac-companying images often provide rich informa-tion that can be useful for name tagging. For ex-

https://archive.org/details/twitterstream

https://archive.org/details/twitterstream

1991

Figure 1: Examples of Modern Baseball associ-ated with different images.

ample, as shown in Figure 1, both captions in-clude the phrase ‘Modern Baseball’. It is not easyto tell if each Modern Baseball refers to a nameor not from the textual evidence only. Howeverusing the associated images as reference, we caneasily infer that Modern Baseball in the first sen-tence should be the name of a band because of theimplicit features from the objects like instrumentsand stage, and the Modern Baseball in the secondsentence refers to the sport of baseball because ofthe pitcher in the image.

In this paper, given an image-sentence pair asinput, we explore a new approach to leverage vi-sual context for name tagging in text. First, wepropose an attention-based model to extract visualfeatures from the regions in the image that aremost related to the text. It can ignore irrelevantvisual information. Secondly, we propose to usea gate to combine textual features extracted by aBidirectional Long Short Term Memory (BLSTM)and extracted visual features, before feed theminto a Conditional Random Fields(CRF) layer fortag predication. The proposed gate architectureplays the role to modulate word-level multimodalfeatures.

We evaluate our model on two labeled datasetscollected from Snapchat and Twitter respectively.Our experimental results show that the proposedmodel outperforms state-for-the-art name taggerin multimodal social media.

The main contributions of this work are as fol-lows:

• We create two new datasets for name tag-ging in multimedia data, one using Twitterand the other using crowd-sourced Snapchatposts. These new datasets effectively consti-tute new benchmarks for the task.

• We propose a visual attention model specif-ically for name tagging in multimodal socialmedia data. The proposed end-to-end model

only uses image-sentence pairs as input with-out any human designed features, and a Vi-sual Attention component that helps under-stand the decision making of the model.

2 Model

Figure 2 shows the overall architecture of ourmodel. We describe three main components ofour model in this section: BLSTM-CRF sequencelabeling model (Section 2.1), Visual AttentionModel (Section 2.3) and Modulation Gate (Sec-tion 2.4).

Given a pair of sentence and image as input,the Visual Attention Model extracts regional vi-sual features from the image and computes theweighted sum of the regional visual features as thevisual context vector, based on their relatednesswith the sentence. The BLSTM-CRF sequence la-beling model predicts the label for each word inthe sentence based on both the visual context vec-tor and the textual information of the words. Themodulation gate controls the combination of thevisual context vector and the word representationsfor each word before the CRF layer.

2.1 BLSTM-CRF Sequence Labeling

We model name tagging as a sequence labelingproblem. Given a sequence of words: S ={s1, s2, ..., sn}, we aim to predict a sequence oflabels: L = {l1, l2, ..., ln}, where li ∈ L and L isa pre-defined label set.Bidirectional LSTM. Long Short-term MemoryNetworks (LSTMs) (Hochreiter and Schmidhuber,1997) are variants of Recurrent Neural Networks(RNNs) designed to capture long-range dependen-cies of input. The equations of a LSTM cell are asfollows:

it = σ(Wxixt +Whiht−1 + bi)

ft = σ(Wxfxt +Whfht−1 + bf )

ct = tanh(Wxcxt +Whcht−1 + bc)

ct = ft � ct−1 + it � ctot = σ(Wxoxt +Whoht−1 + bo)

ht = ot � tanh(ct)

where xt, ct and ht are the input, memory and hid-den state at time t respectively. Wxi, Whi, Wxf ,Whf , Wxc, Whc, Wxo, and Who are weight matri-ces. � is the element-wise product function and σis the element-wise sigmoid function.

1992

Florence and the Machinesurprises ill teen with

private concert

CN

N

LSTM

LSTM

LSTM

CRF

LSTM

LSTM

CRF

LSTM

LSTM

CRF

LSTM

LSTM

CRF

ForwardLSTM

BackwardLSTM

CRF

wordembedding

LSTM

encodedtext

Florence and the Machine

B-PER I-PER I-PER I-PER

charrepresentations

Multimodal Input

Visual Attention Model

Modulation Gate

Attention Model

VisualGate

VisualGate

VisualGate

VisualGate

Figure 2: Overall Architecture of the Visual Attention Name Tagging Model.

Name Tagging benefits from both of the past(left) and the future (right) contexts, thus we im-plement the Bidirectional LSTM (Graves et al.,2013; Dyer et al., 2015) by concatenating the leftand right context representations, ht = [

−→ht ,←−ht ],

for each word.Character-level Representation. Follow-ing (Lample et al., 2016), we generate thecharacter-level representation for each word usinganother BLSTM. It receives character embeddingsas input and generates representations combiningimplicit prefix, suffix and spelling information.The final word representation xi is the concate-nation of word embedding ei and character-levelrepresentation ci.

ci = BLSTMchar(si) si ∈ Sxi = [ei, ci]

Conditional random fields (CRFs). For nametagging, it is important to consider the constraintsof the labels in neighborhood (e.g., I-LOC mustfollow B-LOC). CRFs (Lafferty et al., 2001) areeffective to learn those constraints and jointly pre-dict the best chain of labels. We follow the imple-mentation of CRFs in (Ma and Hovy, 2016).

2.2 Visual Feature RepresentationWe use Convolutional Neural Networks(CNNs) (LeCun et al., 1989) to obtain therepresentations of images. Particularly, we useResidual Net (ResNet) (He et al., 2016), which

Figure 3: CNN for visual features extraction.

achieves state-of-the-art on ImageNet (Rus-sakovsky et al., 2015) detection, ImageNetlocalization, COCO (Lin et al., 2014) detection,and COCO segmentation tasks. Given an inputpair (S, I), where S represents the word sequenceand I represents the image rescaled to 224x224pixels, we use ResNet to extract visual featuresfor regional areas as well as for the whole image(Fig 3):

Vg = ResNetg(I)

Vr = ResNetr(I)

where the global visual vector Vg, which repre-sents the whole image, is the output before thelast fully connected layer3. The dimension of Vgis 1,024. Vr are the visual representations for re-gional areas and they are extracted from the lastconvolutional layer of ResNet, and the dimensionis 1,024x7x7 as shown in Figure 3. 7x7 is thenumber of regions in the image and 1,024 is the

3the last fully connect layer outputs the probabilities over1,000 classes of objects.

1993

dimension of the feature vector. Thus each featurevector of Vr corresponds to a 32x32 pixel regionof the rescaled input image.

2.3 Visual Attention Model

Figure 4: Example of partially related image andsentence. (‘I have just bought Jeremy Pied.’)

The global visual representation is a reason-able representation of the whole input image, butnot the best. Sometimes only parts of the im-age are related to the associated sentence. Forexample, the visual features from the right partof the image in Figure 4 cannot contribute to in-ferring the information in the associated sentence‘I have just bought Jeremy Pied.’ In this workwe utilize visual attention mechanism to combatthe problem, which has been proven effective forvision-language related tasks such as Image Cap-tioning (Xu et al., 2015) and Visual Question An-swering (Yang et al., 2016b; Lu et al., 2016), byenforcing the model to focus on the regions in im-ages that are mostly related to context textual in-formation while ignoring irrelevant regions. Alsothe visualization of attention can also help us tounderstand the decision making of the model. At-tention mechanism is mapping a query and a setof key-value pairs to an output. The output isa weighted sum of the values and the assignedweight for each value is computed by a functionof the query and corresponding key. We encodethe sentence into a query vector using an LSTM,and use regional visual representations Vr as bothkeys and values.Text Query Vector. We use an LSTM to encodethe sentence into a query vector, in which the in-puts of the LSTM are the concatenations of wordembeddings and character-level word representa-tions. Different from the LSTM model used forsequence labeling in Section 2.1, the LSTM hereaims to get the semantic information of the sen-

tence and it is unidirectional:

Q = LSTMquery(S) (1)

Attention Implementation. There are many im-plementations of visual attention mechanism suchas Multi-layer Perceptron (Bahdanau et al., 2014),Bilinear (Luong et al., 2015), dot product (Lu-ong et al., 2015), Scaled Dot Product (Vaswaniet al., 2017), and linear projection after summa-tion (Yang et al., 2016b). Based on our experi-mental results, dot product implementations usu-ally result in more concentrated attentions and lin-ear projection after summation results in more dis-persed attentions. In the context of name tagging,we choose the implementation of linear projec-tion after summation because it is beneficial forthe model to utilize as many related visual fea-tures as possible, and concentrated attentions maymake the model bias. For implementation, we firstproject the text query vector Q and regional visualfeatures Vr into the same dimensions:

Pt = tanh(WtQ)

Pv = tanh(WvVr)

then we sum up the projected query vector witheach projected regional visual vector respectively:

A = Pt ⊕ Pv

the weights of the regional visual vectors:

E = softmax(WaA+ ba)

whereWa is weights matrix. The weighted sum ofthe regional visual features is:

vc =∑

αivi αi ∈ E, vi ∈ Vr

We use vc as the visual context vector to initial-ize the BLSTM sequence labeling model in Sec-tion 2.1. We compare the performances of themodels using global visual vector Vg and attentionbased visual context vector Vc for initialization inSection 4.

2.4 Visual Modulation GateThe BLSTM-CRF sequence labeling model ben-efits from using the visual context vector to ini-tialize the LSTM cell. However, the better wayto utilize visual features for sequence labeling isto incorporate the features at word level individ-ually. However visual features contribute quite

1994

differently when they are used to infer the tagsof different words. For example, we can easilyfind matched visual patterns from associated im-ages for verbs such as ‘sing’, ‘run’, and ‘play’.Words/Phrases such as names of basketball play-ers, artists, and buildings are often well-alignedwith objects in images. However it is difficult toalign function words such as ‘the’, ‘of ’ and ‘well’with visual features. Fortunately, most of the chal-lenging cases in name tagging involve nouns andverbs, the disambiguation of which can benefitmore from visual features.

We propose to use a visual modulation gate,similar to (Miyamoto and Cho, 2016; Yang et al.,2016a), to dynamically control the combination ofvisual features and word representation generatedby BLSTM at word-level, before feed them intothe CRF layer for tag prediction. The equationsfor the implementation of modulation gate are asfollows:

βv = σ(Wvhi + Uvvc + bv)

βw = σ(Wwhi + Uwvc + bw)

m = tanh(Wmhi + Umvc + bm)

wm = βw · hi + βv ·m

where hi is the word representation generated byBLSTM, vc is the computed visual context vector,Wv, Ww, Wm, Uv, Uw and Um are weight matri-ces, σ is the element-wise sigmoid function, andwm is the modulated word representations fed intothe CRF layer in Section 2.1. We conduct experi-ments to evaluate the impact of modulation gate inSection 4.

3 Datasets

We evaluate our model on two multimodaldatasets, which are collected from Twitter andSnapchat respectively. Table 1 summarizesthe data statistics. Both datasets contain fourtypes of named entities: Location, Person,Organization and Miscellaneous. Eachdata instance contains a pair of sentence and im-age, and the names in sentences are manuallytagged by three expert labelers.Twitter name tagging. The Twitter name taggingdataset contains pairs of tweets and their associ-ated images extracted from May 2016, January2017 and June 2017. We use sports and socialevent related key words, such as concert, festi-val, soccer, basketball, as queries. We don’t take

into consideration messages without images forthis experiment. If a tweet has more than one im-age associated to it, we randomly select one of theimages.Snap name tagging. The Snap name taggingdataset consists of caption and image pairs exclu-sively extracted from snaps submitted to publicand live stories. They were collected between Mayand July of 2017. The data contains captions sub-mitted to multiple community curated stories likethe Electric Daisy Carnival (EDC) music festivaland the Golden State Warrior’s NBA parade.

Both Twitter and Snapchat are social mediawith plenty of multimodal posts, but they have ob-vious differences with sentence length and imagestyles. In Twitter, text plays a more important role,and the sentences in the Twitter dataset are muchlonger than those in the Snap dataset (16.0 tokensvs 8.1 tokens). The image is often more related tothe content of the text and added with the purposeof illustrating or giving more context. On the otherhand, as users of Snapchat use cameras to commu-nicate, the roles of text and image are switched.Captions are often added to complement what isbeing portrayed by the snap. On our experimentsection we will show that our proposed model out-performs baseline on both datasets.

We believe the Twitter dataset can be an im-portant step towards more research in multimodalname tagging and we plan to provide it as a bench-mark upon request.

4 Experiment

4.1 TrainingTokenization. To tokenize the sentences, we usethe same rules as (Owoputi et al., 2013), except weseparate the hashtag ‘#’ with the words after.Labeling Schema. We use the standard BIOschema (Sang and Veenstra, 1999), because wesee little difference when we switch to BIOESschema (Ratinov and Roth, 2009).Word embeddings. We use the 100-dimensionalGloVe4 (Pennington et al., 2014) embeddingstrained on 2 billions tweets to initialize the lookuptable and do fine-tuning during training.Character embeddings. As in (Lample et al.,2016), we randomly initialize the character em-beddings with uniform samples. Based on exper-imental results, the size of the character embed-dings affects little, and we set it as 50.

4https://nlp.stanford.edu/projects/glove/

1995

Training Development Testing

Snapchat Sentences 4,817 1,032 1,033Tokens 39,035 8,334 8,110

Twitter Sentences 4,290 1,432 1,459Tokens 68,655 22,872 23,051

Table 1: Sizes of the datasets in numbers of sentence and token.

Pretrained CNNs. We use the pretrained ResNet-152 (He et al., 2016) from Pytorch.Early Stopping. We use early stopping (Caruanaet al., 2001; Graves et al., 2013) with a patience of15 to prevent the model from over-fitting.Fine Tuning. The models are optimized with fine-tuning on both the word-embeddings and the pre-trained ResNet.Optimization. The models achieve the best per-formance by using mini-batch stochastic gradientdescent (SGD) with batch size 20 and momentum0.9 on both datasets. We set an initial learning rateof η0 = 0.03 with decay rate of ρ = 0.01. We usea gradient clipping of 5.0 to reduce the effects ofgradient exploding.Hyper-parameters. We summarize the hyper-parameters in Table 2.

Hyper-parameter ValueLSTM hidden state size 300

Char LSTM hidden state size 50visual vector size 100

dropout rate 0.5

Table 2: Hyper-parameters of the networks.

4.2 ResultsTable 3 shows the performance of the baseline,which is BLSTM-CRF with sentences as inputonly, and our proposed models on both datasets.BLSTM-CRF + Global Image Vector: useglobal image vector to initialize the BLSTM-CRF.BLSTM-CRF + Visual attention: use atten-tion based visual context vector to initialize theBLSTM-CRF.BLSTM-CRF + Visual attention + Gate: modu-late word representations with visual vector.

Our final model BLSTM-CRF + VISUAL AT-TENTION + GATE, which has visual attentioncomponent and modulation gate, obtains the bestF1 scores on both datasets. Visual features suc-cessfully play a role of validating entity types. Forexample, when there is a person in the image, it

is more likely to include a person name in the as-sociated sentence, but when there is a soccer fieldin the image, it is more likely to include a sportsteam name.

All the models get better scores on Twitterdataset than on Snap dataset, because the averagelength of the sentences in Snap dataset (8.1 tokens)is much smaller than that of Twitter dataset (16.0tokens), which means there is much less contex-tual information in Snap dataset.

Also comparing the gains from visual featureson different datasets, we find that the model bene-fits more from visual features on Twitter dataset,considering the much higher baseline scores onTwitter dataset. Based on our observation, usersof Snapchat often post selfies with captions, whichmeans some of the images are not strongly relatedto their associated captions. In contrast, users ofTwitter prefer to post images to illustrate texts

4.3 Attention Visualization

Figure 5 shows some good examples of the atten-tion visualization and their corresponding nametagging results. The model can successfully focuson appropriate regions when the images are wellaligned with the associated sentences. Based onour observation, the multimodal contexts in postsrelated to sports, concerts or festival are usuallybetter aligned with each other, therefore the visualfeatures easily contribute to these cases. For ex-ample, the ball and shoot action in example (a) inFigure 5 indicates that the context should be re-lated to basketball, thus the ‘Warriors’ should bethe name of a sports team. A singing person with amicrophone in example (b) indicates that the nameof an artist or a band (‘Radiohead’) may appear inthe sentence.

The second and the third rows in Figure 5 showsome more challenging cases whose tagging re-sults benefit from visual features. In example (d),the model pays attention to the big Apple logo,thus tags the ‘Apple’ in the sentence as an Orga-nization name. In example (e) and (i), a small

1996

ModelSnap Captions Twitter

Precision Recall F1 Precision Recall F1BLSTM-CRF 57.71 58.65 58.18 78.88 77.47 78.17

BLSTM-CRF + Global Image Vector 61.49 57.84 59.61 79.75 77.32 78.51BLSTM-CRF + Visual attention 65.53 57.03 60.98 80.81 77.36 79.05

BLSTM-CRF + Visual attention + Gate 66.67 57.84 61.94 81.62 79.90 80.75

Table 3: Results of our models on noisy social media data.

group of people indicates that it is likely to includenames of bands (‘Florence and the Machine’ and‘BTS’). And a crowd can indicate an organization(‘Warriorette’ in example (i)). A jersey shirt onthe table indicates a sports team. (‘Leicester’ inexample (h) can refer to both a city and a soccerclub based in it.)

4.4 Error Analysis

Figure 6 shows some failed examples that are cat-egorized into three types: (1) bad alignments be-tween visual and textual information; (2) blur im-ages; (3) wrong attention made by the model.

Name tagging greatly benefits from visual fea-

tures when the sentences are well aligned with theassociated image as we show in Section 4.3. But itis not always the case in social media. The exam-ple (a) in Figure 6 shows a failed example resultedfrom poor alignment between sentences and im-ages. In this image, there are two bins standing infront of a wall, but the sentence talks about bas-ketball players. The unrelated visual informationmakes the model tag ‘Cleveland’ as a Location,however it refers to the basketball team ‘ClevelandCavaliers’.

The image in example (b) is blur, so the ex-tracted visual information extracted actually intro-duces noise instead of additional information. The

Figure 5: Examples of visual attentions and NER outputs.

1997

(a). Nice image of [PER Kevin Love] and [PERKyle Korver] during 1st half #NBAFinals #Cavsin9

#[LOC Cleveland]

(b). Very drunk in a #magnum concert (c). Looking forward to editing some SBUbaseball shots from Saturday.

Figure 6: Examples of Failed Visual Attention.

image in example (c) is about a baseball pitcher,but our model pays attention to the top right cor-ner of the image. The visual context feature com-puted by our model is not related to the sentence,and results in missed tagging of ‘SBU’, which isan organization name.

5 Related Work

In this section, we summarize relevant backgroundon previous work on name tagging and visual at-tention.Name Tagging. In recent years, (Chiu andNichols, 2015; Lample et al., 2016; Ma and Hovy,2016) proposed several neural network architec-tures for named tagging that outperform tradi-tional explicit features based methods (Chieu andNg, 2002; Florian et al., 2003; Ando and Zhang,2005; Ratinov and Roth, 2009; Lin and Wu, 2009;Passos et al., 2014; Luo et al., 2015). They alluse Bidirectional LSTM (BLSTM) to extract fea-tures from a sequence of words. For character-level representations, (Lample et al., 2016) pro-posed to use another BLSTM to capture prefixand suffix information of words, and (Chiu andNichols, 2015; Ma and Hovy, 2016) used CNNto extract position-independent character features.On top of BLSTM, (Chiu and Nichols, 2015) useda softmax layer to predict the label for each word,and (Lample et al., 2016; Ma and Hovy, 2016)used a CRF layer for joint prediction. Com-pared with traditional approaches, neural networksbased approaches do not require hand-crafted fea-tures and achieved state-of-the-art performanceon name tagging (Ma and Hovy, 2016). How-ever, these methods were mainly developed fornewswire and paid little attention to social me-dia. For name tagging in social media, (Ritteret al., 2011) leveraged a large amount of unla-beled data and many dictionaries into a pipelinemodel. (Limsopatham and Collier, 2016) adaptedthe BLSTM-CRF model with additional word

shape information, and (Aguilar et al., 2017) uti-lized an effective multi-task approach. Amongthese methods, our model is most similar to (Lam-ple et al., 2016), but we designed a new visual at-tention component and a modulation control gate.Visual Attention. Since the attention mechanismwas proposed by (Bahdanau et al., 2014), it hasbeen widely adopted to language and vision re-lated tasks, such as Image Captioning and VisualQuestion Answering (VQA), by retrieving the vi-sual features most related to text context (Zhuet al., 2016; Anderson et al., 2017; Xu and Saenko,2016; Chen et al., 2015). (Xu et al., 2015) pro-posed to predict a word based on the visual patchthat is most related to the last predicted word forimage captioning. (Yang et al., 2016b; Lu et al.,2016) applied attention mechanism for VQA, tofind the regions in images that are most related tothe questions. (Yu et al., 2016) applied the visualattention mechanism on video captioning. Our at-tention implementation approach in this work issimilar to those used for VQA. The model findsthe regions in images that are most related to theaccompanying sentences, and then feed the visualfeatures into an BLSTM-CRF sequence labelingmodel. The differences are: (1) we add visual con-text feature at each step of sequence labeling; and(2) we propose to use a gate to control the combi-nation of the visual information and textual infor-mation based on their relatedness. 2

6 Conclusions and Future Work

We propose a gated Visual Attention for nametagging in multimodal social media. We con-struct two multimodal datasets from Twitter andSnapchat. Experiments show an absolute 3%-4%F-score gain. We hope this work will encour-age more research on multimodal social media inthe future and we plan on making our benchmarkavailable upon request.

Name Tagging for more fine-grained types (e.g.

1998

soccer team, basketball team, politician, artist) canbenefit more from visual features. For example, animage including a pitcher indicates that the ‘Gi-ants’ in context should refer to the baseball team‘San Francisco Giants’. We plan to expand ourmodel to tasks such as fine-grained Name Taggingor Entity Liking in the future.

Acknowledgments

This work was partially supported by the U.S.DARPA AIDA Program No. FA8750-18-2-0014and U.S. ARL NS-CTA No. W911NF-09-2-0053.The views and conclusions contained in this doc-ument are those of the authors and should not beinterpreted as representing the official policies, ei-ther expressed or implied, of the U.S. Govern-ment. The U.S. Government is authorized to re-produce and distribute reprints for Governmentpurposes notwithstanding any copyright notationhere on.

ReferencesGustavo Aguilar, Suraj Maharjan, Adrian Pastor Lopez

Monroy, and Thamar Solorio. 2017. A multi-taskapproach for named entity recognition in social me-dia data. In Proceedings of the 3rd Workshop onNoisy User-generated Text.

Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and LeiZhang. 2017. Bottom-up and top-down attentionfor image captioning and vqa. arXiv preprintarXiv:1707.07998.

Rie Kubota Ando and Tong Zhang. 2005. A frameworkfor learning predictive structures from multiple tasksand unlabeled data. Journal of Machine LearningResearch.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. In Proceedings of the2015 International Conference on Learning Repre-sentations.

Timothy Baldwin, Marie-Catherine de Marneffe,Bo Han, Young-Bum Kim, Alan Ritter, and WeiXu. 2015. Shared tasks of the 2015 workshop onnoisy user-generated text: Twitter lexical normaliza-tion and named entity recognition. In Proceedingsof the Workshop on Noisy User-generated Text.

Rich Caruana, Steve Lawrence, and C Lee Giles. 2001.Overfitting in neural nets: Backpropagation, conju-gate gradient, and early stopping. In Proceedings ofthe 2001 Advances in Neural Information Process-ing Systems.

Kan Chen, Jiang Wang, Liang-Chieh Chen, HaoyuanGao, Wei Xu, and Ram Nevatia. 2015. Abc-cnn: An attention based convolutional neural net-work for visual question answering. arXiv preprintarXiv:1511.05960.

Hai Leong Chieu and Hwee Tou Ng. 2002. Named en-tity recognition: a maximum entropy approach usingglobal information. In Proceedings of the 19th inter-national conference on Computational Linguistics.

Jason PC Chiu and Eric Nichols. 2015. Named entityrecognition with bidirectional lstm-cnns. Transac-tions of the Association of Computational Linguis-tics.

Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing.

Radu Florian, Abe Ittycheriah, Hongyan Jing, andTong Zhang. 2003. Named entity recognitionthrough classifier combination. In Proceedings ofthe seventh conference on Natural language learn-ing at HLT-NAACL.

Alex Graves, Abdel-rahman Mohamed, and GeoffreyHinton. 2013. Speech recognition with deep recur-rent neural networks. In Proceedings of the 2013IEEE international conference on acoustics, speechand signal processing.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural computation.

John Lafferty, Andrew McCallum, and Fernando CNPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data. In Proceedings of the Eighteenth In-ternational Conference on Machine Learning.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies.

Yann LeCun, Bernhard Boser, John S Denker, Don-nie Henderson, Richard E Howard, Wayne Hubbard,and Lawrence D Jackel. 1989. Backpropagation ap-plied to handwritten zip code recognition. Neuralcomputation.

Nut Limsopatham and Nigel Henry Collier. 2016.Bidirectional lstm for named entity recognition intwitter messages. In Proceedings of the 2nd Work-shop on Noisy User-generated Text.

1999

Dekang Lin and Xiaoyun Wu. 2009. Phrase clusteringfor discriminative learning. In Proceedings of theJoint Conference of the 47th Annual Meeting of theACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In Proceedings of the2014 European Conference on Computer Vision.

Jiasen Lu, Jianwei Yang, Dhruv Batra, and DeviParikh. 2016. Hierarchical question-image co-attention for visual question answering. In Proceed-ings of the 2016 Advances In Neural InformationProcessing Systems.

Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Za-iqing Nie. 2015. Joint entity recognition and disam-biguation. In Proceedings of the 2015 Conferenceon Empirical Methods in Natural Language Pro-cessing.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing.

Xuezhe Ma and Eduard Hovy. 2016. End-to-end se-quence labeling via bi-directional lstm-cnns-crf. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics.

Yasumasa Miyamoto and Kyunghyun Cho. 2016.Gated word-character recurrent language model. InProceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing.

Olutobi Owoputi, Brendan O’Connor, Chris Dyer,Kevin Gimpel, Nathan Schneider, and Noah ASmith. 2013. Improved part-of-speech tagging foronline conversational text with word clusters. InIn Proceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies.

Alexandre Passos, Vineet Kumar, and Andrew McCal-lum. 2014. Lexicon infused phrase embeddings fornamed entity resolution. In Proceedings of the Eigh-teenth Conference on Computational Natural Lan-guage Learning.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing.

Lev Ratinov and Dan Roth. 2009. Design challengesand misconceptions in named entity recognition. InProceedings of the Thirteenth Conference on Com-putational Natural Language Learning.

Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011.Named entity recognition in tweets: an experimentalstudy. In Proceedings of the conference on Empiri-cal Methods in Natural Language Processing.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, et al.2015. Imagenet large scale visual recognition chal-lenge. International Journal of Computer Vision.

Erik F Sang and Jorn Veenstra. 1999. Representing textchunks. In Proceedings of the ninth conference onEuropean chapter of the Association for Computa-tional Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of the 2017 Advances inNeural Information Processing Systems.

Huijuan Xu and Kate Saenko. 2016. Ask, attend andanswer: Exploring question-guided spatial attentionfor visual question answering. In Proceedings of the2016 European Conference on Computer Vision.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual atten-tion. In Proceedings of the 2015 International Con-ference on Machine Learning.

Zhilin Yang, Bhuwan Dhingra, Ye Yuan, Junjie Hu,William W Cohen, and Ruslan Salakhutdinov.2016a. Words or characters? fine-grained gating forreading comprehension. In Proceedings of the 2016International Conference on Learning Representa-tions.

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng,and Alex Smola. 2016b. Stacked attention networksfor image question answering. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition.

Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, andWei Xu. 2016. Video paragraph captioning usinghierarchical recurrent neural networks. In Proceed-ings of the IEEE conference on computer vision andpattern recognition.

Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answeringin images. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition.

visual attention model for name tagging in multimodal

Documents