arxiv:1802.05365v2 [cs.cl] 22 mar 2018 · deep contextualized word representations matthew e....

15
Deep contextualized word representations Matthew E. Peters , Mark Neumann , Mohit Iyyer , Matt Gardner , {matthewp,markn,mohiti,mattg}@allenai.org Christopher Clark * , Kenton Lee * , Luke Zettlemoyer †* {csquared,kentonl,lsz}@cs.washington.edu Allen Institute for Artificial Intelligence * Paul G. Allen School of Computer Science & Engineering, University of Washington Abstract We introduce a new type of deep contextual- ized word representation that models both (1) complex characteristics of word use (e.g., syn- tax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned func- tions of the internal states of a deep bidirec- tional language model (biLM), which is pre- trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, tex- tual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals. 1 Introduction Pre-trained word representations (Mikolov et al., 2013; Pennington et al., 2014) are a key compo- nent in many neural language understanding mod- els. However, learning high quality representa- tions can be challenging. They should ideally model both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). In this paper, we introduce a new type of deep contextualized word representation that directly addresses both challenges, can be easily integrated into existing models, and significantly improves the state of the art in every considered case across a range of challenging language un- derstanding problems. Our representations differ from traditional word type embeddings in that each token is assigned a representation that is a function of the entire input sentence. We use vectors derived from a bidirec- tional LSTM that is trained with a coupled lan- guage model (LM) objective on a large text cor- pus. For this reason, we call them ELMo (Em- beddings from Language Models) representations. Unlike previous approaches for learning contextu- alized word vectors (Peters et al., 2017; McCann et al., 2017), ELMo representations are deep, in the sense that they are a function of all of the in- ternal layers of the biLM. More specifically, we learn a linear combination of the vectors stacked above each input word for each end task, which markedly improves performance over just using the top LSTM layer. Combining the internal states in this manner al- lows for very rich word representations. Using in- trinsic evaluations, we show that the higher-level LSTM states capture context-dependent aspects of word meaning (e.g., they can be used with- out modification to perform well on supervised word sense disambiguation tasks) while lower- level states model aspects of syntax (e.g., they can be used to do part-of-speech tagging). Simultane- ously exposing all of these signals is highly bene- ficial, allowing the learned models select the types of semi-supervision that are most useful for each end task. Extensive experiments demonstrate that ELMo representations work extremely well in practice. We first show that they can be easily added to existing models for six diverse and challenging language understanding problems, including tex- tual entailment, question answering and sentiment analysis. The addition of ELMo representations alone significantly improves the state of the art in every case, including up to 20% relative error reductions. For tasks where direct comparisons are possible, ELMo outperforms CoVe (McCann et al., 2017), which computes contextualized rep- resentations using a neural machine translation en- coder. Finally, an analysis of both ELMo and CoVe reveals that deep representations outperform arXiv:1802.05365v2 [cs.CL] 22 Mar 2018

Upload: dangkhanh

Post on 08-Aug-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

Deep contextualized word representations

Matthew E. Peters†, Mark Neumann†, Mohit Iyyer†, Matt Gardner†,{matthewp,markn,mohiti,mattg}@allenai.org

Christopher Clark∗, Kenton Lee∗, Luke Zettlemoyer†∗{csquared,kentonl,lsz}@cs.washington.edu

†Allen Institute for Artificial Intelligence∗Paul G. Allen School of Computer Science & Engineering, University of Washington

Abstract

We introduce a new type of deep contextual-ized word representation that models both (1)complex characteristics of word use (e.g., syn-tax and semantics), and (2) how these usesvary across linguistic contexts (i.e., to modelpolysemy). Our word vectors are learned func-tions of the internal states of a deep bidirec-tional language model (biLM), which is pre-trained on a large text corpus. We show thatthese representations can be easily added toexisting models and significantly improve thestate of the art across six challenging NLPproblems, including question answering, tex-tual entailment and sentiment analysis. Wealso present an analysis showing that exposingthe deep internals of the pre-trained network iscrucial, allowing downstream models to mixdifferent types of semi-supervision signals.

1 Introduction

Pre-trained word representations (Mikolov et al.,2013; Pennington et al., 2014) are a key compo-nent in many neural language understanding mod-els. However, learning high quality representa-tions can be challenging. They should ideallymodel both (1) complex characteristics of worduse (e.g., syntax and semantics), and (2) how theseuses vary across linguistic contexts (i.e., to modelpolysemy). In this paper, we introduce a new typeof deep contextualized word representation thatdirectly addresses both challenges, can be easilyintegrated into existing models, and significantlyimproves the state of the art in every consideredcase across a range of challenging language un-derstanding problems.

Our representations differ from traditional wordtype embeddings in that each token is assigned arepresentation that is a function of the entire inputsentence. We use vectors derived from a bidirec-tional LSTM that is trained with a coupled lan-

guage model (LM) objective on a large text cor-pus. For this reason, we call them ELMo (Em-beddings from Language Models) representations.Unlike previous approaches for learning contextu-alized word vectors (Peters et al., 2017; McCannet al., 2017), ELMo representations are deep, inthe sense that they are a function of all of the in-ternal layers of the biLM. More specifically, welearn a linear combination of the vectors stackedabove each input word for each end task, whichmarkedly improves performance over just usingthe top LSTM layer.

Combining the internal states in this manner al-lows for very rich word representations. Using in-trinsic evaluations, we show that the higher-levelLSTM states capture context-dependent aspectsof word meaning (e.g., they can be used with-out modification to perform well on supervisedword sense disambiguation tasks) while lower-level states model aspects of syntax (e.g., they canbe used to do part-of-speech tagging). Simultane-ously exposing all of these signals is highly bene-ficial, allowing the learned models select the typesof semi-supervision that are most useful for eachend task.

Extensive experiments demonstrate that ELMorepresentations work extremely well in practice.We first show that they can be easily added toexisting models for six diverse and challenginglanguage understanding problems, including tex-tual entailment, question answering and sentimentanalysis. The addition of ELMo representationsalone significantly improves the state of the artin every case, including up to 20% relative errorreductions. For tasks where direct comparisonsare possible, ELMo outperforms CoVe (McCannet al., 2017), which computes contextualized rep-resentations using a neural machine translation en-coder. Finally, an analysis of both ELMo andCoVe reveals that deep representations outperform

arX

iv:1

802.

0536

5v2

[cs

.CL

] 2

2 M

ar 2

018

Page 2: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

those derived from just the top layer of an LSTM.Our trained models and code are publicly avail-able, and we expect that ELMo will provide simi-lar gains for many other NLP problems.1

2 Related work

Due to their ability to capture syntactic and se-mantic information of words from large scale un-labeled text, pretrained word vectors (Turian et al.,2010; Mikolov et al., 2013; Pennington et al.,2014) are a standard component of most state-of-the-art NLP architectures, including for questionanswering (Liu et al., 2017), textual entailment(Chen et al., 2017) and semantic role labeling(He et al., 2017). However, these approaches forlearning word vectors only allow a single context-independent representation for each word.

Previously proposed methods overcome someof the shortcomings of traditional word vectorsby either enriching them with subword informa-tion (e.g., Wieting et al., 2016; Bojanowski et al.,2017) or learning separate vectors for each wordsense (e.g., Neelakantan et al., 2014). Our ap-proach also benefits from subword units throughthe use of character convolutions, and we seam-lessly incorporate multi-sense information intodownstream tasks without explicitly training topredict predefined sense classes.

Other recent work has also focused onlearning context-dependent representations.context2vec (Melamud et al., 2016) uses abidirectional Long Short Term Memory (LSTM;Hochreiter and Schmidhuber, 1997) to encode thecontext around a pivot word. Other approachesfor learning contextual embeddings include thepivot word itself in the representation and arecomputed with the encoder of either a supervisedneural machine translation (MT) system (CoVe;McCann et al., 2017) or an unsupervised lan-guage model (Peters et al., 2017). Both of theseapproaches benefit from large datasets, althoughthe MT approach is limited by the size of parallelcorpora. In this paper, we take full advantage ofaccess to plentiful monolingual data, and trainour biLM on a corpus with approximately 30million sentences (Chelba et al., 2014). We alsogeneralize these approaches to deep contextualrepresentations, which we show work well acrossa broad range of diverse NLP tasks.

1http://allennlp.org/elmo

Previous work has also shown that different lay-ers of deep biRNNs encode different types of in-formation. For example, introducing multi-tasksyntactic supervision (e.g., part-of-speech tags) atthe lower levels of a deep LSTM can improveoverall performance of higher level tasks such asdependency parsing (Hashimoto et al., 2017) orCCG super tagging (Søgaard and Goldberg, 2016).In an RNN-based encoder-decoder machine trans-lation system, Belinkov et al. (2017) showed thatthe representations learned at the first layer in a 2-layer LSTM encoder are better at predicting POStags then second layer. Finally, the top layer of anLSTM for encoding word context (Melamud et al.,2016) has been shown to learn representations ofword sense. We show that similar signals are alsoinduced by the modified language model objectiveof our ELMo representations, and it can be verybeneficial to learn models for downstream tasksthat mix these different types of semi-supervision.

Dai and Le (2015) and Ramachandran et al.(2017) pretrain encoder-decoder pairs using lan-guage models and sequence autoencoders and thenfine tune with task specific supervision. In con-trast, after pretraining the biLM with unlabeleddata, we fix the weights and add additional task-specific model capacity, allowing us to leveragelarge, rich and universal biLM representations forcases where downstream training data size dictatesa smaller supervised model.

3 ELMo: Embeddings from LanguageModels

Unlike most widely used word embeddings (Pen-nington et al., 2014), ELMo word representationsare functions of the entire input sentence, as de-scribed in this section. They are computed on topof two-layer biLMs with character convolutions(Sec. 3.1), as a linear function of the internal net-work states (Sec. 3.2). This setup allows us to dosemi-supervised learning, where the biLM is pre-trained at a large scale (Sec. 3.4) and easily incor-porated into a wide range of existing neural NLParchitectures (Sec. 3.3).

3.1 Bidirectional language models

Given a sequence of N tokens, (t1, t2, ..., tN ), aforward language model computes the probabilityof the sequence by modeling the probability of to-

Page 3: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

ken tk given the history (t1, ..., tk−1):

p(t1, t2, . . . , tN ) =N∏k=1

p(tk | t1, t2, . . . , tk−1).

Recent state-of-the-art neural language models(Jozefowicz et al., 2016; Melis et al., 2017; Mer-ity et al., 2017) compute a context-independent to-ken representation xLM

k (via token embeddings ora CNN over characters) then pass it through L lay-ers of forward LSTMs. At each position k, eachLSTM layer outputs a context-dependent repre-sentation

−→h LM

k,j where j = 1, . . . , L. The top layer

LSTM output,−→h LM

k,L , is used to predict the nexttoken tk+1 with a Softmax layer.

A backward LM is similar to a forward LM, ex-cept it runs over the sequence in reverse, predict-ing the previous token given the future context:

p(t1, t2, . . . , tN ) =N∏k=1

p(tk | tk+1, tk+2, . . . , tN ).

It can be implemented in an analogous way to aforward LM, with each backward LSTM layer jin a L layer deep model producing representations←−h LM

k,j of tk given (tk+1, . . . , tN ).A biLM combines both a forward and backward

LM. Our formulation jointly maximizes the loglikelihood of the forward and backward directions:

N∑k=1

( log p(tk | t1, . . . , tk−1; Θx,−→ΘLSTM ,Θs)

+ log p(tk | tk+1, . . . , tN ; Θx,←−ΘLSTM ,Θs) ) .

We tie the parameters for both the token represen-tation (Θx) and Softmax layer (Θs) in the forwardand backward direction while maintaining sepa-rate parameters for the LSTMs in each direction.Overall, this formulation is similar to the approachof Peters et al. (2017), with the exception that weshare some weights between directions instead ofusing completely independent parameters. In thenext section, we depart from previous work by in-troducing a new approach for learning word rep-resentations that are a linear combination of thebiLM layers.

3.2 ELMo

ELMo is a task specific combination of the in-termediate layer representations in the biLM. For

each token tk, a L-layer biLM computes a set of2L+ 1 representations

Rk = {xLMk ,−→h LM

k,j ,←−h LM

k,j | j = 1, . . . , L}= {hLM

k,j | j = 0, . . . , L},

where hLMk,0 is the token layer and hLM

k,j =

[−→h LM

k,j ;←−h LM

k,j ], for each biLSTM layer.For inclusion in a downstream model, ELMo

collapses all layers in R into a single vector,ELMok = E(Rk; Θe). In the simplest case,ELMo just selects the top layer, E(Rk) = hLM

k,L ,as in TagLM (Peters et al., 2017) and CoVe (Mc-Cann et al., 2017). More generally, we compute atask specific weighting of all biLM layers:

ELMotaskk = E(Rk; Θtask) = γtask

L∑j=0

staskj hLMk,j .

(1)In (1), stask are softmax-normalized weights andthe scalar parameter γtask allows the task model toscale the entire ELMo vector. γ is of practical im-portance to aid the optimization process (see sup-plemental material for details). Considering thatthe activations of each biLM layer have a differentdistribution, in some cases it also helped to applylayer normalization (Ba et al., 2016) to each biLMlayer before weighting.

3.3 Using biLMs for supervised NLP tasksGiven a pre-trained biLM and a supervised archi-tecture for a target NLP task, it is a simple processto use the biLM to improve the task model. Wesimply run the biLM and record all of the layerrepresentations for each word. Then, we let theend task model learn a linear combination of theserepresentations, as described below.

First consider the lowest layers of the super-vised model without the biLM. Most supervisedNLP models share a common architecture at thelowest layers, allowing us to add ELMo in aconsistent, unified manner. Given a sequenceof tokens (t1, . . . , tN ), it is standard to form acontext-independent token representation xk foreach token position using pre-trained word em-beddings and optionally character-based represen-tations. Then, the model forms a context-sensitiverepresentation hk, typically using either bidirec-tional RNNs, CNNs, or feed forward networks.

To add ELMo to the supervised model, wefirst freeze the weights of the biLM and then

Page 4: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

concatenate the ELMo vector ELMotaskk with

xk and pass the ELMo enhanced representation[xk; ELMotask

k ] into the task RNN. For sometasks (e.g., SNLI, SQuAD), we observe furtherimprovements by also including ELMo at the out-put of the task RNN by introducing another setof output specific linear weights and replacing hk

with [hk; ELMotaskk ]. As the remainder of the

supervised model remains unchanged, these addi-tions can happen within the context of more com-plex neural models. For example, see the SNLIexperiments in Sec. 4 where a bi-attention layerfollows the biLSTMs, or the coreference resolu-tion experiments where a clustering model is lay-ered on top of the biLSTMs.

Finally, we found it beneficial to add a moder-ate amount of dropout to ELMo (Srivastava et al.,2014) and in some cases to regularize the ELMoweights by adding λ‖w‖22 to the loss. This im-poses an inductive bias on the ELMo weights tostay close to an average of all biLM layers.

3.4 Pre-trained bidirectional language modelarchitecture

The pre-trained biLMs in this paper are similar tothe architectures in Jozefowicz et al. (2016) andKim et al. (2015), but modified to support jointtraining of both directions and add a residual con-nection between LSTM layers. We focus on largescale biLMs in this work, as Peters et al. (2017)highlighted the importance of using biLMs overforward-only LMs and large scale training.

To balance overall language model perplexitywith model size and computational requirementsfor downstream tasks while maintaining a purelycharacter-based input representation, we halved allembedding and hidden dimensions from the singlebest model CNN-BIG-LSTM in Jozefowicz et al.(2016). The final model uses L = 2 biLSTM lay-ers with 4096 units and 512 dimension projectionsand a residual connection from the first to secondlayer. The context insensitive type representationuses 2048 character n-gram convolutional filtersfollowed by two highway layers (Srivastava et al.,2015) and a linear projection down to a 512 repre-sentation. As a result, the biLM provides three lay-ers of representations for each input token, includ-ing those outside the training set due to the purelycharacter input. In contrast, traditional word em-bedding methods only provide one layer of repre-sentation for tokens in a fixed vocabulary.

After training for 10 epochs on the 1B WordBenchmark (Chelba et al., 2014), the average for-ward and backward perplexities is 39.7, comparedto 30.0 for the forward CNN-BIG-LSTM. Gener-ally, we found the forward and backward perplex-ities to be approximately equal, with the backwardvalue slightly lower.

Once pretrained, the biLM can compute repre-sentations for any task. In some cases, fine tuningthe biLM on domain specific data leads to signifi-cant drops in perplexity and an increase in down-stream task performance. This can be seen as atype of domain transfer for the biLM. As a result,in most cases we used a fine-tuned biLM in thedownstream task. See supplemental material fordetails.

4 Evaluation

Table 1 shows the performance of ELMo across adiverse set of six benchmark NLP tasks. In everytask considered, simply adding ELMo establishesa new state-of-the-art result, with relative error re-ductions ranging from 6 - 20% over strong basemodels. This is a very general result across a di-verse set model architectures and language under-standing tasks. In the remainder of this section weprovide high-level sketches of the individual taskresults; see the supplemental material for full ex-perimental details.

Question answering The Stanford QuestionAnswering Dataset (SQuAD) (Rajpurkar et al.,2016) contains 100K+ crowd sourced question-answer pairs where the answer is a span in a givenWikipedia paragraph. Our baseline model (Clarkand Gardner, 2017) is an improved version of theBidirectional Attention Flow model in Seo et al.(BiDAF; 2017). It adds a self-attention layer af-ter the bidirectional attention component, simpli-fies some of the pooling operations and substitutesthe LSTMs for gated recurrent units (GRUs; Choet al., 2014). After adding ELMo to the baselinemodel, test set F1 improved by 4.7% from 81.1%to 85.8%, a 24.9% relative error reduction over thebaseline, and improving the overall single modelstate-of-the-art by 1.4%. A 11 member ensem-ble pushes F1 to 87.4, the overall state-of-the-artat time of submission to the leaderboard.2 Theincrease of 4.7% with ELMo is also significantlylarger then the 1.8% improvement from addingCoVe to a baseline model (McCann et al., 2017).

2As of November 17, 2017.

Page 5: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

TASK PREVIOUS SOTA OURBASELINE

ELMO +BASELINE

INCREASE(ABSOLUTE/RELATIVE)

SQuAD Liu et al. (2017) 84.4 81.1 85.8 4.7 / 24.9%SNLI Chen et al. (2017) 88.6 88.0 88.7 ± 0.17 0.7 / 5.8%SRL He et al. (2017) 81.7 81.4 84.6 3.2 / 17.2%Coref Lee et al. (2017) 67.2 67.2 70.4 3.2 / 9.8%NER Peters et al. (2017) 91.93 ± 0.19 90.15 92.22 ± 0.10 2.06 / 21%SST-5 McCann et al. (2017) 53.7 51.4 54.7 ± 0.5 3.3 / 6.8%

Table 1: Test set comparison of ELMo enhanced neural models with state-of-the-art single model baselines acrosssix benchmark NLP tasks. The performance metric varies across tasks – accuracy for SNLI and SST-5; F1 forSQuAD, SRL and NER; average F1 for Coref. Due to the small test sizes for NER and SST-5, we report the meanand standard deviation across five runs with different random seeds. The “increase” column lists both the absoluteand relative improvements over our baseline.

Textual entailment Textual entailment is thetask of determining whether a “hypothesis” istrue, given a “premise”. The Stanford Natu-ral Language Inference (SNLI) corpus (Bowmanet al., 2015) provides approximately 550K hypoth-esis/premise pairs. Our baseline, the ESIM se-quence model from Chen et al. (2017), uses a biL-STM to encode the premise and hypothesis, fol-lowed by a matrix attention layer, a local infer-ence layer, another biLSTM inference composi-tion layer, and finally a pooling operation beforethe output layer. Overall, adding ELMo to theESIM model improves accuracy by an average of0.7% across five random seeds. A five memberensemble pushes the overall accuracy to 89.3%,exceeding the previous ensemble best of 88.9%(Gong et al., 2018).

Semantic role labeling A semantic role label-ing (SRL) system models the predicate-argumentstructure of a sentence, and is often described asanswering “Who did what to whom”. He et al.(2017) modeled SRL as a BIO tagging problemand used an 8-layer deep biLSTM with forwardand backward directions interleaved, followingZhou and Xu (2015). As shown in Table 1, whenadding ELMo to a re-implementation of He et al.(2017) the single model test set F1 jumped 3.2%from 81.4% to 84.6% – a new state-of-the-art onthe OntoNotes benchmark (Pradhan et al., 2013),even improving over the previous best ensembleresult by 1.2%.

Coreference resolution Coreference resolutionis the task of clustering mentions in text that re-fer to the same underlying real world entities. Ourbaseline model is the end-to-end span-based neu-ral model of Lee et al. (2017). It uses a biLSTM

and attention mechanism to first compute spanrepresentations and then applies a softmax men-tion ranking model to find coreference chains. Inour experiments with the OntoNotes coreferenceannotations from the CoNLL 2012 shared task(Pradhan et al., 2012), adding ELMo improved theaverage F1 by 3.2% from 67.2 to 70.4, establish-ing a new state of the art, again improving over theprevious best ensemble result by 1.6% F1.

Named entity extraction The CoNLL 2003NER task (Sang and Meulder, 2003) consists ofnewswire from the Reuters RCV1 corpus taggedwith four different entity types (PER, LOC, ORG,MISC). Following recent state-of-the-art systems(Lample et al., 2016; Peters et al., 2017), the base-line model uses pre-trained word embeddings, acharacter-based CNN representation, two biLSTMlayers and a conditional random field (CRF) loss(Lafferty et al., 2001), similar to Collobert et al.(2011). As shown in Table 1, our ELMo enhancedbiLSTM-CRF achieves 92.22% F1 averaged overfive runs. The key difference between our systemand the previous state of the art from Peters et al.(2017) is that we allowed the task model to learn aweighted average of all biLM layers, whereas Pe-ters et al. (2017) only use the top biLM layer. Asshown in Sec. 5.1, using all layers instead of justthe last layer improves performance across multi-ple tasks.

Sentiment analysis The fine-grained sentimentclassification task in the Stanford Sentiment Tree-bank (SST-5; Socher et al., 2013) involves select-ing one of five labels (from very negative to verypositive) to describe a sentence from a movie re-view. The sentences contain diverse linguisticphenomena such as idioms and complex syntac-

Page 6: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

Task Baseline Last OnlyAll layers

λ=1 λ=0.001SQuAD 80.8 84.7 85.0 85.2SNLI 88.1 89.1 89.3 89.5SRL 81.6 84.1 84.6 84.8

Table 2: Development set performance for SQuAD,SNLI and SRL comparing using all layers of the biLM(with different choices of regularization strength λ) tojust the top layer.

TaskInputOnly

Input &Output

OutputOnly

SQuAD 85.1 85.6 84.8SNLI 88.9 89.5 88.7SRL 84.7 84.3 80.9

Table 3: Development set performance for SQuAD,SNLI and SRL when including ELMo at different lo-cations in the supervised model.

tic constructions such as negations that are diffi-cult for models to learn. Our baseline model isthe biattentive classification network (BCN) fromMcCann et al. (2017), which also held the priorstate-of-the-art result when augmented with CoVeembeddings. Replacing CoVe with ELMo in theBCN model results in a 1.0% absolute accuracyimprovement over the state of the art.

5 Analysis

This section provides an ablation analysis to vali-date our chief claims and to elucidate some inter-esting aspects of ELMo representations. Sec. 5.1shows that using deep contextual representationsin downstream tasks improves performance overprevious work that uses just the top layer, regard-less of whether they are produced from a biLM orMT encoder, and that ELMo representations pro-vide the best overall performance. Sec. 5.3 ex-plores the different types of contextual informa-tion captured in biLMs and uses two intrinsic eval-uations to show that syntactic information is betterrepresented at lower layers while semantic infor-mation is captured a higher layers, consistent withMT encoders. It also shows that our biLM consis-tently provides richer representations then CoVe.Additionally, we analyze the sensitivity to whereELMo is included in the task model (Sec. 5.2),training set size (Sec. 5.4), and visualize the ELMolearned weights across the tasks (Sec. 5.5).

5.1 Alternate layer weighting schemes

There are many alternatives to Equation 1 for com-bining the biLM layers. Previous work on con-textual representations used only the last layer,whether it be from a biLM (Peters et al., 2017) oran MT encoder (CoVe; McCann et al., 2017). Thechoice of the regularization parameter λ is alsoimportant, as large values such as λ = 1 effec-tively reduce the weighting function to a simpleaverage over the layers, while smaller values (e.g.,λ = 0.001) allow the layer weights to vary.

Table 2 compares these alternatives for SQuAD,SNLI and SRL. Including representations from alllayers improves overall performance over just us-ing the last layer, and including contextual rep-resentations from the last layer improves perfor-mance over the baseline. For example, in thecase of SQuAD, using just the last biLM layer im-proves development F1 by 3.9% over the baseline.Averaging all biLM layers instead of using just thelast layer improves F1 another 0.3% (comparing“Last Only” to λ=1 columns), and allowing thetask model to learn individual layer weights im-proves F1 another 0.2% (λ=1 vs. λ=0.001). Asmall λ is preferred in most cases with ELMo, al-though for NER, a task with a smaller training set,the results are insensitive to λ (not shown).

The overall trend is similar with CoVe but withsmaller increases over the baseline. For SNLI, av-eraging all layers with λ=1 improves developmentaccuracy from 88.2 to 88.7% over using just thelast layer. SRL F1 increased a marginal 0.1% to82.2 for the λ=1 case compared to using the lastlayer only.

5.2 Where to include ELMo?

All of the task architectures in this paper includeword embeddings only as input to the lowest layerbiRNN. However, we find that including ELMo atthe output of the biRNN in task-specific architec-tures improves overall results for some tasks. Asshown in Table 3, including ELMo at both the in-put and output layers for SNLI and SQuAD im-proves over just the input layer, but for SRL (andcoreference resolution, not shown) performance ishighest when it is included at just the input layer.One possible explanation for this result is that boththe SNLI and SQuAD architectures use attentionlayers after the biRNN, so introducing ELMo atthis layer allows the model to attend directly to thebiLM’s internal representations. In the SRL case,

Page 7: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

Source Nearest Neighbors

GloVe playplaying, game, games, played, players, plays, player,Play, football, multiplayer

biLM

Chico Ruiz made a spec-tacular play on Alusik ’sgrounder {. . . }

Kieffer , the only junior in the group , was commendedfor his ability to hit in the clutch , as well as his all-roundexcellent play .

Olivia De Havillandsigned to do a Broadwayplay for Garson {. . . }

{. . .} they were actors who had been handed fat roles ina successful play , and had talent enough to fill the rolescompetently , with nice understatement .

Table 4: Nearest neighbors to “play” using GloVe and the context embeddings from a biLM.

Model F1

WordNet 1st Sense Baseline 65.9Raganato et al. (2017a) 69.9Iacobacci et al. (2016) 70.1CoVe, First Layer 59.4CoVe, Second Layer 64.7biLM, First layer 67.4biLM, Second layer 69.0

Table 5: All-words fine grained WSD F1. For CoVeand the biLM, we report scores for both the first andsecond layer biLSTMs.

the task-specific context representations are likelymore important than those from the biLM.

5.3 What information is captured by thebiLM’s representations?

Since adding ELMo improves task performanceover word vectors alone, the biLM’s contextualrepresentations must encode information gener-ally useful for NLP tasks that is not capturedin word vectors. Intuitively, the biLM mustbe disambiguating the meaning of words usingtheir context. Consider “play”, a highly poly-semous word. The top of Table 4 lists near-est neighbors to “play” using GloVe vectors.They are spread across several parts of speech(e.g., “played”, “playing” as verbs, and “player”,“game” as nouns) but concentrated in the sports-related senses of “play”. In contrast, the bottomtwo rows show nearest neighbor sentences fromthe SemCor dataset (see below) using the biLM’scontext representation of “play” in the source sen-tence. In these cases, the biLM is able to disam-biguate both the part of speech and word sense inthe source sentence.

These observations can be quantified using an

Model Acc.Collobert et al. (2011) 97.3Ma and Hovy (2016) 97.6Ling et al. (2015) 97.8CoVe, First Layer 93.3CoVe, Second Layer 92.8biLM, First Layer 97.3biLM, Second Layer 96.8

Table 6: Test set POS tagging accuracies for PTB. ForCoVe and the biLM, we report scores for both the firstand second layer biLSTMs.

intrinsic evaluation of the contextual representa-tions similar to Belinkov et al. (2017). To isolatethe information encoded by the biLM, the repre-sentations are used to directly make predictions fora fine grained word sense disambiguation (WSD)task and a POS tagging task. Using this approach,it is also possible to compare to CoVe, and acrosseach of the individual layers.

Word sense disambiguation Given a sentence,we can use the biLM representations to predictthe sense of a target word using a simple 1-nearest neighbor approach, similar to Melamudet al. (2016). To do so, we first use the biLMto compute representations for all words in Sem-Cor 3.0, our training corpus (Miller et al., 1994),and then take the average representation for eachsense. At test time, we again use the biLM to com-pute representations for a given target word andtake the nearest neighbor sense from the trainingset, falling back to the first sense from WordNetfor lemmas not observed during training.

Table 5 compares WSD results using the eval-uation framework from Raganato et al. (2017b)across the same suite of four test sets in Raganatoet al. (2017a). Overall, the biLM top layer rep-

Page 8: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

resentations have F1 of 69.0 and are better atWSD then the first layer. This is competitive witha state-of-the-art WSD-specific supervised modelusing hand crafted features (Iacobacci et al., 2016)and a task specific biLSTM that is also trainedwith auxiliary coarse-grained semantic labels andPOS tags (Raganato et al., 2017a). The CoVebiLSTM layers follow a similar pattern to thosefrom the biLM (higher overall performance at thesecond layer compared to the first); however, ourbiLM outperforms the CoVe biLSTM, which trailsthe WordNet first sense baseline.

POS tagging To examine whether the biLMcaptures basic syntax, we used the context repre-sentations as input to a linear classifier that pre-dicts POS tags with the Wall Street Journal portionof the Penn Treebank (PTB) (Marcus et al., 1993).As the linear classifier adds only a small amountof model capacity, this is direct test of the biLM’srepresentations. Similar to WSD, the biLM rep-resentations are competitive with carefully tuned,task specific biLSTMs (Ling et al., 2015; Ma andHovy, 2016). However, unlike WSD, accuraciesusing the first biLM layer are higher than thetop layer, consistent with results from deep biL-STMs in multi-task training (Søgaard and Gold-berg, 2016; Hashimoto et al., 2017) and MT (Be-linkov et al., 2017). CoVe POS tagging accuraciesfollow the same pattern as those from the biLM,and just like for WSD, the biLM achieves higheraccuracies than the CoVe encoder.

Implications for supervised tasks Taken to-gether, these experiments confirm different layersin the biLM represent different types of informa-tion and explain why including all biLM layers isimportant for the highest performance in down-stream tasks. In addition, the biLM’s representa-tions are more transferable to WSD and POS tag-ging than those in CoVe, helping to illustrate whyELMo outperforms CoVe in downstream tasks.

5.4 Sample efficiency

Adding ELMo to a model increases the sample ef-ficiency considerably, both in terms of number ofparameter updates to reach state-of-the-art perfor-mance and the overall training set size. For ex-ample, the SRL model reaches a maximum devel-opment F1 after 486 epochs of training withoutELMo. After adding ELMo, the model exceedsthe baseline maximum at epoch 10, a 98% relativedecrease in the number of updates needed to reach

Figure 1: Comparison of baseline vs. ELMo perfor-mance for SNLI and SRL as the training set size is var-ied from 0.1% to 100%.

Figure 2: Visualization of softmax normalized biLMlayer weights across tasks and ELMo locations. Nor-malized weights less then 1/3 are hatched with hori-zontal lines and those greater then 2/3 are speckled.

the same level of performance.

In addition, ELMo-enhanced models usesmaller training sets more efficiently than mod-els without ELMo. Figure 1 compares the per-formance of baselines models with and withoutELMo as the percentage of the full training set isvaried from 0.1% to 100%. Improvements withELMo are largest for smaller training sets andsignificantly reduce the amount of training dataneeded to reach a given level of performance. Inthe SRL case, the ELMo model with 1% of thetraining set has about the same F1 as the baselinemodel with 10% of the training set.

5.5 Visualization of learned weights

Figure 2 visualizes the softmax-normalizedlearned layer weights. At the input layer, thetask model favors the first biLSTM layer. Forcoreference and SQuAD, the this is stronglyfavored, but the distribution is less peaked forthe other tasks. The output layer weights arerelatively balanced, with a slight preference forthe lower layers.

Page 9: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

6 Conclusion

We have introduced a general approach for learn-ing high-quality deep context-dependent represen-tations from biLMs, and shown large improve-ments when applying ELMo to a broad range ofNLP tasks. Through ablations and other controlledexperiments, we have also confirmed that thebiLM layers efficiently encode different types ofsyntactic and semantic information about words-in-context, and that using all layers improves over-all task performance.

ReferencesJimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016.

Layer normalization. CoRR abs/1607.06450.

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has-san Sajjad, and James R. Glass. 2017. What do neu-ral machine translation models learn about morphol-ogy? In ACL.

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information. TACL 5:135–146.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large an-notated corpus for learning natural language infer-ence. In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing(EMNLP). Association for Computational Linguis-tics.

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2014. One billion word benchmark for mea-suring progress in statistical language modeling. InINTERSPEECH.

Qian Chen, Xiao-Dan Zhu, Zhen-Hua Ling, Si Wei,Hui Jiang, and Diana Inkpen. 2017. Enhanced lstmfor natural language inference. In ACL.

Jason Chiu and Eric Nichols. 2016. Named entityrecognition with bidirectional LSTM-CNNs. InTACL.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder-decoder ap-proaches. In SSST@EMNLP.

Christopher Clark and Matthew Gardner. 2017. Sim-ple and effective multi-paragraph reading compre-hension. CoRR abs/1710.10723.

Kevin Clark and Christopher D. Manning. 2016. Deepreinforcement learning for mention-ranking corefer-ence models. In EMNLP.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel P. Kuksa.2011. Natural language processing (almost) fromscratch. In JMLR.

Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In NIPS.

Greg Durrett and Dan Klein. 2013. Easy victories anduphill battles in coreference resolution. In EMNLP.

Yarin Gal and Zoubin Ghahramani. 2016. A theoret-ically grounded application of dropout in recurrentneural networks. In NIPS.

Yichen Gong, Heng Luo, and Jian Zhang. 2018. Nat-ural language inference over interaction space. InICLR.

Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsu-ruoka, and Richard Socher. 2017. A joint many-taskmodel: Growing a neural network for multiple nlptasks. In EMNLP 2017.

Luheng He, Kenton Lee, Mike Lewis, and Luke S.Zettlemoyer. 2017. Deep semantic role labeling:What works and what’s next. In ACL.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural Computation 9.

Ignacio Iacobacci, Mohammad Taher Pilehvar, andRoberto Navigli. 2016. Embeddings for word sensedisambiguation: An evaluation study. In ACL.

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, NoamShazeer, and Yonghui Wu. 2016. Exploring the lim-its of language modeling. CoRR abs/1602.02410.

Rafal Jozefowicz, Wojciech Zaremba, and IlyaSutskever. 2015. An empirical exploration of recur-rent network architectures. In ICML.

Yoon Kim, Yacine Jernite, David Sontag, and Alexan-der M Rush. 2015. Character-aware neural languagemodels. In AAAI 2016.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In ICLR.

Ankit Kumar, Ozan Irsoy, Peter Ondruska, MohitIyyer, Ishaan Gulrajani James Bradbury, VictorZhong, Romain Paulus, and Richard Socher. 2016.Ask me anything: Dynamic memory networks fornatural language processing. In ICML.

John D. Lafferty, Andrew McCallum, and FernandoPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data. In ICML.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In NAACL-HLT .

Page 10: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

Kenton Lee, Luheng He, Mike Lewis, and Luke S.Zettlemoyer. 2017. End-to-end neural coreferenceresolution. In EMNLP.

Wang Ling, Chris Dyer, Alan W. Black, Isabel Tran-coso, Ramon Fermandez, Silvio Amir, Luıs Marujo,and Tiago Luıs. 2015. Finding function in form:Compositional character models for open vocabu-lary word representation. In EMNLP.

Xiaodong Liu, Yelong Shen, Kevin Duh, and Jian-feng Gao. 2017. Stochastic answer networks formachine reading comprehension. arXiv preprintarXiv:1712.03556 .

Xuezhe Ma and Eduard H. Hovy. 2016. End-to-endsequence labeling via bi-directional LSTM-CNNs-CRF. In ACL.

Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of english: The penn treebank. Computa-tional Linguistics 19:313–330.

Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation: Con-textualized word vectors. In NIPS 2017.

Oren Melamud, Jacob Goldberger, and Ido Dagan.2016. context2vec: Learning generic context em-bedding with bidirectional lstm. In CoNLL.

Gabor Melis, Chris Dyer, and Phil Blunsom. 2017. Onthe state of the art of evaluation in neural languagemodels. CoRR abs/1707.05589.

Stephen Merity, Nitish Shirish Keskar, and RichardSocher. 2017. Regularizing and optimizing lstm lan-guage models. CoRR abs/1708.02182.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In NIPS.

George A. Miller, Martin Chodorow, Shari Landes,Claudia Leacock, and Robert G. Thomas. 1994. Us-ing a semantic concordance for sense identification.In HLT .

Tsendsuren Munkhdalai and Hong Yu. 2017. Neuraltree indexers for text understanding. In EACL.

Arvind Neelakantan, Jeevan Shankar, Alexandre Pas-sos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings perword in vector space. In EMNLP.

Martha Palmer, Paul Kingsbury, and Daniel Gildea.2005. The proposition bank: An annotated corpus ofsemantic roles. Computational Linguistics 31:71–106.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP.

Matthew E. Peters, Waleed Ammar, Chandra Bhaga-vatula, and Russell Power. 2017. Semi-supervisedsequence tagging with bidirectional language mod-els. In ACL.

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Hwee Tou Ng, Anders Bjorkelund, Olga Uryupina,Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-bust linguistic analysis using ontonotes. In CoNLL.

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Olga Uryupina, and Yuchen Zhang. 2012. Conll-2012 shared task: Modeling multilingual unre-stricted coreference in ontonotes. In EMNLP-CoNLL Shared Task.

Alessandro Raganato, Claudio Delli Bovi, and RobertoNavigli. 2017a. Neural sequence learning modelsfor word sense disambiguation. In EMNLP.

Alessandro Raganato, Jose Camacho-Collados, andRoberto Navigli. 2017b. Word sense disambigua-tion: A unified evaluation framework and empiricalcomparison. In EACL.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100, 000+ questions formachine comprehension of text. In EMNLP.

Prajit Ramachandran, Peter Liu, and Quoc Le. 2017.Improving sequence to sequence learning with unla-beled data. In EMNLP.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InCoNLL.

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2017. Bidirectional attentionflow for machine comprehension. In ICLR.

Richard Socher, Alex Perelygin, Jean Y Wu, JasonChuang, Christopher D Manning, Andrew Y Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In EMNLP.

Anders Søgaard and Yoav Goldberg. 2016. Deepmulti-task learning with low level tasks supervisedat lower layers. In ACL 2016.

Nitish Srivastava, Geoffrey E. Hinton, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. 2014. Dropout: a simple way to prevent neuralnetworks from overfitting. Journal of MachineLearning Research 15:1929–1958.

Rupesh Kumar Srivastava, Klaus Greff, and JurgenSchmidhuber. 2015. Training very deep networks.In NIPS.

Joseph P. Turian, Lev-Arie Ratinov, and Yoshua Ben-gio. 2010. Word representations: A simple and gen-eral method for semi-supervised learning. In ACL.

Page 11: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang,and Ming Zhou. 2017. Gated self-matching net-works for reading comprehension and question an-swering. In ACL.

John Wieting, Mohit Bansal, Kevin Gimpel, and KarenLivescu. 2016. Charagram: Embedding words andsentences via character n-grams. In EMNLP.

Sam Wiseman, Alexander M. Rush, and Stuart M.Shieber. 2016. Learning global features for coref-erence resolution. In HLT-NAACL.

Matthew D. Zeiler. 2012. Adadelta: An adaptive learn-ing rate method. CoRR abs/1212.5701.

Jie Zhou and Wei Xu. 2015. End-to-end learning ofsemantic role labeling using recurrent neural net-works. In ACL.

Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu,Hongyun Bao, and Bo Xu. 2016. Text classificationimproved by integrating bidirectional lstm with two-dimensional max pooling. In COLING.

Page 12: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

A Supplemental Material to accompanyDeep contextualized wordrepresentations

This supplement contains details of the model ar-chitectures, training routines and hyper-parameterchoices for the state-of-the-art models in Section4.

All of the individual models share a common ar-chitecture in the lowest layers with a context inde-pendent token representation below several layersof stacked RNNs – LSTMs in every case exceptthe SQuAD model that uses GRUs.

A.1 Fine tuning biLM

As noted in Sec. 3.4, fine tuning the biLM on taskspecific data typically resulted in significant dropsin perplexity. To fine tune on a given task, thesupervised labels were temporarily ignored, thebiLM fine tuned for one epoch on the training splitand evaluated on the development split. Once finetuned, the biLM weights were fixed during tasktraining.

Table 7 lists the development set perplexities forthe considered tasks. In every case except CoNLL2012, fine tuning results in a large improvement inperplexity, e.g., from 72.1 to 16.8 for SNLI.

The impact of fine tuning on supervised perfor-mance is task dependent. In the case of SNLI,fine tuning the biLM increased development accu-racy 0.6% from 88.9% to 89.5% for our single bestmodel. However, for sentiment classification de-velopment set accuracy is approximately the sameregardless whether a fine tuned biLM was used.

A.2 Importance of γ in Eqn. (1)

The γ parameter in Eqn. (1) was of practical im-portance to aid optimization, due to the differ-ent distributions between the biLM internal rep-resentations and the task specific representations.It is especially important in the last-only case inSec. 5.1. Without this parameter, the last-onlycase performed poorly (well below the baseline)for SNLI and training failed completely for SRL.

A.3 Textual Entailment

Our baseline SNLI model is the ESIM sequencemodel from Chen et al. (2017). Following theoriginal implementation, we used 300 dimensionsfor all LSTM and feed forward layers and pre-trained 300 dimensional GloVe embeddings thatwere fixed during training. For regularization, we

Dataset Beforetuning

Aftertuning

SNLI 72.1 16.8CoNLL 2012 (coref/SRL) 92.3 -CoNLL 2003 (NER) 103.2 46.3

SQuADContext 99.1 43.5Questions 158.2 52.0

SST 131.5 78.6

Table 7: Development set perplexity before and afterfine tuning for one epoch on the training set for vari-ous datasets (lower is better). Reported values are theaverage of the forward and backward perplexities.

added 50% variational dropout (Gal and Ghahra-mani, 2016) to the input of each LSTM layer and50% dropout (Srivastava et al., 2014) at the inputto the final two fully connected layers. All feedforward layers use ReLU activations. Parame-ters were optimized using Adam (Kingma and Ba,2015) with gradient norms clipped at 5.0 and ini-tial learning rate 0.0004, decreasing by half eachtime accuracy on the development set did not in-crease in subsequent epochs. The batch size was32.

The best ELMo configuration added ELMo vec-tors to both the input and output of the lowestlayer LSTM, using (1) with layer normalizationand λ = 0.001. Due to the increased number ofparameters in the ELMo model, we added `2 reg-ularization with regularization coefficient 0.0001to all recurrent and feed forward weight matricesand 50% dropout after the attention layer.

Table 8 compares test set accuracy of our sys-tem to previously published systems. Overall,adding ELMo to the ESIM model improved ac-curacy by 0.7% establishing a new single modelstate-of-the-art of 88.7%, and a five member en-semble pushes the overall accuracy to 89.3%.

A.4 Question AnsweringOur QA model is a simplified version of the modelfrom Clark and Gardner (2017). It embeds to-kens by concatenating each token’s case-sensitive300 dimensional GloVe word vector (Penning-ton et al., 2014) with a character-derived embed-ding produced using a convolutional neural net-work followed by max-pooling on learned char-acter embeddings. The token embeddings arepassed through a shared bi-directional GRU, andthen the bi-directional attention mechanism fromBiDAF (Seo et al., 2017). The augmented con-

Page 13: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

Model Acc.Feature based (Bowman et al., 2015) 78.2DIIN (Gong et al., 2018) 88.0BCN+Char+CoVe (McCann et al., 2017) 88.1ESIM (Chen et al., 2017) 88.0ESIM+TreeLSTM (Chen et al., 2017) 88.6ESIM+ELMo 88.7 ± 0.17DIIN ensemble (Gong et al., 2018) 88.9ESIM+ELMo ensemble 89.3

Table 8: SNLI test set accuracy.3Single model results occupy the portion, with ensemble results at the bottom.

text vectors are then passed through a linear layerwith ReLU activations, a residual self-attentionlayer that uses a GRU followed by the same atten-tion mechanism applied context-to-context, andanother linear layer with ReLU activations. Fi-nally, the results are fed through linear layers topredict the start and end token of the answer.

Variational dropout is used before the input tothe GRUs and the linear layers at a rate of 0.2. Adimensionality of 90 is used for the GRUs, and180 for the linear layers. We optimize the modelusing Adadelta with a batch size of 45. At testtime we use an exponential moving average of theweights and limit the output span to be of at mostsize 17. We do not update the word vectors duringtraining.

Performance was highest when adding ELMowithout layer normalization to both the input andoutput of the contextual GRU layer and leaving theELMo weights unregularized (λ = 0).

Table 9 compares test set results from theSQuAD leaderboard as of November 17, 2017when we submitted our system. Overall, our sub-mission had the highest single model and ensem-ble results, improving the previous single modelresult (SAN) by 1.4% F1 and our baseline by4.2%. A 11 member ensemble pushes F1 to87.4%, 1.0% increase over the previous ensemblebest.

A.5 Semantic Role Labeling

Our baseline SRL model is an exact reimplemen-tation of (He et al., 2017). Words are representedusing a concatenation of 100 dimensional vectorrepresentations, initialized using GloVe (Penning-ton et al., 2014) and a binary, per-word predicatefeature, represented using an 100 dimensional em-

3A comprehensive comparison can be found at https://nlp.stanford.edu/projects/snli/

bedding. This 200 dimensional token represen-tation is then passed through an 8 layer “inter-leaved” biLSTM with a 300 dimensional hiddensize, in which the directions of the LSTM layersalternate per layer. This deep LSTM uses High-way connections (Srivastava et al., 2015) betweenlayers and variational recurrent dropout (Gal andGhahramani, 2016). This deep representation isthen projected using a final dense layer followedby a softmax activation to form a distribution overall possible tags. Labels consist of semantic rolesfrom PropBank (Palmer et al., 2005) augmentedwith a BIO labeling scheme to represent argu-ment spans. During training, we minimize thenegative log likelihood of the tag sequence usingAdadelta with a learning rate of 1.0 and ρ = 0.95(Zeiler, 2012). At test time, we perform Viterbidecoding to enforce valid spans using BIO con-straints. Variational dropout of 10% is added toall LSTM hidden layers. Gradients are clipped iftheir value exceeds 1.0. Models are trained for 500epochs or until validation F1 does not improve for200 epochs, whichever is sooner. The pretrainedGloVe vectors are fine-tuned during training. Thefinal dense layer and all cells of all LSTMs are ini-tialized to be orthogonal. The forget gate bias isinitialized to 1 for all LSTMs, with all other gatesinitialized to 0, as per (Jozefowicz et al., 2015).

Table 10 compares test set F1 scores of ourELMo augmented implementation of (He et al.,2017) with previous results. Our single modelscore of 84.6 F1 represents a new state-of-the-artresult on the CONLL 2012 Semantic Role Label-ing task, surpassing the previous single model re-sult by 2.9 F1 and a 5-model ensemble by 1.2 F1.

A.6 Coreference resolution

Our baseline coreference model is the end-to-endneural model from Lee et al. (2017) with all hy-

Page 14: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

Model EM F1

BiDAF (Seo et al., 2017) 68.0 77.3BiDAF + Self Attention 72.1 81.1DCN+ 75.1 83.1Reg-RaSoR 75.8 83.3FusionNet 76.0 83.9r-net (Wang et al., 2017) 76.5 84.3SAN (Liu et al., 2017) 76.8 84.4BiDAF + Self Attention + ELMo 78.6 85.8DCN+ Ensemble 78.9 86.0FusionNet Ensemble 79.0 86.0Interactive AoA Reader+ Ensemble 79.1 86.5BiDAF + Self Attention + ELMo Ensemble 81.0 87.4

Table 9: Test set results for SQuAD, showing both Exact Match (EM) and F1. The top half of the table containssingle model results with ensembles at the bottom. References provided where available.

Model F1

Pradhan et al. (2013) 77.5Zhou and Xu (2015) 81.3He et al. (2017), single 81.7He et al. (2017), ensemble 83.4He et al. (2017), our impl. 81.4He et al. (2017) + ELMo 84.6

Table 10: SRL CoNLL 2012 test set F1.

Model Average F1

Durrett and Klein (2013) 60.3Wiseman et al. (2016) 64.2Clark and Manning (2016) 65.7Lee et al. (2017) (single) 67.2Lee et al. (2017) (ensemble) 68.8Lee et al. (2017) + ELMo 70.4

Table 11: Coreference resolution average F1 on the testset from the CoNLL 2012 shared task.

perparameters exactly following the original im-plementation.

The best configuration added ELMo to the in-put of the lowest layer biLSTM and weighted thebiLM layers using (1) without any regularization(λ = 0) or layer normalization. 50% dropout wasadded to the ELMo representations.

Table 11 compares our results with previouslypublished results. Overall, we improve the singlemodel state-of-the-art by 3.2% average F1, and oursingle model result improves the previous ensem-ble best by 1.6% F1. Adding ELMo to the outputfrom the biLSTM in addition to the biLSTM inputreduced F1 by approximately 0.7% (not shown).

A.7 Named Entity Recognition

Our baseline NER model concatenates 50 dimen-sional pre-trained Senna vectors (Collobert et al.,2011) with a CNN character based representation.The character representation uses 16 dimensionalcharacter embeddings and 128 convolutional fil-ters of width three characters, a ReLU activationand by max pooling. The token representation ispassed through two biLSTM layers, the first with200 hidden units and the second with 100 hid-den units before a final dense layer and softmaxlayer. During training, we use a CRF loss and attest time perform decoding using the Viterbi algo-rithm while ensuring that the output tag sequenceis valid.

Variational dropout is added to the input of bothbiLSTM layers. During training the gradients arerescaled if their `2 norm exceeds 5.0 and param-eters updated using Adam with constant learningrate of 0.001. The pre-trained Senna embeddingsare fine tuned during training. We employ earlystopping on the development set and report the av-eraged test set score across five runs with differentrandom seeds.

ELMo was added to the input of the lowest layertask biLSTM. As the CoNLL 2003 NER data setis relatively small, we found the best performanceby constraining the trainable layer weights to beeffectively constant by setting λ = 0.1 with (1).

Table 12 compares test set F1 scores of ourELMo enhanced biLSTM-CRF tagger with previ-ous results. Overall, the 92.22% F1 from our sys-tem establishes a new state-of-the-art. When com-pared to Peters et al. (2017), using representations

Page 15: arXiv:1802.05365v2 [cs.CL] 22 Mar 2018 · Deep contextualized word representations Matthew E. Peters y, Mark Neumann , Mohit Iyyer , Matt Gardnery, fmatthewp,markn,mohiti,mattgg@allenai.org

Model F1 ± std.Collobert et al. (2011)♣ 89.59Lample et al. (2016) 90.94Ma and Hovy (2016) 91.2Chiu and Nichols (2016)♣,♦ 91.62 ± 0.33Peters et al. (2017)♦ 91.93 ± 0.19biLSTM-CRF + ELMo 92.22 ± 0.10

Table 12: Test set F1 for CoNLL 2003 NER task. Mod-els with ♣ included gazetteers and those with ♦ usedboth the train and development splits for training.

Model Acc.DMN (Kumar et al., 2016) 52.1LSTM-CNN (Zhou et al., 2016) 52.4NTI (Munkhdalai and Yu, 2017) 53.1BCN+Char+CoVe (McCann et al., 2017) 53.7BCN+ELMo 54.7

Table 13: Test set accuracy for SST-5.

from all layers of the biLM provides a modest im-provement.

A.8 Sentiment classificationWe use almost the same biattention classificationnetwork architecture described in McCann et al.(2017), with the exception of replacing the finalmaxout network with a simpler feedforward net-work composed of two ReLu layers with dropout.A BCN model with a batch-normalized maxoutnetwork reached significantly lower validation ac-curacies in our experiments, although there maybe discrepancies between our implementation andthat of McCann et al. (2017). To match the CoVetraining setup, we only train on phrases that con-tain four or more tokens. We use 300-d hiddenstates for the biLSTM and optimize the model pa-rameters with Adam (Kingma and Ba, 2015) us-ing a learning rate of 0.0001. The trainable biLMlayer weights are regularized by λ = 0.001, andwe add ELMo to both the input and output of thebiLSTM; the output ELMo vectors are computedwith a second biLSTM and concatenated to the in-put.