luke: deep contextualized entity representations with

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6442–6454,November 16–20, 2020. c©2020 Association for Computational Linguistics

6442

LUKE: Deep Contextualized Entity Representations withEntity-aware Self-attention

Ikuya Yamada1,2

[email protected]

Akari [email protected]

Hiroyuki Shindo4,2

[email protected]

Hideaki Takeda5

[email protected]

Yuji Matsumoto2

[email protected]

1Studio Ousia 2RIKEN AIP 3University of Washington4Nara Institute of Science and Technology 5National Institute of Informatics

AbstractEntity representations are useful in natural lan-guage tasks involving entities. In this paper,we propose new pretrained contextualized rep-resentations of words and entities based onthe bidirectional transformer (Vaswani et al.,2017). The proposed model treats wordsand entities in a given text as independent to-kens, and outputs contextualized representa-tions of them. Our model is trained using anew pretraining task based on the masked lan-guage model of BERT (Devlin et al., 2019).The task involves predicting randomly maskedwords and entities in a large entity-annotatedcorpus retrieved from Wikipedia. We alsopropose an entity-aware self-attention mecha-nism that is an extension of the self-attentionmechanism of the transformer, and consid-ers the types of tokens (words or entities)when computing attention scores. The pro-posed model achieves impressive empiricalperformance on a wide range of entity-relatedtasks. In particular, it obtains state-of-the-artresults on five well-known datasets: Open En-tity (entity typing), TACRED (relation classi-fication), CoNLL-2003 (named entity recog-nition), ReCoRD (cloze-style question an-swering), and SQuAD 1.1 (extractive ques-tion answering). Our source code and pre-trained representations are available at https://github.com/studio-ousia/luke.

1 Introduction

Many natural language tasks involve entities, e.g.,relation classification, entity typing, named entityrecognition (NER), and question answering (QA).Key to solving such entity-related tasks is a modelto learn the effective representations of entities.Conventional entity representations assign each en-tity a fixed embedding vector that stores informa-tion regarding the entity in a knowledge base (KB)(Bordes et al., 2013; Trouillon et al., 2016; Yamadaet al., 2016, 2017). Although these models capture

the rich information in the KB, they require entitylinking to represent entities in a text, and cannotrepresent entities that do not exist in the KB.

By contrast, contextualized word representa-tions (CWRs) based on the transformer (Vaswaniet al., 2017), such as BERT (Devlin et al., 2019),and RoBERTa (Liu et al., 2020), provide effec-tive general-purpose word representations trainedwith unsupervised pretraining tasks based on lan-guage modeling. Many recent studies have solvedentity-related tasks using the contextualized rep-resentations of entities computed based on CWRs(Zhang et al., 2019; Peters et al., 2019; Joshi et al.,2020). However, the architecture of CWRs is notwell suited to representing entities for the follow-ing two reasons: (1) Because CWRs do not out-put the span-level representations of entities, theytypically need to learn how to compute such rep-resentations based on a downstream dataset that istypically small. (2) Many entity-related tasks, e.g.,relation classification and QA, involve reasoningabout the relationships between entities. Althoughthe transformer can capture the complex relation-ships between words by relating them to each othermultiple times using the self-attention mechanism(Clark et al., 2019; Reif et al., 2019), it is difficultto perform such reasoning between entities becausemany entities are split into multiple tokens in themodel. Furthermore, the word-based pretrainingtask of CWRs is not suitable for learning the repre-sentations of entities because predicting a maskedword given other words in the entity, e.g., predict-ing “Rings” given “The Lord of the [MASK]”, isclearly easier than predicting the entire entity.

In this paper, we propose new pretrained contex-tualized representations of words and entities bydeveloping LUKE (Language Understanding withKnowledge-based Embeddings). LUKE is basedon a transformer (Vaswani et al., 2017) trainedusing a large amount of entity-annotated corpus

https://github.com/studio-ousia/luke


6443

Question AnsweringNamed Entity Recognition

Relation Classification

C2

ABeyoncéA[CLS]

C1

A[MASK]Token Emb.

Position Emb.

Entity Type Emb.

C3

Ain

C4 C5

ALos

C6

A[MASK] BUBeyoncé BU[MASK]A[SEP]

C7 D2

Transformer

e e C2

ABeyoncéA[CLS]

C1

BU[MASK]A[SEP]

C7

Transformer with Entity-aware Self-attention

e

Predict lives Predict Los_AngelesPretraining

Entity Typing Predict LOCATION

Beyoncé[CLS] [MASK] in Los [MASK] Beyoncé [MASK][SEP]

Beyoncé[CLS] [SEP]...

...

...

...

[MASK]

hw1 hw2 hw3 hw4 hw5 hw6 hw7 he1 he2

he1hw7hw2hw1

Words Entities

Words Entities

D5 + D62

D5 + D62

Predict Angeles

Figure 1: Architecture of LUKE using the input sentence “Beyonce lives in Los Angeles.” LUKE outputs contextu-alized representation for each word and entity in the text. The model is trained to predict randomly masked words(e.g., lives and Angeles in the figure) and entities (e.g., Los Angeles in the figure). Downstream tasks are solvedusing its output representations with linear classifiers.

obtained from Wikipedia. An important differencebetween LUKE and existing CWRs is that it treatsnot only words, but also entities as independenttokens, and computes intermediate and output rep-resentations for all tokens using the transformer(see Figure 1). Since entities are treated as to-kens, LUKE can directly model the relationshipsbetween entities.

LUKE is trained using a new pretraining task, astraightforward extension of BERT’s masked lan-guage model (MLM) (Devlin et al., 2019). The taskinvolves randomly masking entities by replacingthem with [MASK] entities, and trains the modelby predicting the originals of these masked enti-ties. We use RoBERTa as base pre-trained model,and conduct pretraining of the model by simulta-neously optimizing the objectives of the MLM andour proposed task. When applied to downstreamtasks, the resulting model can compute representa-tions of arbitrary entities in the text using [MASK]entities as inputs. Furthermore, if entity annotationis available in the task, the model can compute en-tity representations based on the rich entity-centricinformation encoded in the corresponding entityembeddings.

Another key contribution of this paper is thatit extends the transformer using our entity-awareself-attention mechanism. Unlike existing CWRs,our model needs to deal with two types of tokens,i.e., words and entities. Therefore, we assume thatit is beneficial to enable the mechanism to easilydetermine the types of tokens. To this end, weenhance the self-attention mechanism by adoptingdifferent query mechanisms based on the attendingtoken and the token attended to.

We validate the effectiveness of our proposedmodel by conducting extensive experiments on fivestandard entity-related tasks: entity typing, relationclassification, NER, cloze-style QA, and extractiveQA. Our model outperforms all baseline models, in-cluding RoBERTa, in all experiments, and obtainsstate-of-the-art results on five tasks: entity typingon the Open Entity dataset (Choi et al., 2018), rela-tion classification on the TACRED dataset (Zhanget al., 2017), NER on the CoNLL-2003 dataset(Tjong Kim Sang and De Meulder, 2003), cloze-style QA on the ReCoRD dataset (Zhang et al.,2018a), and extractive QA on the SQuAD 1.1dataset (Rajpurkar et al., 2016). We publicizeour source code and pretrained representations athttps://github.com/studio-ousia/luke.

The main contributions of this paper are summa-rized as follows:

• We propose LUKE, a new contextualized repre-sentations specifically designed to address entity-related tasks. LUKE is trained to predict ran-domly masked words and entities using a largeamount of entity-annotated corpus obtained fromWikipedia.

• We introduce an entity-aware self-attentionmechanism, an effective extension of the originalmechanism of transformer. The proposed mech-anism considers the type of the tokens (words orentities) when computing attention scores.

• LUKE achieves strong empirical performanceand obtains state-of-the-art results on five pop-ular datasets: Open Entity, TACRED, CoNLL-2003, ReCoRD, and SQuAD 1.1.


6444

2 Related Work

Static Entity Representations Conventional en-tity representations assign a fixed embedding toeach entity in the KB. They include knowledgeembeddings trained on knowledge graphs (Bor-des et al., 2013; Yang et al., 2015; Trouillon et al.,2016), and embeddings trained using textual con-texts or descriptions of entities retrieved from a KB(Yamada et al., 2016, 2017; Cao et al., 2017; Ganeaand Hofmann, 2017). Similar to our pretrainingtask, NTEE (Yamada et al., 2017) and RELIC(Ling et al., 2020) use an approach that trains en-tity embeddings by predicting entities given theirtextual contexts obtained from a KB. The maindrawbacks of this line of work, when representingentities in text, are that (1) they need to resolveentities in the text to corresponding KB entries torepresent the entities, and (2) they cannot represententities that do not exist in the KB.

Contextualized Word Representations Manyrecent studies have addressed entity-related tasksbased on the contextualized representations of en-tities in text computed using the word representa-tions of CWRs (Zhang et al., 2019; Baldini Soareset al., 2019; Peters et al., 2019; Joshi et al., 2020;Wang et al., 2019b, 2020). Representative ex-amples of CWRs are ELMo (Peters et al., 2018)and BERT (Devlin et al., 2019), which are basedon deep bidirectional long short-term memory(LSTM) and the transformer (Vaswani et al., 2017),respectively. BERT is trained using an MLM, apretraining task that masks random words in thetext and trains the model to predict the maskedwords. Most recent CWRs, such as RoBERTa(Liu et al., 2020), XLNet (Yang et al., 2019), Span-BERT (Joshi et al., 2020), ALBERT (Lan et al.,2020), BART (Lewis et al., 2020), and T5 (Raffelet al., 2020), are based on transformer trained usinga task equivalent to or similar to the MLM. Similarto our proposed pretraining task that masks entitiesinstead of words, several recent CWRs, e.g., Span-BERT, ALBERT, BART, and T5, have extended theMLM by randomly masking word spans instead ofsingle words.

Furthermore, various recent studies have ex-plored methods to enhance CWRs by injectingthem with knowledge from external sources, suchas KBs. ERNIE (Zhang et al., 2019) and Know-BERT (Peters et al., 2019) use a similar idea toenhance CWRs using static entity embeddings sep-

arately learned from a KB. WKLM (Xiong et al.,2020) trains the model to detect whether an en-tity name in text is replaced by another entityname of the same type. KEPLER (Wang et al.,2019b) conducts pretraining based on the MLMand a knowledge-embedding objective (Bordeset al., 2013). K-Adapter (Wang et al., 2020) wasproposed concurrently with our work, and extendsCWRs using neural adapters that inject factual andlinguistic knowledge. This line of work is relatedto ours because our pretraining task also enhancesthe model using information in the KB.

Unlike the CWRs mentioned above, LUKEuses an improved transformer architecture withan entity-aware self-attention mechanism that isdesigned to effectively solve entity-related tasks.LUKE also outputs entity representations by learn-ing how to compute them during pretraining. Itachieves superior empirical results to existingCWRs and knowledge-enhanced CWRs in all ofour experiments.

3 LUKE

Figure 1 shows the architecture of LUKE. Themodel adopts a multi-layer bidirectional trans-former (Vaswani et al., 2017). It treats wordsand entities in the document as input tokens, andcomputes a representation for each token. For-mally, given a sequence consisting of m wordsw1, w2, ..., wm and n entities e1, e2, ..., en, ourmodel computes D-dimensional word representa-tions hw1 ,hw2 , ...,hwm , where hw ∈ RD, and en-tity representations he1 ,he2 , ...,hen , where he ∈RD. The entities can be Wikipedia entities (e.g.,Beyonce in Figure 1) or special entities (e.g.,[MASK]).

3.1 Input Representation

The input representation of a token (word or entity)is computed using the following three embeddings:

• Token embedding represents the correspondingtoken. We denote the word token embeddingby A ∈ RVw×D, where Vw is the number ofwords in our vocabulary. For computational ef-ficiency, we represent the entity token embed-ding by decomposing it into two small matrices,B ∈ RVe×H and U ∈ RH×D, where Ve is thenumber of entities in our vocabulary. Hence, thefull matrix of the entity token embedding can becomputed as BU.

6445

• Position embedding represents the position ofthe token in a word sequence. A word and an en-tity appearing at the i-th position in the sequenceare represented as Ci ∈ RD and Di ∈ RD, re-spectively. If an entity name contains multiplewords, its position embedding is computed byaveraging the embeddings of the correspondingpositions, as shown in Figure 1.

• Entity type embedding represents that the to-ken is an entity. The embedding is a single vectordenoted by e ∈ RD.

The input representation of a word and that ofan entity are computed by summing the token andposition embeddings, and the token, position, andentity type embeddings, respectively. Followingpast work (Devlin et al., 2019; Liu et al., 2020), weinsert special tokens [CLS] and [SEP] into theword sequence as the first and last words, respec-tively.

3.2 Entity-aware Self-attention

The self-attention mechanism is the foundation ofthe transformer (Vaswani et al., 2017), and relatestokens each other based on the attention score be-tween each pair of tokens. Given a sequence ofinput vectors x1,x2, ...,xk, where xi ∈ RD, eachof the output vectors y1,y2, ...,yk, where yi ∈ RL,is computed based on the weighted sum of the trans-formed input vectors. Here, each input and outputvector corresponds to a token (a word or an entity)in our model; therefore, k = m + n. The i-thoutput vector yi is computed as:

yi =k∑

j=1

αijVxj

eij =Kx>j Qxi√

L

αij = softmax(eij)

where Q ∈ RL×D, K ∈ RL×D, and V ∈ RL×D

denote the query, key, and value matrices, respec-tively.

Because LUKE handles two types of tokens (i.e.,words and entities), we assume that it is beneficialto use the information of target token types whencomputing the attention scores (eij). With this inmind, we enhance the mechanism by introducingan entity-aware query mechanism that uses a dif-ferent query matrix for each possible pair of tokentypes of xi and xj . Formally, the attention score

eij is computed as follows:

eij =

Kx>j Qxi, if both xi and xj are words

Kx>j Qw2exi, if xi is word and xj is entity

Kx>j Qe2wxi, if xi is entity and xj is word

Kx>j Qe2exi, if both xi and xj are entities

where Qw2e, Qe2w, Qe2e ∈ RL×D are query ma-trices. Note that the computational costs of theoriginal mechanism and our proposed mechanismare identical except the additional cost of comput-ing gradients and updating the parameters of theadditional query matrices at the training time.

3.3 Pretraining Task

To pretrain LUKE, we use the conventional MLMand a new pretraining task that is an extension ofthe MLM to learn entity representations. In partic-ular, we treat hyperlinks in Wikipedia as entity an-notations, and train the model using a large entity-annotated corpus retrieved from Wikipedia. Werandomly mask a certain percentage of the entitiesby replacing them with special [MASK] entities1

and then train the model to predict the masked enti-ties. Formally, the original entity corresponding toa masked entity is predicted by applying the soft-max function over all entities in our vocabulary:

y = softmax(BTm+ bo)

m = layer norm(gelu(Whhe + bh)

)where he is the representation corresponding tothe masked entity, T ∈ RH×D and Wh ∈ RD×D

are weight matrices, bo ∈ RVe and bh ∈ RD arebias vectors, gelu(·) is the gelu activation function(Hendrycks and Gimpel, 2016), and layer norm(·)is the layer normalization function (Lei Ba et al.,2016). Our final loss function is the sum ofMLM loss and cross-entropy loss on predictingthe masked entities, where the latter is computedidentically to the former.

3.4 Modeling Details

Our model configuration follows RoBERTaLARGE(Liu et al., 2020), pretrained CWRs based on a bidi-rectional transformer and a variant of BERT (De-vlin et al., 2019). In particular, our model is basedon the bidirectional transformer with D = 1024

1Note that LUKE uses two different [MASK] tokens: the[MASK] word for MLM and the [MASK] entity for our pro-posed pretraining task.

6446

hidden dimensions, 24 hidden layers, L = 64 atten-tion head dimensions, and 16 self-attention heads.The number of dimensions of the entity token em-bedding is set to H = 256. The total number ofparameters is approximately 483 M, consisting of355 M in RoBERTa and 128 M in our entity em-beddings. The input text is tokenized into wordsusing RoBERTa’s tokenizer with the vocabularyconsisting of Vw = 50K words. For computationalefficiency, our entity vocabulary does not includeall entities but only the Ve = 500K entities mostfrequently appearing in our entity annotations. Theentity vocabulary also includes two special entities,i.e., [MASK] and [UNK].

The model is trained via iterations overWikipedia pages in a random order for 200K steps.To reduce training time, we initialize the parame-ters that LUKE have in common with RoBERTa(parameters in the transformer and the embeddingsfor words) using RoBERTa. Following past work(Devlin et al., 2019; Liu et al., 2020), we mask 15%of all words and entities at random. If an entity doesnot exist in the vocabulary, we replace it with the[UNK] entity. We perform pretraining using theoriginal self-attention mechanism rather than ourentity-aware self-attention mechanism because wewant an ablation study of our mechanism but cannot afford to run pretraining twice. Query matri-ces of our self-attention mechanism (Qw2e, Qe2w,and Qe2e) are learned using downstream datasets.Further details of our pretraining are described inAppendix A.

4 Experiments

We conduct extensive experiments using five entity-related tasks: entity typing, relation classification,NER, cloze-style QA, and extractive QA. We usesimilar model architectures for all tasks based ona simple linear classifier on top of the representa-tions of words, entities, or both. Unless otherwisespecified, we create the input word sequence byinserting tokens of [CLS] and [SEP] into theoriginal word sequence as the first and the last to-kens, respectively. The input entity sequence isbuilt using [MASK] entities, special entities intro-duced for the task, or Wikipedia entities. The tokenembedding of a task-specific special entity is ini-tialized using that of the [MASK] entity, and thequery matrices of our entity-aware self-attentionmechanism (Qw2e, Qe2w, and Qe2e) are initializedusing the original query matrix Q.

Name Prec. Rec. F1UFET (Zhang et al., 2019) 77.4 60.6 68.0BERT (Zhang et al., 2019) 76.4 71.0 73.6ERNIE (Zhang et al., 2019) 78.4 72.9 75.6KEPLER (Wang et al., 2019b) 77.2 74.2 75.7KnowBERT (Peters et al., 2019) 78.6 73.7 76.1K-Adapter (Wang et al., 2020) 79.3 75.8 77.5RoBERTa (Wang et al., 2020) 77.6 75.0 76.2LUKE 79.9 76.6 78.2

Table 1: Results of entity typing on the Open Entitydataset.

Because we use RoBERTa as the base model inour pretraining, we use it as our primary baselinefor all tasks. We omit a description of the baselinemodels in each section if they are described inSection 2. Further details of our experiments areavailable in Appendix B.

4.1 Entity Typing

We first conduct experiments on entity typing,which is the task of predicting the types of an en-tity in the given sentence. Following Zhang et al.(2019), we use the Open Entity dataset (Choi et al.,2018), and consider only nine general entity types.Following Wang et al. (2020), we report loosemicro-precision, recall, and F1, and employ themicro-F1 as the primary metric.

Model We represent the target entity using the[MASK] entity, and enter words and the entity ineach sentence into the model. We then classify theentity using a linear classifier based on the corre-sponding entity representation. We treat the task asmulti-label classification, and train the model usingbinary cross-entropy loss averaged over all entitytypes.

Baselines UFET (Choi et al., 2018) is a conven-tional model that computes context representationsusing the bidirectional LSTM. We also use BERT,RoBERTa, ERNIE, KnowBERT, KEPLER, and K-Adapter as baselines.

Results Table 1 shows the experimental results.LUKE significantly outperforms our primary base-line, RoBERTa, by 2.0 F1 points, and the previ-ous best published model, KnowBERT, by 2.1 F1points. Furthermore, LUKE achieves a new stateof the art by outperforming K-Adapter by 0.7 F1points.

6447

Name Prec. Rec. F1BERT (Zhang et al., 2019) 67.2 64.8 66.0C-GCN (Zhang et al., 2018b) 69.9 63.3 66.4ERNIE (Zhang et al., 2019) 70.0 66.1 68.0SpanBERT (Joshi et al., 2020) 70.8 70.9 70.8MTB (Baldini Soares et al., 2019) - - 71.5KnowBERT (Peters et al., 2019) 71.6 71.4 71.5KEPLER (Wang et al., 2019b) 70.4 73.0 71.7K-Adapter (Wang et al., 2020) 68.9 75.4 72.0RoBERTa (Wang et al., 2020) 70.2 72.4 71.3LUKE 70.4 75.1 72.7

Table 2: Results of relation classification on the TA-CRED dataset.

4.2 Relation Classification

Relation classification determines the correct rela-tion between head and tail entities in a sentence.We conduct experiments using TACRED dataset(Zhang et al., 2017), a large-scale relation classifi-cation dataset containing 106,264 sentences with42 relation types. Following Wang et al. (2020),we report the micro-precision, recall, and F1, anduse the micro-F1 as the primary metric.

Model We introduce two special entities,[HEAD] and [TAIL], to represent the head andthe tail entities, respectively, and input words andthese two entities in each sentence to the model.We then solve the task using a linear classifierbased on a concatenated representation of thehead and tail entities. The model is trained usingcross-entropy loss.

Baselines C-GCN (Zhang et al., 2018b) usesgraph convolutional networks over dependency treestructures to solve the task. MTB (Baldini Soareset al., 2019) learns relation representations basedon BERT through the matching-the-blanks task us-ing a large amount of entity-annotated text. We alsocompare LUKE with BERT, RoBERTa, SpanBERT,ERNIE, KnowBERT, KEPLER, and K-Adapter.

Results The experimental results are presentedin Table 2. LUKE clearly outperforms our pri-mary baseline, RoBERTa, by 1.4 F1 points, andthe previous best published models, namely MTBand KnowBERT, by 1.2 F1 points. Furthermore, itachieves a new state of the art by outperformingK-Adapter by 0.7 F1 points.

4.3 Named Entity Recognition

We conduct experiments on the NER task using thestandard CoNLL-2003 dataset (Tjong Kim Sangand De Meulder, 2003). Following past work, we

Name F1LSTM-CRF (Lample et al., 2016) 91.0ELMo (Peters et al., 2018) 92.2BERT (Devlin et al., 2019) 92.8Akbik et al. (2018) 93.1Baevski et al. (2019) 93.5RoBERTa 92.4LUKE 94.3

Table 3: Results of named entity recognition on theCoNLL-2003 dataset.

report the span-level F1.

Model Following Sohrab and Miwa (2018), wesolve the task by enumerating all possible spans(or n-grams) in each sentence as entity name can-didates, and classifying them into the target entitytypes or non-entity type, which indicates that thespan is not an entity. For each sentence in thedataset, we enter words and the [MASK] entitiescorresponding to all possible spans. The represen-tation of each span is computed by concatenatingthe word representations of the first and last wordsin the span, and the entity representation corre-sponding to the span. We classify each span usinga linear classifier with its representation, and trainthe model using cross-entropy loss. We excludespans longer than 16 words for computational effi-ciency. During the inference, we first exclude allspans classified into the non-entity type. To avoidselecting overlapping spans, we greedily select aspan from the remaining spans based on the logit ofits predicted entity type in descending order if thespan does not overlap with those already selected.Following Devlin et al. (2019), we include the max-imal document context in the target document.

Baselines LSTM-CRF (Lample et al., 2016) isa model based on the bidirectional LSTM with con-ditional random fields (CRF). Akbik et al. (2018)address the task using the bidirectional LSTM withCRF enhanced with character-level contextualizedrepresentations. Similarly, Baevski et al. (2019)use the bidirectional LSTM with CRF enhancedwith CWRs based on a bidirectional transformer.We also use ELMo, BERT, and RoBERTa as base-lines. To conduct a fair comparison with RoBERTa,we report its performance using the model de-scribed above with the span representation com-puted by concatenating the representations of thefirst and last words of the span.

Results The experimental results are shown inTable 3. LUKE outperforms RoBERTa by 1.9 F1

6448

points. Furthermore, it achieves a new state of theart on this competitive dataset by outperformingthe previous state of the art reported in Baevskiet al. (2019) by 0.8 F1 points.

4.4 Cloze-style Question Answering

We evaluate our model on the ReCoRD dataset(Zhang et al., 2018a), a cloze-style QA dataset con-sisting of over 120K examples. An interesting char-acteristic of this dataset is that most of its questionscannot be solved without external knowledge. Thefollowing is an example question and its answer inthe dataset:

Question: According to claims in the suit,“Parts of ’Stairway to Heaven,’ instantly recog-nizable to the music fans across the world, soundalmost identical to significant portions of ‘X.”’Answer: Taurus

Given a question and a passage, the task is to findthe entity mentioned in the passage that fits themissing entity (denoted by X in the question above).In this dataset, annotations of entity spans (start andend positions) in a passage are provided, and theanswer is contained in the provided entity spansone or multiple times. Following past work, weevaluate the models using exact match (EM) andtoken-level F1 on the development and test sets.

Model We solve this task by assigning arelevance score to each entity in the pas-sage and selecting the entity with the high-est score as the answer. Following Liu et al.(2020), given a question q1, q2, ..., qj , and apassage p1, p2, ..., pl, the input word sequenceis constructed as: [CLS]q1, q2, ..., qj[SEP][SEP]p1, p2, ..., pl[SEP]. Further, we input[MASK] entities corresponding to the missing en-tity and all entities in the passage. We compute therelevance score of each entity in the passage usinga linear classifier with the concatenated representa-tion of the missing entity and the corresponding en-tity. We train the model using binary cross-entropyloss averaged over all entities in the passage, andselect the entity with the highest score (logit) as theanswer.

Baselines DocQA+ELMo (Clark and Gardner,2018) is a model based on ELMo, bidirectionalattention flow (Seo et al., 2017), and self-attentionmechanism. XLNet+Verifier (Li et al., 2019) is amodel based on XLNet with rule-based answer ver-ification, and is the winner of a recent competition

NameDevEM

DevF1

TestEM

TestF1

DocQA+ELMo (Zhang et al., 2018a) 44.1 45.4 45.4 46.7BERT (Wang et al., 2019a) - - 71.3 72.0XLNet+Verifier (Li et al., 2019) 80.6 82.1 81.5 82.7RoBERTa (Liu et al., 2020) 89.0 89.5 - -RoBERTa (ensemble) (Liu et al., 2020) - - 90.0 90.6LUKE 90.8 91.4 90.6 91.2

Table 4: Results of cloze-style question answering onthe ReCoRD dataset. All models except RoBERTa (en-semble) are based on a single model.

based on this dataset (Ostermann et al., 2019). Wealso use BERT and RoBERTa as baselines.

Results The results are presented in Table 4.LUKE significantly outperforms RoBERTa, thebest baseline, on the development set by 1.8 EMpoints and 1.9 F1 points. Furthermore, it achievessuperior results to RoBERTa (ensemble) on the testset without ensembling the models.

4.5 Extractive Question Answering

Finally, we conduct experiments using the well-known Stanford Question Answering Dataset(SQuAD) 1.1 consisting of 100K question/answerpairs (Rajpurkar et al., 2016). Given a questionand a Wikipedia passage containing the answer,the task is to predict the answer span in the pas-sage. Following past work, we report the EM andtoken-level F1 on the development and test sets.

Model We construct the word sequence from thequestion and the passage in the same way as in theprevious experiment. Unlike in the other experi-ments, we input Wikipedia entities into the modelbased on entity annotations automatically gener-ated on the question and the passage using a map-ping from entity names (e.g., “U.S.”) to their refer-ent entities (e.g., United States). The mapping isautomatically created using the entity hyperlinks inWikipedia as described in detail in Appendix C. Wesolve this task using the same model architecture asthat of BERT and RoBERTa. In particular, we usetwo linear classifiers independently on top of theword representations to predict the span boundaryof the answer (i.e., the start and end positions), andtrain the model using cross-entropy loss.

Baselines We compare our models with the re-sults of recent CWRs, including BERT, RoBERTa,SpanBERT, XLNet, and ALBERT. Because the re-sults for RoBERTa and ALBERT are reported onlyon the development set, we conduct a comparison

6449

NameDevEM

DevF1

TestEM

TestF1

BERT (Devlin et al., 2019) 84.2 91.1 85.1 91.8SpanBERT (Joshi et al., 2020) - - 88.8 94.6XLNet (Yang et al., 2019) 89.0 94.5 89.9 95.1ALBERT (Lan et al., 2020) 89.3 94.8 - -RoBERTa (Liu et al., 2020) 88.9 94.6 - -LUKE 89.8 95.0 90.2 95.4

Table 5: Results of extractive question answering onthe SQuAD 1.1 dataset.

NameCoNLL-2003

(Test F1)SQuAD

(Dev EM)SQuAD(Dev F1)

LUKE w/o entity inputs 92.9 89.2 94.8LUKE 94.3 89.8 95.0

Table 6: Ablation study of our entity representations.

with these models using this set. To conduct afair compassion with RoBERTa, we use the samemodel architecture and hyper-parameters as thoseof RoBERTa (Liu et al., 2020).

Results The experimental results are presentedin Table 5. LUKE outperforms our primary base-line, RoBERTa, by 0.9 EM points and 0.4 F1 pointson the development set. Furthermore, it achieves anew state of the art on this competitive dataset byoutperforming XLNet by 0.3 points both in termsof EM and F1. Note that XLNet uses a more so-phisticated model involving beam search than theother models considered here.

5 Analysis

In this section, we provide a detailed analysis ofLUKE by reporting three additional experiments.

5.1 Effects of Entity Representations

To investigate how our entity representations in-fluence performance on downstream tasks, we per-form an ablation experiment by addressing NER onthe CoNLL-2003 dataset and extractive QA on theSQuAD dataset without inputting any entities. Inthis setting, LUKE uses only the word sequence tocompute the representation for each word. We ad-dress the tasks using the same model architecturesas those for RoBERTa described in the correspond-ing sections. As shown in Table 6, this settingclearly degrades performance, i.e., 1.4 F1 points onthe CoNLL-2003 dataset and 0.6 EM points on theSQuAD dataset, demonstrating the effectiveness ofour entity representations on these two tasks.

5.2 Effects of Entity-aware Self-attention

We conduct an ablation study of our entity-awareself-attention mechanism by comparing the perfor-mance of LUKE using our mechanism with thatusing the original mechanism of the transformer.As shown in Table 7, our entity-aware self-attentionmechanism consistently outperforms the originalmechanism across all tasks. Furthermore, we ob-serve significant improvements on two kinds oftasks, relation classification (TACRED) and QA(ReCoRD and SQuAD). Because these tasks in-volve reasoning based on relationships betweenentities, we consider that our mechanism enablesthe model (i.e., attention heads) to easily focus oncapturing the relationships between entities.

5.3 Effects of Extra Pretraining

As mentioned in Section 3.4, LUKE is based onRoBERTa with pretraining for 200K steps usingour Wikipedia corpus. Because past studies (Liuet al., 2020; Lan et al., 2020) suggest that simplyincreasing the number of training steps of CWRstends to improve performance on downstream tasks,the superior experimental results of LUKE com-pared with those of RoBERTa may be obtainedbecause of its greater number of pretraining steps.To investigate this, we train another model basedon RoBERTa with extra pretraining based on theMLM using the Wikipedia corpus for 200K train-ing steps. The detailed configuration used in thepretraining is available in Appendix A.

We evaluate the performance of this model onthe CoNLL-2003 and SQuAD datasets using thesame model architectures as those for RoBERTadescribed in the corresponding sections. As shownin Table 8, the model achieves similar performanceto the original RoBERTa on both datasets, whichindicates that the superior performance of LUKEis not owing to its longer pretraining.

6 Conclusions

In this paper, we propose LUKE, new pretrainedcontextualized representations of words and enti-ties based on the transformer. LUKE outputs thecontextualized representations of words and en-tities using an improved transformer architecturewith using a novel entity-aware self-attention mech-anism. The experimental results prove its effective-ness on various entity-related tasks. Future workinvolves applying LUKE to domain-specific tasks,such as those in biomedical and legal domains.

6450

NameOpen Entity

(Test F1)TACRED(Test F1)

CoNLL-2003(Test F1)

ReCoRD(Dev EM)

ReCoRD(Dev F1)

SQuAD(Dev EM)

SQuAD(Dev F1)

Original Attention 77.9 72.2 94.1 90.1 90.7 89.2 94.7Entity-aware Attention 78.2 72.7 94.3 90.8 91.4 89.8 95.0

Table 7: Ablation study of our entity-aware self-attention mechanism.

NameCoNLL-2003

(Test F1)SQuAD

(Dev EM)SQuAD(Dev F1)

RoBERTa w/ extra training 92.5 89.1 94.7RoBERTa 92.4 88.9 94.6LUKE 94.3 89.8 95.0

Table 8: Results of RoBERTa additionally trained usingour Wikipedia corpus.

ReferencesAlan Akbik, Duncan Blythe, and Roland Vollgraf.

2018. Contextual String Embeddings for SequenceLabeling. In Proceedings of the 27th InternationalConference on Computational Linguistics, pages1638–1649.

Alexei Baevski, Sergey Edunov, Yinhan Liu, LukeZettlemoyer, and Michael Auli. 2019. Cloze-drivenPretraining of Self-attention Networks. In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing, pages 5360–5369.

Livio Baldini Soares, Nicholas FitzGerald, JeffreyLing, and Tom Kwiatkowski. 2019. Matching theBlanks: Distributional Similarity for Relation Learn-ing. In Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics,pages 2895–2905.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural InformationProcessing Systems 26, pages 2787–2795.

Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and JuanziLi. 2017. Bridge Text and Knowledge by LearningMulti-Prototype Entity Mention Embedding. In Pro-ceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 1623–1633.

Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettle-moyer. 2018. Ultra-Fine Entity Typing. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 87–96.

Christopher Clark and Matt Gardner. 2018. Simple andEffective Multi-Paragraph Reading Comprehension.In Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 845–855.

Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D Manning. 2019. What Does BERTLook at? An Analysis of BERT’s Attention. In Pro-ceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks forNLP, pages 276–286.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186.

Octavian-Eugen Ganea and Thomas Hofmann. 2017.Deep Joint Entity Disambiguation with Local NeuralAttention. In Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Process-ing, pages 2619–2629.

Dan Hendrycks and Kevin Gimpel. 2016. Gaus-sian Error Linear Units (GELUs). arXiv preprintarXiv:1606.08415v3.

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld,Luke Zettlemoyer, and Omer Levy. 2020. Span-BERT: Improving Pre-training by Representing andPredicting Spans. Transactions of the Associationfor Computational Linguistics, 8:64–77.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural Architectures for Named Entity Recognition.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 260–270.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2020. ALBERT: A Lite BERT for Self-supervisedLearning of Language Representations. In Interna-tional Conference on Learning Representations.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer Normalization. arXiv preprintarXiv:1607.06450v1.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Veselin Stoyanov, and Luke Zettlemoyer.2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Transla-tion, and Comprehension. In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics, pages 7871–7880.

6451

Xiepeng Li, Zhexi Zhang, Wei Zhu, Zheng Li, YuanNi, Peng Gao, Junchi Yan, and Guotong Xie. 2019.Pingan Smart Health and SJTU at COIN - SharedTask: utilizing Pre-trained Language Models andCommon-sense Knowledge in Machine ReadingTasks. In Proceedings of the First Workshop onCommonsense Inference in Natural Language Pro-cessing, pages 93–98.

Jeffrey Ling, Nicholas FitzGerald, Zifei Shan,Livio Baldini Soares, Thibault Fevry, David Weiss,and Tom Kwiatkowski. 2020. Learning Cross-Context Entity Representations from Text. arXivpreprint arXiv:2001.03765v1.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2020.RoBERTa: A Robustly Optimized BERT Pretrain-ing Approach. arXiv preprint arXiv:1907.11692v1.

Simon Ostermann, Sheng Zhang, Michael Roth, andPeter Clark. 2019. Commonsense Inference in Nat-ural Language Processing (COIN) - Shared Task Re-port. In Proceedings of the First Workshop on Com-monsense Inference in Natural Language Process-ing, pages 66–74.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep Contextualized Word Rep-resentations. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers), pages 2227–2237.

Matthew E. Peters, Mark Neumann, Robert Logan, RoySchwartz, Vidur Joshi, Sameer Singh, and Noah A.Smith. 2019. Knowledge Enhanced ContextualWord Representations. In Proceedings of the 2019Conference on Empirical Methods in Natural Lan-guage Processing and the 9th International JointConference on Natural Language Processing, pages43–54.

Colin Raffel, Noam Shazeer, Adam Roberts, Kather-ine Lee, Sharan Narang, Michael Matena, YanqiZhou, Wei Li, and Peter J Liu. 2020. Exploring theLimits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Re-search, 21(140):1–67.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ Questionsfor Machine Comprehension of Text. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 2383–2392.

Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda BViegas, Andy Coenen, Adam Pearce, and Been Kim.2019. Visualizing and Measuring the Geometry ofBERT. In Advances in Neural Information Process-ing Systems 32, pages 8594–8603.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHananneh Hajishirzi. 2017. Bidirectional AttentionFlow for Machine Comprehension. In InternationalConference on Learning Representations.

Mohammad Golam Sohrab and Makoto Miwa. 2018.Deep Exhaustive Model for Nested Named EntityRecognition. In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 2843–2849.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition.In Proceedings of the Seventh Conference on Nat-ural Language Learning at HLT-NAACL 2003.

Theo Trouillon, Johannes Welbl, Sebastian Riedel, EricGaussier, and Guillaume Bouchard. 2016. ComplexEmbeddings for Simple Link Prediction. In Pro-ceedings of The 33rd International Conference onMachine Learning, volume 48, pages 2071–2080.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention Is AllYou Need. In Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008.

Alex Wang, Yada Pruksachatkun, Nikita Nangia,Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel Bowman. 2019a. SuperGLUE: AStickier Benchmark for General-Purpose LanguageUnderstanding Systems. In Advances in Neural In-formation Processing Systems 32, pages 3266–3280.

Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei,Xuanjing Huang, Cuihong Cao, Daxin Jiang, MingZhou, and others. 2020. K-Adapter: InfusingKnowledge into Pre-trained Models with Adapters.arXiv preprint arXiv:2002.01808v3.

Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, ZhiyuanLiu, Juanzi Li, and Jian Tang. 2019b. KEPLER: AUnified Model for Knowledge Embedding and Pre-trained Language Representation. arXiv preprintarXiv:1911.06136v1.

Wenhan Xiong, Jingfei Du, William Yang Wang, andVeselin Stoyanov. 2020. Pretrained Encyclope-dia: Weakly Supervised Knowledge-Pretrained Lan-guage Model. In International Conference on Learn-ing Representations.

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, andYoshiyasu Takefuji. 2016. Joint Learning of theEmbedding of Words and Entities for Named En-tity Disambiguation. In Proceedings of the 20thSIGNLL Conference on Computational Natural Lan-guage Learning, pages 250–259.

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, andYoshiyasu Takefuji. 2017. Learning DistributedRepresentations of Texts and Entities from Knowl-edge Base. Transactions of the Association for Com-putational Linguistics, 5:397–411.

6452

Bishan Yang, Scott Wen-tau Yih, Xiaodong He, Jian-feng Gao, and Li Deng. 2015. Embedding Entitiesand Relations for Learning and Inference in Knowl-edge Bases. In Proceedings of the InternationalConference on Learning Representations.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Ruslan Salakhutdinov, and Quoc V Le.2019. XLNet: Generalized Autoregressive Pretrain-ing for Language Understanding. arXiv preprintarXiv:1906.08237v1.

Sheng Zhang, Xiaodong Liu, Jingjing Liu, JianfengGao, Kevin Duh, and Benjamin Van Durme. 2018a.ReCoRD: Bridging the Gap between Human andMachine Commonsense Reading Comprehension.arXiv preprint arXiv:1810.12885v1.

Yuhao Zhang, Peng Qi, and Christopher D Manning.2018b. Graph Convolution over Pruned Depen-dency Trees Improves Relation Extraction. In Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing, pages 2205–2215.

Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-geli, and Christopher D Manning. 2017. Position-aware Attention and Supervised Data Improve SlotFilling. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing,pages 35–45.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019. ERNIE: En-hanced Language Representation with InformativeEntities. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics,pages 1441–1451.

Appendix for “LUKE: DeepContextualized Entity Representationswith Entity-aware Self-attention”

A Details of Pretraining

As input corpus for pretraining, we use the De-cember 2018 version of Wikipedia, comprising ap-proximately 3.5 billion words and 11 million entityannotations. We generate input sequences by split-ting the content of each page into sequences com-prising ≤ 512 words and their entity annotations(i.e., hyperlinks). We optimize the model usingAdamW with learning rate warmup and linear de-cay of the learning rate. To stabilize training, weupdate only those parameters that are randomly ini-tialized (i.e., fix the parameters that are initializedusing RoBERTa) in the first 100K steps, and updateall parameters in the remaining 100K steps. Werun the pretraining on NVIDIA’s PyTorch Dockercontainer 19.02 hosted on a server with two IntelXeon Platinum 8168 CPUs and 16 NVIDIA Tesla

Name ValueMaximum word length 512Batch size 2048Peak learning rate 1e-5Peak learning rate (first 100K steps) 5e-4Learning rate decay linearWarmup steps 2500Mask probability for words 15%Mask probability for entities 15%Dropout 0.1Weight decay 0.01Gradient clipping noneAdam β1 0.9Adam β2 0.999Adam ε 1e-6

Table 9: Hyper-parameters used to pretrain LUKE.

Name ValueMaximum word length 512Batch size 2048Peak learning rate 1e-5Learning rate decay linearWarmup steps 2500Mask probability for words 15%Dropout 0.1Weight decay 0.01Gradient clipping noneAdam β1 0.9Adam β2 0.999Adam ε 1e-6

Table 10: Hyper-parameters used for the extra pretrain-ing of RoBERTa on our Wikipedia corpus.

V100 GPUs. The training takes approximately 30days. The detailed hyper-parameters are shown inTable 9.

Table 10 shows the hyper-parameters used forthe extra pretraining of RoBERTa on our Wikipediacorpus described in Section 5. As shown in theTable, we use the same hyper-parameters as theones used to train LUKE. We train the model for200K steps and update all parameters throughoutthe training.

B Details of Experiments

We conduct the experiments using NVIDIA’s Py-Torch Docker container 19.02 hosted on a serverwith two Intel Xeon E5-2698 v4 CPUs and eightV100 GPUs. For each dataset, excluding SQuAD,we conduct hyper-parameter tuning using gridsearch based on the performance on the develop-ment set. We evaluate performance using EM onthe ReCoRD dataset, and F1 on the other datasets.Because our computational resources are limited,we use the following constrained search space:• learning rate: 1e-5, 2e-5, 3e-5

6453

Name Open Entity TACRED CoNLL-2003 ReCoRD SQuADLearning rate 1e-5 1e-5 1e-5 1e-5 15e-6Batch size 4 32 8 32 48Training epochs 3 5 5 2 2Training time 10min 190min 203min 92min 42minNumber of GPUs 1 1 1 8 8Dev score 78.5 F1 72.0 F1 97.1 F1 90.8 EM/91.4 F1 89.8 EM/95.0 F1

Table 11: Hyper-parameters and other details of our experiments.

Name ValueMaximum word length 512Learning rate decay linearWarmup ratio 0.06Dropout 0.1Weight decay 0.01Gradient clipping noneAdam β1 0.9Adam β2 0.98Adam ε 1e-6

Table 12: Common hyper-parameters used in our ex-periments.

• batch size: 4, 8, 16, 32, 64

• number of training epochs: 2, 3, 5We do not tune the hyper-parameters of the SQuADdataset, and use the ones described in Liu et al.(2020). The hyper-parameters and other details,including the training time, number of GPUs used,and the best score on the development set, areshown in Table 11. For the other hyper-parameters,we simply follow Liu et al. (2020) (see Table 12).We optimize the model using AdamW with learn-ing rate warmup and linear decay of the learningrate. We also use early stopping based on perfor-mance on the development set. The details of thedatasets used in our experiments are provided be-low.

B.1 Open Entity

The Open Entity dataset used in Zhang et al. (2019)consists of training, development, and test sets,where each set contains 1,998 examples with labelsof nine general entity types. The dataset is down-loaded from the website for Zhang et al. (2019).2

We compute the reported results using our codebased on that of Zhang et al. (2019).

B.2 TACRED

The TACRED dataset contains 68,124 training ex-amples, 22,631 development examples, and 15,509test examples with labels of their relation types.The total number of relation types is 42. The

2https://github.com/thunlp/ERNIE

dataset is obtained from the LDC website.3 Wecompute the reported results using our code basedon that of Zhang et al. (2019).

B.3 CoNLL-2003

The CoNLL-2003 dataset comprises training, de-velopment, and test sets, containing 14,987, 3,466,and 3,684 sentences, respectively. Each sentencecontains annotations of four entity types, namelyperson, location, organization, and miscellaneous.The dataset is downloaded from the relevant web-site.4 The reported results are computed using theconlleval script obtained from the website.

B.4 ReCoRD

The ReCoRD dataset consists of 100,730 train-ing, 10,000 development, and 10,000 test ques-tions created based on 80,121 unique news articles.The dataset is obtained from the relevant website.5

We compute the performance on the developmentset using the official evaluation script downloadedfrom the website. Performance on the test set is ob-tained by submitting our model to the leaderboard.

B.5 SQuAD 1.1

The SQuAD 1.1 dataset contains 87,599 training,10,570 development, and 9,533 test questions cre-ated based on 536 Wikipedia articles. The datasetis downloaded from the relevant website.6 We com-pute performance on the development set using theofficial evaluation script downloaded from the web-site. Performance on the test set is obtained bysubmitting our model to the leaderboard.

3https://catalog.ldc.upenn.edu/LDC2018T24

4https://www.clips.uantwerpen.be/conll2003/ner

5https://sheng-z.github.io/ReCoRD-explorer

6https://rajpurkar.github.io/SQuAD-explorer

https://github.com/thunlp/ERNIE

https://catalog.ldc.upenn.edu/LDC2018T24

https://catalog.ldc.upenn.edu/LDC2018T24

https://www.clips.uantwerpen.be/conll2003/ner

https://www.clips.uantwerpen.be/conll2003/ner

https://sheng-z.github.io/ReCoRD-explorer

https://sheng-z.github.io/ReCoRD-explorer

https://rajpurkar.github.io/SQuAD-explorer

https://rajpurkar.github.io/SQuAD-explorer

6454

C Adding Entity Annotations to SQuADdataset

For each question–passage pair in the SQuADdataset, we first create a mapping from the entitynames (e.g., “U.S.”) to their referent Wikipediaentities (e.g., United States) using the entity hyper-links on the source Wikipedia page of the passage.We then perform simple string matching to extractall entity names in the question and the passage,and treat all matched entity names as entity annota-tions for their referent entities. We ignore an entityname if the name refers to multiple entities on thepage. Further, to reduce noise, we also exclude anentity name if its link probability, the probabilitythat the name appears as a hyperlink in Wikipedia,is lower than 1%. We use the March 2016 versionof Wikipedia to collect the entity hyperlinks andthe link probabilities of the entity names.

luke: deep contextualized entity representations with

Documents