knowledge enhanced contextual word representations · 2019-11-01 · knowledge enhanced contextual...

14
Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1 , Mark Neumann 1 , Robert L. Logan IV 2 , Roy Schwartz 1,3 , Vidur Joshi 1 , Sameer Singh 2 , and Noah A. Smith 1,3 1 Allen Institute for Artificial Intelligence, Seattle, WA, USA 2 University of California, Irvine, CA, USA 3 Paul G. Allen School of Computer Science & Engineering, University of Washington {matthewp,markn,roys,noah}@allenai.org {rlogan,sameer}@uci.edu Abstract Contextual word representations, typically trained on unstructured, unlabeled text, do not contain any explicit grounding to real world entities and are often unable to remember facts about those entities. We propose a general method to embed multiple knowledge bases (KBs) into large scale models, and thereby enhance their representations with structured, human-curated knowledge. For each KB, we first use an integrated entity linker to retrieve relevant entity embeddings, then update con- textual word representations via a form of word-to-entity attention. In contrast to pre- vious approaches, the entity linkers and self- supervised language modeling objective are jointly trained end-to-end in a multitask set- ting that combines a small amount of entity linking supervision with a large amount of raw text. After integrating WordNet and a subset of Wikipedia into BERT, the knowledge en- hanced BERT (KnowBert) demonstrates im- proved perplexity, ability to recall facts as measured in a probing task and downstream performance on relationship extraction, en- tity typing, and word sense disambiguation. KnowBert’s runtime is comparable to BERT’s and it scales to large KBs. 1 Introduction Large pretrained models such as ELMo (Peters et al., 2018), GPT (Radford et al., 2018), and BERT (Devlin et al., 2019) have significantly im- proved the state of the art for a wide range of NLP tasks. These models are trained on large amounts of raw text using self-supervised objectives. How- ever, they do not contain any explicit grounding to real world entities and as a result have difficulty recovering factual knowledge (Logan et al., 2019). Knowledge bases (KBs) provide a rich source of high quality, human-curated knowledge that can be used to ground these models. In addition, they often include complementary information to that found in raw text, and can encode factual knowl- edge that is difficult to learn from selectional pref- erences either due to infrequent mentions of com- monsense knowledge or long range dependencies. We present a general method to insert multiple KBs into a large pretrained model with a Knowl- edge Attention and Recontextualization (KAR) mechanism. The key idea is to explicitly model entity spans in the input text and use an entity linker to retrieve relevant entity embeddings from a KB to form knowledge enhanced entity-span representations. Then, the model recontextual- izes the entity-span representations with word-to- entity attention to allow long range interactions between contextual word representations and all entity spans in the context. The entire KAR is in- serted between two layers in the middle of a pre- trained model such as BERT. In contrast to previous approaches that inte- grate external knowledge into task-specific mod- els with task supervision (e.g., Yang and Mitchell, 2017; Chen et al., 2018), our approach learns the entity linkers with self-supervision on unlabeled data. This results in general purpose knowledge enhanced representations that can be applied to a wide range of downstream tasks. Our approach has several other benefits. First, it leaves the top layers of the original model un- changed so we may retain the output loss layers and fine-tune on unlabeled corpora while training the KAR. This also allows us to simply swap out BERT for KnowBert in any downstream applica- tion. Second, by taking advantage of the exist- ing high capacity layers in the original model, the KAR is lightweight, adding minimal additional parameters and runtime. Finally, it is easy to in- corporate additional KBs by simply inserting them at other locations. KnowBert is agnostic to the form of the arXiv:1909.04164v2 [cs.CL] 31 Oct 2019

Upload: others

Post on 05-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

Knowledge Enhanced Contextual Word Representations

Matthew E. Peters1, Mark Neumann1, Robert L. Logan IV2, Roy Schwartz1,3,Vidur Joshi1, Sameer Singh2, and Noah A. Smith1,3

1Allen Institute for Artificial Intelligence, Seattle, WA, USA2University of California, Irvine, CA, USA

3Paul G. Allen School of Computer Science & Engineering, University of Washington{matthewp,markn,roys,noah}@allenai.org

{rlogan,sameer}@uci.edu

Abstract

Contextual word representations, typicallytrained on unstructured, unlabeled text, do notcontain any explicit grounding to real worldentities and are often unable to remember factsabout those entities. We propose a generalmethod to embed multiple knowledge bases(KBs) into large scale models, and therebyenhance their representations with structured,human-curated knowledge. For each KB, wefirst use an integrated entity linker to retrieverelevant entity embeddings, then update con-textual word representations via a form ofword-to-entity attention. In contrast to pre-vious approaches, the entity linkers and self-supervised language modeling objective arejointly trained end-to-end in a multitask set-ting that combines a small amount of entitylinking supervision with a large amount of rawtext. After integrating WordNet and a subsetof Wikipedia into BERT, the knowledge en-hanced BERT (KnowBert) demonstrates im-proved perplexity, ability to recall facts asmeasured in a probing task and downstreamperformance on relationship extraction, en-tity typing, and word sense disambiguation.KnowBert’s runtime is comparable to BERT’sand it scales to large KBs.

1 Introduction

Large pretrained models such as ELMo (Peterset al., 2018), GPT (Radford et al., 2018), andBERT (Devlin et al., 2019) have significantly im-proved the state of the art for a wide range of NLPtasks. These models are trained on large amountsof raw text using self-supervised objectives. How-ever, they do not contain any explicit groundingto real world entities and as a result have difficultyrecovering factual knowledge (Logan et al., 2019).

Knowledge bases (KBs) provide a rich sourceof high quality, human-curated knowledge that canbe used to ground these models. In addition, they

often include complementary information to thatfound in raw text, and can encode factual knowl-edge that is difficult to learn from selectional pref-erences either due to infrequent mentions of com-monsense knowledge or long range dependencies.

We present a general method to insert multipleKBs into a large pretrained model with a Knowl-edge Attention and Recontextualization (KAR)mechanism. The key idea is to explicitly modelentity spans in the input text and use an entitylinker to retrieve relevant entity embeddings froma KB to form knowledge enhanced entity-spanrepresentations. Then, the model recontextual-izes the entity-span representations with word-to-entity attention to allow long range interactionsbetween contextual word representations and allentity spans in the context. The entire KAR is in-serted between two layers in the middle of a pre-trained model such as BERT.

In contrast to previous approaches that inte-grate external knowledge into task-specific mod-els with task supervision (e.g., Yang and Mitchell,2017; Chen et al., 2018), our approach learns theentity linkers with self-supervision on unlabeleddata. This results in general purpose knowledgeenhanced representations that can be applied to awide range of downstream tasks.

Our approach has several other benefits. First,it leaves the top layers of the original model un-changed so we may retain the output loss layersand fine-tune on unlabeled corpora while trainingthe KAR. This also allows us to simply swap outBERT for KnowBert in any downstream applica-tion. Second, by taking advantage of the exist-ing high capacity layers in the original model, theKAR is lightweight, adding minimal additionalparameters and runtime. Finally, it is easy to in-corporate additional KBs by simply inserting themat other locations.

KnowBert is agnostic to the form of the

arX

iv:1

909.

0416

4v2

[cs

.CL

] 3

1 O

ct 2

019

Page 2: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

KB, subject to a small set of requirements (seeSec. 3.2). We experiment with integrating bothWordNet (Miller, 1995) and Wikipedia, thus ex-plicitly adding word sense knowledge and factsabout named entities (including those unseen attraining time). However, the method could be ex-tended to commonsense KBs such as ConceptNet(Speer et al., 2017) or domain specific ones (e.g.,UMLS; Bodenreider, 2004).

We evaluate KnowBert with a mix of intrin-sic and extrinsic tasks. Despite being basedon the smaller BERTBASE model, the experi-ments demonstrate improved masked languagemodel perplexity and ability to recall facts overBERTLARGE. The extrinsic evaluations demon-strate improvements for challenging relationshipextraction, entity typing and word sense disam-biguation datasets, and often outperform othercontemporaneous attempts to incorporate externalknowledge into BERT.

2 Related Work

Pretrained word representations Initial worklearning word vectors focused on static wordembeddings using multi-task learning objectives(Collobert and Weston, 2008) or corpus level co-occurence statistics (Mikolov et al., 2013a; Pen-nington et al., 2014). Recently the field hasshifted toward learning context-sensitive embed-dings (Dai and Le, 2015; McCann et al., 2017; Pe-ters et al., 2018; Devlin et al., 2019). We buildupon these by incorporating structured knowledgeinto these models.

Entity embeddings Entity embedding methodsproduce continuous vector representations fromexternal knowledge sources. Knowledge graph-based methods optimize the score of observedtriples in a knowledge graph. These methodsbroadly fall into two categories: translational dis-tance models (Bordes et al., 2013; Wang et al.,2014b; Lin et al., 2015; Xiao et al., 2016) whichuse a distance-based scoring function, and linearmodels (Nickel et al., 2011; Yang et al., 2014;Trouillon et al., 2016; Dettmers et al., 2018) whichuse a similarity-based scoring function. We exper-iment with TuckER (Balazevic et al., 2019) em-beddings, a recent linear model which general-izes many of the aforecited models. Other meth-ods combine entity metadata with the graph (Xieet al., 2016), use entity contexts (Chen et al., 2014;Ganea and Hofmann, 2017), or a combination of

contexts and the KB (Wang et al., 2014a; Guptaet al., 2017). Our approach is agnostic to the de-tails of the entity embedding method and as a re-sult is able to use any of these methods.

Entity-aware language models Some previouswork has focused on adding KBs to generative lan-guage models (LMs) (Ahn et al., 2017; Yang et al.,2017; Logan et al., 2019) or building entity-centricLMs (Ji et al., 2017). However, these methodsintroduce latent variables that require full annota-tion for training, or marginalization. In contrast,we adopt a method that allows training with largeamounts of unannotated text.

Task-specific KB architectures Other work hasfocused on integrating KBs into neural architec-tures for specific downstream tasks (Yang andMitchell, 2017; Sun et al., 2018; Chen et al., 2018;Bauer et al., 2018; Mihaylov and Frank, 2018;Wang and Jiang, 2019; Yang et al., 2019). Ourapproach instead uses KBs to learn more gener-ally transferable representations that can be usedto improve a variety of downstream tasks.

3 KnowBert

KnowBert incorporates knowledge bases intoBERT using the Knowledge Attention and Re-contextualization component (KAR). We start bydescribing the BERT and KB components. Wethen move to introducing KAR. Finally, we de-scribe the training procedure, including the multi-task training regime for jointly training KnowBertand an entity linker.

3.1 Pretrained BERT

We describe KnowBert as an extension to (andcandidate replacement for) BERT, although themethod is general and can be applied to any deeppretrained model including left-to-right and right-to-left LMs such as ELMo and GPT. Formally,BERT accepts as input a sequence of N Word-Piece tokens (Sennrich et al., 2016; Wu et al.,2016), (x1, . . . , xN ), and computes L layers ofD-dimensional contextual representations Hi ∈RN×D by successively applying non-linear func-tions Hi = Fi(Hi−1). The non-linear func-tion is a multi-headed self-attention layer followedby a position-wise multilayer perceptron (MLP)

Page 3: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

(Vaswani et al., 2017):

Fi(Hi−1) =

TransformerBlock(Hi−1) =

MLP(MultiHeadAttn(Hi−1,Hi−1,Hi−1)).

The multi-headed self-attention uses Hi−1 as thequery, key, and value to allow each vector to attendto every other vector.

BERT is trained to minimize an objective func-tion that combines both next-sentence prediction(NSP) and masked LM log-likelihood (MLM):

LBERT = LNSP + LMLM.

Given two inputs xA and xB , the next-sentenceprediction task is binary classification to predictwhether xB is the next sentence following xA.The masked LM objective randomly replaces apercentage of input word pieces with a special[MASK] token and computes the negative log-likelihood of the missing token with a linear layerand softmax over all possible word pieces.

3.2 Knowledge BasesThe key contribution of this paper is a methodto incorporate knowledge bases (KB) into a pre-trained BERT model. To encompass as wide a se-lection of prior knowledge as possible, we adopta broad definition for a KB in the most generalsense as fixed collection of K entity nodes, ek,from which it is possible to compute entity embed-dings, ek ∈ RE . This includes KBs with a typical(subj, rel, obj) graph structure, KBs thatcontain only entity metadata without a graph, andthose that combine both a graph and entity meta-data, as long as there is some method for embed-ding the entities in a low dimensional vector space.We also do not make any assumption that the en-tities are typed. As we show in Sec. 4.1 this flex-ibility is beneficial, where we compute entity em-beddings from WordNet using both the graph andsynset definitions, but link directly to Wikipediapages without a graph by using embeddings com-puted from the entity description.

We also assume that the KB is accompaniedby an entity candidate selector that takes as inputsome text and returns a list of C potential entitylinks, each consisting of the start and end indicesof the potential mention span and Mm candidateentities in the KG:

C = {〈(startm, endm), (em,1, . . . , em,Mm)〉 |m ∈ 1 . . . C, ek ∈ 1 . . .K}.

In practice, these are often implemented us-ing precomputed dictionaries (e.g., CrossWikis;Spitkovsky and Chang, 2012), KB specific rules(e.g., a WordNet lemmatizer), or other heuristics(e.g., string match; Mihaylov and Frank, 2018).Ling et al. (2015) showed that incorporating can-didate priors into entity linkers can be a power-ful signal, so we optionally allow for the candi-date selector to return an associated prior proba-bility for each entity candidate. In some cases, itis beneficial to over-generate potential candidatesand add a special NULL entity to each candidatelist, thereby allowing the linker to discriminate be-tween actual links and false positive candidates. Inthis work, the entity candidate selectors are fixedbut their output is passed to a learned context de-pendent entity linker to disambiguate the candi-date mentions.

Finally, by restricting the number of candidateentities to a fixed small number (we use 30),KnowBert’s runtime is independent of the size theKB, as it only considers a small subset of all pos-sible entities for any given text. As the candidateselection is rule-based and fixed, it is fast and inour implementation is performed asynchronouslyon CPU. The only overhead for scaling up the sizeof the KB is the memory footprint to store the en-tity embeddings.

3.3 KAR

The Knowledge Attention and Recontextualiza-tion component (KAR) is the heart of KnowBert.The KAR accepts as input the contextual rep-resentations at a particular layer, Hi, and com-putes knowledge enhanced representations H′i =KAR(Hi, C). This is fed into the next pretrainedlayer, Hi+1 = TransformerBlock(H′i), and theremainder of BERT is run as usual.

In this section, we describe the KAR’s key com-ponents: mention-span representations, retrievalof relevant entity embeddings using an entitylinker, update of mention-span embeddings withretrieved information, and recontextualization ofentity-span embeddings with word-to-entity-spanattention. We describe the KAR for a single KB,but extension to multiple KBs at different layers isstraightforward. See Fig. 1 for an overview.

Mention-span representations The KAR startswith the KB entity candidate selector that providesa list of candidate mentions which it uses to com-pute mention-span representations. Hi is first pro-

Page 4: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

Prince sang Purple Rain

Prince_(musician)Prince_Motor_CompanyPrince,_West_Virginia

Purple_Rain_(album)Purple_Rain_(film)Purple_Rain_(song)

Rain_(entertainer)Rain_(Beatles_song)Rain_(1932_film)

+

+

+

1

2

3

45

67

Hiproj

Hi

S Se

~E

S’e

Hi’proj

Hi’

Figure 1: The Knowledge Attention and Recontextualization (KAR) component. BERT word piece representations(Hi) are first projected to Hproj

i (1), then pooled over candidate mentions spans (2) to compute S, and contextual-ized into Se using mention-span self-attention (3). An integrated entity linker computes weighted average entityembeddings E (4), which are used to enhance the span representations with knowledge from the KB (5), comput-ing S′e. Finally, the BERT word piece representations are recontextualized with word-to-entity-span attention (6)and projected back to the BERT dimension (7) resulting in H′

i.

jected to the entity dimension (E, typically 200 or300, see Sec. 4.1) with a linear projection,

Hproji = HiW

proj1 + b

proj1 . (1)

Then, the KAR computes C mention-span repre-sentations sm ∈ RE , one for each candidate men-tion, by pooling over all word pieces in a mention-span using the self-attentive span pooling fromLee et al. (2017). The mention-spans are stackedinto a matrix S ∈ RC×E .

Entity linker The entity linker is responsible forperforming entity disambiguation for each poten-tial mention from among the available candidates.It first runs mention-span self-attention to compute

Se = TransformerBlock(S). (2)

The span self-attention is identical to the typicaltransformer layer, exception that the self-attentionis between mention-span vectors instead of wordpiece vectors. This allows KnowBert to incorpo-rate global information into each linking decisionso that it can take advantage of entity-entity co-occurrence and resolve which of several overlap-ping candidate mentions should be linked.1

Following Kolitsas et al. (2018), Se is used toscore each of the candidate entities while incorpo-rating the candidate entity prior from the KB. Eachcandidate span m has an associated mention-span

1We found a small transformer layer with four atten-tion heads and a 1024 feed-forward hidden dimension wassufficient, significantly smaller than each of the layers inBERT. Early experiments demonstrated the effectiveness ofthis layer with improved entity linking performance.

vector sem (computed via Eq. 2), Mm candidateentities with embeddings emk (from the KB), andprior probabilities pmk. We compute Mm scoresusing the prior and dot product between the entity-span vectors and entity embeddings,

ψmk = MLP(pmk, sem · emk), (3)

with a two-layer MLP (100 hidden dimensions).If entity linking (EL) supervision is available,

we can compute a loss with the gold entity emg.The exact form of the loss depends on the KB, andwe use both log-likelihood,

LEL = −∑m

log

(exp(ψmg)∑k exp(ψmk)

), (4)

and max-margin,

LEL =max(0, γ − ψmg) +∑emk 6=emg

max(0, γ + ψmk), (5)

formulations (see Sec. 4.1 for details).

Knowledge enhanced entity-span representa-tions KnowBert next injects the KB entity in-formation into the mention-span representationscomputed from BERT vectors (sem) to form entity-span representations. For a given span m, we firstdisregard all candidate entities with score ψ belowa fixed threshold, and softmax normalize the re-maining scores:

ψmk =

exp(ψmk)∑

ψmk≥δexp(ψmk)

, ψmk ≥ δ

0, ψmk < δ.

Page 5: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

Then the weighted entity embedding is

em =∑k

ψmkemk.

If all entity linking scores are below the thresh-old δ, we substitute a special NULL embeddingfor em. Finally, the entity-span representations areupdated with the weighted entity embeddings

s′em = sem + em, (6)

which are packed into a matrix S′e ∈ RC×E .

Recontextualization After updating the entity-span representations with the weighted entity vec-tors, KnowBert uses them to recontextualize theword piece representations. This is accomplishedusing a modified transformer layer that substi-tutes the multi-headed self-attention with a multi-headed attention between the projected word piecerepresentations and knowledge enhanced entity-span vectors. As introduced by Vaswani et al.(2017), the contextual embeddings Hi are usedfor the query, key, and value in multi-headed self-attention. The word-to-entity-span attention inKnowBert substitutes Hproj

i for the query, and S′e

for both the key and value:

H′proji = MLP(MultiHeadAttn(H

proji ,S

′e,S′e)).

This allows each word piece to attend to allentity-spans in the context, so that it can propa-gate entity information over long contexts. Af-ter the multi-headed word-to-entity-span attention,we run a position-wise MLP analogous to the stan-dard transformer layer.2

Finally, H′proji is projected back to the BERT di-

mension with a linear transformation and a resid-ual connection added:

H′i = H′proji W

proj2 + b

proj2 +Hi (7)

Alignment of BERT and entity vectors AsKnowBert does not place any restrictions on theentity embeddings, it is essential to align themwith the pretrained BERT contextual representa-tions. To encourage this alignment we initializeW

proj2 as the matrix inverse of Wproj

1 (Eq. 1). Theuse of dot product similarity (Eq. 3) and residualconnection (Eq. 7) further aligns the entity-spanrepresentations with entity embeddings.

2As for the multi-headed entity-span self-attention, wefound a small transformer layer to be sufficient, with fourattention heads and 1024 hidden units in the MLP.

Algorithm 1: KnowBert training methodInput: Pretrained BERT and J KBsOutput: KnowBertfor j = 1 . . . J do

Compute entity embeddings for KBjif EL supervision available then

Freeze all network parameters exceptthose in (Eq. 1–3)

Train to convergence using (Eq. 4) or(Eq. 5)

endInitialize W

proj2 as (Wproj

1 )−1

Unfreeze all parameters except entityembeddings

MinimizeLKnowBert = LBERT +

∑ji=1 LELi

end

3.4 Training ProcedureOur training regime incrementally pretrains in-creasingly larger portions of KnowBert beforefine-tuning all trainable parameters in a multitasksetting with any available EL supervision. It issimilar in spirit to the “chain-thaw” approach inFelbo et al. (2017), and is summarized in Alg. 1.

We assume access to a pretrained BERT modeland one or more KBs with their entity candidateselectors. To add the first KB, we begin by pre-training entity embeddings (if not already pro-vided from another source), then freeze them in allsubsequent training, including task-specific fine-tuning. If EL supervision is available, it is usedto pretrain the KB specific EL parameters, whilefreezing the remainder of the network. Finally,the entire network is fine-tuned to convergence byminimizing

LKnowBert = LBERT + LEL.

We apply gradient updates to homogeneousbatches randomly sampled from either the unla-beled corpus or EL supervision.

To add a second KB, we repeat the process, in-serting it in any layer above the first one. Whenadding a KB, the BERT layers above it will expe-rience large gradients early in training, as they aresubject to the randomly initialized parameters as-sociated with the new KB. They are thus expectedto move further from their pretrained values be-fore convergence compared to parameters belowthe KB. By adding KBs from bottom to top, we

Page 6: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

System PPLWikidata # params. # params. # params. Fwd. / Bwd.

MRR masked LM KAR entity embed. time

BERTBASE 5.5 0.09 110 0 0 0.25BERTLARGE 4.5 0.11 336 0 0 0.75KnowBert-Wiki 4.3 0.26 110 2.4 141 0.27KnowBert-WordNet 4.1 0.22 110 4.9 265 0.31KnowBert-W+W 3.5 0.31 110 7.3 406 0.33

Table 1: Comparison of masked LM perplexity, Wikidata probing MRR, and number of parameters (in millions)in the masked LM (word piece embeddings, transformer layers, and output layers), KAR, and entity embeddingsfor BERT and KnowBert. The table also includes the total time to run one forward and backward pass (in seconds)on a TITAN Xp GPU (12 GB RAM) for a batch of 32 sentence pairs with total length 80 word pieces. Due tomemory constraints, the BERTLARGE batch is accumulated over two smaller batches.

minimize disruption of the network and decreasethe likelihood that training will fail. See Sec. 4.1for details of where each KB was added.

The entity embeddings and selected candidatescontain lexical information (especially in the caseof WordNet), that will make the masked LM pre-dictions significantly easier. To prevent leakinginto the masked word pieces, we adopt the BERTstrategy and replace all entity candidates from theselectors with a special [MASK] entity if the can-didate mention span overlaps with a masked wordpiece.3 This prevents KnowBert from relying onthe selected candidates to predict masked wordpieces.

4 Experiments

4.1 Experimental Setup

We used the English uncased BERTBASE model(Devlin et al., 2019) to train three versionsof KnowBert: KnowBert-Wiki, KnowBert-WordNet, and KnowBert-W+W that includes bothWikipedia and WordNet.

KnowBert-Wiki The entity linker inKnowBert-Wiki borrows both the entity can-didate selectors and embeddings from Ganea andHofmann (2017). The candidate selectors andpriors are a combination of CrossWikis, a large,precomputed dictionary that combines statisticsfrom Wikipedia and a web corpus (Spitkovsky andChang, 2012), and the YAGO dictionary (Hoffartet al., 2011). The entity embeddings use a skip-gram like objective (Mikolov et al., 2013b) tolearn 300-dimensional embeddings of Wikipedia

3Following BERT, for 80% of masked word pieces allcandidates are replaced with [MASK], 10% are replaced withrandom candidates and 10% left unmasked.

page titles directly from Wikipedia descriptionswithout using any explicit graph structure betweennodes. As such, nodes in the KB are Wikipediapage titles, e.g., Prince (musician). Ganeaand Hofmann (2017) provide pretrained embed-dings for a subset of approximately 470K entities.Early experiments with embeddings derived fromWikidata relations4 did not improve results.

We used the AIDA-CoNLL dataset (Hoffartet al., 2011) for supervision, adopting the stan-dard splits. This dataset exhaustively annotatesentity links for named entities of person, organi-zation and location types, as well as a miscella-neous type. It does not annotate links to commonnouns or other Wikipedia pages. At both train andtest time, we consider all selected candidate spansand the top 30 entities, to which we add the spe-cial NULL entity to allow KnowBert to discrim-inate between actual links and false positive linksfrom the selector. As such, KnowBert models bothentity mention detection and disambiguation in anend-to-end manner. Eq. 5 was used as the objec-tive.

KnowBert-WordNet Our WordNet KB com-bines synset metadata, lemma metadata and the re-lational graph. To construct the graph, we first ex-tracted all synsets, lemmas, and their relationshipsfrom WordNet 3.0 using the nltk interface. Afterdisregarding certain symmetric relationships (e.g.,we kept the hypernym relationship, but removedthe inverse hyponym relationship) we were leftwith 28 synset-synset and lemma-lemma relation-ships. From these, we constructed a graph whereeach node is either a synset or lemma, and intro-

4https://github.com/facebookresearch/PyTorch-BigGraph

Page 7: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

System F1

WN-first sense baseline 65.2ELMo 69.2BERTBASE 73.1BERTLARGE 73.9KnowBert-WordNet 74.9KnowBert-W+W 75.1

Table 2: Fine-grained WSD F1.

duced the special lemma in synset relation-ship to link synsets and lemmas. The candidate se-lector uses a rule-based lemmatizer without part-of-speech (POS) information.5

Our embeddings combine both the graph andsynset glosses (definitions), as early experimentsindicated improved perplexity when using bothvs. just graph-based embeddings. We usedTuckER (Balazevic et al., 2019) to compute 200-dimensional vectors for each synset and lemmausing the relationship graph. Then, we extractedthe gloss for each synset and used an off-the-shelf state-of-the-art sentence embedding method(Subramanian et al., 2018) to produce 2048-dimensional vectors. These are concatenated tothe TuckER embeddings. To reduce the dimen-sionality for use in KnowBert, the frozen 2248-dimensional embeddings are projected to 200-dimensions with a learned linear transformation.

For supervision, we combined the SemCorword sense disambiguation (WSD) dataset (Milleret al., 1994) with all lemma example usages fromWordNet6 and link directly to synsets. The lossfunction is Eq. 4. At train time, we did not providegold lemmas or POS tags, so KnowBert must learnto implicitly model coarse grained POS tags to dis-ambiguate each word. At test time when evaluat-ing we restricted candidate entities to just thosematching the gold lemma and POS tag, consistentwith the standard WSD evaluation.

Training details To control for the unlabeledcorpus, we concatenated Wikipedia and the BooksCorpus (Zhu et al., 2015) and followed the datapreparation process in BERT with the exception ofheavily biasing our dataset to shorter sequences of128 word pieces for efficiency. Both KnowBert-

5https://spacy.io/6To provide a fair evaluation on the WiC dataset which

is partially based on the same source, we excluded all WiCtrain, development and test instances.

System AIDA-A AIDA-B

Daiber et al. (2013) 49.9 52.0Hoffart et al. (2011) 68.8 71.9Kolitsas et al. (2018) 86.6 82.6KnowBert-Wiki 80.2 74.4KnowBert-W+W 82.1 73.7

Table 3: End-to-end entity linking strong match, microaveraged F1.

Wiki and KnowBert-WordNet insert the KB be-tween layers 10 and 11 of the 12-layer BERTBASE

model. KnowBert-W+W adds the Wikipedia KBbetween layers 10 and 11, with WordNet be-tween layers 11 and 12. Earlier experiments withKnowBert-WordNet in a lower layer had worseperplexity. We generally followed the fine-tuningprocedure in Devlin et al. (2019); see supplemen-tal materials for details.

4.2 Intrinsic EvaluationPerplexity Table 1 compares masked LM per-plexity for KnowBert with BERTBASE andBERTLARGE. To rule out minor differences due toour data preparation, the BERT models are fine-tuned on our training data before being evalu-ated. Overall, KnowBert improves the maskedLM perplexity, with all KnowBert models outper-forming BERTLARGE, despite being derived fromBERTBASE.

Factual recall To test KnowBert’s ability to re-call facts from the KBs, we extracted 90K tu-ples from Wikidata (Vrandecic and Krotzsch,2014) for 17 different relationships such ascompanyFoundedBy. Each tuple was writteninto natural language such as “Adidas was foundedby Adolf Dassler” and used to construct two testinstances, one that masks out the subject and onethat masks the object. Then, we evaluated whethera model could recover the masked entity by com-puting the mean reciprocal rank (MRR) of themasked word pieces. Table 1 displays a sum-mary of the results (see supplementary materialfor results across all relationship types). Overall,KnowBert-Wiki is significantly better at recallingfacts than both BERTBASE and BERTLARGE, withKnowBert-W+W better still.

Speed KnowBert is almost as fast as BERTBASE

(8% slower for KnowBert-Wiki, 32% forKnowBert-W+W) despite adding a large number

Page 8: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

System LM P R F1

Zhang et al. (2018) — 69.9 63.3 66.4Alt et al. (2019) GPT 70.1 65.0 67.4Shi and Lin (2019) BERTBASE 73.3 63.1 67.8Zhang et al. (2019) BERTBASE 70.0 66.1 68.0Soares et al. (2019) BERTLARGE — — 70.1Soares et al. (2019) BERTLARGE† — — 71.5KnowBert-W+W BERTBASE 71.6 71.4 71.5

Table 4: Single model test set results on the TACREDrelationship extraction dataset. † with MTB pretrain-ing.

of (frozen) parameters in the entity embed-dings (Table 1). KnowBert is much faster thanBERTLARGE. By taking advantage of the alreadyhigh capacity model, the number of trainableparameters added by KnowBert is a fraction ofthe total parameters in BERT. The faster speed ispartially due to the entity parameter efficiency inKnowBert as only as small fraction of parametersin the entity embeddings are used for any giveninput due to the sparse linker. Our candidategenerators consider the top 30 candidates andproduce approximately O(number tokens) can-didate spans. For a typical 25 token sentence,approximately 2M entity embedding parametersare actually used. In contrast, BERTLARGE uses themajority of its 336M parameters for each input.

Integrated EL It is also possible to evaluate theperformance of the integrated entity linkers in-side KnowBert using diagnostic probes withoutany further fine-tuning. As these were trained ina multitask setting primarily with raw text, we donot a priori expect high performance as they mustbalance specializing for the entity linking task andlearning general purpose representations suitablefor language modeling.

Table 2 displays fine-grained WSD F1 using theevaluation framework from Navigli et al. (2017)and the ALL dataset (combing SemEval 2007,2013, 2015 and Senseval 2 and 3). By linking tonodes in our WordNet graph and restricting to goldlemmas at test time we can recast the WSD taskunder our general entity linking framework. TheELMo and BERT baselines use a nearest neighborapproach trained on the SemCor dataset, similarto the evaluation in Melamud et al. (2016), whichhas previously been shown to be competitive withtask-specific architectures (Raganato et al., 2017).As can be seen, KnowBert provides competi-tive performance, and KnowBert-W+W is able to

System LM F1

Wang et al. (2016) — 88.0Wang et al. (2019b) BERTBASE 89.0Soares et al. (2019) BERTLARGE 89.2Soares et al. (2019) BERTLARGE† 89.5KnowBert-W+W BERTBASE 89.1

Table 5: Test set F1 for SemEval 2010 Task 8 relation-ship extraction. † with MTB pretraining.

match the performance of KnowBert-WordNet de-spite incorporating both Wikipedia and WordNet.

Table 3 reports end-to-end entity linking per-formance for the AIDA-A and AIDA-B datasets.Here, KnowBert’s performance lags behind thecurrent state-of-the-art model from Kolitsas et al.(2018), but still provides strong performance com-pared to other established systems such as AIDA(Hoffart et al., 2011) and DBpedia Spotlight(Daiber et al., 2013). We believe this is due tothe selective annotation in the AIDA data thatonly annotates named entities. The CrossWikis-based candidate selector used in KnowBert gen-erates candidate mentions for all entities includ-ing common nouns from which KnowBert may belearning to extract information, at the detriment ofspecializing to maximize linking performance forAIDA.

4.3 Downstream Tasks

This section evaluates KnowBert on downstreamtasks to validate that the addition of knowledgeimproves performance on tasks expected to benefitfrom it. Given the overall superior performance ofKnowBert-W+W on the intrinsic evaluations, wefocus on it exclusively for evaluation in this sec-tion. The main results are included in this section;see the supplementary material for full details.

The baselines we compare against areBERTBASE, BERTLARGE, the pre-BERT stateof the art, and two contemporaneous papers thatadd similar types of knowledge to BERT. ERNIE(Zhang et al., 2019) uses TAGME (Ferraginaand Scaiella, 2010) to link entities to Wikidata,retrieves the associated entity embeddings, andfuses them into BERTBASE by fine-tuning. Soareset al. (2019) learns relationship representations byfine-tuning BERTLARGE with large scale “match-ing the blanks” (MTB) pretraining using entitylinked text.

Page 9: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

System Accuracy

ELMo† 57.7BERTBASE

† 65.4BERTLARGE

† 65.5BERTLARGE

†† 69.5KnowBert-W+W 70.9

Table 6: Test set results for the WiC dataset (v1.0).†Pilehvar and Camacho-Collados (2019)††Wang et al. (2019a)

Relation extraction Our first task is relation ex-traction using the TACRED (Zhang et al., 2017)and SemEval 2010 Task 8 (Hendrickx et al., 2009)datasets. Systems are given a sentence withmarked a subject and object, and asked to pre-dict which of several different relations (or no re-lation) holds. Following Soares et al. (2019), ourKnowBert model uses special entity tokens [E1],[/E1], [E2], [/E2] to mark the location of thesubject and object in the input sentence, then con-catenates the contextual word representations for[E1] and [E2] to predict the relationship. ForTACRED, we also encode the subject and objecttypes with special tokens and concatenate them tothe end of the sentence.

For TACRED (Table 4), KnowBert-W+W sig-nificantly outperforms the comparable BERTBASE

systems including ERNIE by 3.5%, improves overBERTLARGE by 1.4%, and is able to match the per-formance of the relationship specific MTB pre-training in Soares et al. (2019). For SemEval2010 Task 8 (Table 5), KnowBert-W+W F1 fallsbetween the entity aware BERTBASE model fromWang et al. (2019b), and the BERTLARGE modelfrom Soares et al. (2019).

Words in Context (WiC) WiC (Pilehvar andCamacho-Collados, 2019) is a challenging taskthat presents systems with two sentences both con-taining a word with the same lemma and asks themto determine if they are from the same sense or not.It is designed to test the quality of contextual wordrepresentations. We follow standard practice andconcatenate both sentences with a [SEP] tokenand fine-tune the [CLS] embedding. As shownin Table 6, KnowBert-W+W sets a new state ofthe art for this task, improving over BERTLARGE

by 1.4% and reducing the relative gap to 80% hu-man performance by 13.3%.

System P R F1

UFET 68.8 53.3 60.1BERTBASE 76.4 71.0 73.6ERNIE 78.4 72.9 75.6KnowBert-W+W 78.6 73.7 76.1

Table 7: Test set results for entity typing using the ninegeneral types from (Choi et al., 2018).

Entity typing We also evaluated KnowBert-W+W using the entity typing dataset from Choiet al. (2018). To directly compare to ERNIE, weadopted the evaluation protocol in Zhang et al.(2019) which considers the nine general entitytypes.7 Our model marks the location of a targetspan with the special [E] and [/E] tokens anduses the representation of the [E] token to predictthe type. As shown in Table 7, KnowBert-W+Wshows an improvement of 0.6% F1 over ERNIEand 2.5% over BERTBASE.

5 Conclusion

We have presented an efficient and general methodto insert prior knowledge into a deep neural model.Intrinsic evaluations demonstrate that the additionof WordNet and Wikipedia to BERT improves thequality of the masked LM and significantly im-proves its ability to recall facts. Downstream eval-uations demonstrate improvements for relation-ship extraction, entity typing and word sense dis-ambiguation datasets. Future work will involve in-corporating a diverse set of domain specific KBsfor specialized NLP applications.

Acknowledgements

The authors acknowledge helpful feedback fromanonymous reviewers and the AllenNLP team.This research was funded in part by the NSF underawards IIS-1817183 and CNS-1730158.

ReferencesSungjin Ahn, Heeyoul Choi, Tanel Parnamaa, and

Yoshua Bengio. 2017. A neural knowledge lan-guage model. arXiv:1608.00318.

Christoph Alt, Marc Hubner, and Leonhard Hennig.2019. Improving relation extraction by pre-trainedlanguage representations. In AKBC.

7Data obtained from https://github.com/thunlp/ERNIE

Page 10: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

Ivana Balazevic, Carl Allen, and Timothy M.Hospedales. 2019. TuckER: Tensor factorization forknowledge graph completion. In EMNLP.

Lisa Bauer, Yicheng Wang, and Mohit Bansal. 2018.Commonsense for generative multi-hop question an-swering tasks. In EMNLP.

Olivier Bodenreider. 2004. The Unified Medical Lan-guage System (UMLS): integrating biomedical ter-minology. Nucleic Acids Research, 32 Databaseissue:D267–70.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In NeurIPS.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, DianaInkpen, and Si Wei. 2018. Neural natural languageinference models enhanced with external knowl-edge. In ACL.

Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014.A unified model for word sense representation anddisambiguation. In EMNLP.

Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettle-moyer. 2018. Ultra-fine entity typing. In ACL.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In ICML.

Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In NeurIPS.

Joachim Daiber, Max Jakob, Chris Hokamp, andPablo N. Mendes. 2013. Improving efficiency andaccuracy in multilingual entity extraction. In I-SEMANTICS.

Tim Dettmers, Pasquale Minervini, Pontus Stenetorp,and Sebastian Riedel. 2018. Convolutional 2dknowledge graph embeddings. In AAAI.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In NAACL-HLT.

Bjarke Felbo, Alan Mislove, Anders Søgaard, IyadRahwan, and Sune Lehmann. 2017. Using millionsof emoji occurrences to learn any-domain represen-tations for detecting sentiment, emotion and sar-casm. In EMNLP.

Paolo Ferragina and Ugo Scaiella. 2010. TAGME:on-the-fly annotation of short text fragments (bywikipedia entities). In CIKM.

Octavian-Eugen Ganea and Thomas Hofmann. 2017.Deep joint entity disambiguation with local neuralattention. In EMNLP.

Nitish Gupta, Sameer Singh, and Dan Roth. 2017. En-tity linking via joint encoding of types, descriptions,and context. In EMNLP.

Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,Preslav Nakov, Diarmuid O Seaghdha, SebastianPado, Marco Pennacchiotti, Lorenza Romano, andStan Szpakowicz. 2009. SemEval-2010 task 8:Multi-way classification of semantic relations be-tween pairs of nominals. In HLT-NAACL.

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor-dino, Hagen Furstenau, Manfred Pinkal, Marc Span-iol, Bilyana Taneva, Stefan Thater, and GerhardWeikum. 2011. Robust disambiguation of namedentities in text. In EMNLP.

Jeremy Howard and Sebastian Ruder. 2018. UniversalLanguage Model Fine-tuning for Text Classification.In Proceedings of ACL 2018.

Yangfeng Ji, Chenhao Tan, Sebastian Martschat, YejinChoi, and Noah A Smith. 2017. Dynamic entity rep-resentations in neural language models. In EMNLP.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In ICLR.

Nikolaos Kolitsas, Octavian-Eugen Ganea, andThomas Hofmann. 2018. End-to-end neural entitylinking. In CoNLL.

Kenton Lee, Luheng He, Mike Lewis, and Luke S.Zettlemoyer. 2017. End-to-end neural coreferenceresolution. In EMNLP.

Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, andXuan Zhu. 2015. Learning entity and relation em-beddings for knowledge graph completion. In AAAI.

Xiao Ling, Sameer Singh, and Daniel S. Weld. 2015.Design challenges for entity linking. Transactionsof the Association for Computational Linguistics,3:315–328.

Robert L Logan, Nelson F. Liu, Matthew E. Peters,Matthew Ph Gardner, and Sameer Singh. 2019.Barack’s wife hillary: Using knowledge graphs forfact-aware language modeling. In ACL.

Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in Translation: Con-textualized Word Vectors. In Advances in NeuralInformation Processing Systems.

Oren Melamud, Jacob Goldberger, and Ido Dagan.2016. context2vec: Learning generic context em-bedding with bidirectional LSTM. In CoNLL.

Todor Mihaylov and Anette Frank. 2018. Knowledge-able reader: Enhancing cloze-style reading compre-hension with external commonsense knowledge. InACL.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient estimation of word represen-tations in vector space. arXiv:1301.3781.

Page 11: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In NeurIPS.

George A Miller. 1995. WordNet: a lexicaldatabase for English. Communications of the ACM,38(11):39–41.

George A. Miller, Martin Chodorow, Shari Landes,Claudia Leacock, and Robert G. Thomas. 1994. Us-ing a semantic concordance for sense identification.In HLT.

Roberto Navigli, Jose Camacho-Collados, andAlessandro Raganato. 2017. Word sense disam-biguation: A unified evaluation framework andempirical comparison. In EACL.

Maximilian Nickel, Volker Tresp, and Hans-PeterKriegel. 2011. A three-way model for collectivelearning on multi-relational data. In ICML.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global vectors forword representation. In EMNLP.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In NAACL-HLT.

Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. WiC: the word-in-context datasetfor evaluating context-sensitive meaning representa-tions. In NAACL-HLT.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.

Alessandro Raganato, Claudio Delli Bovi, and RobertoNavigli. 2017. Neural sequence learning models forword sense disambiguation. In EMNLP.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In ACL.

Peng Shi and Jimmy Lin. 2019. Simple BERT modelsfor relation extraction and semantic role labeling.

Livio B. Soares, Nicholas FitzGerald, Jeffrey Ling, andTom Kwiatkowski. 2019. Matching the blanks: Dis-tributional similarity for relation learning. In ACL.

Robert Speer, Joshua Chin, and Catherine Havasi.2017. ConceptNet 5.5: An open multilingual graphof general knowledge. In AAAI.

Valentin I. Spitkovsky and Angel X. Chang. 2012. Across-lingual dictionary for English Wikipedia con-cepts. In LREC.

Sandeep Subramanian, Adam Trischler, Yoshua Ben-gio, and Christopher J Pal. 2018. Learning gen-eral purpose distributed sentence representations vialarge scale multi-task learning. In ICLR.

Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, KathrynMazaitis, Ruslan R. Salakhutdinov, and William W.Cohen. 2018. Open domain question answering us-ing early fusion of knowledge bases and text. InEMNLP.

Theo Trouillon, Johannes Welbl, Sebastian Riedel, EricGaussier, and Guillaume Bouchard. 2016. Complexembeddings for simple link prediction. In ICML.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NeurIPS.

Denny Vrandecic and Markus Krotzsch. 2014. Wiki-data: A free collaborative knowledgebase. Com-mun. ACM, 57(10):78–85.

Alex Wang, Yada Pruksachatkun, Nikita Nangia,Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel R. Bowman. 2019a. SuperGLUE:A stickier benchmark for general-purpose languageunderstanding systems. arXiv:1905.00537.

Chao Wang and Hui Jiang. 2019. Explicit utilization ofgeneral knowledge in machine reading comprehen-sion. In ACL.

Haoyu Wang, Ming Tan, Mo Yu, Shiyu Chang, DakuoWang, Kun Xu, Xiaoxiao Guo, and Saloni Potdar.2019b. Extracting multiple-relations in one-passwith pre-trained transformers. In ACL.

Linlin Wang, Zhu Cao, Gerard de Melo, and ZhiyuanLiu. 2016. Relation classification via multi-level at-tention CNNs. In ACL.

Zhen Wang, Jianwen Zhang, Jianlin Feng, and ZhengChen. 2014a. Knowledge graph and text jointly em-bedding. In EMNLP.

Zhen Wang, Jianwen Zhang, Jianlin Feng, and ZhengChen. 2014b. Knowledge graph embedding bytranslating on hyperplanes. In AAAI.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural machinetranslation system: Bridging the gap between humanand machine translation. arXiv:1609.08144.

Han Xiao, Minlie Huang, and Xiaoyan Zhu. 2016.From one point to a manifold: knowledge graph em-bedding for precise link prediction. In AAAI.

Ruobing Xie, Zhiyuan Liu, J. J. Jia, Huanbo Luan,and Maosong Sun. 2016. Representation learning ofknowledge graphs with entity descriptions. In AAAI.

An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu,Hua Wu, Qiaoqiao She, and Sujian Li. 2019. En-hancing pre-trained language representations withrich knowledge for machine reading comprehension.In ACL.

Page 12: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

Bishan Yang and Tom Michael Mitchell. 2017. Lever-aging knowledge bases in LSTMs for improving ma-chine reading. In ACL.

Bishan Yang, Wen-tau Yih, Xiaodong He, JianfengGao, and Li Deng. 2014. Embedding entities andrelations for learning and inference in knowledgebases. arXiv:1412.6575.

Zichao Yang, Phil Blunsom, Chris Dyer, and WangLing. 2017. Reference-aware language models. InEMNLP.

Yuhao Zhang, Peng Qi, and Christopher D. Manning.2018. Graph convolution over pruned dependencytrees improves relation extraction. In EMNLP.

Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-geli, and Christopher D. Manning. 2017. Position-aware attention and supervised data improve slot fill-ing. In EMNLP.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019. ERNIE: En-hanced language representation with informative en-tities. In ACL.

Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan R.Salakhutdinov, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Aligning books and movies:Towards story-like visual explanations by watchingmovies and reading books. ICCV.

A Supplemental Material

A.1 KnowBert training detailsWe generally followed the fine-tuning procedurein Devlin et al. (2019), using Adam (Kingma andBa, 2015) with weight decay 0.1 and learning rateschedule that linearly increases then decreases.Similar to ULMFiT (Howard and Ruder, 2018),we found it beneficial to vary the learning rate indifferent layers when fine tuning with this simpleschedule: the randomly initialized newly addedlayers in the KAR surrounding each KB had thelargest learning rate α; all BERT layers below hadlearning rate 0.25α; and BERT layers above had0.5α. KnowBert-Wiki was fine-tuned for a to-tal of 750K gradient updates, KnowBert-WordNetfor 500K updates, and KnowBert-W+W for 500Kupdates (after pretraining KnowBert-Wiki). Fine-tuning was done on a single Titan RTX GPU withbatch size of 32 using gradient accumulation forlong sequences, except for the final 250K stepsof KnowBert-W+W which was trained with batchsize of 64 using 4 GPUs. At the end of training,masked LM perplexity continued to decrease, sug-gesting that further training would improve the re-sults.

Due to computational expense, we performedvery little hyperparameter tuning during the pre-training stage, and none with the full KnowBert-W+W model. All tuning was performed with par-tial training runs by monitoring masked LM per-plexity in the early portion of training and termi-nating the under performing configurations. Foreach KnowBert-WordNet and KnowBert-Wiki weran experiments with two learning rates (maxi-mum learning rate of 1e-4 and 2e-4) and chose thebest value after a few hundred thousand gradientupdates (2e-4 for KnowBert-WordNet and 1e-4 forKnowBert-Wiki).

When training KnowBert-WordNet andKnowBert-Wiki we chose random batches fromthe unlabeled corpus and annotated entity linkingdatasets with a ratio of 4:1 (80% unlabeled,20% labeled). For KnowBert-W+W we used a85%/7.5%/7.5% sampling ratio.

A.2 Task fine-tuning detailsThis section details the fine tuning procedures andhyperparameters for each of the end tasks. Alloptimization was performed with the Adam opti-mizer with linear warmup of learning rate over thefirst 10% of gradient updates to a maximum value,then linear decay over the remainder of training.Gradients were clipped if their norm exceeded 1.0,and weight decay on all non-bias parameters wasset to 0.01. Grid search was used for hyperparame-ter tuning (maximum values bolded below), usingfive random restarts for each hyperparameter set-ting for all datasets except TACRED (which used asingle seed). Early stopping was performed on thedevelopment set. Batch size was 32 in all cases.

TACRED This dataset provides annotations for106K sentences with typed subject and objectspans and relationship labels across 41 differentclasses (plus the no-relation label). The hyperpa-rameter search space was:

• learning rate: [3e-5, 5e-5]

• number epochs: [1, 2, 3, 4, 5]

• β2: [0.98, 0.999]

Maximum development micro F1 is 71.7%.

SemEval 2010 Task 8 This dataset provides an-notations for 10K sentences with untyped subjectand object spans and relationship labels across 18different classes (plus the no-relation label). As

Page 13: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

the task does not define a standard developmentsplit, we randomly sampled 500 of the 8000 train-ing examples for development. The hyperparame-ter search space was:

• learning rate: [3e-5, 5e-5]

• number epochs: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

with β2 = 0.98. We used the providedsemeval2010 task8 scorer-v1.2.plscript to compute F1. The maximum developmentF1 averaged across the random restarts was 89.1± 0.77 (maximum value was 90.5 across theseeds).

WiC WiC is a binary classification task with7.5K annotated sentence pairs. Due to the smallsize of the dataset, we found it helpful to usemodel averaging to reduce the variance in the de-velopment accuracy across random restarts. Thehyperparameter search space was:

• learning rate: [1e-5, 2e-5, 3e-5, 5e-5]

• number epochs: [2, 3, 4, 5]

• beta2: [0.98, 0.999]

• weight averaging decay: [no averaging, 0.95,0.99]

The maximum development accuracy was 72.6.

Entity typing As described in Section 4.3, weevaluated on a subset of data corresponding to en-tities classified by nine different general classes:person, location, object, organization, place, en-tity, object, time, and event. β2 was set to 0.98.The hyperparameter search space was:

• learning rate: [2e-5, 3e-5]

• number epochs: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14]

• beta2: [0.98, 0.999]

• weight averaging decay: [no averaging,0.99]

The maximum development F1 was 75.5 ± 0.38averaged across five seeds, with maximum valueof 76.0.

A.3 Wikidata probing resultsTable 8 shows the results for the Wikidata probingtask for all relationships.

Page 14: Knowledge Enhanced Contextual Word Representations · 2019-11-01 · Knowledge Enhanced Contextual Word Representations Matthew E. Peters 1, Mark Neumann , Robert L. Logan IV2, Roy

BERT KnowBertbase large Wiki Wordnet W+W

companyFoundedBy 0.08 0.08 0.23 0.21 0.28movieDirectedBy 0.05 0.04 0.08 0.09 0.10movieStars 0.06 0.05 0.09 0.09 0.11personCityOfBirth 0.18 0.20 0.33 0.35 0.40personCountryOfBirth 0.16 0.17 0.33 0.36 0.47personCountryOfDeath 0.18 0.18 0.30 0.30 0.44personEducatedAt 0.11 0.10 0.37 0.29 0.43personEmployer 0.03 0.02 0.20 0.13 0.21personFather 0.15 0.14 0.42 0.40 0.48personMemberOfBand 0.11 0.11 0.19 0.18 0.23personMemberOfSportsTeam 0.01 0.03 0.05 0.08 0.13personMother 0.11 0.11 0.28 0.24 0.32personOccupation 0.14 0.24 0.24 0.24 0.30personSpouse 0.05 0.06 0.12 0.08 0.18songPerformedBy 0.07 0.05 0.10 0.10 0.13videoGamePlatform 0.16 0.22 0.27 0.20 0.34writtenTextAuthor 0.06 0.05 0.15 0.16 0.18

Total 0.09 0.11 0.26 0.22 0.31

Table 8: Full results on the Wikidata probing task including all relations.