enhancing pre-trained language representations with rich knowledge for machine … ·...

12
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2346–2357 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 2346 Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension An Yang 1* , Quan Wang 2 , Jing Liu 2 , Kai Liu 2 , Yajuan Lyu 2 , Hua Wu 2, Qiaoqiao She 2 and Sujian Li 11 Key Laboratory of Computational Linguistics, Peking University, MOE, China 2 Baidu Inc., Beijing, China {yangan, lisujian}@pku.edu.cn {wangquan05, liujing46, liukai20, lvyajuan, wu hua, sheqiaoqiao}@baidu.com Abstract Machine reading comprehension (MRC) is a crucial and challenging task in NLP. Recently, pre-trained language models (LMs), especially BERT, have achieved remarkable success, pre- senting new state-of-the-art results in MRC. In this work, we investigate the potential of lever- aging external knowledge bases (KBs) to fur- ther improve BERT for MRC. We introduce KT-NET, which employs an attention mech- anism to adaptively select desired knowledge from KBs, and then fuses selected knowledge with BERT to enable context- and knowledge- aware predictions. We believe this would com- bine the merits of both deep LMs and curated KBs towards better MRC. Experimental re- sults indicate that KT-NET offers significant and consistent improvements over BERT, out- performing competitive baselines on ReCoRD and SQuAD1.1 benchmarks. Notably, it ranks the 1st place on the ReCoRD leaderboard, and is also the best single model on the SQuAD1.1 leaderboard at the time of submission (March 4th, 2019). 1 1 Introduction Machine reading comprehension (MRC), which requires machines to comprehend text and answer questions about it, is a crucial task in natural lan- guage processing. With the development of deep learning and the increasing availability of datasets (Rajpurkar et al., 2016, 2018; Nguyen et al., 2016; Joshi et al., 2017), MRC has achieved remarkable advancements in the last few years. Recently language model (LM) pre-training has caused a stir in the MRC community. These LMs * This work was done while the first author was an intern at Baidu Inc. Co-corresponding authors: Hua Wu and Sujian Li. 1 Our code will be available at http://github. com/paddlepaddle/models/tree/develop/ PaddleNLP/Research/ACL2019-KTNET Passage: The US government has extended its review into whether trade sanctions against Sudan should be repealed. [...] Sudan is committed to the full implementation of UN Security Council resolutions on North Korea . [...] Sudan ’s past support for North Korea could present an obstacle [...] Question: Sudan remains a XXX-designated state sponsor of terror and is one of six countries subject to the Trump administration’s ban. Original BERT prediction: UN Security Council Prediction with background knowledge: US Background knowledge: NELL: (Donald Trump, person-leads-organization, US) WordNet: (government, same-synset-with, administration) WordNet: (sanctions, common-hypernym-with, ban) Figure 1: An example from ReCoRD, with answer can- didates marked (underlined) in the passage. The vanilla BERT model fails to predict the correct answer. But it succeeds after integrating background knowledge col- lected from WordNet and NELL. are pre-trained on unlabeled text and then applied to MRC, in either a feature-based (Peters et al., 2018a) or a fine-tuning (Radford et al., 2018) man- ner, both offering substantial performance boosts. Among different pre-training mechanisms, BERT (Devlin et al., 2018), which uses Transformer en- coder (Vaswani et al., 2017) and trains a bidirec- tional LM, is undoubtedly the most successful by far, presenting new state-of-the-art results in MRC and a wide variety of other language understand- ing tasks. Owing to the large amounts of unlabeled data and the sufficiently deep architectures used during pre-training, advanced LMs such as BERT are able to capture complex linguistic phenomena, understanding language better than previously ap- preciated (Peters et al., 2018b; Goldberg, 2019). However, as widely recognized, genuine read- ing comprehension requires not only language understanding, but also knowledge that supports sophisticated reasoning (Chen et al., 2016; Mi- haylov and Frank, 2018; Bauer et al., 2018; Zhong

Upload: others

Post on 04-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine … · 2019-07-15 · Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2346–2357Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

2346

Enhancing Pre-Trained Language Representations with Rich Knowledgefor Machine Reading Comprehension

An Yang1*, Quan Wang2, Jing Liu2, Kai Liu2,Yajuan Lyu2, Hua Wu2†, Qiaoqiao She2 and Sujian Li1†

1Key Laboratory of Computational Linguistics, Peking University, MOE, China2Baidu Inc., Beijing, China

{yangan, lisujian}@pku.edu.cn {wangquan05, liujing46,liukai20, lvyajuan, wu hua, sheqiaoqiao}@baidu.com

Abstract

Machine reading comprehension (MRC) is acrucial and challenging task in NLP. Recently,pre-trained language models (LMs), especiallyBERT, have achieved remarkable success, pre-senting new state-of-the-art results in MRC. Inthis work, we investigate the potential of lever-aging external knowledge bases (KBs) to fur-ther improve BERT for MRC. We introduceKT-NET, which employs an attention mech-anism to adaptively select desired knowledgefrom KBs, and then fuses selected knowledgewith BERT to enable context- and knowledge-aware predictions. We believe this would com-bine the merits of both deep LMs and curatedKBs towards better MRC. Experimental re-sults indicate that KT-NET offers significantand consistent improvements over BERT, out-performing competitive baselines on ReCoRDand SQuAD1.1 benchmarks. Notably, it ranksthe 1st place on the ReCoRD leaderboard, andis also the best single model on the SQuAD1.1leaderboard at the time of submission (March4th, 2019).1

1 Introduction

Machine reading comprehension (MRC), whichrequires machines to comprehend text and answerquestions about it, is a crucial task in natural lan-guage processing. With the development of deeplearning and the increasing availability of datasets(Rajpurkar et al., 2016, 2018; Nguyen et al., 2016;Joshi et al., 2017), MRC has achieved remarkableadvancements in the last few years.

Recently language model (LM) pre-training hascaused a stir in the MRC community. These LMs

*This work was done while the first author was an internat Baidu Inc.

†Co-corresponding authors: Hua Wu and Sujian Li.1Our code will be available at http://github.

com/paddlepaddle/models/tree/develop/PaddleNLP/Research/ACL2019-KTNET

Passage: The US government has extended its review intowhether trade sanctions against Sudan should be repealed.[...] Sudan is committed to the full implementation of UNSecurity Council resolutions on North Korea. [...] Sudan’spast support for North Korea could present an obstacle [...]Question: Sudan remains a XXX-designated state sponsorof terror and is one of six countries subject to the Trumpadministration’s ban.Original BERT prediction: UN Security CouncilPrediction with background knowledge: US

Background knowledge:NELL: (Donald Trump, person-leads-organization, US)WordNet: (government, same-synset-with, administration)WordNet: (sanctions, common-hypernym-with, ban)

Figure 1: An example from ReCoRD, with answer can-didates marked (underlined) in the passage. The vanillaBERT model fails to predict the correct answer. But itsucceeds after integrating background knowledge col-lected from WordNet and NELL.

are pre-trained on unlabeled text and then appliedto MRC, in either a feature-based (Peters et al.,2018a) or a fine-tuning (Radford et al., 2018) man-ner, both offering substantial performance boosts.Among different pre-training mechanisms, BERT(Devlin et al., 2018), which uses Transformer en-coder (Vaswani et al., 2017) and trains a bidirec-tional LM, is undoubtedly the most successful byfar, presenting new state-of-the-art results in MRCand a wide variety of other language understand-ing tasks. Owing to the large amounts of unlabeleddata and the sufficiently deep architectures usedduring pre-training, advanced LMs such as BERTare able to capture complex linguistic phenomena,understanding language better than previously ap-preciated (Peters et al., 2018b; Goldberg, 2019).

However, as widely recognized, genuine read-ing comprehension requires not only languageunderstanding, but also knowledge that supportssophisticated reasoning (Chen et al., 2016; Mi-haylov and Frank, 2018; Bauer et al., 2018; Zhong

Page 2: Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine … · 2019-07-15 · Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine

2347

et al., 2018). Thereby, we argue that pre-trainedLMs, despite their powerfulness, could be fur-ther improved for MRC by integrating backgroundknowledge. Fig. 1 gives a motivating examplefrom ReCoRD (Zhang et al., 2018). In this exam-ple, the passage describes that Sudan faces tradesanctions from US due to its past support for NorthKorea. The cloze-style question states that Sudanis subject to the Trump’s ban, and asks the orga-nization by which Sudan is deemed to be a statesponsor of terror. BERT fails on this case as thereis not enough evidence in the text. But after in-troducing the world knowledge “Trump is the per-son who leads US” and word knowledge “sanc-tions has a common hypernym with ban”, we canreasonably infer that the answer is “US”. This ex-ample suggests the importance and necessity of in-tegrating knowledge, even on the basis of a ratherstrong model like BERT. We refer interested read-ers to Appendix A for another motivating examplefrom SQuAD1.1 (Rajpurkar et al., 2016).

Thus, in this paper, we devise KT-NET (abbr.for Knowledge and Text fusion NET), a new ap-proach to MRC which improves pre-trained LMswith additional knowledge from knowledge bases(KBs). The aim here is to take full advantage ofboth linguistic regularities covered by deep LMsand high-quality knowledge derived from curatedKBs, towards better MRC. We leverage two KBs:WordNet (Miller, 1995) that records lexical rela-tions between words and NELL (Carlson et al.,2010) that stores beliefs about entities. Both areuseful for the task (see Fig. 1). Instead of intro-ducing symbolic facts, we resort to distributed rep-resentations (i.e., embeddings) of KBs (Yang andMitchell, 2017). With such KB embeddings, wecould (i) integrate knowledge relevant not only lo-cally to the reading text but also globally aboutthe whole KBs; and (ii) easily incorporate multipleKBs at the same time, with minimal task-specificengineering (see § 2.2 for detailed explanation).

As depicted in Fig. 2, given a question and pas-sage, KT-NET first retrieves potentially relevantKB embeddings and encodes them in a knowledgememory. Then, it employs, in turn, (i) a BERT en-coding layer to compute deep, context-aware rep-resentations for the reading text; (ii) a knowledgeintegration layer to select desired KB embeddingsfrom the memory, and integrate them with BERTrepresentations; (iii) a self-matching layer to fuseBERT and KB representations, so as to enable rich

Question + Passage

BERT Encoding

Knowledge Integration

Self-Matching

Output

Start probabilities End probabilities

Bilinear

Softmax

Σ

Concat

Concat

Sentinelvector

BERTvector

KBembeddings

KBs

Figure 2: Overall architecture of KT-NET (left), withthe knowledge integration module illustrated (right).

interactions among them; and (iv) an output layerto predict the final answer. In this way we enrichBERT with curated knowledge, combine merits ofthe both, and make knowledge-aware predictions.

We evaluate our approach on two benchmarks:ReCoRD (Zhang et al., 2018) and SQuAD1.1 (Ra-jpurkar et al., 2016). On ReCoRD, a passage isgenerated from the first few paragraphs of a newsarticle, and the corresponding question the rest ofthe article, which, by design, requires backgroundknowledge and reasoning. On SQuAD1.1 wherethe best models already outperform humans, ques-tions remaining unsolved are really difficult ones.Both are appealing testbeds for evaluating genuinereading comprehension capabilities. We show thatincorporating knowledge can bring significant andconsistent improvements to BERT, which itself isone of the strongest models on both datasets.

The contributions of this paper are two-fold: (i)We investigate and demonstrate the feasibility ofenhancing pre-trained LMs with rich knowledgefor MRC. To our knowledge, this is the first studyof its kind, indicating a potential direction for fu-ture research. (ii) We devise a new approach KT-NET to MRC. It outperforms competitive base-lines, ranks the 1st place on the ReCoRD leader-board, and is also the best single model on theSQuAD1.1 leaderboard at the time of submission(March 4th, 2019).

2 Our Approach

In this work we consider the extractive MRC task.Given a passage with m tokens P = {pi}mi=1 anda question with n tokens Q = {qj}nj=1, our goal

Page 3: Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine … · 2019-07-15 · Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine

2348

is to predict an answer A which is constrained as acontiguous span in the passage, i.e., A = {pi}bi=a,with a and b indicating the answer boundary.

We propose KT-NET for this task, the key ideaof which is to enhance BERT with curated knowl-edge from KBs, so as to combine the merits of theboth. To encode knowledge, we adopt knowledgegraph embedding techniques (Yang et al., 2015)and learn vector representations of KB concepts.Given passage P and question Q, we retrieve foreach token w ∈ P ∪Q a set of potentially relevantKB conceptsC(w), where each concept c ∈ C(w)is associated with a learned vector embedding c.

Based upon these pre-trained KB embeddings,KT-NET is built, as depicted in Fig. 2, with fourmajor components: (i) a BERT encoding layer thatcomputes deep, context-aware representations forquestions and passages; (ii) a knowledge integra-tion layer that employs an attention mechanism toselect the most relevant KB embeddings, and in-tegrates them with BERT representations; (iii) aself-matching layer that further enables rich inter-actions among BERT and KB representations; and(iv) an output layer that predicts the final answer.In what follows, we first introduce the four majorcomponents in § 2.1, and leave knowledge embed-ding and retrieval to § 2.2.

2.1 Major Components of KT-NETKT-NET consists of four major modules: BERTencoding, knowledge integration, self-matching,and final output, detailed as follows.

BERT Encoding Layer This layer uses BERTencoder to model passages and questions. It takesas input passage P and question Q, and computesfor each token a context-aware representation.

Specifically, given passage P = {pi}mi=1 andquestion Q = {qj}nj=1, we first pack them into asingle sequence of length m+ n+ 3, i.e.,

S = [〈CLS〉, Q, 〈SEP〉, P, 〈SEP〉],

where 〈SEP〉 is the token separating Q and P , and〈CLS〉 the token for classification (will not be usedin this paper). For each token si in S, we constructits input representation as:

h0i = stoki + sposi + ssegi ,

where stoki , sposi , and ssegi are the token, position,and segment embeddings for si, respectively. To-kens in Q share a same segment embedding qseg,

and tokens in P a same segment embedding pseg.Such input representations are then fed into L suc-cessive Transformer encoder blocks, i.e.,

h`i = Transformer(h`−1

i ), ` = 1, 2, · · · , L,

so as to generate deep, context-aware representa-tions for passages and questions. We refer readersto (Devlin et al., 2018; Vaswani et al., 2017) fordetails. The final hidden states {hL

i }m+n+3i=1 ∈ Rd1

are taken as the output of this layer.

Knowledge Integration Layer This layer is de-signed to further integrate knowledge into BERT,and is a core module of our approach. It takes asinput the BERT representations {hL

i } output fromthe previous layer, and enriches them with relevantKB embeddings, which makes the representationsnot only context-aware but also knowledge-aware.

Specifically, for each token si, we get its BERTrepresentation hL

i ∈ Rd1 and retrieve a set of po-tentially relevant KB concepts C(si), where eachconcept cj is associated with KB embedding cj ∈Rd2 . (We will describe the KB embedding and re-trieval process later in § 2.2.) Then we employ anattention mechanism to adaptively select the mostrelevant KB concepts. We measure the relevanceof concept cj to token si with a bilinear operation,and calculate the attention weight as:

αij ∝ exp(c>j WhLi ), (1)

where W ∈ Rd2×d1 is a trainable weight parame-ter. As these KB concepts are not necessarily rel-evant to the token, we follow (Yang and Mitchell,2017) to further introduce a knowledge sentinelc ∈ Rd2 , and calculate its attention weight as:

βi ∝ exp(c>WhLi ). (2)

The retrieved KB embeddings {cj} (as well as thesentinel c) are then aligned to si and aggregatedaccordingly, i.e.,

ki =∑

jαijcj + βic, (3)

with∑

j αij+βi = 1.2 Here ki can be regarded asa knowledge state vector that encodes extra KB in-formation w.r.t. the current token. We concatenateki with the BERT representation hL

i and output ui

= [hLi ,ki] ∈ Rd1+d2 , which is by nature not only

context-aware but also knowledge-aware.2We set ki = 0 if C(si) = ∅.

Page 4: Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine … · 2019-07-15 · Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine

2349

Self-Matching Layer This layer takes as inputthe knowledge-enriched representations {ui}, andemploys a self-attention mechanism to further en-able interactions among the context components{hL

i } and knowledge components {ki}. It is alsoan important module of our approach.

We model both direct and indirect interactions.As for direct interactions, given two tokens si andsj (along with their knowledge-enriched represen-tations ui and uj), we measure their similaritywith a trilinear function (Seo et al., 2017):

rij = w>[ui,uj ,ui � uj ],

and accordingly obtain a similarity matrix R withrij being the ij-th entry. Here � denotes element-wise multiplication, and w ∈ R3d1+3d2 is a train-able weight parameter. Then, we apply a row-wisesoftmax operation on R to get the self-attentionweight matrix A, and compute for each token sian attended vector vi, i.e.,

aij =exp(rij)∑j exp(rij)

,

vi =∑

jaijuj ,

where aij is the ij-th entry of A. vi reflects howeach token sj interacts directly with si.

Aside from direct interactions, indirect interac-tions, e.g., the interaction between si and sj viaan intermediate token sk, are also useful. To fur-ther model such indirect interactions, we conduct aself-multiplication of the original attention matrixA, and compute for each token si another attendedvector vi, i.e.,

A = A2,

vi =∑

jaijuj ,

where aij is the ij-th entry of A. vi reflects howeach token sj interacts indirectly with si, throughall possible intermediate tokens. Finally, we buildthe output for each token by a concatenation oi =[ui,vi,ui − vi,ui � vi, vi,ui − vi] ∈ R6d1+6d2 .

Output Layer We follow BERT and simply usea linear output layer, followed by a standard soft-max operation, to predict answer boundaries. Theprobability of each token si to be the start or endposition of the answer span is calculated as:

p1i =exp(w>1 oi)∑j exp(w>1 oj)

, p2i =exp(w>2 oi)∑j exp(w>2 oj)

,

where {oi} are output by the self-matching layer,and w1,w2 ∈ R6d1+6d2 are trainable parameters.The training objective is the log-likelihood of thetrue start and end positions:

L = − 1

N

N∑j=1

(log p1y1j+ log p2y2j

),

where N is the number of examples in the dataset,and y1j , y

2j are the true start and end positions of

the j-th example, respectively. At inference time,the span (a, b) where a ≤ b with maximum p1ap

2b

is chosen as the predicted answer.

2.2 Knowledge Embedding and RetrievalNow we introduce the knowledge embedding andretrieval process. We use two KBs: WordNet andNELL, both stored as (subject, relation, object)triples, where each triple is a fact indicating a spe-cific relation between two entities. WordNet storeslexical relations between word synsets, e.g., (or-ganism, hypernym of, animal). NELL stores be-liefs about entities, where the subjects are usuallyreal-world entities and the objects are either enti-ties, e.g., (Coca Cola, headquartered in, Atlanta),or concepts, e.g., (Coca Cola, is a, company). Be-low we shall sometimes abuse terminologies andrefer to synsets, real-world entities, and conceptsas “entities”. As we have seen in Fig. 1, both KBsare useful for MRC.

KB Embedding In contrast to directly encodingKBs as symbolic (subject, relation, object) facts,we choose to encode them in a continuous vectorspace. Specifically, given any triple (s, r, o), wewould like to learn vector embeddings of subjects, relation r, and object o, so that the validity of thetriple can be measured in the vector space based onthe embeddings. We adopt the BILINEAR model(Yang et al., 2015) which measures the validity viaa bilinear function f(s, r, o) = s>diag(r)o. Here,s, r,o ∈ Rd2 are the vector embeddings associatedwith s, r, o, respectively, and diag(r) is a diagonalmatrix with the main diagonal given by r. Triplesalready stored in a KB are supposed to have highervalidity. A margin-based ranking loss is then ac-cordingly designed to learn the embeddings (referto (Yang et al., 2015) for details). After this em-bedding process, we obtain a vector representationfor each entity (as well as relation) of the two KBs.

KB Concepts Retrieval In this work, we treatWordNet synsets and NELL concepts as knowl-

Page 5: Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine … · 2019-07-15 · Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine

2350

edge to be retrieved from KBs, similar to (Yangand Mitchell, 2017). For WordNet, given a pas-sage or question word, we return its synsets as can-didate KB concepts. For NELL, we first recognizenamed entities from a given passage and question,link the recognized mentions to NELL entities bystring matching, and then collect the correspond-ing NELL concepts as candidates. Words withina same entity name and subwords within a sameword will share the same retrieved concepts, e.g.,we retrieve the NELL concept “company” for both“Coca” and “Cola”. After this retrieval process,we obtain a set of potentially relevant KB conceptsfor each token in the input sequence, where eachKB concept is associated with a vector embedding.

Advantages Previous attempts that leverage ex-tra knowledge for MRC (Bauer et al., 2018; Mi-haylov and Frank, 2018) usually follow a retrieve-then-encode paradigm, i.e., they first retrieve rele-vant knowledge from KBs, and only the retrievedknowledge—which is relevant locally to the read-ing text—will be encoded and integrated for MRC.Our approach, by contrast, first learns embeddingsfor KB concepts with consideration of the wholeKBs (or at least sufficiently large subsets of KBs).The learned embeddings are then retrieved and in-tegrated for MRC, which are thus relevant not onlylocally to the reading text but also globally aboutthe whole KBs. Such knowledge is more informa-tive and potentially more useful for MRC.

Moreover, our approach offers a highly conve-nient way to simultaneously integrate knowledgefrom multiple KBs. For instance, suppose we re-trieve for token si a set of candidate KB conceptsC1(si) from WordNet, and C2(si) from NELL.Then, we can compute a knowledge state vector k1

i

based on C1(si), and k2i based on C2(si), which

are further combined with the BERT hidden statehLi to generate ui = [hL

i ,k1i ,k

2i ]. As such, ui nat-

urally encodes knowledge from both KBs (see theknowledge integration layer for technical details).

3 Experiments

3.1 Datasets

In this paper we empirically evaluate our approachon two benchmarks: ReCoRD and SQuAD1.1.

ReCoRD—acronym for the Reading Compre-hension with Commonsense Reasoning Dataset—is a large-scale MRC dataset requiring common-sense reasoning (Zhang et al., 2018). It consists

Dataset Train Dev TestReCoRD 100,730 10,000 10,000SQuAD1.1 87,599 10,570 9,533

Table 1: The number of training, development, and testexamples of ReCoRD and SQuAD1.1.

of passage-question-answer tuples, collected fromCNN and Daily Mail news articles. In each tuple,the passage is formed by the first few paragraphsof a news article, with named entities recognizedand marked. The question is a sentence from therest of the article, with a missing entity specified asthe golden answer. The goal is to find the goldenanswer among the entities marked in the passage,which can be deemed as an extractive MRC task.This data collection process by design generatesquestions that require external knowledge and rea-soning. It also filters out questions that can be an-swered simply by pattern matching, posing furtherchallenges to current MRC systems. We take it asthe major testbed for evaluating our approach.

SQuAD1.1 (Rajpurkar et al., 2016) is a well-known extractive MRC dataset that consists ofquestions created by crowdworkers for Wikipediaarticles. The golden answer to each question is aspan from the corresponding passage. In this pa-per, we focus more on answerable questions thanunanswerable ones. Hence, we choose SQuAD1.1rather than SQuAD2.0 (Rajpurkar et al., 2018).

Table 1 provides the statistics of ReCoRD andSQuAD1.1. On both datasets, the training and de-velopment (dev) sets are publicly available, but thetest set is hidden. One has to submit the code to re-trieve the final test score. As frequent submissionsto probe the unseen test set are not encouraged, weonly submit our best single model for testing,3 andconduct further analysis on the dev set. Both data-sets use Exact Match (EM) and (macro-averaged)F1 as the evaluation metrics (Zhang et al., 2018).

3.2 Experimental Setups

Data Preprocessing We first prepare pre-trainedKB embeddings. We use the resources providedby Yang and Mitchell (2017), where the WordNetembeddings were pre-trained on a subset consist-ing of 151,442 triples with 40,943 synsets and 18relations, and the NELL embeddings pre-trainedon a subset containing 180,107 entities and 258

3In this paper, we restrict ourselves to improvements in-volving a single model, and hence do not consider ensembles.

Page 6: Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine … · 2019-07-15 · Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine

2351

concepts. Both groups of embeddings are 100-D.Refer to (Yang and Mitchell, 2017) for details.

Then we retrieve knowledge from the two KBs.For WordNet, we employ the BasicTokenizer builtin BERT to tokenize text, and look up synsets foreach word using NLTK (Bird and Loper, 2004).Synsets within the 40,943 subset are returned ascandidate KB concepts for the word. For NELL,we link entity mentions to the whole KB, and re-turn associated concepts within the 258 subset ascandidate KB concepts. Entity mentions are givenas answer candidates on ReCoRD, and recognizedby Stanford CoreNLP (Manning et al., 2014) onSQuAD1.1.

Finally, we follow Devlin et al. (2018) and usethe FullTokenizer built in BERT to segment wordsinto wordpieces. The maximum question length isset to 64. Questions longer than that are truncated.The maximum input length (|S|) is set to 384. In-put sequences longer than that are segmented intochunks with a stride of 128. The maximum answerlength at inference time is set to 30.

Comparison Setting We evaluate our approachin three settings: KT-NETWordNet, KT-NETNELL,and KT-NETBOTH, to incorporate knowledge fromWordNet, NELL, and both of the two KBs, respec-tively. We take BERT as a direct baseline, in whichonly the BERT encoding layer and output layer areused, and no knowledge will be incorporated. OurBERT follows exactly the same design as the orig-inal paper (Devlin et al., 2018). Besides BERT, wefurther take top-ranked systems on each dataset asadditional baselines (will be detailed in § 3.3).

Training Details For all three settings of KT-NET (as well as BERT), we initialize parame-ters of the BERT encoding layer with pre-trainedmodels officially released by Google4. Thesemodels were pre-trained on the concatenationof BooksCorpus (800M words) and Wikipedia(2,500M words), using the tasks of masked lan-guage model and next sentence prediction (Devlinet al., 2018). We empirically find that the cased,large model—which is case sensitive and con-tains 24 Transformer encoding blocks, each with16 self-attention heads and 1024 hidden units—performs the best on both datasets. Throughoutour experiments, we use this setting unless speci-fied otherwise. Other trainable parameters are ran-domly initialized.

4https://github.com/google-research/bert

Model Dev TestEM F1 EM F1

Leaderboard (Mar. 4th, 2019)Human 91.28 91.64 91.31 91.69#1 DCReader+BERT – – 70.49 71.98#2 BERTBASE – – 55.99 57.99#3 DocQA w/ ELMo 44.13 45.39 45.44 46.65#4 SAN 38.14 39.09 39.77 40.72#5 DocQA 36.59 37.89 38.52 39.76

OursBERT 70.22 72.16 – –KT-NETWordNet 70.56 72.75 – –KT-NETNELL 70.54 72.52 – –KT-NETBOTH 71.60 73.61 73.01 74.76

Table 2: Results on ReCoRD. The top 5 systems are allsingle models and chosen for comparison.

Model Dev TestEM F1 EM F1

Leaderboard (Mar. 4th, 2019)Human 80.3 90.5 82.30 91.22#1 BERT+TriviaQA 84.2 91.1 85.08 91.83#2 WD – – 84.40 90.56#3 nlnet – – 83.47 90.13#4 MARS – – 83.19 89.55#5 QANet – – 82.47 89.31

OursBERT 84.41 91.24 – –KT-NETWordNet 85.15 91.70 85.94 92.43KT-NETNELL 85.02 91.69 – –KT-NETBOTH 84.96 91.64 – –

Table 3: Results on SQuAD1.1. The top 5 single mod-els are chosen for comparison.

We use the Adam optimizer (Kingma and Ba,2015) with a learning rate of 3e-5 and a batch sizeof 24. The number of training epochs is chosenfrom {2,3,4}, according to the best EM+F1 scoreon the dev set of each dataset. During training, thepre-trained BERT parameters will be fine-tunedwith other trainable parameters, and the KB em-beddings will be kept fixed, which is empiricallyobserved to offer the best performance.

3.3 Results

On ReCoRD and SQuAD1.1, we compare our ap-proach to BERT and the top 5 (single) models onthe leaderboard (exclusive of ours). The results aregiven in Table 2 and Table 3, respectively, wherethe scores of the non-BERT baselines are taken di-rectly from the leaderboard and/or literature.

On ReCoRD5 (Table 2): (i) DCReader+BERTis the former top leaderboard system(unpublished)prior to our submission; (ii) BERTBASE is BERTwith the base setting (12 Transformer blocks, each

5https://sheng-z.github.io/ReCoRD-explorer/

Page 7: Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine … · 2019-07-15 · Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine

2352

Concepts from NELL:US1. geopoliticalorganization: 0.8742. geopoliticallocation: 0.1223. organization: 0.003

UN1. nongovorganization: 0.9862. sentinel: 0.0123. terroristorganization: 0.001

Concepts from WordNet:ban1. forbidding_NN_1: 0.8612. proscription_NN_1: 0.1353. ban_VB_2: 0.002

sanctions1. sanction_VB_1: 0.3362. sanction_NN_3: 0.3103. sanction_NN_4: 0.282

(a) KT-NET(a)

(b) BERT

Figure 3: Case study. Heat maps present similarities between question (row) and passage (column) words. Linecharts show probabilities of answer boundaries. In KT-NET, top 3 most relevant KB concepts are further given.

with 12 self-attention heads and 768 hidden units);(iii) DocQA (Liu et al., 2018) and SAN (Clark andGardner, 2018) are two previous state-of-the-artMRC models; (iv) the pre-trained LM ELMo (Pe-ters et al., 2018a) is further used in DocQA. Allthese models, except for DCReader+BERT, werere-implemented by the creators of the dataset andprovided as official baselines (Zhang et al., 2018).

On SQuAD6 (Table 3): (i) BERT+TriviaQA is theformer best model officially submitted by Google.It is an uncased, large model, and further uses dataaugmentation with TriviaQA (Joshi et al., 2017);(ii) WD, nlnet, and MARS are three competitivemodels that have not been published; (iii) QANetis a well performing MRC model proposed by Yuet al. (2018), and later re-implemented and sub-mitted by Google Brain & CMU.

Results on dev sets show that (i) KT-NET con-sistently outperforms BERT (which itself alreadysurpasses all the other baselines), irrespective ofwhich KB is used, and on both datasets. Our bestKT-NET model offers a 1.38/1.45 improvement inEM/F1 over BERT on ReCoRD, and a 0.74/0.46improvement in EM/F1 on SQuAD1.1. (ii) BothKBs are capable of improving BERT for MRC, butthe best setting varies across datasets. Integratingboth KBs performs best on ReCoRD, while usingWordNet alone is a better choice on SQuAD1.1.

Results on test sets further demonstrate the su-periority of our approach. It significantly outper-forms the former top leaderboard system by +2.52EM/+2.78 F1 on ReCoRD. And on SQuAD1.1,

6https://rajpurkar.github.io/SQuAD-explorer/

although little room for improvement, it still getsa meaningful gain of +0.86 EM/+0.60 F1 over theformer best single model.

4 Case Study

This section provides a case study, using the moti-vating example described in Fig. 1, to vividly showthe effectiveness of KT-NET, and make a directcomparison with BERT. For both methods, we usethe optimal configurations that offer their respec-tive best performance on ReCoRD (where the ex-ample comes from).

Relevant Knowledge Selection We first explorehow KT-NET can adaptively select the most rele-vant knowledge w.r.t. the reading text. Recall thatgiven a token si, the relevance of a retrieved KBconcept cj is measured by the attention weight αij

(Eq. (1)), according to which we can pick the mostrelevant KB concepts for this token. Fig. 3(a) (left)presents 4 tokens from the question/passage, eachassociated with top 3 most relevant concepts fromNELL or WordNet. As we can see, these attentiondistributions are quite meaningful, with “US” and“UN” attending mainly to the NELL concepts of“geopoliticalorganization” and “nongovorganiza-tion”, respectively, “ban” mainly to the WordNetsynset “forbidding NN 1”, and “sanction” almostuniformly to the three highly relevant synsets.

Question/Passage Representations We furtherexamine how such knowledge will affect the finalrepresentations learned for the question/passage.We consider all sentences listed in Fig. 1, and con-

Page 8: Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine … · 2019-07-15 · Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine

2353

tent words (nouns, verbs, adjectives, and adverbs)therein. For each word si, we take its final repre-sentation oi, obtained right before the output layer.Then we calculate the cosine similarity cos(oi,oj)between each question word si and passage wordsj . The resultant similarity matrices are visualizedin Fig. 3(a) and Fig. 3(b) (heat maps), obtained byKT-NET and BERT, respectively.7

For BERT (Fig. 3(b)), given any passage word,all question words tend to have similar similaritiesto the given word, e.g., all the words in the ques-tion have a low degree of similarity to the passageword “US”, while a relatively high degree of sim-ilarity to “repealed”. Such phenomenon indicatesthat after fine-tuning in the MRC task, BERT tendsto learn similar representations for question words,all of which approximately express the meaning ofthe whole question and are hard to distinguish.

For KT-NET (Fig. 3(a)), by contrast, differentquestion words can exhibit diverse similarities toa passage word, and these similarities may per-fectly reflect their relationships encoded in KBs.For example, we can observe relatively high sim-ilarities between: (i) “administration” and “gov-ernment” which share a same synset, (ii) “ban”and “sanctions” which have a common hyper-nym, and (iii) “sponsor” and “support” where asynset of the former has the relation “derivation-ally related form” with the latter, all in WordNet.Such phenomenon indicates that after integratingknowledge, KT-NET can learn more accurate rep-resentations which enable better question-passagematching.

Final Answer Prediction Fig. 3(a) and Fig. 3(b)(line charts) list the probability of each word to bestart/end of the answer, predicted by KT-NET andBERT, respectively. BERT mistakenly predicts theanswer as “UN Security Council”, but our methodsuccessfully gets the correct answer “US”.

We observed similar phenomena on SQuAD1.1and report the results in Appendix B.

5 Related Work

Machine Reading Comprehension In the lastfew years, a number of datasets have been createdfor MRC, e.g., CNN/DM (Hermann et al., 2015),SQuAD (Rajpurkar et al., 2016, 2018), SearchQA(Dunn et al., 2017), TriviaQA (Joshi et al., 2017),and MS-MARCO (Nguyen et al., 2016). These

7During visualization, we use a row-wise softmax opera-tion to normalize similarity scores over all passage tokens.

datasets have led to advances like Match-LSTM(Wang and Jiang, 2017), BiDAF (Seo et al., 2017),AoA Reader (Cui et al., 2017), DCN (Xiong et al.,2017), R-Net (Wang et al., 2017), and QANet (Yuet al., 2018). These end-to-end neural models havesimilar architectures, starting off with an encodinglayer to encode every question/passage word as avector, passing through various attention-based in-teraction layers and finally a prediction layer.

More recently, LMs such as ELMo (Peters et al.,2018b), GPT (Radford et al., 2018), and BERT(Devlin et al., 2018) have been devised. Theypre-train deep LMs on large-scale unlabeled cor-pora to obtain contextual representations of text.When used in downstream tasks including MRC,the pre-trained contextual representations greatlyimprove the performance in either a fine-tuning orfeature-based way. Built upon pre-trained LMs,our work further explores the potential of incorpo-rating structured knowledge from KBs, combiningthe strengths of both text and knowledge represen-tations.

Incorporating KBs Several MRC datasets thatrequire external knowledge have been proposed,such as ReCoRD (Zhang et al., 2018), ARC (Clarket al., 2018), MCScript (Ostermann et al., 2018),OpenBookQA (Mihaylov et al., 2018) and Com-monsenseQA (Talmor et al., 2018). ReCoRD canbe viewed as an extractive MRC dataset, while thelater four are multi-choice MRC datasets, with rel-atively smaller size than ReCoRD. In this paper,we focus on the extractive MRC task. Hence, wechoose ReCoRD and SQuAD in the experiments.

Some previous work attempts to leverage struc-tured knowledge from KBs to deal with the tasksof MRC and QA. Weissenborn et al. (2017), Baueret al. (2018), Mihaylov and Frank (2018), Panet al. (2019), Chen et al. (2018), Wang et al. (2018)follow a retrieve-then-encode paradigm, i.e., theyfirst retrieve relevant knowledge from KBs, andonly the retrieved knowledge relevant locally tothe reading text will be encoded and integrated. Bycontrast, we leverage pre-trained KB embeddingswhich encode whole KBs. Then we use attentionmechanisms to select and integrate knowledge thatis relevant locally to the reading text. Zhong et al.(2018) try to leverage pre-trained KB embeddingsto solve the multi-choice MRC task. However, theknowledge and text modules are not integrated,butused independently to predict the answer. And themodel cannot be applied to extractive MRC.

Page 9: Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine … · 2019-07-15 · Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine

2354

6 Conclusion

This paper introduces KT-NET for MRC, whichenhances BERT with structured knowledge fromKBs and combines the merits of the both. We usetwo KBs: WordNet and NELL. We learn embed-dings for the two KBs, select desired embeddingsfrom them, and fuse the selected embeddings withBERT hidden states, so as to enable context- andknowledge-aware predictions.Our model achievessignificant improvements over previous methods,becoming the best single model on ReCoRD andSQuAD1.1 benchmarks. This work demonstratesthe feasibility of further enhancing advanced LMswith knowledge from KBs, which indicates a po-tential direction for future research.

Acknowledgments

This work is supported by the Natural ScienceFoundation of China (No.61533018, 61876009and 61876223) and Baidu-Peking University JointProject. We would like to thank the ReCoRD andSQuAD teams for evaluating our results on theanonymous test sets. And we are grateful to TomM. Mitchell and Bishan Yang for sharing with usthe valuable KB resources. We would also like tothank the anonymous reviewers for their insightfulsuggestions.

ReferencesLisa Bauer, Yicheng Wang, and Mohit Bansal. 2018.

Commonsense for generative multi-hop question an-swering tasks. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 4220–4230. Association for Com-putational Linguistics.

Steven Bird and Edward Loper. 2004. Nltk: Thenatural language toolkit. In Proceedings of theACL Interactive Poster and Demonstration Sessions,page 31. Association for Computational Linguistics.

Andrew Carlson, Justin Betteridge, Bryan Kisiel, BurrSettles, Estevam R Hruschka, and Tom M Mitchell.2010. Toward an architecture for never-ending lan-guage learning. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence,pages 1306–1313. AAAI Press.

Danqi Chen, Jason Bolton, and Christopher D. Man-ning. 2016. A thorough examination of thecnn/daily mail reading comprehension task. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 2358–2367. Association for Compu-tational Linguistics.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, DianaInkpen, and Si Wei. 2018. Neural natural languageinference models enhanced with external knowl-edge. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 2406–2417. Associa-tion for Computational Linguistics.

Christopher Clark and Matt Gardner. 2018. Simpleand effective multi-paragraph reading comprehen-sion. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 845–855. Associationfor Computational Linguistics.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,Ashish Sabharwal, Carissa Schoenick, and OyvindTafjord. 2018. Think you have solved question an-swering? try arc, the AI2 reasoning challenge. arXive-prints, arXiv:1803.05457.

Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang,Ting Liu, and Guoping Hu. 2017. Attention-over-attention neural networks for reading comprehen-sion. In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 593–602. Associationfor Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv e-prints, arXiv:1810.04805.

Matthew Dunn, Levent Sagun, Mike Higgins, V UgurGuney, Volkan Cirik, and Kyunghyun Cho. 2017.Searchqa: A new q&a dataset augmented withcontext from a search engine. arXiv e-prints,arXiv:1704.05179.

Yoav Goldberg. 2019. Assessing bert’s syntactic abili-ties. arXiv e-prints, arXiv:1901.05287.

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Advances in Neu-ral Information Processing Systems 28, pages 1693–1701. Curran Associates, Inc.

Mandar Joshi, Eunsol Choi, Daniel Weld, and LukeZettlemoyer. 2017. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehen-sion. In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 1601–1611. Associa-tion for Computational Linguistics.

Diederik P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In InternationalConference on Learning Representations (ICLR).

Xiaodong Liu, Yelong Shen, Kevin Duh, and JianfengGao. 2018. Stochastic answer networks for ma-chine reading comprehension. In Proceedings of the

Page 10: Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine … · 2019-07-15 · Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine

2355

56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1694–1704. Association for Computational Linguis-tics.

Christopher Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven Bethard, and David McClosky.2014. The stanford corenlp natural language pro-cessing toolkit. In Proceedings of 52nd AnnualMeeting of the Association for Computational Lin-guistics: System Demonstrations, pages 55–60. As-sociation for Computational Linguistics.

Todor Mihaylov, Peter Clark, Tushar Khot, and AshishSabharwal. 2018. Can a suit of armor conduct elec-tricity? a new dataset for open book question an-swering. In Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 2381–2391. Association for Compu-tational Linguistics.

Todor Mihaylov and Anette Frank. 2018. Knowledge-able reader: Enhancing cloze-style reading compre-hension with external commonsense knowledge. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 821–832. Association for Com-putational Linguistics.

George A Miller. 1995. Wordnet: a lexical database forenglish. Communications of the ACM, 38(11):39–41.

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,Saurabh Tiwary, Rangan Majumder, and Li Deng.2016. Ms marco: A human generated machine read-ing comprehension dataset. In Proceedings of theWorkshop on Cognitive Computation: IntegratingNeural and Symbolic Approaches, pages 96–105.

Simon Ostermann, Michael Roth, Ashutosh Modi, Ste-fan Thater, and Manfred Pinkal. 2018. Semeval-2018 task 11: Machine comprehension using com-monsense knowledge. In Proceedings of The 12thInternational Workshop on Semantic Evaluation,pages 747–757. Association for Computational Lin-guistics.

Xiaoman Pan, Kai Sun, Dian Yu, Heng Ji, and DongYu. 2019. Improving question answering with exter-nal knowledge. arXiv e-prints, arXiv:1902.00993.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018a. Deep contextualized word rep-resentations. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics.

Matthew Peters, Mark Neumann, Luke Zettlemoyer,and Wen-tau Yih. 2018b. Dissecting contextualword embeddings: Architecture and representation.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages

1499–1509. Association for Computational Linguis-tics.

Alec Radford, Karthik Narasimhan, Time Salimans,and Ilya Sutskever. 2018. Improving language un-derstanding by generative pre-training. Technicalreport, Technical report, OpenAI.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad. In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers), pages 784–789.Association for Computational Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing, pages 2383–2392. Asso-ciation for Computational Linguistics.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2017. Bidirectional attentionflow for machine comprehension. In InternationalConference on Learning Representations (ICLR).

Alon Talmor, Jonathan Herzig, Nicholas Lourie, andJonathan Berant. 2018. Commonsenseqa: A ques-tion answering challenge targeting commonsenseknowledge. arXiv e-prints, arXiv:1811.00937.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008. Curran Asso-ciates, Inc.

Liang Wang, Meng Sun, Wei Zhao, Kewei Shen, andJingming Liu. 2018. Yuanfudao at semeval-2018task 11: Three-way attention and relational knowl-edge for commonsense machine comprehension. InProceedings of The 12th International Workshop onSemantic Evaluation, pages 758–762. Associationfor Computational Linguistics.

Shuohang Wang and Jing Jiang. 2017. Machine com-prehension using match-lstm and answer pointer. InInternational Conference on Learning Representa-tions (ICLR).

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang,and Ming Zhou. 2017. Gated self-matching net-works for reading comprehension and question an-swering. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 189–198. Associa-tion for Computational Linguistics.

Dirk Weissenborn, Tomas Kocisky, and Chris Dyer.2017. Dynamic integration of background knowl-edge in neural NLU systems. arXiv e-prints,arXiv:1706.02596.

Page 11: Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine … · 2019-07-15 · Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine

2356

Caiming Xiong, Victor Zhong, and Richard Socher.2017. Dynamic coattention networks for questionanswering. In International Conference on Learn-ing Representations (ICLR).

Bishan Yang and Tom Mitchell. 2017. Leveragingknowledge bases in lstms for improving machinereading. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1436–1446. Asso-ciation for Computational Linguistics.

Bishan Yang, Wen-tau Yih, Xiaodong He, JianfengGao, and Li Deng. 2015. Embedding entities andrelations for learning and inference in knowledgebases. In International Conference on LearningRepresentations (ICLR).

Adams Wei Yu, David Dohan, Minh-Thang Luong, RuiZhao, Kai Chen, Mohammad Norouzi, and Quoc VLe. 2018. Qanet: Combining local convolution withglobal self-attention for reading comprehension. InInternational Conference on Learning Representa-tions (ICLR).

Sheng Zhang, Xiaodong Liu, Jingjing Liu, JianfengGao, Kevin Duh, and Benjamin Van Durme. 2018.Record: Bridging the gap between human and ma-chine commonsense reading comprehension. arXive-prints, arXiv:1810.12885.

Wanjun Zhong, Duyu Tang, Nan Duan, Ming Zhou,Jiahai Wang, and Jian Yin. 2018. Improving ques-tion answering by commonsense-based pre-training.arXiv e-prints, arXiv:1809.03568.

A Motivating Example from SQuAD1.1

We provide a motivating example from SQuAD1.1to show the importance and necessity of integrat-ing background knowledge. We restrict ourselvesto knowledge from WordNet, which offers the bestperformance on this dataset according to our ex-perimental results (Table 3).

Fig. 4 presents the example. The passage statesthat the congress aimed to formalize a unified frontin trade and negotiations with various Indians, butthe plan was never ratified by the colonial legisla-tures nor approved of by the crown. And the ques-tion asks whether the plan was formalized. BERTfails on this case by spuriously matching the two“formalize” appearing in the passage and question.But after introducing the word knowledge “ratifiedis a hypernym of formalized” and “approved has acommon hypernym with formalized”, we can suc-cessfully predict that the correct answer is “neverratified by the colonial legislatures nor approvedof by the crown”.

Passage: [...] The goal of the congress was to formalize aunified front in trade and negotiations with various Indi-ans, since allegiance of the various tribes and nations wasseen to be pivotal in the success in the war that was un-folding. The plan that the delegates agreed to was neverratified by the colonial legislatures nor approved of bythe crown. [...]Question: Was the plan formalized?Original BERT prediction: formalize a unified front intrade and negotiations with various IndiansPrediction with background knowledge: never ratifiedby the colonial legislatures nor approved of by the crown

Background knowledge:(ratified, hypernym-of, formalized)(approved, common-hypernym-with, formalized)

Figure 4: An example from SQuAD1.1. The vanillaBERT model fails to predict the correct answer. But itsucceeds after integrating background knowledge col-lected from WordNet.

B Case Study on SQuAD1.1

We further provide a case study, using the aboveexample, to vividly show the effectiveness of ourmethod KT-NET, and make a direct comparisonwith BERT. We use the same analytical strategyas described in § 4. For both KT-NET and BERT,we use the optimal configurations that offer theirrespective best performance on SQuAD1.1 (wherethe example comes from).

Relevant Knowledge Selection We first explorehow KT-NET can adaptively select the most rele-vant knowledge w.r.t. the reading text. Fig.5(a)(left) presents 3 words from the question/passage,each associated with top 3 most relevant synsetsfrom WordNet.8 Here the relevance of synset cjto word si is measured by the attention weight αij

(Eq. (1)).9 As we can see, these attention distribu-tions are quite meaningful, with “ratified” attend-ing mainly to WordNet synset “sign VB 2”, “for-malized” mainly to synset “formalize VB 1”, and“approved” mainly to synsets“approve VB 2” and“sanction VB 1”.

Question/Passage Representations We furtherexamine how such knowledge will affect the finalrepresentations learned for the question/passage.We consider all sentences listed in Fig. 4, and con-tent words (nouns, verbs, adjectives, and adverbs)therein. For each word si, we take its final repre-

8We retrieve a single synset “sign VB 2” for “ratified”.9If word si consists of multiple subwords, we average the

relevance of cj over these subwords.

Page 12: Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine … · 2019-07-15 · Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine

2357

formalized1. formalize_VB_1: 0.9482. validate_VB_1: 0.0463. formalized_JJ_1: 0.006

ratified1. sign_VB_2: 0.9912. sentinel: 0.009

approved1. approve_VB_2: 0.7912. sanction_VB_1: 0.2063. sentinel: 0.003

(a) KT-NET (b) BERT

Figure 5: Case study. Heat maps present similarities between question (row) and passage (column) words. Linecharts show probabilities of answer boundaries. In KT-NET, top 3 most relevant KB concepts are further given.

sentation oi, obtained right before the output layer.Then we calculate the cosine similarity cos(oi,oj)between each question word si and passage wordsj . The resultant similarity matrices are visualizedin Fig.5(a) and Fig.5(b) (heat maps), obtained byKT-NET and BERT, respectively.10

For BERT (Fig.5(b)), we observe very similarpatterns as in the ReCoRD example (§ 4). Givenany passage word, all question words tend to havesimilar similarities to the given word, e.g., all thewords in the question have a low degree of similar-ity to the passage word “never”, while a relativelyhigh degree of similarity to “various”. Such phe-nomenon indicates, again, that after fine-tuning inthe MRC task, BERT tends to learn similar repre-sentations for question words, all of which approx-imately express the meaning of the whole questionand are hard to distinguish.

For KT-NET (Fig.5(a)), although the similari-ties between question and passage words are gen-erally higher, these similarities may still perfectlyreflect their relationships encoded in KBs. For ex-ample, we can observe relatively high similaritiesbetween: (i) “formalized” and “ratified” where thelatter is a hypernym of the former; (ii) “formal-ized” and “approved” which share a common hy-pernym in WordNet. Such phenomenon indicates,again, that after integrating knowledge, KT-NETcan learn more accurate representations which en-able better question-passage matching.

Final Answer Prediction With the learned rep-resentations, predicting final answers is a naturalnext step. Fig.5(a) and Fig.5(b) (line charts) list

10During visualization, we take the averaged cosine simi-larity if word si or word sj has subwords. And we use a row-wise softmax operation to normalize similarity scores over allpassage tokens.

the probability of each word to be the start/end ofthe answer, predicted by KT-NET and BERT, re-spectively. As we can see, BERT mistakenly pre-dicts the answer as “formalize a unified front intrade and negotiations with various Indians”, butour method successfully gets the correct answer“never ratified by the colonial legislatures nor ap-proved of by the crown”.

The phenomena observed here are quite similarto those observed in the ReCoRD example, bothdemonstrating the effectiveness of our method andits superiority over BERT.