arxiv:1704.06194v2 [cs.cl] 27 may 2017 · directed path of relation from root to the answer node....

Improved Neural Relation Detection for Knowledge Base QuestionAnswering

Mo Yu† Wenpeng Yin? Kazi Saidul Hasan‡ Cicero dos Santos†Bing Xiang‡ Bowen Zhou†

†AI Foundations, IBM Research, USA?Center for Information and Language Processing, LMU Munich

‡IBM Watson, USA{yum,kshasan,cicerons,bingxia,zhou}@us.ibm.com, [email protected]

Abstract

Relation detection is a core component ofmany NLP applications including Knowl-edge Base Question Answering (KBQA).In this paper, we propose a hierarchi-cal recurrent neural network enhanced byresidual learning which detects KB re-lations given an input question. Ourmethod uses deep residual bidirectionalLSTMs to compare questions and rela-tion names via different levels of abstrac-tion. Additionally, we propose a sim-ple KBQA system that integrates entitylinking and our proposed relation detec-tor to make the two components enhanceeach other. Our experimental results showthat our approach not only achieves out-standing relation detection performance,but more importantly, it helps our KBQAsystem achieve state-of-the-art accuracyfor both single-relation (SimpleQuestions)and multi-relation (WebQSP) QA bench-marks.

1 Introduction

Knowledge Base Question Answering (KBQA)systems answer questions by obtaining informa-tion from KB tuples (Berant et al., 2013; Yao et al.,2014; Bordes et al., 2015; Bast and Haussmann,2015; Yih et al., 2015; Xu et al., 2016). For aninput question, these systems typically generate aKB query, which can be executed to retrieve theanswers from a KB. Figure 1 illustrates the processused to parse two sample questions in a KBQAsystem: (a) a single-relation question, which canbe answered with a single <head-entity, relation,tail-entity>KB tuple (Fader et al., 2013; Yih et al.,2014; Bordes et al., 2015); and (b) a more complexcase, where some constraints need to be handled

for multiple entities in the question. The KBQAsystem in the figure performs two key tasks: (1)entity linking, which links n-grams in questionsto KB entities, and (2) relation detection, whichidentifies the KB relation(s) a question refers to.

The main focus of this work is to improve therelation detection subtask and further explore howit can contribute to the KBQA system. Althoughgeneral relation detection1 methods are well stud-ied in the NLP community, such studies usuallydo not take the end task of KBQA into considera-tion. As a result, there is a significant gap betweengeneral relation detection studies and KB-specificrelation detection. First, in most general relationdetection tasks, the number of target relations islimited, normally smaller than 100. In contrast, inKBQA even a small KB, like Freebase2M (Bor-des et al., 2015), contains more than 6,000 relationtypes. Second, relation detection for KBQA oftenbecomes a zero-shot learning task, since some testinstances may have unseen relations in the trainingdata. For example, the SimpleQuestions (Bordeset al., 2015) data set has 14% of the golden testrelations not observed in golden training tuples.Third, as shown in Figure 1(b), for some KBQAtasks like WebQuestions (Berant et al., 2013), weneed to predict a chain of relations instead of asingle relation. This increases the number of tar-get relation types and the sizes of candidate rela-tion pools, further increasing the difficulty of KBrelation detection. Owing to these reasons, KB re-lation detection is significantly more challengingcompared to general relation detection tasks.

This paper improves KB relation detection tocope with the problems mentioned above. First, inorder to deal with the unseen relations, we proposeto break the relation names into word sequencesfor question-relation matching. Second, noticing

1In the information extraction field such tasks are usuallycalled relation extraction or relation classification.

arX

iv:1

704.

0619

4v2

[cs

.CL

] 2

7 M

ay 2

017

Question: what episode was mike kelley the writer of

KnowledgeBase

MikeKelley(Americantelevisionwriter/producer)

MikeKelley(Americanbaseball

player)

…

Entity Linking

LoveWillFindaWay

USA

…

Firstbaseman…

episodes_written

position_played

Relation Detection

(a) (b)

Question: what tv show did grant show play on in 2008

MikeKelley ?episodes_written

Entity Linking RelationDetection

Grant Show

?starring_roles

series

(date) from 2008

ConstraintDetection

Grant Show(Americanactor)

SwingTown

Big Love

episodes

Scoundrelsseries

2011from

2010

2008

Figure 1: KBQA examples and its three key components. (a) A single relation example. We first identify the topic entity withentity linking and then detect the relation asked by the question with relation detection (from all relations connecting the topicentity). Based on the detected entity and relation, we form a query to search the KB for the correct answer “Love Will Find aWay”. (b) A more complex question containing two entities. By using “Grant Show” as the topic entity, we could detect a chainof relations “starring roles-series” pointing to the answer. An additional constraint detection takes the other entity “2008” asa constraint, to filter the correct answer “SwingTown” from all candidates found by the topic entity and relation.

that original relation names can sometimes helpto match longer question contexts, we proposeto build both relation-level and word-level rela-tion representations. Third, we use deep bidirec-tional LSTMs (BiLSTMs) to learn different levelsof question representations in order to match thedifferent levels of relation information. Finally, wepropose a residual learning method for sequencematching, which makes the model training easierand results in more abstract (deeper) question rep-resentations, thus improves hierarchical matching.

In order to assess how the proposed improvedrelation detection could benefit the KBQA endtask, we also propose a simple KBQA implemen-tation composed of two-step relation detection.Given an input question and a set of candidate enti-ties retrieved by an entity linker based on the ques-tion, our proposed relation detection model plays akey role in the KBQA process: (1) Re-ranking theentity candidates according to whether they con-nect to high confident relations detected from theraw question text by the relation detection model.This step is important to deal with the ambigui-ties normally present in entity linking results. (2)Finding the core relation (chains) for each topicentity2 selection from a much smaller candidateentity set after re-ranking. The above steps arefollowed by an optional constraint detection step,when the question cannot be answered by singlerelations (e.g., multiple entities in the question).Finally the highest scored query from the above

2Following Yih et al. (2015), here topic entity refers tothe root of the (directed) query tree; and core-chain is thedirected path of relation from root to the answer node.

steps is used to query the KB for answers.Our main contributions include: (i) An im-

proved relation detection model by hierarchicalmatching between questions and relations withresidual learning; (ii) We demonstrate that the im-proved relation detector enables our simple KBQAsystem to achieve state-of-the-art results on bothsingle-relation and multi-relation KBQA tasks.

2 Related Work

Relation Extraction Relation extraction (RE) isan important sub-field of information extraction.General research in this field usually works on a(small) pre-defined relation set, where given a textparagraph and two target entities, the goal is todetermine whether the text indicates any types ofrelations between the entities or not. As a resultRE is usually formulated as a classification task.Traditional RE methods rely on large amount ofhand-crafted features (Zhou et al., 2005; Rink andHarabagiu, 2010; Sun et al., 2011). Recent re-search benefits a lot from the advancement of deeplearning: from word embeddings (Nguyen and Gr-ishman, 2014; Gormley et al., 2015) to deep net-works like CNNs and LSTMs (Zeng et al., 2014;dos Santos et al., 2015; Vu et al., 2016) and atten-tion models (Zhou et al., 2016; Wang et al., 2016).

The above research assumes there is a fixed(closed) set of relation types, thus no zero-shotlearning capability is required. The numberof relations is usually not large: The widelyused ACE2005 has 11/32 coarse/fine-grained rela-tions; SemEval2010 Task8 has 19 relations; TAC-

KBP2015 has 74 relations although it considersopen-domain Wikipedia relations. All are muchfewer than thousands of relations in KBQA. As aresult, few work in this field focuses on dealingwith large number of relations or unseen relations.Yu et al. (2016) proposed to use relation embed-dings in a low-rank tensor method. However theirrelation embeddings are still trained in supervisedway and the number of relations is not large in theexperiments.

Relation Detection in KBQA Systems Rela-tion detection for KBQA also starts with feature-rich approaches (Yao and Van Durme, 2014; Bastand Haussmann, 2015) towards usages of deepnetworks (Yih et al., 2015; Xu et al., 2016; Daiet al., 2016) and attention models (Yin et al., 2016;Golub and He, 2016). Many of the above re-lation detection research could naturally supportlarge relation vocabulary and open relation sets(especially for QA with OpenIE KB like ParaLex(Fader et al., 2013)), in order to fit the goal ofopen-domain question answering.

Different KBQA data sets have different levelsof requirement about the above open-domain ca-pacity. For example, most of the gold test relationsin WebQuestions can be observed during train-ing, thus some prior work on this task adopted theclose domain assumption like in the general RE re-search. While for data sets like SimpleQuestionsand ParaLex, the capacity to support large relationsets and unseen relations becomes more necessary.To the end, there are two main solutions: (1) usepre-trained relation embeddings (e.g. from TransE(Bordes et al., 2013)), like (Dai et al., 2016); (2)factorize the relation names to sequences and for-mulate relation detection as a sequence match-ing and ranking task. Such factorization worksbecause that the relation names usually comprisemeaningful word sequences. For example, Yinet al. (2016) split relations to word sequences forsingle-relation detection. Liang et al. (2016) alsoachieve good performance on WebQSP with word-level relation representation in an end-to-end neu-ral programmer model. Yih et al. (2015) use char-acter tri-grams as inputs on both question and rela-tion sides. Golub and He (2016) propose a gener-ative framework for single-relation KBQA whichpredicts relation with a character-level sequence-to-sequence model.

Another difference between relation detectionin KBQA and general RE is that general RE re-

search assumes that the two argument entitiesare both available. Thus it usually benefits fromfeatures (Nguyen and Grishman, 2014; Gormleyet al., 2015) or attention mechanisms (Wang et al.,2016) based on the entity information (e.g. entitytypes or entity embeddings). For relation detec-tion in KBQA, such information is mostly missingbecause: (1) one question usually contains singleargument (the topic entity) and (2) one KB entitycould have multiple types (type vocabulary sizelarger than 1,500). This makes KB entity typingitself a difficult problem so no previous used en-tity information in the relation detection model.3

3 Background: Different Granularity inKB Relations

Previous research (Yih et al., 2015; Yin et al.,2016) formulates KB relation detection as a se-quence matching problem. However, while thequestions are natural word sequences, how to rep-resent relations as sequences remains a challeng-ing problem. Here we give an overview of twotypes of relation sequence representations com-monly used in previous work.

(1) Relation Name as a Single Token (relation-level). In this case, each relation name is treatedas a unique token. The problem with this ap-proach is that it suffers from the low relation cov-erage due to limited amount of training data, thuscannot generalize well to large number of open-domain relations. For example, in Figure 1, whentreating relation names as single tokens, it will bedifficult to match the questions to relation names“episodes written” and “starring roles” if thesenames do not appear in training data – their rela-tion embeddings hrs will be random vectors thusare not comparable to question embeddings hqs.

(2) Relation as Word Sequence (word-level). Inthis case, the relation is treated as a sequence ofwords from the tokenized relation name. It hasbetter generalization, but suffers from the lackof global information from the original relationnames. For example in Figure 1(b), when doingonly word-level matching, it is difficult to rank thetarget relation “starring roles” higher comparedto the incorrect relation “plays produced”. Thisis because the incorrect relation contains word“plays”, which is more similar to the question

3Such entity information has been used in KBQA systemsas features for the final answer re-rankers.

Relation Token Question 1 Question 2what tv episodes were <e> the writer of what episode was written by <e>

relation-level episodes written tv episodes were <e> the writer of episode was written by <e>

word-level episodes tv episodes episodewritten the writer of written

Table 1: An example of KB relation (episodes written) with two types of relation tokens (relation namesand words), and two questions asking this relation. The topic entity is replaced with token <e> whichcould give the position information to the deep networks. The italics show the evidence phrase for eachrelation token in the question.

(containing word “play”) in the embedding space.On the other hand, if the target relation co-occurswith questions related to “tv appearance” in train-ing, by treating the whole relation as a token (i.e.relation id), we could better learn the correspon-dence between this token and phrases like “tvshow” and “play on”.

The two types of relation representation con-tain different levels of abstraction. As shownin Table 1, the word-level focuses more on lo-cal information (words and short phrases), andthe relation-level focus more on global informa-tion (long phrases and skip-grams) but suffer fromdata sparsity. Since both these levels of granu-larity have their own pros and cons, we proposea hierarchical matching approach for KB relationdetection: for a candidate relation, our approachmatches the input question to both word-level andrelation-level representations to get the final rank-ing score. Section 4 gives the details of our pro-posed approach.

4 Improved KB Relation Detection

This section describes our hierarchical sequencematching with residual learning approach for rela-tion detection. In order to match the question todifferent aspects of a relation (with different ab-straction levels), we deal with three problems asfollows on learning question/relation representa-tions.

4.1 Relation Representations from DifferentGranularity

We provide our model with both types of re-lation representation: word-level and relation-level. Therefore, the input relation becomes r ={rword

1 , · · · , rwordM1} ∪ {rrel1 , · · · , rrelM2

}, where thefirst M1 tokens are words (e.g. {episode, writ-ten}), and the last M2 tokens are relation names,e.g., {episode written} or {starring roles, series}(when the target is a chain like in Figure 1(b)).We transform each token above to its word embed-

ding then use two BiLSTMs (with shared parame-ters) to get their hidden representations [Bword

1:M1:

Brel1:M2

] (each row vector βi is the concatena-tion between forward/backward representations ati). We initialize the relation sequence LSTMswith the final state representations of the word se-quence, as a back-off for unseen relations. We ap-ply one max-pooling on these two sets of vectorsand get the final relation representation hr.

4.2 Different Abstractions of QuestionsRepresentations

From Table 1, we can see that different parts of arelation could match different contexts of questiontexts. Usually relation names could match longerphrases in the question and relation words couldmatch short phrases. Yet different words mightmatch phrases of different lengths.

As a result, we hope the question representa-tions could also comprise vectors that summa-rize various lengths of phrase information (differ-ent levels of abstraction), in order to match rela-tion representations of different granularity. Wedeal with this problem by applying deep BiL-STMs on questions. The first-layer of BiLSTMworks on the word embeddings of question wordsq = {q1, · · · , qN} and gets hidden representationsΓ(1)1:N = [γ

(1)1 ; · · · ;γ(1)

N ]. The second-layer BiL-STM works on Γ

(1)1:N to get the second set of hid-

den representations Γ(2)1:N . Since the second BiL-

STM starts with the hidden vectors from the firstlayer, intuitively it could learn more general andabstract information compared to the first layer.

Note that the first(second)-layer of question rep-resentations does not necessarily correspond to theword(relation)-level relation representations, in-stead either layer of question representations couldpotentially match to either level of relation repre-sentations. This raises the difficulty of matchingbetween different levels of relation/question rep-resentations; the following section gives our pro-posal to deal with such problem.

…

… …

max-pooling

max-pooling

QuestionRepresentation

RelationRepresentation

(cosine similarity)

Shortcutconnections

Point-wisesummation

Bi-LSTM 2

Bi-LSTM 1𝜸𝒊𝟏

𝜸𝒊𝟐

Relation-Level Word-Level

what tv show did <e>… starring_role series starring role series

𝜷𝒋𝒘 𝜷𝒌𝒓….

….

...

…

...

Question Relation

Figure 2: The proposed Hierarchical Residual BiLSTM (HR-BiLSTM) model for relation detection.Note that without the dotted arrows of shortcut connections between two layers, the model will onlycompute the similarity between the second-layer of questions representations and the relation, thus is notdoing hierarchical matching.

4.3 Hierarchical Matching between Relationand Question

Now we have question contexts of differentlengths encoded in Γ

(1)1:N and Γ

(2)1:N . Unlike the

standard usage of deep BiLSTMs that employsthe representations in the final layer for prediction,here we expect that two layers of question repre-sentations can be complementary to each other andboth should be compared to the relation represen-tation space (Hierarchical Matching). This is im-portant for our task since each relation token cancorrespond to phrases of different lengths, mainlybecause of syntactic variations. For example in Ta-ble 1, the relation word written could be matchedto either the same single word in the question or amuch longer phrase be the writer of.

We could perform the above hierarchical match-ing by computing the similarity between eachlayer of Γ and hr separately and doing the(weighted) sum between the two scores. How-ever this does not give significant improvement(see Table 2). Our analysis in Section 6.2 showsthat this naive method suffers from the trainingdifficulty, evidenced by that the converged train-ing loss of this model is much higher than thatof a single-layer baseline model. This is mainlybecause (1) Deep BiLSTMs do not guarantee thatthe two-levels of question hidden representationsare comparable, the training usually falls to localoptima where one layer has good matching scoresand the other always has weight close to 0. (2)

The training of deeper architectures itself is moredifficult.

To overcome the above difficulties, we adopt theidea from Residual Networks (He et al., 2016) forhierarchical matching by adding shortcut connec-tions between two BiLSTM layers. We proposedtwo ways of such Hierarchical Residual Match-ing: (1) Connecting each γ

(1)i and γ

(2)i , resulting

in a γ′i = γ

(1)i +γ

(2)i for each position i. Then the

final question representation hq becomes a max-pooling over all γ

′is, 1≤i≤N . (2) Applying max-

pooling on Γ(1)1:N and Γ

(2)1:N to get h

(1)max and h

(2)max,

respectively, then setting hq = h(1)max + h

(2)max. Fi-

nally we compute the matching score of r given qas srel(r;q) = cos(hr,hq).

Intuitively, the proposed method should benefitfrom hierarchical training since the second layer isfitting the residues from the first layer of matching,so the two layers of representations are more likelyto be complementary to each other. This also en-sures the vector spaces of two layers are compara-ble and makes the second-layer training easier.

During training we adopt a ranking loss to max-imizing the margin between the gold relation r+

and other relations r− in the candidate pool R.

lrel = max{0, γ − srel(r+;q) + srel(r−;q)}

where γ is a constant parameter. Fig 2 sum-marizes the above Hierarchical Residual BiLSTM(HR-BiLSTM) model.

Remark: Another way of hierarchical matchingconsists in relying on attention mechanism, e.g.(Parikh et al., 2016), to find the correspondencebetween different levels of representations. Thisperforms below the HR-BiLSTM (see Table 2).

5 KBQA Enhanced by RelationDetection

This section describes our KBQA pipeline system.We make minimal efforts beyond the training ofthe relation detection model, making the wholesystem easy to build.

Following previous work (Yih et al., 2015; Xuet al., 2016), our KBQA system takes an existingentity linker to produce the top-K linked entities,ELK(q), for a question q (“initial entity linking”).Then we generate the KB queries for q followingthe four steps illustrated in Algorithm 1.

Algorithm 1: KBQA with two-step relation detectionInput : Question q, Knowledge Base KB, the initial

top-K entity candidates ELK(q)Output: Top query tuple (e, r, {(c, rc)})

1 Entity Re-Ranking (first-step relation detection): Usethe raw question text as input for a relation detector toscore all relations in the KB that are associated to theentities in ELK(q); use the relation scores to re-rankELK(q) and generate a shorter list EL′

K′(q)containing the top-K′ entity candidates (Section 5.1)

2 Relation Detection: Detect relation(s) using thereformatted question text in which the topic entity isreplaced by a special token <e> (Section 5.2)

3 Query Generation: Combine the scores from step 1and 2, and select the top pair (e, r) (Section 5.3)

4 Constraint Detection (optional): Compute similaritybetween q and any neighbor entity c of the entitiesalong r (connecting by a relation rc) , add the highscoring c and rc to the query (Section 5.4).

Compared to previous approaches, the main dif-ference is that we have an additional entity re-ranking step after the initial entity linking. Wehave this step because we have observed that entitylinking sometimes becomes a bottleneck in KBQAsystems. For example, on SimpleQuestions thebest reported linker could only get 72.7% top-1accuracy on identifying topic entities. This is usu-ally due to the ambiguities of entity names, e.g. inFig 1(a), there are TV writer and baseball player“Mike Kelley”, which is impossible to distinguishwith only entity name matching.

Having observed that different entity candidatesusually connect to different relations, here we pro-pose to help entity disambiguation in the initial en-tity linking with relations detected in questions.

Sections 5.1 and 5.2 elaborate how our relationdetection help to re-rank entities in the initial en-tity linking, and then those re-ranked entities en-able more accurate relation detection. The KBQAend task, as a result, benefits from this process.

5.1 Entity Re-Ranking

In this step, we use the raw question text as inputfor a relation detector to score all relations in theKB with connections to at least one of the entitycandidates in ELK(q). We call this step relationdetection on entity set since it does not work ona single topic entity as the usual settings. We usethe HR-BiLSTM as described in Sec. 4. For eachquestion q, after generating a score srel(r; q) foreach relation using HR-BiLSTM, we use the topl best scoring relations (Rl

q) to re-rank the origi-nal entity candidates. Concretely, for each entitye and its associated relations Re, given the origi-nal entity linker score slinker, and the score of themost confident relation r ∈ Rl

q∩Re, we sum thesetwo scores to re-rank the entities:

srerank(e; q) =α · slinker(e; q)+(1− α) · max

r∈Rlq∩Re

srel(r; q).

Finally, we select topK ′ <K entities according toscore srerank to form the re-ranked list EL

′K′(q).

We use the same example in Fig 1(a) to illustratethe idea. Given the input question in the exam-ple, a relation detector is very likely to assign highscores to relations such as “episodes written”,“author of ” and “profession”. Then, accordingto the connections of entity candidates in KB,we find that the TV writer “Mike Kelley” willbe scored higher than the baseball player “MikeKelley”, because the former has the relations“episodes written” and “profession”. This methodcan be viewed as exploiting entity-relation collo-cation for entity linking.

5.2 Relation Detection

In this step, for each candidate entity e ∈EL′K(q), we use the question text as the input to arelation detector to score all the relations r ∈ Re

that are associated to the entity e in the KB.4 Be-cause we have a single topic entity input in thisstep, we do the following question reformatting:we replace the the candidate e’s entity mention in

4Note that the number of entities and the number of rela-tion candidates will be much smaller than those in the previ-ous step.

q with a token “<e>”. This helps the model bet-ter distinguish the relative position of each wordcompared to the entity. We use the HR-BiLSTMmodel to predict the score of each relation r ∈ Re:srel(r; e, q).

5.3 Query GenerationFinally, the system outputs the<entity, relation (orcore-chain)> pair (e, r) according to:

s(e, r; q) = maxe∈EL

′K′ (q),r∈Re

(β · srerank(e; q)

+(1− β) · srel(r; e, q)) ,where β is a hyperparameter to be tuned.

5.4 Constraint DetectionSimilar to (Yih et al., 2015), we adopt an ad-ditional constraint detection step based on textmatching. Our method can be viewed as entity-linking on a KB sub-graph. It contains two steps:(1) Sub-graph generation: given the top scoredquery generated by the previous 3 steps5, for eachnode v (answer node or the CVT node like in Fig-ure 1(b)), we collect all the nodes c connecting tov (with relation rc) with any relation, and generatea sub-graph associated to the original query. (2)Entity-linking on sub-graph nodes: we computea matching score between each n-gram in the inputquestion (without overlapping the topic entity) andentity name of c (except for the node in the orig-inal query) by taking into account the maximumoverlapping sequence of characters between them(see Appendix A for details and B for special rulesdealing with date/answer type constraints). If thematching score is larger than a threshold θ (tunedon training set), we will add the constraint entity c(and rc) to the query by attaching it to the corre-sponding node v on the core-chain.

6 Experiments

6.1 Task Introduction & SettingsWe use the SimpleQuestions (Bordes et al., 2015)and WebQSP (Yih et al., 2016) datasets. Eachquestion in these datasets is labeled with the goldsemantic parse. Hence we can directly evaluaterelation detection performance independently aswell as evaluate on the KBQA end task.

5Starting with the top-1 query suffers more from errorpropagation. However we still achieve state-of-the-art on We-bQSP in Sec.6, showing the advantage of our relation detec-tion model. We leave in future work beam-search and featureextraction on beam for final answer re-ranking like in previ-ous research.

SimpleQuestions (SQ): It is a single-relationKBQA task. The KB we use consists of a Freebasesubset with 2M entities (FB2M) (Bordes et al.,2015), in order to compare with previous research.Yin et al. (2016) also evaluated their relation ex-tractor on this data set and released their proposedquestion-relation pairs, so we run our relation de-tection model on their data set. For the KBQAevaluation, we also start with their entity linkingresults6. Therefore, our results can be comparedwith their reported results on both tasks.WebQSP (WQ): A multi-relation KBQA task.We use the entire Freebase KB for evaluationpurposes. Following Yih et al. (2016), we useS-MART (Yang and Chang, 2015) entity-linkingoutputs.7 In order to evaluate the relation detec-tion models, we create a new relation detectiontask from the WebQSP data set.8 For each ques-tion and its labeled semantic parse: (1) we firstselect the topic entity from the parse; and then (2)select all the relations and relation chains (length≤ 2) connected to the topic entity, and set the core-chain labeled in the parse as the positive label andall the others as the negative examples.

We tune the following hyper-parameters on de-velopment sets: (1) the size of hidden states forLSTMs ({50, 100, 200, 400})9; (2) learning rate({0.1, 0.5, 1.0, 2.0}); (3) whether the shortcutconnections are between hidden states or betweenmax-pooling results (see Section 4.3); and (4) thenumber of training epochs.

For both the relation detection experiments andthe second-step relation detection in KBQA, wehave entity replacement first (see Section 5.2and Figure 1). All word vectors are initializedwith 300-d pretrained word embeddings (Mikolovet al., 2013). The embeddings of relation namesare randomly initialized, since existing pre-trainedrelation embeddings (e.g. TransE) usually supportlimited sets of relation names. We leave the usageof pre-trained relation embeddings to future work.

6.2 Relation Detection ResultsTable 2 shows the results on two relation detec-tion tasks. The AMPCNN result is from (Yinet al., 2016), which yielded state-of-the-art scoresby outperforming several attention-based meth-

6The two resources have been downloaded from https://github.com/Gorov/SimpleQuestions-EntityLinking

7https://github.com/scottyih/STAGG8The dataset is available at https://github.com/Gorov/

KBQA_RE_data.9For CNNs we double the size for fair comparison.

https://github.com/Gorov/SimpleQuestions-EntityLinking

https://github.com/Gorov/SimpleQuestions-EntityLinking

https://github.com/scottyih/STAGG

https://github.com/Gorov/KBQA_RE_data

https://github.com/Gorov/KBQA_RE_data

AccuracyModel Relation Input Views SimpleQuestions WebQSPAMPCNN (Yin et al., 2016) words 91.3 -BiCNN (Yih et al., 2015) char-3-gram 90.0 77.74BiLSTM w/ words words 91.2 79.32BiLSTM w/ relation names rel names 88.9 78.96Hier-Res-BiLSTM (HR-BiLSTM) words + rel names 93.3 82.53

w/o rel name words 91.3 81.69w/o rel words rel names 88.8 79.68w/o residual learning (weighted sum on two layers) words + rel names 92.5 80.65replacing residual with attention (Parikh et al., 2016) words + rel names 92.6 81.38single-layer BiLSTM question encoder words + rel names 92.8 78.41replacing BiLSTM with CNN (HR-CNN) words + rel names 92.9 79.08

Table 2: Accuracy on the SimpleQuestions and WebQSP relation detection tasks (test sets). The topshows performance of baselines. On the bottom we give the results of our proposed model together withthe ablation tests.

ods. We re-implemented the BiCNN model from(Yih et al., 2015), where both questions and rela-tions are represented with the word hash trick oncharacter tri-grams. The baseline BiLSTM withrelation word sequence appears to be the best base-line on WebQSP and is close to the previous bestresult of AMPCNN on SimpleQuestions. Our pro-posed HR-BiLSTM outperformed the best base-lines on both tasks by margins of 2-3% (p < 0.001and 0.01 compared to the best baseline BiLSTM w/words on SQ and WQ respectively).

Note that using only relation names insteadof words results in a weaker baseline BiLSTMmodel. The model yields a significant per-formance drop on SimpleQuestions (91.2% to88.9%). However, the drop is much smaller onWebQSP, and it suggests that unseen relationshave a much bigger impact on SimpleQuestions.

Ablation Test: The bottom of Table 2 shows ab-lation results of the proposed HR-BiLSTM. First,hierarchical matching between questions and bothrelation names and relation words yields improve-ment on both datasets, especially for SimpleQues-tions (93.3% vs. 91.2/88.8%). Second, residuallearning helps hierarchical matching compared toweighted-sum and attention-based baselines (seeSection 4.3). For the attention-based baseline,we tried the model from (Parikh et al., 2016) andits one-way variations, where the one-way modelgives better results10. Note that residual learn-ing significantly helps on WebQSP (80.65% to

10We also tried to apply the same attention method on deepBiLSTM with residual connections, but it does not lead tobetter results compared to HR-BiLSTM. We hypothesize thatthe idea of hierarchical matching with attention mechanismmay work better for long sequences, and the new advancedattention mechanisms (Wang and Jiang, 2016; Wang et al.,2017) might help hierarchical matching. We leave the abovedirections to future work.

82.53%), while it does not help as much on Sim-pleQuestions. On SimpleQuestions, even remov-ing the deep layers only causes a small drop in per-formance. WebQSP benefits more from residualand deeper architecture, possibly because in thisdataset it is more important to handle larger scopeof context matching.

Finally, on WebQSP, replacing BiLSTM withCNN in our hierarchical matching framework re-sults in a large performance drop. Yet on Sim-pleQuestions the gap is much smaller. We believethis is because the LSTM relation encoder can bet-ter learn the composition of chains of relations inWebQSP, as it is better at dealing with longer de-pendencies.

Analysis Next, we present empirical evidences,which show why our HR-BiLSTM model achievesthe best scores. We use WebQSP for the analy-sis purposes. First, we have the hypothesis thattraining of the weighted-sum model usually fallsto local optima, since deep BiLSTMs do not guar-antee that the two-levels of question hidden rep-resentations are comparable. This is evidencedby that during training one layer usually gets aweight close to 0 thus is ignored. For exam-ple, one run gives us weights of -75.39/0.14 forthe two layers (we take exponential for the finalweighted sum). It also gives much lower train-ing accuracy (91.94%) compared to HR-BiLSTM(95.67%), suffering from training difficulty.

Second, compared to our deep BiLSTM withshortcut connections, we have the hypothesis thatfor KB relation detection, training deep BiLSTMsis more difficult without shortcut connections. Ourexperiments suggest that deeper BiLSTM does notalways result in lower training accuracy. In theexperiments a two-layer BiLSTM converges to94.99%, even lower than the 95.25% achieved by a

single-layer BiLSTM. Under our setting the two-layer model captures the single-layer model as aspecial case (so it could potentially better fit thetraining data), this result suggests that the deepBiLSTM without shortcut connections might suf-fers more from training difficulty.

Finally, we hypothesize that HR-BiLSTM ismore than combination of two BiLSTMs withresidual connections, because it encourages thehierarchical architecture to learn different levelsof abstraction. To verify this, we replace the deepBiLSTM question encoder with two single-layerBiLSTMs (both on words) with shortcut connec-tions between their hidden states. This decreasestest accuracy to 76.11%. It gives similar trainingaccuracy compared to HR-BiLSTM, indicating amore serious over-fitting problem. This provesthat the residual and deep structures both con-tribute to the good performance of HR-BiLSTM.

6.3 KBQA End-Task Results

Table 3 compares our system with two publishedbaselines (1) STAGG (Yih et al., 2015), the state-of-the-art on WebQSP11 and (2) AMPCNN (Yinet al., 2016), the state-of-the-art on SimpleQues-tions. Since these two baselines are specially de-signed/tuned for one particular dataset, they do notgeneralize well when applied to the other dataset.In order to highlight the effect of different rela-tion detection models on the KBQA end-task, wealso implemented another baseline that uses ourKBQA system but replaces HR-BiLSTM with ourimplementation of AMPCNN (for SimpleQues-tions) or the char-3-gram BiCNN (for WebQSP)relation detectors (second block in Table 3).

Compared to the baseline relation detector (3rdrow of results), our method, which includes an im-proved relation detector (HR-BiLSTM), improvesthe KBQA end task by 2-3% (4th row). Note thatin contrast to previous KBQA systems, our sys-tem does not use joint-inference or feature-basedre-ranking step, nevertheless it still achieves betteror comparable results to the state-of-the-art.

The third block of the table details two ablationtests for the proposed components in our KBQAsystems: (1) Removing the entity re-ranking stepsignificantly decreases the scores. Since the re-ranking step relies on the relation detection mod-els, this shows that our HR-BiLSTM model con-tributes to the good performance in multiple ways.

11The STAGG score on SQ is from (Bao et al., 2016).

AccuracySystem SQ WQSTAGG 72.8 63.9AMPCNN (Yin et al., 2016) 76.4 -Baseline: Our Method w/

75.1 60.0baseline relation detector

Our Method 77.0 63.0w/o entity re-ranking 74.9 60.6w/o constraints - 58.0

Our Method (multi-detectors) 78.7 63.9Table 3: KBQA results on SimpleQuestions (SQ)and WebQSP (WQ) test sets. The numbers ingreen color are directly comparable to our resultssince we start with the same entity linking results.

Appendix C gives the detailed performance of there-ranking step. (2) In contrast to the conclusionin (Yih et al., 2015), constraint detection is crucialfor our system12. This is probably because ourjoint performance on topic entity and core-chaindetection is more accurate (77.5% top-1 accuracy),leaving a huge potential (77.5% vs. 58.0%) for theconstraint detection module to improve.

Finally, like STAGG, which uses multiple rela-tion detectors (see Yih et al. (2015) for the threemodels used), we also try to use the top-3 rela-tion detectors from Section 6.2. As shown on thelast row of Table 3, this gives a significant perfor-mance boost, resulting in a new state-of-the-art re-sult on SimpleQuestions and a result comparableto the state-of-the-art on WebQSP.

7 Conclusion

KB relation detection is a key step in KBQA andis significantly different from general relation ex-traction tasks. We propose a novel KB relationdetection model, HR-BiLSTM, that performs hier-archical matching between questions and KB rela-tions. Our model outperforms the previous meth-ods on KB relation detection tasks and allows ourKBQA system to achieve state-of-the-arts. For fu-ture work, we will investigate the integration ofour HR-BiLSTM into end-to-end systems. For ex-ample, our model could be integrated into the de-coder in (Liang et al., 2016), to provide better se-quence prediction. We will also investigate newemerging datasets like GraphQuestions (Su et al.,2016) and ComplexQuestions (Bao et al., 2016) tohandle more characteristics of general QA.

12Note that another reason is that we are evaluating on ac-curacy here. When evaluating on F1 the gap will be smaller.

ReferencesJunwei Bao, Nan Duan, Zhao Yan, Ming Zhou, and

Tiejun Zhao. 2016. Constraint-based question an-swering with knowledge graph. In Proceedings ofCOLING 2016, the 26th International Conferenceon Computational Linguistics: Technical Papers.The COLING 2016 Organizing Committee, Osaka,Japan, pages 2503–2514.

Hannah Bast and Elmar Haussmann. 2015. More ac-curate question answering on freebase. In Proceed-ings of the 24th ACM International on Conferenceon Information and Knowledge Management. ACM,pages 1431–1440.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on Freebase fromquestion-answer pairs. In Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing. Association for ComputationalLinguistics, Seattle, Washington, USA, pages 1533–1544.

Antoine Bordes, Nicolas Usunier, Sumit Chopra, andJason Weston. 2015. Large-scale simple questionanswering with memory networks. arXiv preprintarXiv:1506.02075 .

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In Advances in Neural InformationProcessing Systems. pages 2787–2795.

Zihang Dai, Lei Li, and Wei Xu. 2016. Cfo: Condi-tional focused neural question answering with large-scale knowledge bases. In Proceedings of the 54thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers). Asso-ciation for Computational Linguistics, Berlin, Ger-many, pages 800–810.

Cicero dos Santos, Bing Xiang, and Bowen Zhou.2015. Classifying relations by ranking with con-volutional neural networks. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing (Vol-ume 1: Long Papers). Association for Computa-tional Linguistics, Beijing, China, pages 626–634.

Anthony Fader, Luke S Zettlemoyer, and Oren Etzioni.2013. Paraphrase-driven learning for open questionanswering. In ACL (1). Citeseer, pages 1608–1618.

David Golub and Xiaodong He. 2016. Character-levelquestion answering with attention. arXiv preprintarXiv:1604.00727 .

Matthew R. Gormley, Mo Yu, and Mark Dredze. 2015.Improved relation extraction with feature-rich com-positional embedding models. In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing. Association for Compu-tational Linguistics, Lisbon, Portugal, pages 1774–1784.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pages770–778.

Chen Liang, Jonathan Berant, Quoc Le, Kenneth DForbus, and Ni Lao. 2016. Neural symbolic ma-chines: Learning semantic parsers on freebase withweak supervision. arXiv preprint arXiv:1611.00020.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems. pages 3111–3119.

Thien Huu Nguyen and Ralph Grishman. 2014. Em-ploying word representations and regularization fordomain adaptation of relation extraction. In Pro-ceedings of the 52nd Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers). Association for Computational Linguistics,Baltimore, Maryland, pages 68–74.

Ankur Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceed-ings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing. Associationfor Computational Linguistics, Austin, Texas, pages2249–2255.

Bryan Rink and Sanda Harabagiu. 2010. Utd: Clas-sifying semantic relations by combining lexical andsemantic resources. In Proceedings of the 5th Inter-national Workshop on Semantic Evaluation. Associ-ation for Computational Linguistics, Uppsala, Swe-den, pages 256–259.

Yu Su, Huan Sun, Brian Sadler, Mudhakar Sri-vatsa, Izzeddin Gur, Zenghui Yan, and Xifeng Yan.2016. On generating characteristic-rich questionsets for qa evaluation. In Proceedings of the2016 Conference on Empirical Methods in Natu-ral Language Processing. Association for Compu-tational Linguistics, Austin, Texas, pages 562–572.https://aclweb.org/anthology/D16-1054.

Ang Sun, Ralph Grishman, and Satoshi Sekine. 2011.Semi-supervised relation extraction with large-scaleword clustering. In Proceedings of the 49th AnnualMeeting of the Association for Computational Lin-guistics: Human Language Technologies. Associa-tion for Computational Linguistics, Portland, Ore-gon, USA, pages 521–529.

Ngoc Thang Vu, Heike Adel, Pankaj Gupta, and Hin-rich Schutze. 2016. Combining recurrent and con-volutional neural networks for relation classifica-tion. In Proceedings of the 2016 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies. Association for Computational Linguis-tics, San Diego, California, pages 534–539.

https://aclweb.org/anthology/D16-1054



Linlin Wang, Zhu Cao, Gerard de Melo, and ZhiyuanLiu. 2016. Relation classification via multi-level at-tention cnns. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers). Association forComputational Linguistics, Berlin, Germany, pages1298–1307.

Shuohang Wang and Jing Jiang. 2016. Learningnatural language inference with lstm. In Pro-ceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Com-putational Linguistics: Human Language Tech-nologies. Association for Computational Linguis-tics, San Diego, California, pages 1442–1451.http://www.aclweb.org/anthology/N16-1170.

Zhiguo Wang, Wael Hamza, and Radu Florian. 2017.Bilateral multi-perspective matching for natural lan-guage sentences. arXiv preprint arXiv:1702.03814.

Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang,and Dongyan Zhao. 2016. Question answering onfreebase via relation extraction and textual evidence.In Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers). Association for Computational Lin-guistics, Berlin, Germany, pages 2326–2336.

Yi Yang and Ming-Wei Chang. 2015. S-mart: Noveltree-based structured learning algorithms applied totweet entity linking. In Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 1:Long Papers). Association for Computational Lin-guistics, Beijing, China, pages 504–513.

Xuchen Yao, Jonathan Berant, and BenjaminVan Durme. 2014. Freebase qa: Informationextraction or semantic parsing? ACL 2014 page 82.

Xuchen Yao and Benjamin Van Durme. 2014. Infor-mation extraction over structured data: Question an-swering with freebase. In ACL (1). Citeseer, pages956–966.

Wen-tau Yih, Ming-Wei Chang, Xiaodong He, andJianfeng Gao. 2015. Semantic parsing via stagedquery graph generation: Question answering withknowledge base. In Association for ComputationalLinguistics (ACL).

Wen-tau Yih, Xiaodong He, and Christopher Meek.2014. Semantic parsing for single-relation ques-tion answering. In Proceedings of the 52nd An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers). Associationfor Computational Linguistics, Baltimore, Mary-land, pages 643–648.

Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of se-mantic parse labeling for knowledge base question

answering. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers). Association for Computa-tional Linguistics, Berlin, Germany, pages 201–206.

Wenpeng Yin, Mo Yu, Bing Xiang, Bowen Zhou, andHinrich Schutze. 2016. Simple question answeringby attentive convolutional neural network. In Pro-ceedings of COLING 2016, the 26th InternationalConference on Computational Linguistics: Techni-cal Papers. The COLING 2016 Organizing Commit-tee, Osaka, Japan, pages 1746–1756.

Mo Yu, Mark Dredze, Raman Arora, and Matthew R.Gormley. 2016. Embedding lexical features via low-rank tensors. In Proceedings of the 2016 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies. Association for Computational Lin-guistics, San Diego, California, pages 1019–1029.http://www.aclweb.org/anthology/N16-1117.

Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,and Jun Zhao. 2014. Relation classification via con-volutional deep neural network. In Proceedings ofCOLING 2014, the 25th International Conferenceon Computational Linguistics: Technical Papers.Dublin City University and Association for Com-putational Linguistics, Dublin, Ireland, pages 2335–2344.

GuoDong Zhou, Jian Su, Jie Zhang, and Min Zhang.2005. Exploring various knowledge in relation ex-traction. In Association for Computational Linguis-tics. pages 427–434.

Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, BingchenLi, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory net-works for relation classification. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers).Association for Computational Linguistics, Berlin,Germany, pages 207–212.

http://www.aclweb.org/anthology/N16-1170






Appendix A: Detailed Score Computationfor Constraint Detection

Given an input question q and an entity name e inKB, we denote the lengths of the question and theentity name as |q| and |ne|. For a mention m ofthe entity e which is an n-gram in q, we computethe longest consecutive common sub-sequence be-tween m and e, and denote its length as |m ∩ e|.All the lengths above are measured by the numberof characters.

Based on the above numbers we compute theproportions of the length of the overlap betweenentity mention and entity name (in characters) inthe entity name |m∩e||e| and in the question |m∩e||q| ;The final score for the question has a mention link-ing to e is

slinker(e; q) = maxm

|m ∩ e||q|

+|m ∩ e||e|

Appendix B: Special Rules for ConstraintDetection

1. Special threshold for date constraints. Thetime stamps in KB usually follow the year-month-day format, while the time in We-bQSP are usually years. This makes the over-lap between the date entities in questions andthe KB entity names smaller (length of over-lap is usually 4). To deal with this, we onlycheck whether the dates in questions couldmatch the years in KB, thus have a specialthreshold of θ = 1 for date constraints.

2. Filtering the constraints for answer nodes.Sometimes the answer node could connectto huge number of other nodes, e.g. whenthe question is asking for a country and wehave an answer candidate the U.S.. Fromthe observation on the WebQSP datasets,we found that for most of the time, thegold constraints on answers are their en-tity types (e.g., whether the question is ask-ing for a country or a city). Based onthis observation, in the constraint detectionstep, for the answer nodes we only keepthe tuples with type relations (i.e. therelation name contains the word “type”),such as common.topic.notable types, educa-tion.educational institution.school type etc.

Appendix C: Effects of Entity Re-Rankingon SimpleQuestions

Removing entity re-ranking step results in signifi-cant performance drop (see Table 3, the row of w/oentity re-ranking). Table 4 evaluates our re-rankeras an separate task. Our re-ranker results in largeimprovement, especially when the beam sizes aresmaller than 10. This is indicating another impor-tant usage of our proposed improved relation de-tection model on entity linking re-ranking.

Top K FreeBase API (Golub & He, 2016)1 40.9 52.9

10 64.3 74.020 69.7 77.850 75.7 82.0

Top K (Yin et al., 2016) Our Re-Ranker1 72.7 79.0

10 86.9 89.520 88.4 90.950 90.2 92.5

Table 4: Entity re-ranking on SimpleQuestions(test set).

arxiv:1704.06194v2 [cs.cl] 27 may 2017 · directed path of relation from root to the answer node....

Documents