kagnet: knowledge-aware graph networks for commonsense … · 2020. 1. 23. · proceedings of the...

11
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2829–2839, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics 2829 KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning Bill Yuchen Lin Xinyue Chen Jamin Chen Xiang Ren {yuchen.lin,jaminche,xiangren}@usc.edu, [email protected] Computer Science Department, University of Southern California Computer Science Department, Shanghai Jiao Tong University Abstract Commonsense reasoning aims to empower machines with the human ability to make presumptions about ordinary situations in our daily life. In this paper, we propose a textual inference framework for answer- ing commonsense questions, which effec- tively utilizes external, structured common- sense knowledge graphs to perform explain- able inferences. The framework first grounds a question-answer pair from the semantic space to the knowledge-based symbolic space as a schema graph, a related sub-graph of exter- nal knowledge graphs. It represents schema graphs with a novel knowledge-aware graph network module named KAGNET, and finally scores answers with graph representations. Our model is based on graph convolutional networks and LSTMs, with a hierarchical path-based attention mechanism. The interme- diate attention scores make it transparent and interpretable, which thus produce trustworthy inferences. Using ConceptNet as the only external resource for BERT-based models, we achieved state-of-the-art performance on the CommonsenseQA, a large-scale dataset for commonsense reasoning. We open-source our code 1 to the community for future research in knowledge-aware commonsense reasoning. 1 Introduction Human beings are rational and a major component of rationality is the ability to reason. Reasoning is the process of combining facts and beliefs to make new decisions (Johnson-Laird, 1980), as well as the ability to manipulate knowledge to draw in- ferences (Hudson and Manning, 2018). Common- sense reasoning utilizes the basic knowledge that reflects our natural understanding of the world and human behaviors, which is common to all humans. 1 https://github.com/INK-USC/KagNet Where do adults use glue sticks ? A: classroom B: office C: desk drawer glue_stick adult work CapableOf AtLocation AtLocation office use HasSubevent CapableOf ReceiveAction Grounding Knowledge-Aware Commonsense Inference Schema Graph Figure 1: An example of using external commonsense knowledge ( symbolic space ) for inference in natural language commonsense questions ( semantic space ). Empowering machines with the ability to per- form commonsense reasoning has been seen as the bottleneck of artificial general intelligence (Davis and Marcus, 2015). Recently, there have been a few emerging large-scale datasets for testing ma- chine commonsense with various focuses (Zellers et al., 2018; Sap et al., 2019b; Zellers et al., 2019). In a typical dataset, CommonsenseQA (Talmor et al., 2019), given a question like “Where do adults use glue sticks?”, with the answer choices being {classroom(), office (), desk drawer ()}, a commonsense reasoner is expected to differenti- ate the correct choice from other “distractive” can- didates. False choices are usually highly related to the question context, but just less possible in real- world scenarios, making the task even more chal- lenging. This paper aims to tackle the research question of how we can teach machines to make such commonsense inferences, particularly in the question-answering setting. It has been shown that simply fine-tuning large, pre-trained language models such as GPT (Rad- ford et al., 2018) and BERT (Devlin et al., 2019)

Upload: others

Post on 05-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 2829–2839,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

    2829

    KagNet: Knowledge-Aware Graph Networksfor Commonsense Reasoning

    Bill Yuchen Lin† Xinyue Chen‡ Jamin Chen† Xiang Ren†{yuchen.lin,jaminche,xiangren}@usc.edu, [email protected]

    †Computer Science Department, University of Southern California‡Computer Science Department, Shanghai Jiao Tong University

    Abstract

    Commonsense reasoning aims to empowermachines with the human ability to makepresumptions about ordinary situations inour daily life. In this paper, we proposea textual inference framework for answer-ing commonsense questions, which effec-tively utilizes external, structured common-sense knowledge graphs to perform explain-able inferences. The framework first grounds aquestion-answer pair from the semantic spaceto the knowledge-based symbolic space as aschema graph, a related sub-graph of exter-nal knowledge graphs. It represents schemagraphs with a novel knowledge-aware graphnetwork module named KAGNET, and finallyscores answers with graph representations.Our model is based on graph convolutionalnetworks and LSTMs, with a hierarchicalpath-based attention mechanism. The interme-diate attention scores make it transparent andinterpretable, which thus produce trustworthyinferences. Using ConceptNet as the onlyexternal resource for BERT-based models, weachieved state-of-the-art performance on theCommonsenseQA, a large-scale dataset forcommonsense reasoning. We open-source ourcode1 to the community for future research inknowledge-aware commonsense reasoning.

    1 Introduction

    Human beings are rational and a major componentof rationality is the ability to reason. Reasoning isthe process of combining facts and beliefs to makenew decisions (Johnson-Laird, 1980), as well asthe ability to manipulate knowledge to draw in-ferences (Hudson and Manning, 2018). Common-sense reasoning utilizes the basic knowledge thatreflects our natural understanding of the world andhuman behaviors, which is common to all humans.

    1https://github.com/INK-USC/KagNet

    Where do adults use glue sticks?

    A: classroom B: office C: desk drawer

    Semantic Space

    Symbolic Space

    glue_stick

    adult

    work

    CapableOf

    AtLocation AtLocat

    ion

    office

    use

    HasSubevent

    CapableOf

    ReceiveAction

    Grounding Knowledge-AwareCommonsense Inference

    Schema Graph

    Where do adults use glue sticks?

    A: classroom B: office C: desk drawer

    glue_stick

    adult

    work

    CapableOf

    AtLocation AtLocat

    ion

    office

    use

    HasSubevent

    CapableOf

    ReceiveAction

    GroundingKnowledge-Aware

    Commonsense Inference

    Schema Graph

    Figure 1: An example of using external commonsenseknowledge (symbolic space) for inference in naturallanguage commonsense questions (semantic space).

    Empowering machines with the ability to per-form commonsense reasoning has been seen as thebottleneck of artificial general intelligence (Davisand Marcus, 2015). Recently, there have been afew emerging large-scale datasets for testing ma-chine commonsense with various focuses (Zellerset al., 2018; Sap et al., 2019b; Zellers et al., 2019).In a typical dataset, CommonsenseQA (Talmoret al., 2019), given a question like “Where doadults use glue sticks?”, with the answer choicesbeing {classroom(7), office (3), desk drawer (7)},a commonsense reasoner is expected to differenti-ate the correct choice from other “distractive” can-didates. False choices are usually highly related tothe question context, but just less possible in real-world scenarios, making the task even more chal-lenging. This paper aims to tackle the researchquestion of how we can teach machines to makesuch commonsense inferences, particularly in thequestion-answering setting.

    It has been shown that simply fine-tuning large,pre-trained language models such as GPT (Rad-ford et al., 2018) and BERT (Devlin et al., 2019)

    https://github.com/INK-USC/KagNet

  • 2830

    can be a very strong baseline method. However,there still exists a large gap between performanceof said baselines and human performance. Rea-soning with neural models is also lacking in trans-parency and interpretability. There is no clear wayas to how they manage to answer commonsensequestions, thus making their inferences dubious.

    Merely relying on pre-training large languagemodels on corpora cannot provide well-defined orreusable structures for explainable commonsensereasoning. We argue that it would be more benefi-cial to propose reasoners that can exploit common-sense knowledge bases (Speer et al., 2017; Tan-don et al., 2017; Sap et al., 2019a). Knowledge-aware models can explicitly incorporate externalknowledge as relational inductive biases (Battagliaet al., 2018) to enhance their reasoning capacity,as well as to increase the transparency of modelbehaviors for more interpretable results. Further-more, a knowledge-centric approach is extensi-ble through commonsense knowledge acquisitiontechniques (Li et al., 2016; Xu et al., 2018).

    We propose a knowledge-aware reasoningframework for learning to answer commonsensequestions, which has two major steps: schemagraph grounding (§3) and graph modeling forinference (§4). As shown in Fig. 1, for eachpair of question and answer candidate, we re-trieve a graph from external knowledge graphs(e.g. ConceptNet) in order to capture the rel-evant knowledge for determining the plausibil-ity of a given answer choice. The graphs arenamed “schema graphs” inspired by the schematheory proposed by Gestalt psychologists (Ax-elrod, 1973). The grounded schema graphs areusually much more complicated and noisier, un-like the ideal case shown in the figure.

    Therefore, we propose a knowledge-awaregraph network module to further effectively modelschema graphs. Our model KAGNET is a combi-nation of graph convolutional networks (Kipf andWelling, 2017) and LSTMs, with a hierarchicalpath-based attention mechanism, which forms aGCN-LSTM-HPA architecture for path-based re-lational graph representation. Experiments showthat our framework achieved a new state-of-the-artperformance2 on the CommonsenseQA dataset.Our model also works better then other meth-ods with limited supervision, and provides human-

    2The highest score on the leaderboard as of the time whenwe submitted the paper (May 2019).

    Schema Graph

    Question

    Answer

    Concept Recognition

    Question

    Concepts

    Language

    Encoder (e.g. BERT)

    Graph Construction

    via Path Finding

    Statement Vector

    Answer

    Concepts

    Graph

    Vector

    MLP

    Plausibility score

    GCN-LSTM-HPA

    KagNet

    Figure 2: The overall workflow of the proposed frame-work with knowledge-aware graph network module.

    readable results via intermediate attention scores.

    2 Overview

    In this section, we first formalize the com-monsense question answering problem in aknowledge-aware setting, and then introduce theoverall workflow of our framework.

    2.1 Problem statement

    Given a commonsense-required natural languagequestion q and a set of N candidate answers {ai},the task is to choose one answer from the set.From a knowledge-aware perspective, we addi-tionally assume that the question q and choices{ai} can be grounded as a schema graph (de-noted as g) extracted from a large external knowl-edge graph G, which is helpful for measuring theplausibility of answer candidates. The knowledgegraph G = (V,E) can be defined as a fixed setof concepts V , and typed edges E describing se-mantic relations between concepts. Therefore, ourgoal is to effectively ground and model schemagraphs to improve the reasoning process.

    2.2 Reasoning Workflow

    As shown in Fig. 2, our framework accepts a pairof question and answer (QA-pair) denoted as qand a. It first recognizes the mentioned conceptswithin them respectively from the concept set Vof the knowledge graph. We then algorithmicallyconstruct the schema graph g by finding paths be-tween pairs of mentioned concepts (§3).

    The grounded schema graph is further encodedwith our proposed knowledge-aware graph net-work module (§4). We first use a model-agnosticlanguage encoder, which can either be trainable ora fixed feature extractor, to represent the QA-pair

  • 2831

    as a statement vector. The statement vector servesas an additional input to a GCN-LSTM-HPA ar-chitecture for path-based attentive graph modelingto obtain a graph vector. The graph vector is fi-nally fed into a simple multi-layer perceptron toscore this QA-pair into a scalar ranging from 0to 1, representing the plausibility of the inference.The answer candidate with the maximum plausi-bility score to the same question becomes the finalchoice of our framework.

    3 Schema Graph Grounding

    The grounding stage is three-fold: recognizingconcepts mentioned in text (§3.1), constructingschema graphs by retrieving paths in the knowl-edge graph (§3.2), and pruning noisy paths (§3.3).

    3.1 Concept Recognition

    We match tokens in questions and answers to setsof mentioned concepts (Cq and Ca respectively)from the knowledge graph G (for this paper wechose to use ConceptNet due to its generality).

    A naive approach to mentioned concept recog-nition is to exactly match n-grams in sentenceswith the surface tokens of concepts in V . For ex-ample, in the question “Sitting too close to watchtv can cause what sort of pain?”, the exact match-ing result Cq would be {sitting, close, watch tv,watch, tv, sort, pain, etc.}. We are aware of thefact that such retrieved mentioned concepts are notalways perfect (e.g. “sort” is not a semantically re-lated concept, “close” is a polysemous concept).How to efficiently retrieve contextually-relatedknowledge from noisy knowledge resources is stillan open research question by itself (Weissenbornet al., 2017; Khashabi et al., 2017), and thus mostprior works choose to stop here (Zhong et al.,2018; Wang et al., 2019b). We enhance thisstraightforward approach with some rules, such assoft matching with lemmatization and filtering ofstop words, and further deal with noise by pruningpaths (§3.3) and reducing their importance with at-tention mechanisms (§4.3).

    3.2 Schema Graph Construction

    ConceptNet. Before diving into the constructionof schema graphs, we would like to briefly intro-duce our target knowledge graph ConceptNet.ConceptNet can be seen as a large set of triplesof the form (h, r, t), like (ice, HasProperty,cold), where h and t represent head and tail con-

    cepts in the concept set V and r is a certain relationtype from the pre-defined set R. We delete andmerge the original 42 relation types into 17 types,in order to increase the density of the knowledgegraph3 for grounding and modeling.

    Sub-graph Matching via Path Finding. Wedefine a schema graph as a sub-graph g of thewhole knowledge graph G, which represents therelated knowledge for reasoning a given question-answer pair with minimal additional concepts andedges. One may want to find a minimal span-ning sub-graph covering all the question and an-swer concepts, which is actually the NP-complete“Steiner tree problem” in graphs (Garey and John-son, 1977). Due to the incompleteness andtremendous size of ConceptNet, we find thatit is impractical to retrieve a comprehensive buthelpful set of knowledge facts this way. Therefore,we propose a straightforward yet effective graphconstruction algorithm via path finding amongmentioned concepts (Cq ∪ Ca).

    Specifically, for each question concept ci ∈ Cqand answer concept cj ∈ Ca, we can efficientlyfind paths between them that are shorter than kconcepts4. Then, we add edges, if any, betweenthe concept pairs within Cq or Ca.

    3.3 Path Pruning via KG Embedding

    To prune irrelevant paths from potentially noisyschema graphs, we first utilize knowledge graphembedding (KGE) techniques, like TransE (Wanget al., 2014), to pre-train concept embeddings Vand relation type embeddings R, which are alsoused as initialization for KAGNET (§4). In orderto measure the quality of a path, we decomposeit into a set of triples, the confidence of which canbe directly measured by the scoring function of theKGE method (i.e. the confidence of triple classi-fication). Thus, we score a path with the multipli-cation product of the scores of each triple in thepath, and then we empirically set a threshold forpruning (§5.3).

    4 Knowledge-Aware Graph Network

    The core component of our reasoning frameworkis the knowledge-aware graph network moduleKAGNET. The KAGNET first encodes plain struc-tures of schema graphs with graph convolutionalnetworks (§4.1) to accommodate pre-trained con-

    3The full mapping list is in the appendix.4We set k = 4 in experiments to gather three-hop paths.

  • 2832

    GCNs

    Encoding Unlabeled

    Schema Graphs

    Statement Vector sAAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2Ae2Q8mkd9rQTGZIMkIZ+hduXCji1r9x59+YtrPQ1gOBwzn3knNPkAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHssHM0nQj+hQ8pAzaqz02IuoGQVhpqf9csWtunOQVeLlpAI5Gv3yV28QszRCaZigWnc9NzF+RpXhTOC01Es1JpSN6RC7lkoaofazeeIpObPKgISxsk8aMld/b2Q00noSBXZyllAvezPxP6+bmvDaz7hMUoOSLT4KU0FMTGbnkwFXyIyYWEKZ4jYrYSOqKDO2pJItwVs+eZW0alXvolq7v6zUb/I6inACp3AOHlxBHe6gAU1gIOEZXuHN0c6L8+58LEYLTr5zDH/gfP4A9u6RGw==

    CqAAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LLYjcsK9gHtWDJppg3NJGOSUcrQ/3DjQhG3/os7/8ZMOwttPRA4nHMv9+QEMWfauO63s7K6tr6xWdgqbu/s7u2XDg5bWiaK0CaRXKpOgDXlTNCmYYbTTqwojgJO28G4nvntR6o0k+LOTGLqR3goWMgINla670XYjAjmaX3afyj2S2W34s6AlomXkzLkaPRLX72BJElEhSEca9313Nj4KVaGEU6nxV6iaYzJGA9p11KBI6r9dJZ6ik6tMkChVPYJg2bq740UR1pPosBOZin1opeJ/3ndxIRXfspEnBgqyPxQmHBkJMoqQAOmKDF8YgkmitmsiIywwsTYorISvMUvL5NWteKdV6q3F+XadV5HAY7hBM7Ag0uowQ00oAkEFDzDK7w5T86L8+58zEdXnHznCP7A+fwBPJySVQ==

    CaAAAB9HicbVDLSsNAFL2pr1pfVZdugkVwVZIq6LLYjcsK9gFtKDfTSTt0Mokzk0IJ/Q43LhRx68e482+ctFlo64GBwzn3cs8cP+ZMacf5tgobm1vbO8Xd0t7+weFR+fikraJEEtoiEY9k10dFORO0pZnmtBtLiqHPacefNDK/M6VSsUg86llMvRBHggWMoDaS1w9RjwnytDEf4KBccarOAvY6cXNSgRzNQfmrP4xIElKhCUeleq4Tay9FqRnhdF7qJ4rGSCY4oj1DBYZUeeki9Ny+MMrQDiJpntD2Qv29kWKo1Cz0zWQWUq16mfif10t0cOulTMSJpoIsDwUJt3VkZw3YQyYp0XxmCBLJTFabjFEi0aankinBXf3yOmnXqu5VtfZwXanf5XUU4QzO4RJcuIE63EMTWkDgCZ7hFd6sqfVivVsfy9GCle+cwh9Ynz/q6JIx

    Ri,jAAAB+XicbVDLSsNAFL2pr1pfUZduBovgQkqigi6LblxWsQ9oQ5hMJ+3YySTMTAol9E/cuFDErX/izr9x0mahrQcGDufcyz1zgoQzpR3n2yqtrK6tb5Q3K1vbO7t79v5BS8WpJLRJYh7LToAV5UzQpmaa004iKY4CTtvB6Db322MqFYvFo54k1IvwQLCQEayN5Nt2L8J6GITZw9TP2NnT1LerTs2ZAS0TtyBVKNDw7a9ePyZpRIUmHCvVdZ1EexmWmhFOp5VeqmiCyQgPaNdQgSOqvGyWfIpOjNJHYSzNExrN1N8bGY6UmkSBmcxzqkUvF//zuqkOr72MiSTVVJD5oTDlSMcorwH1maRE84khmEhmsiIyxBITbcqqmBLcxS8vk9Z5zb2ond9fVus3RR1lOIJjOAUXrqAOd9CAJhAYwzO8wpuVWS/Wu/UxHy1Zxc4h/IH1+QPDdpO9

    LSTM Path Encoder

    Ti,jAAAB+XicbVDLSsNAFL2pr1pfUZduBovgQkqigi6LblxW6AvaECbTSTt2Mgkzk0IJ/RM3LhRx65+482+ctFlo64GBwzn3cs+cIOFMacf5tkpr6xubW+Xtys7u3v6BfXjUVnEqCW2RmMeyG2BFORO0pZnmtJtIiqOA004wvs/9zoRKxWLR1NOEehEeChYygrWRfNvuR1iPgjBrzvyMXTzNfLvq1Jw50CpxC1KFAg3f/uoPYpJGVGjCsVI910m0l2GpGeF0VumniiaYjPGQ9gwVOKLKy+bJZ+jMKAMUxtI8odFc/b2R4UipaRSYyTynWvZy8T+vl+rw1suYSFJNBVkcClOOdIzyGtCASUo0nxqCiWQmKyIjLDHRpqyKKcFd/vIqaV/W3Kva5eN1tX5X1FGGEziFc3DhBurwAA1oAYEJPMMrvFmZ9WK9Wx+L0ZJV7BzDH1ifP8aMk78=

    Pi,jAAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBg5REBT0WvXisYD+gDWWznbRrN5uwuxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkRwbVz32ymsrK6tbxQ3S1vbO7t75f2Dpo5TxbDBYhGrdkA1Ci6xYbgR2E4U0igQ2ApGt1O/9YRK81g+mHGCfkQHkoecUWOlVr2X8bPHSa9ccavuDGSZeDmpQI56r/zV7ccsjVAaJqjWHc9NjJ9RZTgTOCl1U40JZSM6wI6lkkao/Wx27oScWKVPwljZkobM1N8TGY20HkeB7YyoGepFbyr+53VSE177GZdJalCy+aIwFcTEZPo76XOFzIixJZQpbm8lbEgVZcYmVLIheIsvL5PmedW7qJ7fX1ZqN3kcRTiCYzgFD66gBndQhwYwGMEzvMKbkzgvzrvzMW8tOPnMIfyB8/kDGqSPag==

    Pi,j [k]AAAB8XicbVBNSwMxEJ2tX7V+VT16CRbBg5TdKuix6MVjBfuB26Vk02wbm02WJCuUpf/CiwdFvPpvvPlvTNs9aOuDgcd7M8zMCxPOtHHdb6ewsrq2vlHcLG1t7+zulfcPWlqmitAmkVyqTog15UzQpmGG006iKI5DTtvh6Gbqt5+o0kyKezNOaBDjgWARI9hY6aHRy9jZ48QfBb1yxa26M6Bl4uWkAjkavfJXty9JGlNhCMda+56bmCDDyjDC6aTUTTVNMBnhAfUtFTimOshmF0/QiVX6KJLKljBopv6eyHCs9TgObWeMzVAvelPxP89PTXQVZEwkqaGCzBdFKUdGoun7qM8UJYaPLcFEMXsrIkOsMDE2pJINwVt8eZm0alXvvFq7u6jUr/M4inAEx3AKHlxCHW6hAU0gIOAZXuHN0c6L8+58zFsLTj5zCH/gfP4AS7eQqw==

    ↵(i,j,k)AAAB+XicbVBNS8NAEN34WetX1KOXxSJUKCWpgh6LXjxWsB/QhjDZbtq1m03Y3RRK6D/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n21pb39jc2i7sFHf39g8O7aPjlopTSWiTxDyWnQAU5UzQpmaa004iKUQBp+1gdDfz22MqFYvFo54k1ItgIFjICGgj+bbdA54Mwc/KrPJUGV1MfbvkVJ058Cpxc1JCORq+/dXrxySNqNCEg1Jd10m0l4HUjHA6LfZSRRMgIxjQrqECIqq8bH75FJ8bpY/DWJoSGs/V3xMZREpNosB0RqCHatmbif953VSHN17GRJJqKshiUZhyrGM8iwH3maRE84khQCQzt2IyBAlEm7CKJgR3+eVV0qpV3ctq7eGqVL/N4yigU3SGyshF16iO7lEDNRFBY/SMXtGblVkv1rv1sWhds/KZE/QH1ucPh0KS7w==

    Path-level Attention

    ConceptPair-level Attention.

    �(i,j)AAAB9HicbVDLSgNBEJyNrxhfUY9eBoMQQcJuFPQY9OIxgnlAsoTZSW8yZvbhTG8gLPkOLx4U8erHePNvnCR70GhBQ1HVTXeXF0uh0ba/rNzK6tr6Rn6zsLW9s7tX3D9o6ihRHBo8kpFqe0yDFCE0UKCEdqyABZ6Elje6mfmtMSgtovAeJzG4ARuEwhecoZHcrgfIemlZnD2cTnvFkl2x56B/iZOREslQ7xU/u/2IJwGEyCXTuuPYMbopUyi4hGmhm2iIGR+xAXQMDVkA2k3nR0/piVH61I+UqRDpXP05kbJA60ngmc6A4VAvezPxP6+ToH/lpiKME4SQLxb5iaQY0VkCtC8UcJQTQxhXwtxK+ZApxtHkVDAhOMsv/yXNasU5r1TvLkq16yyOPDkix6RMHHJJauSW1EmDcPJInsgLebXG1rP1Zr0vWnNWNnNIfsH6+AYLWpGf

    LSTM( )Pi,j [k]AAAB8XicbVBNSwMxEJ2tX7V+VT16CRbBg5TdKuix6MVjBfuB26Vk02wbm02WJCuUpf/CiwdFvPpvvPlvTNs9aOuDgcd7M8zMCxPOtHHdb6ewsrq2vlHcLG1t7+zulfcPWlqmitAmkVyqTog15UzQpmGG006iKI5DTtvh6Gbqt5+o0kyKezNOaBDjgWARI9hY6aHRy9jZ48QfBb1yxa26M6Bl4uWkAjkavfJXty9JGlNhCMda+56bmCDDyjDC6aTUTTVNMBnhAfUtFTimOshmF0/QiVX6KJLKljBopv6eyHCs9TgObWeMzVAvelPxP89PTXQVZEwkqaGCzBdFKUdGoun7qM8UJYaPLcFEMXsrIkOsMDE2pJINwVt8eZm0alXvvFq7u6jUr/M4inAEx3AKHlxCHW6hAU0gIOAZXuHN0c6L8+58zFsLTj5zCH/gfP4AS7eQqw==

    …..

    …..

    gAAAB8XicbVDLSsNAFL3xWeur6tLNYBFclaQKuiy4cVnBPrANZTK9aYdOJmFmIpTQv3DjQhG3/o07/8ZJm4W2Hhg4nHMvc+4JEsG1cd1vZ219Y3Nru7RT3t3bPzisHB23dZwqhi0Wi1h1A6pRcIktw43AbqKQRoHATjC5zf3OEyrNY/lgpgn6ER1JHnJGjZUe+xE14yDMRrNBperW3DnIKvEKUoUCzUHlqz+MWRqhNExQrXuemxg/o8pwJnBW7qcaE8omdIQ9SyWNUPvZPPGMnFtlSMJY2ScNmau/NzIaaT2NAjuZJ9TLXi7+5/VSE974GZdJalCyxUdhKoiJSX4+GXKFzIipJZQpbrMSNqaKMmNLKtsSvOWTV0m7XvMua/X7q2rDLeoowSmcwQV4cA0NuIMmtICBhGd4hTdHOy/Ou/OxGF1zip0T+APn8wffSJD9

    Modeling Relational Paths between

    RAAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF4+t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTjm5n/8IRK81jem0mCfkSHkoecUWOl5l2/XHGr7hxklXg5qUCORr/81RvELI1QGiao1l3PTYyfUWU4Ezgt9VKNCWVjOsSupZJGqP1sfuiUnFllQMJY2ZKGzNXfExmNtJ5Ege2MqBnpZW8m/ud1UxNe+xmXSWpQssWiMBXExGT2NRlwhcyIiSWUKW5vJWxEFWXGZlOyIXjLL6+Sdq3qXVRrzctK3c3jKMIJnMI5eHAFdbiFBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPqIWMyA==

    TAAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fjw4rGFfkEbymY7adduNmF3I5TQX+DFgyJe/Une/Ddu2xy09cHA470ZZuYFieDauO63s7G5tb2zW9gr7h8cHh2XTk7bOk4VwxaLRay6AdUouMSW4UZgN1FIo0BgJ5jcz/3OEyrNY9k00wT9iI4kDzmjxkqN5qBUdivuAmSdeDkpQ476oPTVH8YsjVAaJqjWPc9NjJ9RZTgTOCv2U40JZRM6wp6lkkao/Wxx6IxcWmVIwljZkoYs1N8TGY20nkaB7YyoGetVby7+5/VSE975GZdJalCy5aIwFcTEZP41GXKFzIipJZQpbm8lbEwVZcZmU7QheKsvr5N2teJdV6qNm3LNzeMowDlcwBV4cAs1eIA6tIABwjO8wpvz6Lw4787HsnXDyWfO4A+czx+rjYzK

    W1AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRV0GXBjcsK9gFNKZPpTTt0MgkzE6GE/oYbF4q49Wfc+TdO2iy09cDA4Zx7uWdOkAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0Lve7T6g0j+WjmSU4iOhY8pAzaqzk+xE1kyDMuvOhN6zW3Lq7AFknXkFqUKA1rH75o5ilEUrDBNW677mJGWRUGc4Ezit+qjGhbErH2LdU0gj1IFtknpMLq4xIGCv7pCEL9fdGRiOtZ1FgJ/OMetXLxf+8fmrC20HGZZIalGx5KEwFMTHJCyAjrpAZMbOEMsVtVsImVFFmbE0VW4K3+uV10mnUvat64+G61nSLOspwBudwCR7cQBPuoQVtYJDAM7zCm5M6L86787EcLTnFzin8gfP5A/POkZE=

    W2AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRV0GXBjcsK9gFNKZPpTTt0MgkzE6GE/oYbF4q49Wfc+TdO2iy09cDA4Zx7uWdOkAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0Lve7T6g0j+WjmSU4iOhY8pAzaqzk+xE1kyDMuvNhY1ituXV3AbJOvILUoEBrWP3yRzFLI5SGCap133MTM8ioMpwJnFf8VGNC2ZSOsW+ppBHqQbbIPCcXVhmRMFb2SUMW6u+NjEZaz6LATuYZ9aqXi/95/dSEt4OMyyQ1KNnyUJgKYmKSF0BGXCEzYmYJZYrbrIRNqKLM2JoqtgRv9cvrpNOoe1f1xsN1rekWdZThDM7hEjy4gSbcQwvawCCBZ3iFNyd1Xpx352M5WnKKnVP4A+fzB/VSkZI=

    Graph Vector

    c(a)j

    AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix4MVjBfsh7VqyabaNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8fXMbz9RpVkk78wkpr7AQ8lCRrCx0j3pPz6kZXw+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0mrWvFqlertRanuZnHk4QROoQweXEIdbqABTSAg4Ble4c1Rzovz7nwsWnNONnMMf+B8/gA8EY/6

    c(q)i

    AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix4MVjBfsh7VqyabYNTbJrkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8fXMbz9RpVkk78wkpr7AQ8lCRrCx0j3ps4e0/Hg+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0mrWvFqlertRanuZnHk4QROoQweXEIdbqABTSAg4Ble4c1Rzovz7nwsWnNONnMMf+B8/gBS55AJ

    and

    gAAAB6XicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fjw4rGK/YA2lM120y7dbMLuRCih/8CLB0W8+o+8+W/ctDlo64OBx3szzMwLEikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJbe53nrg2IlaPOE24H9GREqFgFK30MCoPKlW35s5BVolXkCoUaA4qX/1hzNKIK2SSGtPz3AT9jGoUTPJZuZ8anlA2oSPes1TRiBs/m186I+dWGZIw1rYUkrn6eyKjkTHTKLCdEcWxWfZy8T+vl2J442dCJSlyxRaLwlQSjEn+NhkKzRnKqSWUaWFvJWxMNWVow8lD8JZfXiXtes27rNXvr6oNt4ijBKdwBhfgwTU04A6a0AIGITzDK7w5E+fFeXc+Fq1rTjFzAn/gfP4A/MyM8Q==

    Figure 3: Illustration of the GCN-LSTM-HPA architecture for the proposed KAGNET module.

    cept embeddings in their particular context withinschema graphs. It then utilizes LSTMs to en-code the paths between Cq and Ca, capturing multi-hop relational information (§4.2). Finally, we ap-ply a hierarchical path-based attention mechanism(§4.3) to complete the GCN-LSTM-HPA architec-ture, which models relational schema graphs withrespect to the paths between question and answerconcepts.

    4.1 Graph Convolutional Networks

    Graph convolutional networks (GCNs) encodegraph-structured data by updating node vectors viapooling features of their adjacent nodes (Kipf andWelling, 2017). Our intuition for applying GCNsto schema graphs is to 1) contextually refine theconcept vectors and 2) capture structural patternsof schema graphs for generalization.

    Although we have obtained concept vectors bypre-training (§3.3), the representations of conceptsstill need to be further accommodated to their spe-cific schema graphs context. Think of polysemousconcepts such as “close” (§3.1), which can eitherbe a verb concept like in “close the door” or an ad-jective concept meaning “a short distance apart”.Using GCNs to update the concept vector withtheir neighbors is thus helpful for disambiguationand contextualized concept embedding. Also, thepattern of schema graph structures provides po-tentially valuable information for reasoning. Forinstance, shorter and denser connections betweenquestion and answer concepts could mean higherplausibility under specific contexts.

    As many works show (Marcheggiani andTitov, 2017; Zhang et al., 2018), relational

    GCNs (Schlichtkrull et al., 2018) usually over-parameterize the model and cannot effectively uti-lize multi-hop relational information. We thus ap-ply GCNs on the plain version (unlabeled, non-directional) of schema graphs, ignoring relationtypes on the edges. Specifically, the vector forconcept ci ∈ Vg in the schema graph g is ini-tialized by their pre-trained embeddings at first(h(0)i = Vi). Then, we update them at the (l+ 1)-th layer by pooling features of their neighboringnodes (Ni) and their own at the l-th layer with annon-linear activation function σ:

    h(l+1)i = σ(W

    (l)selfh

    (l)i +

    ∑j∈Ni

    1

    |Ni|W (l)h

    (l)j )

    4.2 Relational Path EncodingIn order to capture the relational information inschema graphs, we propose an LSTM-based pathencoder on top of the outputs of GCNs. Recall thatour graph representation has a special purpose: “tomeasure the plausibility of a candidate answer toa given question”. Thus, we propose to representgraphs with respect to the paths between questionconcepts Cq and answer concepts Ca.

    We denote the k-th path between i-th questionconcept c(q)i ∈ Cq and j-th answer concept c

    (a)j ∈

    Ca as Pi,j [k], which is a sequence of triples:

    Pi,j [k] = [(c(q)i , r0, t0), ..., (tn−1, rn, c

    (a)j )]

    Note that the relations are represented withtrainable relation vectors (initialized with pre-trained relation embeddings), and concept vectorsare the GCNs’ outputs (h(l)). Thus, each triple canbe represented by the concatenation of the three

  • 2833

    corresponding vectors. We employ LSTM net-works to encode these paths as sequences of triplevectors, taking the concatenation of the first andthe last hidden states:

    Ri,j =1

    |Pi,j |∑k

    LSTM(Pi,j [k])

    The above Ri,j can be viewed as the latent re-lation between the question concept c(q)i and theanswer concept c(a)j , for which we aggregate therepresentations of all the paths between them inthe schema graph. Now we can finalize the vectorrepresentation of a schema graph g by aggregatingall vectors in the matrix R using mean pooling:

    Ti,j = MLP([s ; c(i)q ; c

    (j)a ])

    g =

    ∑i,j [Ri,j ; Ti,j ]

    |Cq| × |Ca|

    , where [· ; ·] means concatenation of two vectors.The statement vector s in the above equation is

    obtained from a certain language encoder, whichcan either be a trainable sequence encoder likeLSTM or features extracted from pre-trained uni-versal language encoders like GPT/BERT). To en-code a question-answer pair with universal lan-guage encoders, we simply create a sentence com-bining the question and the answer with a spe-cial token (“question+ [sep] + answer”), andthen use the vector of ‘[cls]’ as suggested by priorworks (Talmor et al., 2019)..

    We concatenate Ri,j with an additional vectorTi,j before doing average pooling. The Ti,j is in-spired from the Relation Network (Santoroet al., 2017), which also encodes the latent rela-tional information yet from the context in the state-ment s instead of the schema graph g. Simply put,we want to combine the relational representationsof a pair of question/answer concepts from boththe schema graph side (symbolic space) and thelanguage side (semantic space).

    Finally, the plausibility score of the answer can-didate a to the question q can be computed asscore(q, a) = sigmoid(MLP(g)).

    4.3 Hierarchical Attention MechanismA natural argument against the aboveGCN-LSTM-mean architecture is that meanpooling over the path vectors does not alwaysmake sense, since some paths are more importantthan others for reasoning. Also, it is usually notthe case that all pairs of question and answer

    concepts equally contribute to the reasoning.Therefore, we propose a hierarchical path-basedattention mechanism to selectively aggregateimportant path vectors and then more importantquestion-answer concept pairs. This core idea issimilar to the work of Yang et al. (2016), whichproposes a document encoder that has two levelsof attention mechanisms applied at the word- andsentence-level. In our case, we have path-leveland concept-pair-level attention for learning tocontextually model graph representations. Welearn a parameter matrix W1 for path-levelattention scores, and the importance of the pathPi,j [k] is denoted as α̂(i,j,·).

    α(i,j,k) = Ti,j W1 LSTM(Pi,j [k]),

    α̂(i,j,·) = SoftMax(α(i,j,·)),

    R̂i,j =∑k

    α̂(i,j,k) · LSTM(Pi,j [k]).

    Afterwards, we similarly obtain the attention overconcept-pairs.

    β(i,j) = s W2 Ti,j

    β̂(·,·) = SoftMax(β(·,·))

    ĝ =∑i,j

    β̂(i,j)[R̂i,j ; Ti,j ]

    The whole GCN-LSTM-HPA architecture is il-lustrated in Figure 3. To sum up, we claim thatthe KAGNET is a graph neural network modulewith the GCN-LSTM-HPA architecture that mod-els relational graphs for relational reasoning un-der the context of both knowledge symbolic spaceand language semantic space.

    5 Experiments

    We introduce our setups of the CommonsenseQAdataset (Talmor et al., 2019), present the baselinemethods, and finally analyze experimental results.

    5.1 Dataset and Experiment Setup

    The CommonsenseQA dataset consists of 12,102(v1.11) natural language questions in total thatrequire human commonsense reasoning abilityto answer, where each question has five candi-date answers (hard mode). The authors also re-lease an easy version of the dataset by pick-ing two random terms/phrases for sanity check.CommonsenseQA is directly gathered from realhuman annotators and covers a broad range of

  • 2834

    10(%) of IHtrain 50(%) of IHtrain 100(%) of IHtrainModel IHdev-Acc.(%) IHtest-Acc.(%) IHdev-Acc.(%) IHtest-Acc.(%) IHdev-Acc.(%) IHtest-Acc.(%)

    Random guess 20.0 20.0 20.0 20.0 20.0 20.0

    GPT-FINETUNING 27.55 26.51 32.46 31.28 47.35 45.58GPT-KAGNET 28.13 26.98 33.72 32.33 48.95 46.79

    BERT-BASE-FINETUNING 30.11 29.78 38.66 36.83 53.48 53.26BERT-BASE-KAGNET 31.05 30.94 40.32 39.01 55.57 56.19

    BERT-LARGE-FINETUNING 35.71 32.88 55.45 49.88 60.61 55.84BERT-LARGE-KAGNET 36.82 33.91 58.73 51.13 62.35 57.16

    Human Performance - 88.9 - 88.9 - 88.9

    Table 1: Comparisons with large pre-trained language model fine-tuning with different amount of training data.

    types of commonsense, including spatial, social,causal, physical, temporal, etc. To the best of ourknowledge, CommonsenseQA may be the mostsuitable choice for us to evaluate supervised learn-ing models for question answering.

    For the comparisons with the reported resultsin the CommonsenseQA’s paper and leader-board, we use the official split (9,741/1,221/1,140)named (OFtrain/OFdev/OFtest). Note that the per-formance on OFtest can only be tested by submit-ting predictions to the organizers. To efficientlytest other baseline methods and ablation studies,we choose to use randomly selected 1,241 exam-ples from the training data as our in-house data,forming an (8,500/1,221/1,241) split denoted as(IHtrain/IHdev/IHtest). All experiments are usingthe random-split setting as the authors suggested,and three or more random states are tested on de-velopment sets to pick the best-performing one.

    5.2 Compared Methods

    We consider two different kinds of baseline meth-ods as follows:

    • Knowledge-agnostic Methods. These methodseither use no external resources or only use un-structured textual corpora as additional informa-tion, including gathering textual snippets fromsearch engine or large pre-trained language mod-els like BERT-LARGE. QABILINEAR, QACOM-PARE, ESIM are three supervised learning mod-els for natural language inference that can beequipped with different word embeddings in-cluding GloVe and ELMO. BIDAF++ utilizesGoogle web snippets as context and is furtheraugmented with a self-attention layer while usingELMO as input features. GPT/BERT-LARGE arefine-tuning methods with an additional linear layerfor classification as the authors suggested. Theyboth add a special token ‘[sep]’ to the input and

    use the hidden state of the ‘[cls]’ as the input tothe linear layer. More details about them can befound in the dataset paper (Talmor et al., 2019).

    • Knowledge-aware Methods. We also adoptsome recently proposed methods of incorporatingknowledge graphs for question answering. KV-MEM (Mihaylov and Frank, 2018) is a method thatincorporates retrieved triples from ConceptNetat the word-level, which uses a key-valued mem-ory module to improve the representation of eachtoken individually by learning an attentive aggre-gation of related triple vectors. CBPT (Zhonget al., 2018) is a plug-in method of assemblingthe predictions of any models with a straight-forward method of utilizing pre-trained conceptembeddings from ConceptNet. TEXTGRAPH-CAT (Wang et al., 2019c) concatenates the graph-based and text-based representations of the state-ment and then feed it into a classifier. We cre-ate sentence template for generating sentences andthen feed retrieved triples as additional text inputsas a baseline method TRIPLESTRING. Rajani et al.(2019) propose to collect human explanations forcommonsense reasoning from annotators as addi-tional knowledge (COS-E), and then train a lan-guage model based on such human annotations forimproving the model performance.

    5.3 Implementation Details of KagNet

    Our best (tested on OFdev) settings of KAGNEThave two GCN layers (100 dim, 50dim respec-tively), and one bidirectional LSTMs (128dim) .We pre-train KGE using TransE (100 dimension)initialized with GloVe embeddings. The statementencoder in use is BERT-LARGE, which works asa pre-trained sentence encoder to obtain fixed fea-tures for each pair of question and answer candi-date. The paths are pruned with path-score thresh-old set to 0.15, keeping 67.21% of the original

  • 2835

    Model OFdev-Acc.(%) OFtest-Acc.(%)

    Random guess 20.0 20.0

    BIDAF++ - 32.0QACOMPARE+GLOVE - 25.7QABLINEAR+GLOVE - 31.5ESIM+ELMO - 32.8ESIM+GLOVE - 34.1

    GPT-FINETUNING 47.11 45.5BERT-BASE-FINETUNING 53.57 53.0BERT-LARGE-FINETUNING 62.34 56.7COS-E (w/ additional annotations) - 58.2

    KAGNET (Ours) 64.46 58.9

    Human Performance - 88.9

    Table 2: Comparison with official benchmark baselinemethods using the official split on the leaderboard.

    paths. We did not conduct pruning on conceptpairs with less than three paths. For very few pairswith none path, R̂(i,j) will be a randomly sampledvector. We learn our KAGNET models with Adamoptimizers (Kingma and Ba, 2015). In our exper-iments, we found that the recall of ConceptNeton commonsense questions and answers is veryhigh (over 98% of QA-pairs have more than onegrounded concepts).

    5.4 Performance Comparisons and Analysis

    Comparison with standard baselines.As shown in Table 2, we first use the official splitto compare our model with the baseline methodsreported on the paper and leaderboard. BERT andGPT-based pre-training methods are much higherthan other baseline methods, demonstrating theability of language models to store commonsenseknowledge in an implicit way. This presump-tion is also investigated by Trinh and Le (2019)and Wang et al. (2019). Our proposed frame-work achieves an absolute increment of 2.2% inaccuracy on the test data, a state-of-the-art perfor-mance.

    We conduct the experiments with our in-housesplits to investigate whether our KAGNET can alsowork well on other universal language encoders(GPT and BERT-BASE), particularly with differ-ent fractions of the dataset (say 10%, 50%, 100%of the training data). Table 1 shows that ourKAGNET-based methods using fixed pre-trainedlanguage encoders outperform fine-tuning them-selves in all settings. Furthermore, we find thatthe improvements in a small data situation (10%)is relatively limited, and we believe an importantfuture research direction is thus few-shot learning

    Easy Mode Hard ModeModel IHdev.(%) IHtest.(%) IHdev.(%) IHtest.(%)Random guess 33.3 33.3 20.0 20.0

    BLSTMS 80.15 78.01 34.79 32.12+ KV-MN 81.71 79.63 35.70 33.43+ CSPT 81.79 80.01 35.31 33.61+ TEXTGRAPHCAT 82.68 81.03 34.72 33.15+ TRIPLESTRING 79.11 76.02 33.19 31.02+ KAGNET 83.26 82.15 36.38 34.57

    Human Performance - 99.5 - 88.9

    Table 3: Comparisons with knowledge-aware baselinemethods using the in-house split (both easy and hardmode) on top of BLSTM as the sentence encoder.

    for commonsense reasoning.

    Comparison with knowledge-aware baselines.To compare our model with other adopted base-line methods that also incorporate ConceptNet,we set up a bidirectional LSTM networks-basedmodel for our in-house dataset. Then, we addbaseline methods and KAGNET onto the BLSTMsto compare their abilities to utilize external knowl-edge5. Table 3 shows the comparisons under botheasy mode and hard mode, and our methods out-perform all knowledge-aware baseline methods bya large margin in terms of accuracy. Note thatwe compare our model and the CoS-E in Ta-ble 2. Although CoS-E also achieves better re-sult than only fine-tuning BERT by training withhuman-generated explanations, we argue that ourproposed KagNet does not utilize any additionalhuman efforts to provide more supervision.

    Ablation study on model components.To better understand the effectiveness of eachcomponent of our method, we have done abla-tion study as shown in Table 4. We find that re-placing our GCN-LSTM-HPA architecture withtraditional relational GCNs, which uses sepa-rate weight matrices for different relation types,results in worse performance, due to its over-parameterization. The attention mechanisms mat-ters almost equally in two levels, and pruning alsoeffectively filters noisy paths.

    Error analysis.In the failed cases, there are three kinds of hardproblems that KAGNET is still not good at.• negative reasoning: the grounding stage is

    not sensitive to the negation words, and thuscan choose exactly opposite answers.• comparative reasoning strategy: For the5We do LSTM-based setup because it is non-trivial to ap-

    ply token-level knowledge-aware baseline methods for com-plicated pre-trained encoders like BERT.

  • 2836

    Model IHdev.(%) IHtest.(%)

    KAGNET (STANDARD) 62.35 57.16: replace GCN-HPA-LSTM w/ R-GCN 60.01 55.08: w/o GCN 61.84 56.11: #GCN Layers = 1 62.05 57.03: w/o Path-level Attention 60.12 56.05: w/o QAPair-level Attention 60.39 56.13: using all paths (w/o pruning) 59.96 55.27

    Table 4: Ablation study on the KAGNET framework.

    questions with more than one highly plau-sible answers, the commonsense reasonershould benefit from explicitly investigatingthe difference between different answer can-didates, while KAGNET training method isnot capable of doing so.• subjective reasoning: Many answers actu-

    ally depend on the “personality” of the rea-soner. For instance, “Traveling from newplace to new place is likely to be what?”The dataset gives the answer as “exhilarat-ing” instead of “exhausting”, which we thinkis more like a personalized subjective infer-ence instead of common sense.

    5.5 Case Study on Interpretibility

    Our framework enjoys the merit of being moretransparent, and thus provides more interpretableinference process. We can understand our modelbehaviors by analyzing the hierarchical attentionscores on the question-answer concept pairs andpath between them.

    Figure 4 shows an example how we can ana-lyze our KAGNET framework through both pair-level and path-level attention scores. We first se-lect the concept-pairs with highest attention scoresand then look at the (one or two) top-ranked pathsfor each selected pair. We find that paths located inthis way are highly related to the inference processand also shows that noisy concepts like ‘fountain’will be diminished while modeling.

    5.6 Model Transferability.

    We study the transferability of a model that istrained on CommonsenseQA (CSQA) by directlytesting it with another task while fixing its parame-ters. Recall that we have obtained a BERT-LARGEmodel and a KAGNET model trained on CSQA.Now we denoted them as CSQA-BL and CSQA-KN to suggest that they are not trainable anymore.

    In order to investigate their transferability, weseparately test them on SWAG (Zellers et al., 2018)

    WhatJdoJyouJfill withJink toJwrite on an A4 paper?J

    A: fountainJpen ✔ (KagNet); B: printer (BERT);

    C: squid D: pencilJcase (GPT); E:Jnewspaper

    fill in

    kwriteA4 pa

    per

    fountain

    pen

    fountain_pen

    ink —PartOf—> fountain_penink —RelatedTo—> container

  • 2837

    OpenBookQA (Mihaylov et al., 2018).An interesting fact is that the best performance

    on the OFtest set is still achieved the originalfine-tuned RoBERTa model, which is pre-trainedwith copora much larger than BERT. All otherRoBERTa-extended methods have negative im-provements. We also use statement vectors fromRoBERTa as the input vectors for KAGNET, andfind that the performance on OFdev marginallyimproves from 77.47% to 77.56%. Based on ourabove-mentioned failed cases in error analysis, webelieve fine-tuning RoBERTa has achieved thelimit due to the annotator biases of the dataset andthe lack of comparative reasoning strategies.

    6 Related Work

    Commonsense knowledge and reasoning.There is a recent surge of novel large-scaledatasets for testing machine commonsense withvarious focuses, such as situation prediction(SWAG) (Zellers et al., 2018), social behaviorunderstanding (Sap et al., 2019a,b), visual scenecomprehension (Zellers et al., 2019), and gen-eral commonsense reasoning (Talmor et al.,2019), which encourages the study of supervisedlearning methods for commonsense reasoning.Trinh and Le (2018) find that large languagemodels show promising results in WSC resolutiontask (Levesque, 2011), but this approach canhardly be applied in a more general questionanswering setting and also not provide explicitknowledge used in inference. A unique merit ofour KAGNET method is that it provides groundedexplicit knowledge triples and paths with scores,such that users can better understand and put trustin the behaviors and inferences of the model.

    Injecting external knowledge for NLU. Ourwork also lies in the general context of usingexternal knowledge to encode sentences or an-swer questions. Yang and Mitchell (2017) are theamong first ones to propose to encode sentencesby keeping retrieving related entities from knowl-edge bases and then merging their embeddingsinto LSTM networks computations, to achievea better performance on entity/event extractiontasks. Weissenborn et al. (2017), Mihaylov andFrank (2018), and Annervaz et al. (2018) followthis line of works to incorporate the embeddings ofrelated knowledge triples at the word-level and im-prove the performance of natural language under-standing tasks. In contrast to our work, they do not

    explicitly impose graph-structured knowledge intomodels , but limit its potential within transformingword embeddings to concept embeddings.

    Some other recent attempts (Zhong et al., 2018;Wang et al., 2019c) to use ConceptNet graph em-beddings are adopted and compared in our experi-ments (§5). Rajani et al. (2019) propose to manu-ally collect more human explanations for correctanswers as additional supervision for auxiliarytraining. KAGNET-based framework focuses oninjecting external knowledge as an explicit graphstructure, and enjoys the relational reasoning ca-pacity over the graphs.

    Relational reasoning. KAGNET can be seen asa knowledge-augmented Relation Networkmodule (RN) (Santoro et al., 2017), which is pro-posed for the visual question answering task re-quiring relational reasoning (i.e. questions aboutthe relations between multiple 3D-objects in animage). We view the concepts in the questions andanswers as objects and effectively utilize externalknowledge graphs to model their relations fromboth semantic and symbolic spaces (§4.2), whileprior methods mainly work on the semantic one.

    7 Conclusion

    We propose a knowledge-aware framework forlearning to answer commonsense questions. Theframework first constructs schema graphs to rep-resent relevant commonsense knowledge, and thenmodel the graphs with our KAGNET module. Themodule is based on a GCN-LSTM-HPA architec-ture, which effectively represent graphs for rela-tional reasoning purpose in a transparent, inter-pretable way, yielding a new state-of-the-art re-sults on a large-scale general dataset for testingmachine commonsense. Future directions includebetter question parsing methods to deal with nega-tion and comparative question answering, as wellas incorporating knowledge to visual reasoning.

    Acknowledgments

    This work has been supported in part by NationalScience Foundation SMA 18-29268, DARPAMCS and GAILA, IARPA BETTER, SchmidtFamily Foundation, Amazon Faculty Award,Google Research Award, Snapchat Gift and JPMorgan AI Research Award. We would like tothank all the collaborators in the INK research labfor their constructive feedback on the work.

  • 2838

    ReferencesK. M. Annervaz, Somnath Basu Roy Chowdhury,

    and Ambedkar Dukkipati. 2018. Learning beyonddatasets: Knowledge graph augmented neural net-works for natural language processing. In Proc. ofNAACL-HLT.

    Robert Axelrod. 1973. Schema theory: An informa-tion processing model of perception and cognition.American political science review, 67(4):1248–1266.

    Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst,Alvaro Sanchez-Gonzalez, Vinı́cius Flores Zam-baldi, Mateusz Malinowski, Andrea Tacchetti,David Raposo, Adam Santoro, Ryan Faulkner,Çaglar Gülçehre, Francis Song, Andrew J. Ballard,Justin Gilmer, George E. Dahl, Ashish Vaswani,Kelsey R. Allen, Charles Nash, Victoria Langston,Chris Dyer, Nicolas Heess, Daan Wierstra, Push-meet Kohli, Matthew Botvinick, Oriol Vinyals, Yu-jia Li, and Razvan Pascanu. 2018. Relational in-ductive biases, deep learning, and graph networks.CoRR, abs/1806.01261.

    Ernest Davis and Gary Marcus. 2015. Commonsensereasoning and commonsense knowledge in artificialintelligence. Commun. ACM, 58(9):92–103.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In Proc. of NAACL-HLT.

    Michael R Garey and David S. Johnson. 1977. The rec-tilinear steiner tree problem is np-complete. SIAMJournal on Applied Mathematics, 32(4):826–834.

    Drew A. Hudson and Christopher D. Manning. 2018.Compositional attention networks for machine rea-soning. In Proc. of ICLR.

    Philip N Johnson-Laird. 1980. Mental models in cog-nitive science. Cognitive science, 4(1):71–115.

    Daniel Khashabi, Tushar Khot, Ashutosh Sabharwal,and Dan Roth. 2017. Learning what is essential inquestions. In Proc. of CoNLL.

    Diederik P. Kingma and Jimmy Ba. 2015. Adam:A method for stochastic optimization. In Proc. ofICLR.

    Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutionalnetworks. In Proceedings of ICLR.

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,and Eduard H. Hovy. 2017. Race: Large-scale read-ing comprehension dataset from examinations. InProc. of EMNLP.

    Hector J. Levesque. 2011. The winograd schema chal-lenge. In AAAI Spring Symposium: Logical Formal-izations of Commonsense Reasoning.

    Xiang Li, Aynaz Taheri, Lifu Tu, and Kevin Gimpel.2016. Commonsense knowledge base completion.In Proc. of ACL.

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar S. Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke S. Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. ArXiv, abs/1907.11692.

    Diego Marcheggiani and Ivan Titov. 2017. Encodingsentences with graph convolutional networks for se-mantic role labeling. In Proc. of EMNLP.

    Todor Mihaylov and Anette Frank. 2018. Knowledge-able reader: Enhancing cloze-style reading compre-hension with external commonsense knowledge. InProc. of ACL.

    Tzvetan Mihaylov, Peter F. Clark, Tushar Khot, andAshish Sabharwal. 2018. Can a suit of armor con-duct electricity? a new dataset for open book ques-tion answering. In Proc. of EMNLP.

    Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.

    Nazneen Fatema Rajani, Bryan McCann, CaimingXiong, and Richard Socher. 2019. Explain yourself!leveraging language models for commonsense rea-soning. In Proc. of ACL.

    Adam Santoro, David Raposo, David G. T. Barrett,Mateusz Malinowski, Razvan Pascanu, Peter W.Battaglia, and Timothy P. Lillicrap. 2017. A sim-ple neural network module for relational reasoning.In Proc. of NIPS.

    Maarten Sap, Ronan LeBras, Emily Allaway, Chan-dra Bhagavatula, Nicholas Lourie, Hannah Rashkin,Brendan Roof, Noah A Smith, and Yejin Choi.2019a. Atomic: An atlas of machine commonsensefor if-then reasoning. In Proc. of AAAI.

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. 2019b. Socialiqa: Common-sense reasoning about social interactions. CoRR,abs/1904.09728.

    Michael Sejr Schlichtkrull, Thomas N. Kipf, PeterBloem, Rianne van den Berg, Ivan Titov, and MaxWelling. 2018. Modeling relational data with graphconvolutional networks. In Proc. of European Se-mantic Web Conference.

    Robyn Speer, Joshua Chin, and Catherine Havasi.2017. Conceptnet 5.5: An open multilingual graphof general knowledge. In Proc. of AAAI.

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, andJonathan Berant. 2019. Commonsenseqa: A ques-tion answering challenge targeting commonsenseknowledge. In Proc. of NAACL-HLT.

  • 2839

    Niket Tandon, Gerard de Melo, and Gerhard Weikum.2017. Webchild 2.0 : Fine-grained commonsenseknowledge distillation. In Proc. of ACL.

    Trieu H. Trinh and Quoc V. Le. 2018. A sim-ple method for commonsense reasoning. CoRR,abs/1806.02847.

    Trieu H. Trinh and Quoc V. Le. 2019. Do languagemodels have common sense? OpenReview, ICLRsubmissions.

    Cunxiang Wang, Shuailong Liang, Yue Zhang, Xiao-nan Li, and Tian Gao. 2019a. Does it make sense?and why? a pilot study for sense making and expla-nation. In Proc. of ACL.

    Xiaoyan Wang, Pavan Kapanipathi, Ryan Musa,Mo Yu, Kartik Talamadupula, Ibrahim Abdelaziz,Maria Chang, Achille Fokoue, Bassem Makni,Nicholas Mattei, and Michael Witbrock. 2019b. Im-proving natural language inference using externalknowledge in the science questions domain. InProc. of AAAI.

    Xiaoyan Wang, Pavan Kapanipathi, Ryan Musa,Mo Yu, Kartik Talamadupula, Ibrahim Abdelaziz,Maria Chang, Achille Fokoue, Bassem Makni,Nicholas Mattei, and Michael Witbrock. 2019c. Im-proving natural language inference using externalknowledge in the science questions domain.

    Zhen Wang, Jianwen Zhang, Jianlin Feng, and ZhengChen. 2014. Knowledge graph embedding by trans-lating on hyperplanes. In Proc. of AAAI.

    Dirk Weissenborn, Tomáš Kočiskỳ, and Chris Dyer.2017. Dynamic integration of background knowl-edge in neural nlu systems. arXiv preprintarXiv:1706.02596.

    Frank Xu, Bill Yuchen Lin, and Kenny Q. Zhu. 2018.Automatic extraction of commonsense locatednearknowledge. In Proc. of ACL.

    Bishan Yang and Tom Michael Mitchell. 2017. Lever-aging knowledge bases in lstms for improving ma-chine reading. In Proc. of ACL.

    Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.Carbonell, Ruslan Salakhutdinov, and Quoc V.Le. 2019. Xlnet: Generalized autoregressivepretraining for language understanding. ArXiv,abs/1906.08237.

    Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alexander J. Smola, and Eduard H. Hovy. 2016. Hi-erarchical attention networks for document classifi-cation. In Proc. of NAACL-HLT.

    Rowan Zellers, Yonatan Bisk, Ali Farhadi, and YejinChoi. 2019. From recognition to cognition: Visualcommonsense reasoning. In Proc. of CVPR.

    Rowan Zellers, Yonatan Bisk, Roy Schwartz, and YejinChoi. 2018. Swag: A large-scale adversarial datasetfor grounded commonsense inference. In Proc. ofEMNLP.

    Yuhao Zhang, Peng Qi, and Christopher D. Man-ning. 2018. Graph convolution over pruned depen-dency trees improves relation extraction. In Proc. ofEMNLP.

    Wanjun Zhong, Duyu Tang, Nan Duan, Ming Zhou,Jiahai Wang, and Jian Yin. 2018. Improving ques-tion answering by commonsense-based pre-training.ArXiv, abs/1809.03568.